Thoughts after building a text-adventure game using local models

Here are my thoughts so far after trying to build an LLM-powered text-adventure game that behaves sensibly. The idea here is that I wanted to create an app that's playable without the need to edit the outputs by hand, with the ultimate test of "can my son age 10 have a chance in hell to play it".

The good parts:

  • I've been playing with Llama models regularly since they came out and they are a leap forward compared to the medieval days of BERT and GPT2.
  • we can do a lot with just prompting - with no fine-tuning required. The instruct API to LLM models is a very natural way to work with text, and kinda does what you ask.
  • I reckon a fun choose-your-adventure game is possible, as long as it's just ABC choices, and we let the player delete or regenerate particularly bad entries. This keeps the UI complexity down.
  • No matter your hardware architecture, you have a good chance of running accelerated inference. It might be painful to set up, but most of the time it's possible.

The specific issues I've been repeatedly running into that so far frustrated my efforts:

  • human with a cattle prod seems to be always required to guide the model. The longer the game goes, the more the mistakes pile up, causing the model to become unstable and fall off the rails. Ability to rewrite the text by hand is a must - which I don't mind, but it makes the UI more convoluted and makes the app not suitable for general audience. This is the number one thing that prevents my son from being able to play it.
  • shiver down my spine, GPTisms are really strong, and result in a frustrating experience that's only fit for tinkerers. The models are moralistic, and I can't stop them from writing flowery prose with tons of awkward adjectives. The models are stuck operating in a small fragment of the language latent space (the "protestant priest mode" 😛). Perhaps we need more foundational models?
  • each model I have tried struggles with one aspect of the game. For example a model might do really well describing consequences of actions, but fail miserably when rephrasing the location text. From the RP models that produce interesting outputs, there is no single model I can find that can run all the prompts as I want them. Maybe I need a few tiny models, or one large model with a LoRA per use-case?
  • model installation is still a wilderness. Thank g** for GGUF for democratizing access, but there is no single good runtime that would cover all use-cases. Backends have edge cases, CUDA is bulky and ugly, and Metal doesn't work half of the time 😏 . Still, this is a gigantic improvement compared to 6 months ago...

Some more context: I don't claim to be an expert model prompter, so incremental improvements are definitely possible here. Also I'm limited to models that infer on 32GB VRAM (but so would a hypothetical player). Also, fine-tuning on text-adventure games and MUDs could be an interesting avenue to check which I haven't tried yet.

In summary, it feels like there needs to be an order of magnitude change before apps where LLM is a centerpiece (and not just a helper) are possible. Which hopefully isn't too far away, looking at the pace of the advances in hardware and the science 🥳

Can someone please build an LLM-augmented MUD already? 😂