I have a bit of a surprise. I came up with a simple idea: make an LLM play a text-based game—classic Zork. For those unfamiliar, Zork is one of the earliest interactive fiction games, originally developed in the late 1970s. It’s a text-based adventure where the player inputs commands to navigate through a mysterious underground world, solve puzzles, collect treasures, and overcome obstacles like trolls or locked doors. There are no graphics—just descriptions of environments, objects, and events that the player interacts with using simple text commands such as “go north,” “examine lantern,” or “attack troll.” It’s considered a classic but can be quite challenging, as it requires logical thinking and careful exploration.

On the surface, it seemed simple to implement, and indeed it was. However, my LLaMA 3.1 model with 8B parameters struggled a lot to make any meaningful progress in the game.

I quickly rewrote the program using OpenAI’s GPT-4o, and it performed much better, though it came at a cost—it was burning $1 per hour. The cost would have been higher, but I made the app read the output aloud using TTS so I could follow the progress without needing to read everything manually.

Before switching to GPT-4o, I spent the entire weekend tuning prompts with LLaMA 3.1 8B, hoping to at least get it to defeat the troll in the basement. Instead, it kept walking in circles through the forest or struggling to turn on the lantern.

Trying to save money, I decided to play the game manually by copying and pasting commands into ChatGPT. Of course, GPT-4o had no problems. I also tried GPT-4o mini, but it ran into similar issues as my smaller LLaMA 3.1 model. In contrast, the full GPT-4o breezed through the game, handling the inventory, defeating the troll, and exploring the dungeon with no trouble.

Meanwhile, Meta’s LLaMA 3.1 model with 405B parameters got lost in the forest, and Gemini got stuck on the examine command. Zork doesn’t provide much information on examining objects, and Gemini ended up gaslighting me, suggesting that I might have a buggy version of the game or was making typos in the commands.

Claude performed reasonably well, but I didn’t explore much further with it because I was tired of copy-pasting commands endlessly. I also tried LLaMA 3.1 with 70B parameters. Since it runs painfully slow on my laptop, I set it up and lay on the bed watching YouTube on my Quest 3 for two hours. When I came back, I found the model stuck in a ridiculous loop, still trying to open a door that was nailed shut.

TL;DR: I’m quite surprised that LLMs struggle so much with a seemingly simple game. Maybe one day they’ll finally get past the troll—until then, I’ll stick to YouTube.

Next - Previous