Yesterday, I spent some time upgrading from LLaMA 3.0 to LLaMA 3.1. I also updated the inference engine, llama.cpp, which was 3 months behind.

So far, the 8B parameters model is way faster and smarter. The 70B model barely fits in my GPU memory and is very slow. I don’t think it’s useful for real-time tasks, but I can batch some tasks and run them overnight.

In the morning, I copy-pasted some questions I had in the past to ChatGPT, and surprisingly, some answers were even better with the 8B params model - sic! That’s amazing.

TLDR; it’s faster and smarter, but the 70B parameter model is a little too big.

Next - Previous