Playing with llamafile and llama.cpp

Published: 2024-04-10 2:55 PM

Category: AI | Tags: llm, models, local, llama

Warning: This gets a little rambly, but I'm collecting a lot of my experimenting into one place. Sorry.

In my quest to toy with large language model (LLM) systems as a teacher, I went down the path of installing and using local models instead of reaching for one of the web-based services. The environmental impact is reduced, but the questions over training data sources is still murky, at best. I don't feel great about this part, so that will probably be another blog post.

I started with Mozilla's llamafile system which allowed me to download one file and then run it locally. Running a llamafile starts a process which gives access to the model through the browser via localhost and gives the ChatGPT-like interface. As far as ease of running, it works well - pretty much download and you're ready to go. One complaint is that the app greets you with a bunch of options to tune the model but no information on what they are or which would make a difference. Under the hood, it uses llama.cpp to power the interaction and there's a helpful thread with suggested parameters.

My computer is pretty much stock - no fancy GPU and modest RAM (8GB). The model loaded fine, but it was by no means snappy. Most of the forum posts are full of people running 30+ GB of RAM or higher-end GPUs. For me, I was able to squeeze ~2.5 tokens/sec out of this machine. It was pretty slow.

To try and speed it up, I grabbed the smaller phi-2 llamafile which is meant to run on low-power machines. In terms of speed, this is by far the fastest, processing 15.5 tokens/second of input text and producing 5.2 tokens/sec. I was actually kind of surprised at how well this particular model did. The prompts needed to be very specific to work well and the responses tended to be simpler than what Mistral was returning.

I also tried out the LLaVA 1.5 model llamafile becuase it includes image interpretation. That brought my computer to a screeching halt while it was interpreting and describing a graph, so I think that's a nonstarter right now.

After doing more reading, I decided to try out using the larger Mistral model in the command line directly rather than relying on the web interface. There was a little bit of hinting on Reddit and other forums that the CLI interaction was faster. Given that the web app is just piping information from the running model, I doubted that it would make much of a difference, and I didn't see any major differences.

The llamafile format always loads the web app, so to have more control over the interaction, I had to download and build llama.cpp directly. Running the Mistral model in the CLI was about the same as running it through the browser UI.

I prompted each model with something I would use in school:

Write two questions asking students to explain the difference between covalent and ionic bonds.

Model	UI	Total eval (sec)	Prompt eval time (tokens/sec)	Response time (tokens/sec)
Mistral 7B	CLI	53	3.3	2.5
Mistral 7B	llamafile (web)	32.2	7.6	2.1
LLaVA 1.5 7B	llamafile (web)	106.8	4.3	1.6
Phi-2	llamafile (web)	17.2	13.5	3.5

The main thing I noticed was that the responses were more consise through the CLI because my prompts didn't have the standard pramble loaded by the web app:

This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.

The Phi-2 responses were more general and much more conversational. When I prompted Mistral for two questions, I got two high-level questions which I could potentially use:

What are the key differences between ionic and covalent bonding, and how do these bonding types affect the properties of atoms and molecules?
Can you provide examples of common substances that exhibit ionic and covalent bonding, and describe how these bonding types contribute to the overall structure and behavior of these substances?

Phi-2 gave me this:

User: Write two questions asking students to explain the difference between covalent and ionic bonding.

Llama: Sure! What are your thoughts on how these types of chemical bonds differ? Can you provide an example for each type of bond, please?

Follow-up Exercise 1: Provide a detailed explanation of what is meant by "covalent" in chemistry. Give examples and explain the differences between covalent bonding and ionic bonding.

I'm not sure which I prefer right now. Phi-2 tended ramble and gave less helpful responses to prompts. I liked the way Mistral was responding, but it took so much longer. The time sunk in re-prompting Phi-2 might be the same as just waiting longer for first responses.

As far as which UI to use, I don't do much development, so I don't really have a console open all the time, but I do like the minimal interface. On the other hand, I do keep Firefox open all day and having a dedicated tab might come in more use.

All that said, I also could have just written those questions. I'm not doing enough assessment right now to need to generate all kinds of questions and there are already websites which will create all kinds of randomized chemistry questions with a couple clicks.

Share this post

Playing with llamafile and llama.cpp

Comments