You can now run a GPT-3 level AI model on your laptop, phone and Raspberry Pi

Ars Technica

Things move fast in AI Land. On Friday, a software developer named Georgi Gerganov created a tool called “llama.cpp” that can run Meta’s new GPT-3 class AI major language model, LLaMA, locally on a Mac laptop. Soon after, people figured out how to run LLaMA on Windows as well. Then someone let it run on a Pixel 6 phone, and then came a Raspberry Pi (although it is very slow).

If this continues, we may be looking at a pocket-sized ChatGPT competitor before we know it.

But let’s go back for a second, because we’re not quite there yet. (At least not today – like literally today, March 13, 2023.) But what will arrive next week, no one knows.

Since the launch of ChatGPT, some people have been frustrated by the AI ​​model’s built-in limits that prevent it from discussing topics that OpenAI considers sensitive. Thus began the dream – in some circles – of an open source large language model (LLM) that anyone could use locally without censorship and without paying API fees to OpenAI.

Open source solutions (such as GPT-J) do exist, but they require a lot of GPU RAM and storage space. Other open source alternatives failed to boast GPT-3-level performance on readily available consumer-level hardware.

Enter LLaMA, an LLM available in parameter sizes ranging from 7B to 65B (that’s “B” as in “billion parameters”, which are floating-point numbers stored in arrays that represent what the model “knows”). LLaMA made an intoxicating claim: that its smaller models could match OpenAI’s GPT-3, the fundamental model that powers ChatGPT, in terms of output quality and speed. There was just one problem: Meta released the LLaMA code open source, but it only held back the “weights” (the trained “knowledge” stored in a neural network) for qualified researchers.

Fly at the speed of LLaMA

Meta’s restrictions on LLaMA didn’t last long as on March 2 someone leaked the LLaMA weights on BitTorrent. Since then there has been an explosion of development around LLaMA. Independent AI researcher Simon Willison has compared this situation to the release of Stable Diffusion, an open source image synthesis model launched last August. This is what he wrote in a post on his blog:

It feels to me like that Stable Diffusion moment in August started the whole new wave of interest in generative AI, which was then pushed into overdrive by the release of ChatGPT in late November.

That stable diffusion moment is now underway again, for large language models – the technology behind ChatGPT itself. This morning I ran a GPT-3 language model on my own personal laptop for the first time!

AI stuff was already weird. It’s about to get even weirder.

Typically, running GPT-3 requires several data center-class A100 GPUs (also, the weights for GPT-3 are not public), but LLaMA made waves because it could run on a single powerful consumer GPU. And now, with optimizations that reduce the model size using a technique called quantization, LLaMA can be run on an M1 Mac or a smaller consumer Nvidia GPU.

Developments are moving so fast that it is sometimes difficult to keep up with the latest developments. (As for the advancement of AI, a fellow AI reporter told Ars, “It’s like those videos of dogs where you tip over a crate of tennis balls. [They] don’t know where to hunt first and get lost in the confusion.”

For example, here’s a list of notable LLaMA-related events based on a timeline Willison laid out in a Hacker News commentary:

  • February 24, 2023: Meta AI announces LLaMA.
  • March 2, 2023: Someone leaks the LLaMA models via BitTorrent.
  • March 10, 2023: Georgi Gerganov creates llama.cpp, which can run on an M1 Mac.
  • March 11, 2023: Artem Andreenko runs LLaMA 7B (slow) on a Raspberry Pi 44GB RAM, 10 sec/token.
  • March 12, 2023: LLaMA 7B runs on NPX, a node.js execution tool.
  • March 13, 2023: Someone gets lama.cpp up and running on a Pixel 6 phonealso very slowly.
  • March 13, 2023, 2023: Stanford releases Alpaca 7B, an instruction-tuned version of LLaMA 7B that “behaves similarly to OpenAI’s “text-davinci-003”, but runs on much less powerful hardware.

After obtaining the LLaMA weights ourselves, we followed Willison’s instructions and ran the 7B parameterized version on an M1 Macbook Air, and it runs at a reasonable speed. You invoke it as a script on the command line with a prompt, and LLaMA does its best to complete it in a reasonable manner.

A screenshot of LLaMA 7B in action on a MacBook Air using llama.cpp.
Enlarge / A screenshot of LLaMA 7B in action on a MacBook Air using llama.cpp.

Benj Edwards / Ars Technica

There is still the question of how much quantization affects the quality of the output. In our tests, LLaMA 7B truncated to 4-bit quantization was very impressive for use on a MacBook Air, but still not up to par with what you’d expect from ChatGPT. It’s entirely possible that better prompting techniques yield better results.

Optimizations and fine-tuning also come quickly once everyone has their hands on the code and weights, even though LLaMA is still saddled with some fairly restrictive terms of use. Stanford’s release of Alpaca today proves that fine-tuning (additional training with a specific goal in mind) can improve performance, and it’s still early days after the release of LLaMA.

At the time of writing, running LLaMA on a Mac remains a fairly technical exercise. You must install Python and Xcode and be familiar with working on the command line. Willison has good step-by-step instructions for anyone who wants to try it. But that could soon change if developers keep coding.

As for the implications of having this technology out in the wild – nobody knows yet. While some worry about the impact of AI as a tool for spam and misinformation, Willison says, “It’s not going to go uninvented, so I think our priority should be to find the most constructive ways to use it.”

At this point, our only guarantee is that things will change soon.

Leave a Comment