Member-only story
Basic LLM Inference Memory Math
Explaining why these LLMs need so much VRAM
Download More RAM? Nah, Download More VRAM!
Unfortunately, downloading more RAM won’t save you with Large Language Models, you’ll need GPU RAM (also called VRAM) that can fit them.
If you’re renting them from AWS, Azure, or GCP, you can expect to spend $1 at the minimum per hour to rent them. If you want to buy an A10 or higher graphics card, it can set you back ~$3000, and even more for an A100 or H100 GPU.
But wait, why do you need these ultra expensive graphics cards? Can’t you just use your regular graphics card to do some inference?
Well, unfortunately if you want a smarter model, math is not on your side.
The Math, at a Back of the Envelope Calculation
tl;dr Multiply the number of params by 2, add another 2 GB for overhead
Many LLMs if you’ve ever seen them written out come with a number of “parameters” in their name. For example:
While there are many mechanisms to estimate the memory required, I’m going to give you the back of the envelope one.