Member-only story

Basic LLM Inference Memory Math

Matt Kornfield
3 min readJan 30, 2025

Explaining why these LLMs need so much VRAM

Photo by Andrey Matveev on Unsplash

Download More RAM? Nah, Download More VRAM!

Unfortunately, downloading more RAM won’t save you with Large Language Models, you’ll need GPU RAM (also called VRAM) that can fit them.

If you’re renting them from AWS, Azure, or GCP, you can expect to spend $1 at the minimum per hour to rent them. If you want to buy an A10 or higher graphics card, it can set you back ~$3000, and even more for an A100 or H100 GPU.

But wait, why do you need these ultra expensive graphics cards? Can’t you just use your regular graphics card to do some inference?

Well, unfortunately if you want a smarter model, math is not on your side.

The Math, at a Back of the Envelope Calculation

tl;dr Multiply the number of params by 2, add another 2 GB for overhead

Many LLMs if you’ve ever seen them written out come with a number of “parameters” in their name. For example:

While there are many mechanisms to estimate the memory required, I’m going to give you the back of the envelope one.

--

--

Matt Kornfield
Matt Kornfield

Written by Matt Kornfield

Today's solutions are tomorrow's debugging adventure.

No responses yet