One of the main difficulties faced by people when deploying their own AI model onto the VPS is picking the appropriate one without enough RAM. Either you can’t start running your model at all or crash somewhere along the way due to calculations or just make it too slow for yourself that you will start having regrets about your life choices. But the secret to successful deployment lies in RAM.
So, if you tried running 7B model with 4GB of RAM before, you probably got an idea of what I am talking about. In this article, I want to discuss how much RAM do you need for popular AI models, factors affecting it, and pick the appropriate VPS plan accordingly.
Why RAM Matters for AI Models
During the training phase of the language model, all of this has to be stored in the RAM because the language model will generate its output from there. An LLM cannot use the load-as-needed feature that is used by any other software because all of its parameters have to reside within it at one time. This makes it not only a massive lookup table containing 7 billion parameters, but also a matrix.
This is what will happen if you do not have sufficient RAM:
- The model cannot even load. An out of memory error occurs before anything else.
- Disk swapping starts. All of a sudden, your fast NVMe disk turns out to be the bottleneck, and what is supposed to take two seconds takes two minutes.
- Your whole VPS hangs. Applications on your server break because the kernel terminates processes in order to conserve memory.
- The accuracy of the prediction remains the same, but the speed drops. Poor memory does not make a bad inference; it simply slows it down.
So, choosing the right amount of memory is important in order to use TensorFlow.
What factors affect the RAM needs?
Number of parameters is the first thing that pops up in one’s mind, but it’s not the only one. There are a couple more that are equally significant.
Second in significance comes Quantization – the process of adjusting how much precision each bit has. For instance, to run a full precision 7B model using FP16 technology, one requires approximately 14GB of RAM. But if you lower the level of precision to 4 bits, then you will require only 4 to 5GB of space. Although it may affect the results slightly, for practically all cases, it won’t be noticeable at all.
The model format also impacts the memory usage. When the model runs using the llama.cpp method, it consumes more or less memory than when the model is run using the transformers module in Python. The GGUF native format of the model is known to consume the least memory in CPU processing.
There is also the matter of the context window the cost that people don’t talk about. Every token of context takes memory space. A 7B model running on 2K tokens of context might use up 6GB of memory space, but running on 32K tokens of context can consume 10GB or even more. Be aware of this when designing models for processing long documents.
Finally, concurrent sessions increase the necessary resources. The difference between one person interacting with a 7B model and ten people using the same model at once is huge. If you are developing something other than a toy model, always keep this in mind.The OS itself will also eat up some of your RAM. Your Linux machine alone needs 1-2GB to work properly. Whatever services you plan to run alongside your model will add to that.
RAM Requirements for Model Sizes
This is the practical bit. This is how the typical model sizes look like in practice:
| Model Size | Examples | RAM (Min / Comfortable) | Best Use Cases | Performance Notes |
|---|---|---|---|---|
| Small (3B) | Llama 3.2 3B, Phi 3 Mini, Gemma 2B | 4GB min 8GB comfortable Weights: 2-3GB at 4 bit | Chatbots, classification, basic summarization, prototyping, automation, beginner experiments | 10-15 tokens per second on a standard VPS. Great fit for modern CPUs. Plenty fast for conversational use. |
| Medium (7B) | Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B, Gemma 2 9B | 8GB min 16GB comfortable Weights: 5-6GB at 4 bit | Real conversational use, coding assistance, document Q&A, content generation, practical AI tools | Sweet spot for self hosting. Quality close to frontier APIs when prompted well. Mid tier VPS friendly. |
| Larger (13B) | Llama 2 13B, Qwen 2.5 14B, various fine tunes | 16GB min 24GB comfortable Weights: 9-10GB at 4 bit | Customer support bots, programming assistants, content workflows needing better consistency than 7B | Noticeable quality jump in reasoning and long form writing. 4-7 tokens per second on CPU only. |
| Large (30B) | Qwen 2.5 32B, Yi 34B, Mixtral 8x7B | 32GB min 48GB comfortable Weights: around 20GB at 4 bit | Production applications, complex reasoning, replacing paid API calls where quality matters | Resources start feeling tight. 2-4 tokens per second on CPU. Workable for non realtime, weak for chat UIs. |
| 70B+ | Llama 3.1 70B, Qwen 2.5 72B | 48GB min 64GB+ comfortable Weights: 40GB+ at 4 bit | Heavy production workloads, frontier quality requirements, advanced reasoning tasks | Technically possible without GPU but rarely practical. 1-2 tokens per second on CPU. Consider GPU instances. |
AI Model RAM Comparison Table
Here is the quick reference version of everything above.
| Model Size | Quantized RAM Need | Comfortable Total RAM | Best Use Cases | Suggested VPS Tier |
|---|---|---|---|---|
| Small (3B) | 2-3GB | 4-8GB | Hobbyist, testing, light chat | Entry VPS |
| Medium (7B) | 5-6GB | 8-16GB | Mainstream self hosting, real apps | Mid tier VPS |
| Larger (13B) | 9-10GB | 16-24GB | Better quality apps, coding tools | Higher VPS plan |
| Large (30B) | 18-20GB | 32-48GB | Production workloads, complex tasks | High RAM VPS |
| 70B+ | 40GB+ | 48-64GB+ | Heavy production, frontier quality | Top tier or dedicated |
Numbers assume 4 bit quantization with moderate context windows. If you need full precision or large context, double these numbers.
Can You Run LLMs Without GPUs?
Of course , and this is something most people are unaware of. While the general assumption in the open-source world when it comes to large language models is GPU inference, the reality is that CPU-only inference works just fine for many practical applications.
llama.cpp and all sorts of its derivatives have enabled CPU inference to become feasible. With an up-to-date CPU and adequate memory, you can make models of up to 7B parameters run fast enough to be used in interactive chat apps.
When should CPU inference be used?
- When your traffic is sparse and bursty instead of dense and continuous.
- When you are developing tools internally that do not need quick response times.
- When you need to pay predictable prices without the extra GPU charges.
- When you have batch jobs instead of real-time conversations.
When should you avoid CPU inference?
- When you have multiple simultaneous users who need fast response times.
- When you are using models larger than 30B parameters.
- When speed is valued over individual request costs.
For most users who self-host, especially newbies, having sufficient RAM in their CPU-based VPS is plenty.
Common Errors in Estimating RAM Size
I see this happen again and again in forums and support tickets.
Neglecting the operating system. Users carefully measure RAM requirements for the model and choose the VPS with exactly that much RAM. The OS itself, the inference server, and all other processes consume easily 1-2GB. Add buffer.
Assumption of full precision. Many tutorials give the number of required RAM in terms of FP16 or even FP32 weights. In reality, very few people use non-quantized models on VPS hardware. Plan for quantized models unless there is an explicit reason not to.
Disregarding the context window cost. After testing the model with short prompts, users proceed to run it with documents of a substantial size and get surprised at their high RAM consumption. Perform testing with actual context sizes.
Without making proper calculations for several examples. The usage of two 7B models would consume twice as much RAM. To run them simultaneously, either upgrade significantly or use an adequate server environment such as vLLM or text generation inference.
Sizing for the model only. Your VPS probably also runs your application code, a database, maybe a web server, and other services. The model is just one tenant.
Other VPS Specs That Matter
RAM is the headline requirement, but it is not the only thing that affects AI performance.
CPU cores and clock rate determine the efficiency of the CPU-only inference. Increased cores will lead to increased generation speed; this is particularly true when using the llama.cpp package, which works very well in a multithreading environment. Even a modern multicore computer will outperform an older one in terms of tokens/second.
NVMe storage matters more than people think. Models are large files, often 4GB to 40GB depending on size and quantization. Loading them from slow disks adds seconds to startup time. NVMe also matters if you end up needing swap, though you should really avoid swap for production AI workloads.
Network bandwidth matters if you are downloading models frequently or serving API responses to many users. For most self hosting, this is not the bottleneck.
It may be useful to give some consideration to swap tuning. There is an option that aggressive swapping could be enabled just in case; however, aggressive swapping would be detrimental to the performance of large language models.
Choosing the Right VPS Plan for Your AI Model
Here’s how you can choose based on your requirements.
For the hobbyist or learner using 3B models, trying out prompts, making toy projects. You need an entry level VPS with 8GB of RAM. This gives you enough space for a small model, the OS, and your code. Affordable enough to keep it running, powerful enough to do something productive.
For the developer working on applications using 7B models, deploying chat functionality, integrating AI into existing applications. You need an intermediate VPS with 16GB of RAM. This will be able to handle a 7B model along with your app and some reasonable context window sizes.
For the dedicated self hoster using 13B models, making tools which others rely on, better output quality. You need an advanced VPS with 24 to 32GB of RAM.
The 30B+ model deployment in production that serves actual users, with a requirement for quality consistency. You will require a high RAM VPS with at least 48GB. This is when you consider using GPU instances, but in many cases, CPU plus RAM can still be cost effective.
The good thing about VPS hosting is that you can usually upgrade as your needs grow. Starting smaller and scaling up is often smarter than overprovisioning from day one.
Why UltaHost Works for Self Hosted AI
The other positive point about UltaHost VPS packages is that it offers a VPS plan on every level of memory options, which makes it very easy for one to select a perfect plan according to his needs. So, you would not be spending extra money for more resources whether you are using a 3B model with limited RAM or 30B with plenty of memory.
In addition to this, these plans give you the flexibility to scale from using 7B models to 13B easily and smoothly. The use of NVMe drives in all plans ensures great performance while loading models, and CPU cores will ensure decent inference speed in CPU-only configurations.
Talking about hosting AI models on your own, an excellent upgrade path matters more than lower costs in this case
Conclusion
Properly selecting the right VPS for AI applications comes with making realistic estimates of the amount of RAM necessary. The main criteria for this will be the size of the model and its quantization level. No matter how talented a coder you are, you simply won’t get the job done if you don’t do the math right.
Summary: 4 GB per 3B models, 8 GB per 7B models, 16 GB per 13B models, 32 GB per 30B models, and more than 48 GB per 70B models. Remember to leave some additional space for OS and context management, parallel computing, and unforeseen circumstances. Quantize your model – while using GGUF file format with 4 bit quantization, you save loads of RAM without compromising anything.
In case of doubt, it is better to go for a smaller version compared to the larger one. The reason for this is that most people are likely to exaggerate their need for memory.
Appropriately selected and configured 7 B model running on appropriate hardware beats wrong choice and wrongly selected 70 B running on inappropriate hardware any day.
So go ahead, pick a model, select the plan, and conquer!