• Members 30 posts
    June 25, 2025, 9:41 a.m.

    miro.medium.com/v2/resize:fit:1400/format:webp/1*LpEtY0rvz5-oyZAPmwtdUA.png

    Large Language Models have gained a lot of popularity in recent years and have become a cornerstone technology in AI-powered applications. LLMs are being used in many different ways, from chatbots and virtual assistants, to data analysis and creative writing. With the explosion of available models on platforms like Hugging Face, selecting the right one for your application can be overwhelming.

    In this article, we’ll break three leading open-source LLMs — Llama, Mistral, and DeepSeek — and compare their performance across (1) compute requirements, (2) memory footprint, (3) latency vs throughput trade-offs, (4) production deployment considerations, (5) safety behaviors, and (6) benchmark performance. Whether you’re a beginner or an AI engineer, we’ll explain the key concepts in accessible terms with technical depth.

    1. Compute requirements of Llama, Mistral, and DeepSeek
      1.1. Model sizes and FLOPs
      Each family offers models at different parameter sizes (7B, 13B, up to ~65–70B parameters). The number of parameters directly impacts the compute (FLOPs) needed per inference. For example, the 7B models of Llama and Mistral have around 7 billion parameters, which translates to roughly ~14 billion floating-point operations per token generated (FLOPs for a forward pass is ≈2P where P is the number of parameters in the model) [1]. A much larger 70B model like Llama-2–70B requires about 140 billion FLOPs per token — about 10x more compute than a 7B model for each output token. DeepSeek’s open models come in 7B variants and a larger 67B variant (analogous to Llama’s 65–70B range) [2]. Running the 67B DeepSeek model will demand nearly the same compute class as a 70B Llama, i.e. on the order of 1e11 FLOPs per token generation.

    1.2. Typical inference hardware
    Smaller models (7B-13B) can run on a single modern GPU, whereas the largest models need multi-GPU or specialized hardware. In practice, a Llama-3–8B or Mistral 7B (legacy) model can be served on a consumer GPU with ~12–16GB of VRAM. For example, Mistral 7B (7.3B params) needs ~15GB of GPU memory to load at full precision [3]. Llama-2–13B (13B params) roughly doubles that requirement — around 24 GB VRAM is recommended. The larger models (Llama 65B/70B or DeepSeek 67B) are far more demanding: running Llama-2 70B in 16-bit precision requires at least two high-memory GPUs. To summarize:

    7B/8B models (Llama-2–7B, Llama3.1–8B, Mistral-7B, DeepSeek-R1-Distill-Llama-8B): 1 GPU (≈15 GB VRAM) is sufficient for FP16 inference. These can even run on some laptop GPUs or modest cloud instances.
    13B models (Llama2–13B): 1 high-end GPU (≈24 GB VRAM) needed. May require memory optimizations or multi-GPU if only 16 GB GPUs are available.
    65B–70B models (Llama-3.1–70B, DeepSeek-67B): 2–4 GPUs or specialized accelerators are needed. These models have ~130–140 GB of weights in FP16, so they do not fit on a single GPU. Multi-GPU inference or server-class accelerators (like Intel’s Gaudi accelerator) are used in practice.
    2. Memory requirements for inference and fine-tuning
    2.1. Base memory needs
    The raw memory required grows with model size. For inference, a rule of thumb is ~2 bytes per parameter for FP16 models (plus some overhead). So, a 7B model is roughly 14–16 GB in memory, and a 13B model ~26–30 GB in FP16. In practice, Llama-2 7B can occupy ~14 GB in half precision and easily fits on a 16 GB card. And as noted, 65B+ models exceed 130 GB, hence needing multiple devices.

    2.2. Memory for fine-tuning
    Fine-tuning demands additional memory for optimizer states and gradients. A full fine-tune in FP16 requires about 2–3x the model size in memory since gradients and optimizer moments often also use 16-bit or 32-bit precision. For example, fine-tuning a 13B model on a 24 GB GPU will likely run OOM (out-of-memory) without strategies like gradient checkpointing or low-rank adaptation. This is why techniques like LoRA/QLoRA [4][5] are popular — they freeze most weights and train a small number of extra parameters, drastically reducing memory usage. With QLoRA (4-bit quantization + low-rank adapters), it’s possible to fine-tune 7B and 13B models on a single GPU by cutting memory requirements to a fraction of full-size. Check the LoRA and QLoRA papers for more info on low rank adaptation for fine-tuning [4][5].

    2.3. Context length and runtime memory
    Another aspect of memory is the KV cache for the attention mechanism, which grows with the number of tokens in the context. Long prompts can bloat memory usage, as the model needs to store keys/values for each layer. Mistral 7B’s Sliding Window Attention addresses this by processing long contexts in fixed-size segments (e.g. window of 4096 tokens) [6], allowing efficient context of up to ~131k tokens with only a modest memory increase (it doesn’t keep the entire long context in memory at once). DeepSeek versions introduced Multi-Head Latent Attention (MLA), a novel technique compressing the attention key-value cache to reduce the amount of computation and memory per token [7]. In short, Mistral and DeepSeek leverage architecture improvements (sliding windows, MLA, etc.) to lower the compute needed, meaning you get more performance per FLOP in those models relative to the original Llama design.

    1. Latency/Throughput: understanding the trade-off
      When serving a model in production, there is a trade-off between latency and throughput:

    Latency is the time it takes to produce a result for a single input (how quickly a chatbot responds to one user’s question).
    Throughput is how many results (or tokens) you can produce per unit time when the system is fully utilized (total tokens per second your server can generate, or responses per second if batching requests).
    These two are often at odds. If you try to maximize throughput by processing many requests or a long batch simultaneously, each individual request may see higher latency (waiting for others in the batch). On the other hand, to get the absolute lowest latency for one user, you might run the model for that user alone, under-utilizing the hardware and therefore reducing total throughput.

    3.1. Why it matters for different use cases
    For interactive applications like chatbots, latency is king, as users expect a prompt response. A difference between 0.5 seconds and 2 seconds is noticeable. So, you would run the model in a mode that favors quick single-stream generation. For large-scale batch processing (translating a million documents or analyzing a large dataset), throughput (tokens processed per second) is more important than the realtime latency of any single item. In those cases, feeding the model with as large a batch as possible (or parallel streams) to keep the GPUs 100% busy will give the fastest overall job completion, even if any given document waits a bit in a queue. Smaller models (7B, 13B) have lower per-token latency than 70B models. For example, on the same GPU, a 7B model can generate dozens of tokens per second, whereas a 70B might generate only a few tokens per second because of the heavier computation in each step.

    3.2. Latency/Throughput and Use-Case Tuning
    In production deployments, the system is often configured depending on use case. For chatbots or interactive agents, you’d run with no (or minimal) batching prioritizing each request’s speed. For non-real-time batch jobs (like nightly data processing), you might batch dozens of inputs together to fully utilize hardware. Modern inference frameworks even allow dynamic batching — automatically grouping incoming requests within a short time window to improve GPU utilization (boosting throughput) without adding too much delay. This can give a middle ground where latency increases slightly in exchange for a big jump in throughput.

    To summarize, chatty and interactive applications benefit from low latency, while large-scale automated tasks favor high throughput. The models themselves don’t change, but how you run them does. Smaller Mistral and Llama models will be faster per request than a huge DeepSeek model, but if you need maximum accuracy and can tolerate some delay (or use more hardware to parallelize), the larger model may be worth the trade-off.

    1. Production deployment
      Bringing these models to production involves considerations like software support, optimization (quantization), and serving infrastructure. The good news is that Llama, Mistral, and DeepSeek models are all compatible with popular open-source tooling, and each has an active community.

    4.1. Framework compatibility
    All three model families use the Llama-like Transformer architecture, so they are supported by frameworks such as Hugging Face Transformers out-of-the-box. For example, one can load the DeepSeek 7B or 67B model with AutoModelForCausalLM just like a Llama [8]. This means you can use common libraries (Transformers, Accelerate, etc.) to run inference or fine-tune these models with minimal changes. Also, all provide model weights via Hugging Face Hub or direct download.

    Deployment Examples: Here are a few common patterns for deploying these models:

    Local GPU server: Many run these models on a single GPU box (or a few GPUs) using Hugging Face’s TextGenerationInference server or an API wrapper. This is feasible for models up to 13B on a single GPU, or larger with multi-GPU.
    Cloud inference: All three models can be deployed on cloud GPU instances. For example, AWS Bedrock offers Mistral models, and IBM’s watsonx.ai made Mistral’s 8×7B mixture model available in early 2024 (leveraging IBM’s GPU/accelerator infrastructure). DeepSeek models, being open, can similarly be hosted on AWS, GCP, or Azure VMs with A100/H100 GPUs. One can containerize the model with TensorRT or vLLM for efficiency.
    CPU and edge: 7B models (especially with 4-bit quantization) are light enough to run on high-end CPUs. Projects like Llama.cpp have enabled running Llama 7B on laptops or phones by optimizing for AVX2/AVX512 instructions. Mistral 7B, for example, has been run on CPU at reasonable speeds due to its smaller size and optimizations, making it attractive for offline or edge use cases where GPU isn’t available.
    4.2. Quantization and framework support summary
    All these models support 8-bit and 4-bit quantization in libraries like Hugging Face Transformers (via bitsandbytes or GPTQ integration). They also integrate with serving frameworks:

    Transformers + Accelerate: easy and flexible, good for prototypes.
    vLLM: highly optimized for throughput with LLM-intact batching (Mistral provided examples for this).
    TensorRT-LLM: leverages NVIDIA Tensor Cores for speed, supports Llama and similar architectures.
    Habana Gaudi: an accelerator alternative to GPU, with growing support for Llama-family models in the Optimum library (more on this in the Gaudi section).
    In practice, deploying an open model might involve converting the weights (if needed), loading on specialized hardware, and ensuring you have good monitoring and guardrails (especially since these open models don’t come with OpenAI-style monitoring by default). That brings us to the next topic: safety considerations.

    1. Safety considerations
      Open-source models generally do not come with the robust safety reinforcement learning and content filters that proprietary models (like OpenAI’s ChatGPT or Anthropic’s Claude) have. If you plan to deploy these open models in a product, you must implement safety layers on top. This could include:

    A content filtering system: using libraries or smaller models to detect hate speech, self-harm, etc. in outputs and either refuse or post-process them.
    Prompt moderation and injection scanning: ensuring user inputs don’t contain hidden instructions.
    Rate limiting and use policies to prevent automated exploitation of the model for malicious ends.
    The community is working on alignment techniques for open models. For example, there are projects fine-tuning Llama-2 on safety instructions or using GPT-4 to judge and filter outputs (creating “referee” models). But as of 2025, open-source LLMs still significantly trail closed models on safety. If you’re planning on deploying these models, be aware that out-of-the-box they will produce content that might be disallowed, and it’s your responsibility to address that as needed. The flip side is flexibility — some users specifically want models with minimal filtering (for research or creative freedom), and open models fill that niche. Just be cautious to not deploy them directly to end-users without guarding rails if there’s risk of misuse.

    1. Benchmark performance comparisons
      Despite these models being smaller and open, they have shown impressive performance on standard benchmarks. Let’s compare Llama-3, Mistral, and DeepSeek. Each represents the best current model of its family at roughly the 7–8B scale (fitting on a single high-end GPU). We focus on their performance across standard benchmarks for knowledge & reasoning (MMLU), mathematical problem solving (GSM8K), and coding ability (HumanEval). The table below summarizes the results:

    Table: Benchmark accuracy/pass rates of top open-source ~8B models on knowledge (MMLU), math (GSM8K), and coding (HumanEval). Higher is better. Each model’s score reflects accuracy (for MMLU, GSM8K) or pass@1 rate (for HumanEval) on the benchmark. Despite their small size, these models achieve strong results, narrowing the gap to much larger models in certain areas.
    6.1. Llama 3–8B General-Purpose Model
    Meta’s Llama-3–8B shows to be a well-rounded, general-purpose open model that delivers strong performance across reasoning, math, and coding, while remaining compact enough to run on a single GPU. It achieves ~68% on MMLU, ~80% on GSM8K, and ~62% on HumanEval, making it one of the most capable base models in its size class. It’s a well-balanced model, performing reliably across diverse tasks without being particularly specialized. It’s ideal for developers seeking a versatile, instruction-following LLM for chat, question answering, and lightweight coding without compromising performance or requiring multi-GPU setups.

    6.2. Mistral 7B — Efficient Foundation with Solid Fundamentals
    Mistral 7B was the first open model to truly challenge larger competitors, outperforming Llama-2–13B on most benchmarks because of its efficient architecture choices like grouped-query and sliding-window attention. It scores ~60% on MMLU and ~50% on GSM8K, with modest coding ability (~26% HumanEval), but stands out for its great performance-to-weight ratio. Optimized for speed and lower memory use, Mistral remains a strong foundational model for resource-constrained deployments or long-context applications. While newer models have surpassed its raw performance, it’s still a favorite for fast inference and extensibility.

    DeepSeek — 8B Distilled Model Optimized for Reasoning and Code
    DeepSeek’s distilled 8B model is the top performer among open-source models at this scale, especially in math and code. Scoring ~78% on MMLU, ~85.5% on GSM8K, and ~71% on HumanEval, it rivals or exceeds the performance of older 30B+ models in these domains. This is the result of a carefully engineered training pipeline involving reasoning-focused datasets, chain-of-thought prompting, and reinforcement learning. While less balanced than Llama 3, DeepSeek excels when the use-case demands high accuracy in complex reasoning or program synthesis. It’s a top-tier choice for applications where correctness trumps speed or generality.

    6.3. Performance vs Model Size
    Even with their small sizes, these ~8B parameter models deliver surprisingly high performance on challenging benchmarks. For context, proprietary models like GPT-4 still score higher (GPT-4 exceeds 85% on MMLU), but the gap has narrowed considerably. Llama-3–8B and DeepSeek-8B are punching above their weight (pun intended). Llama 3’s MMLU score in the high 60s was once the realm of 30–70B models, and DeepSeek’s ~85% on GSM8K math approaches the performance of much larger models. Plus, the fact that you can host these models on a single GPU is a testament to the fast progress in model design and training techniques in the field.

    In summary, each model has its distinct strengths:

    Llama-3–8B is the best general-purpose small LLM, with well-rounded abilities across knowledge, reasoning, and code.
    Mistral 7B offers efficient performance, maintaining a strong baseline in understanding and reasoning tasks given its tiny footprint.
    DeepSeek 8B (distilled) is highly specialized, pushing the state of the art in mathematical reasoning and coding for an 8B model
    All three demonstrate that mid-2025’s open 8B-scale models can deliver impressive results, often comparable to or better than older 13B–30B models, while remaining lightweight and accessible.

    Intel Gaudi: Running LLMs on Gaudi Accelerators

    Intel’s Gaudi 3 AI accelerator designed specifically for deep learning workloads.
    A final note for those considering infrastructure: all these models are available on Intel’s Gaudi AI accelerators (like Gaudi2, and the latest Gaudi3) hosted on the Intel® Tiber™ AI Cloud. Gaudi accelerators offer an alternative to NVIDIA GPUs with competitive price-performance for training and inference of LLMs. In fact, Intel has optimized support for Llama-family models on Gaudi, for example, Gaudi2 was shown to run Llama-2 7B, 13B, and 70B with strong performance using the Optimum Habana library [9]. Some early results even showed Gaudi3 slightly outperforming NVIDIA’s H100 on LLM throughput in certain scenarios [10].

    The takeaway is that you can deploy Llama, Mistral, or DeepSeek on cloud instances with Gaudi accelerators (offered by AWS DL1 instances) often at a lower cost. Gaudi’s architecture (large onboard memory and high memory bandwidth) is well-suited to these models, and with the growing software stack, you might see cost savings and performance gains by using Gaudi for open-source models. All the models we discussed can be converted and run on Gaudi using libraries like Hugging Face Optimum with minimal code changes. This provides more flexibility and potentially higher throughput per dollar when scaling these models in production.

    References
    Inference Characteristics of Llama | Cursor — The AI Code Editor
    deepseek-ai/deepseek-llm-7b-base · Hugging Face
    Mistral 7B: Recipes for Fine-tuning and Quantization on Your Computer | by Benjamin Marie | TDS Archive | Medium
    LoRA: Low-Rank Adaptation of Large Language Models
    QLoRA: Efficient Finetuning of Quantized LLMs
    Mistral documentation | Hugging Face
    The DeepSeek Shock: A ‘Cost-Effective’ Language Model Challenging GPT
    DeepSeek LLM: Let there be answers
    newsroom.intel.com/artificial-intelligence/intel-gaudi-xeon-and-ai-pc-accelerate-meta-Llama-3-genai-workloads
    Intel Gaudi 3 Accelerates AI at Scale on IBM Cloud — Signal65