top of page

Quantization in Large Language Models(LLMs)

  • Writer: Nagesh Singh Chauhan
    Nagesh Singh Chauhan
  • 2 days ago
  • 9 min read

Demystifying the Compression of Large Language Models(LLMs)



Introduction


Large Language Models are powerful—but they are also huge, slow, and expensive.A single modern LLM can require tens to hundreds of GBs of memory, making deployment on edge devices, CPUs, or cost-sensitive production systems challenging.


Quantization is one of the most effective techniques to bridge this gap.


This blog breaks down what quantization is, why it works, how it’s done, and where it breaks—with intuition, math, and real-world tradeoffs.


Why Quantization Exists


LLMs are dominated by matrix multiplications:


Y=XW


Where:

  • X: activation matrix

  • W: weight matrix (billions of parameters)

  • Both are typically stored as FP16 or FP32


The Problem

Issue

Impact

FP16/FP32 precision

High memory footprint

Large memory bandwidth

Slow inference

GPU dependency

High cost

Cache misses

Latency spikes

What is Quantization in LLM?


Quantization is a model compression and optimization technique that reduces the numerical precision of weights and activation values in a trained large language model.


Instead of storing and computing values in high-precision floating-point formats (like FP32 or FP16), quantization represents them in lower-precision formats (e.g., INT8, INT4).


Converting from a data type that can hold more information to one that holds less


This process significantly shrinks memory footprint, lowers computational cost, and enables deployment on resource-constrained hardware – often with only a small trade-off in accuracy.



A great analogy for understanding quantization is image compression. Compressing an image involves reducing its size by removing some of the information, i.e., bits of data, from it. Now, while decreasing the size of an image typically reduces its quality (to acceptable levels), it also means more images can be saved on a given device while requiring less time and bandwidth to transfer or display to a user. In a similar way, quantizing an LLM increases its portability and the number of ways it can be deployed – albeit with an acceptable sacrifice to detail or precision. 


Quantization is an important process within machine learning because reducing the number of bits required for each of a model’s weights adds up to a significant decrease in its overall size.  Consequently, quantization produces LLMs that consume less memory, require less storage space, are more energy-efficient, and are capable of faster inference. This all adds up to the critical advantage of enabling LLMs to run on a wider range of devices, including single GPUs, instead of expensive hardware featuring multiple GPUs, and, in some cases, even CPUs. 


In the context of LLMs, quantization maps continuous 32-bit floating-point parameters into discrete values with fewer bits. For example, converting FP32 values down to 8-bit integers reduces memory usage by ~4x.



LLM Quantization Types


Quantization isn’t one-size-fits-all. The major types include:


a) Post-Training Quantization (PTQ)


Quantization applied after training is complete. This is usually fast and doesn’t require retraining or access to the original training pipeline, but may sometimes slightly reduce accuracy.


b) Quantization-Aware Training (QAT)


Incorporates quantization effects during the training process so the model learns to be robust to reduced precision. This often yields better performance than PTQ at the cost of longer training times.


c) Static vs Dynamic Quantization


  • Static Quantization: Both weights and activations are quantized before inference, typically with calibration datasets.

  • Dynamic Quantization: Weights are quantized ahead of time, while activations are quantized dynamically at inference time. Dynamic approaches can offer flexibility with modest speed gains.


d) Mixed Precision and Layer-Wise Quantization


Some approaches mix bit-widths within a model — for example, keeping sensitive layers at FP16 while quantizing others to INT8 — to balance performance and efficiency.


How Does LLM Quantization Work?


At a high level, quantization maps continuous numeric values to a limited set of discrete representations. This is typically achieved using a scale and zero-point to transform high-precision values into low-precision integers and vice-versa:



This transformation retains the relative semantics of the value but stores it using fewer bits. The scale determines the quantized resolution and zero-point aligns the integer range with the original data range.


The more bits you retain, the closer the quantized data is to the original floating-point values, but with higher memory cost. Fewer bits (e.g., 4-bit) save more space but introduce a larger quantization error.


The Math Behind Quantization


Uniform Affine Quantization


Most LLM quantization uses affine quantization:


Where:

  • x: original float value

  • q: quantized integer

  • s: scale factor

  • z: zero point


LLM Quantization Techniques


Here are some common quantization formats and methods used in practice:


a) INT8 / INT4 Quantization


Converts parameters into 8-bit or 4-bit integers. INT8 is widely supported on hardware and provides a good balance of speed, memory, and accuracy; INT4 is more aggressive with larger compression but needs careful handling.


b) GPTQ (Gradient Post-Training Quantization)


A sophisticated PTQ method that minimizes layerwise error using gradient-based strategies, allowing reliable 4-bit quantization.


c) AWQ (Activation-Aware Weight Quantization)


Focuses on maintaining accurate activations while quantizing weights, which is crucial for transformer-like models.


d) SmoothQuant


Redistributes scaling between weights and activations to make quantization friendlier and preserve accuracy, even for heavy models.


e) Mixed Precision and Layer Customization


Certain layers (like layer norms, embeddings or attention heads) may be kept in higher precision (e.g., FP16) to avoid critical loss while the rest of the model is aggressively quantized.


Popular Quantization Formats


Quantization formats define how many bits are used to represent weights and activations and how values are distributed within those bits. In LLMs, the choice of format directly impacts memory, latency, hardware compatibility, and model quality.


FP16 (Half Precision)


  • 16-bit floating point

  • Baseline format for LLM training and inference

  • High numerical stability and accuracy

  • Large memory footprint

  • Widely supported on GPUs


INT8


  • 8-bit integer quantization

  • Most common production format

  • Typically weight-only or weight + activation quantization

  • ~2× memory reduction vs FP16

  • Minimal accuracy loss

  • Requires calibration for best results


INT4


  • 4-bit integer quantization

  • Aggressive compression for cost-sensitive inference

  • Used with group-wise or per-channel scaling

  • ~4× memory reduction vs FP16

  • Slight degradation on reasoning-heavy tasks

  • Requires careful layer handling


NF4 (NormalFloat 4)


  • Non-uniform 4-bit format optimized for LLM weights

  • Quantization bins follow a normal distribution

  • Commonly used with QLoRA-style fine-tuning

  • Better accuracy than standard INT4

  • Specialized tooling required


FP8


  • 8-bit floating point (E4M3 / E5M2 variants)

  • Supported on modern accelerators (e.g., NVIDIA Hopper)

  • Increasingly used for both training and inference

  • High throughput on compatible hardware

  • Limited availability outside latest GPUs


Quick Comparison

Format

Bits

Memory Saving

Accuracy

Common Use

FP16

16

⭐⭐⭐⭐⭐

Training / baseline

INT8

8

~2×

⭐⭐⭐⭐☆

Production inference

INT4

4

~4×

⭐⭐⭐☆

Low-cost deployment

NF4

4

~4×

⭐⭐⭐⭐

LLM fine-tuning

FP8

8

~2×

⭐⭐⭐⭐☆

High-end GPUs


Rule of thumb:

  • Start with INT8 for safe production use

  • Use INT4 / NF4 when memory or cost dominates

  • Consider FP8 only if your hardware fully supports it


Advantages of LLM Quantization


Quantization offers several core benefits:


✔ Smaller Model Size


By using fewer bits per parameter (e.g., 4–8 bits instead of 16–32 bits), model size drastically decreases, enabling deployment in constrained environments.


✔ Faster Inference


Lower-precision arithmetic is typically faster on specialized hardware, leading to quicker forward passes.


✔ Lower Memory Bandwidth


Smaller data means fewer memory fetches, which can be a critical bottleneck in large model inference.


✔ Cost-Effective Deployment


Reduced compute and memory results in lower cloud costs and the ability to serve more instances concurrently.


Why LLMs Are Surprisingly Quantization-Friendly


LLMs are surprisingly quantization-friendly due to a combination of statistical redundancy and architectural robustness. Modern LLMs are massively overparameterized, meaning they contain far more weights than strictly required to model language. This excess capacity creates redundancy, allowing small numerical perturbations introduced by low-bit quantization (INT8, INT4) to behave like mild noise rather than destructive errors. Because semantic information is distributed across many neurons and attention heads, individual weight precision is far less critical than the overall relational structure of the representations.


In addition, the core architecture of LLMs actively stabilizes quantization noise as signals flow through the network. Several design choices make low-precision arithmetic viable without major accuracy loss:


  • Overparameterization: Redundant weights absorb quantization error without collapsing representations

  • LayerNorm: Normalizes activations, reducing error accumulation across layers

  • Residual connections: Preserve original signal paths, limiting distortion

  • Attention mechanisms: Depend more on relative comparisons than absolute numeric precision

  • Selective precision: Keeping sensitive layers (embeddings, output projections) in higher precision preserves quality


Together, these properties explain why LLMs can remain fluent, coherent, and capable even when aggressively quantized—quantization works not because LLMs are numerically exact systems, but because they are statistically robust ones.


Practical Steps to Follow for Quantizing an LLM Model


Here’s a practical workflow many engineers follow:


1. Choose Bit-Width


Determine the target precision — e.g., INT8, INT6, or INT4 — based on accuracy requirements and deployment environment.


2. Select Quantization Strategy


Decide between PTQ and QAT:

  • PTQ for fast, no-retraining deployment,

  • QAT if you need minimal accuracy loss.


3. Calibration


Provide representative data to calibrate activation ranges and scales — especially crucial for static quantization.


4. Apply Quantization Tools


Use frameworks like Hugging Face’s transformers, bitsandbytes, or TensorRT for quantization.


5. Validation & Tuning


Benchmark the quantized model on real inference workloads and compare accuracy against the original. Adjust bit-widths or use mixed precision if needed.


Understanding 4-Bit Quantization (INT4 & NF4) in Large Language Models


4-bit quantization is one of the most aggressive and impactful optimizations used in modern LLM deployment. It reduces model memory by ~4× compared to FP16, often with surprisingly small quality degradation. But achieving this safely requires careful mathematical and architectural choices.


This section explains why 4-bit works, how it’s implemented, and where it can fail.


1. What Does “4-Bit” Actually Mean?


A 4-bit number can represent only:

That’s it.


Compare this with:


  • FP16 → ~65,000 representable values

  • FP32 → millions


So the core challenge is:

How do you squeeze billions of floating-point weights into just 16 bins without breaking the model?

2. Why Naive INT4 Fails


If you uniformly quantize weights into 16 equal bins:

you quickly run into problems:


  • Most LLM weights are clustered near zero

  • Uniform bins waste resolution on rarely used large values

  • Small but important weights collapse to zero


Result:


  • Loss of expressivity

  • Degraded reasoning

  • Instability in early layers


This is why plain INT4 almost never works out-of-the-box.


3. The Statistical Structure of LLM Weights


Empirically, LLM weights:


  • Are approximately normally distributed

  • Have heavy mass near zero

  • Contain a few large outliers



This observation is the foundation of modern 4-bit methods.


4. Scale + Zero-Point Are Not Enough


At 8-bit, affine quantization works well:

At 4-bit:

  • Quantization noise is too large

  • Error is no longer “locally linear”

  • Outliers dominate the scale


So we need more structure.


5. Group-Wise Quantization (Key Breakthrough)


Instead of one scale per tensor:


  • Split weights into groups (e.g., 32 or 64 channels)

  • Each group gets its own scale

  • Limits outlier damage


Why This Works


  • Smaller dynamic range per group

  • Higher effective precision

  • Local error containment


This alone enables usable INT4 inference.


6. NF4: NormalFloat 4 (Why It Matters)


NF4 is a non-uniform 4-bit quantization scheme designed specifically for LLMs.


Core Idea


Instead of evenly spaced bins:

  • Place bins according to a normal distribution

  • More bins near zero

  • Fewer bins in the tails


Result

  • High resolution where weights actually live

  • Lower error for small but important weights


This is why NF4 consistently outperforms uniform INT4.


7. Dequantization at Runtime (Hidden Cost)


Weights are stored in 4-bit, but computation typically happens in FP16:


This introduces:


  • Dequantization overhead

  • Scale lookups

  • Potential memory stalls


Well-optimized kernels (e.g., fused dequant + GEMM) are essential.


8. Error Propagation in 4-Bit Models


4-bit quantization introduces structured noise, not random noise.


Error accumulates most in:


  • Early embedding layers

  • Output projection layers

  • Attention score computation


Best Practice


Keep these layers in:

  • FP16 or INT8

  • Quantize the rest aggressively


This hybrid approach gives the best tradeoff.


9. Why 4-Bit Still Works in Practice


Despite extreme compression, 4-bit works because:


  • LLMs are overparameterized

  • LayerNorm stabilizes distributions

  • Attention relies on relative ranking

  • Residuals preserve signal paths


4-bit errors get averaged, normalized, and dampened across layers.


10. 4-Bit Quantization vs Fine-Tuning (QLoRA)


4-bit enables training-scale fine-tuning on consumer GPUs.


QLoRA approach:


  • Base model in NF4

  • LoRA adapters in FP16

  • Backprop through dequantized weights


This is how multi-billion-parameter models fit on a single GPU.


12. When You Should (and Shouldn’t) Use 4-Bit


Use 4-Bit When:


  • Memory is the bottleneck

  • Serving many replicas

  • Fine-tuning large models cheaply


Avoid 4-Bit When:


  • Exact numerical reasoning matters

  • Model is small (<1B params)

  • Hardware lacks optimized kernels


4-bit quantization works because LLMs don’t need precise numbers—they need stable relative structure.

When done correctly, 4-bit quantization is not a hack—it’s a principled compression strategy aligned with the statistics of language models.


Conclusion


Quantization has evolved from a low-level optimization trick into a core enabler of scalable LLM deployment. As models grow in size and capability, the bottleneck is no longer just intelligence—it is memory, latency, and cost. By reducing numerical precision while preserving the statistical structure of model representations, quantization makes it possible to run powerful LLMs efficiently across GPUs, CPUs, and even edge devices without fundamentally compromising their behavior.


The key insight is that LLMs do not rely on exact arithmetic; they rely on robust relative signals shaped by overparameterization, normalization, and attention. When applied thoughtfully—using per-channel or group-wise scaling, modern formats like INT8, NF4, or FP8, and selective high-precision layers—quantization delivers dramatic gains with minimal tradeoffs. In practice, production-grade LLM systems are no longer “FP16 models with optimizations,” but quantized-first systems by design. Understanding quantization is therefore not optional for practitioners—it is essential infrastructure knowledge for building efficient, real-world AI systems.

Follow

  • Facebook
  • Linkedin
  • Instagram
  • Twitter
Sphere on Spiral Stairs

©2026 by Intelligent Machines

bottom of page