Quantization in Large Language Models(LLMs)

Nagesh Singh Chauhan
2 days ago
9 min read

Demystifying the Compression of Large Language Models(LLMs)

Introduction

Large Language Models are powerful—but they are also huge, slow, and expensive.A single modern LLM can require tens to hundreds of GBs of memory, making deployment on edge devices, CPUs, or cost-sensitive production systems challenging.

Quantization is one of the most effective techniques to bridge this gap.

This blog breaks down what quantization is, why it works, how it’s done, and where it breaks—with intuition, math, and real-world tradeoffs.

Why Quantization Exists

LLMs are dominated by matrix multiplications:

Y=XW

Where:

X: activation matrix
W: weight matrix (billions of parameters)
Both are typically stored as FP16 or FP32

The Problem

Issue	Impact
FP16/FP32 precision	High memory footprint
Large memory bandwidth	Slow inference
GPU dependency	High cost
Cache misses	Latency spikes

What is Quantization in LLM?

Quantization is a model compression and optimization technique that reduces the numerical precision of weights and activation values in a trained large language model.

Instead of storing and computing values in high-precision floating-point formats (like FP32 or FP16), quantization represents them in lower-precision formats (e.g., INT8, INT4).

Converting from a data type that can hold more information to one that holds less

This process significantly shrinks memory footprint, lowers computational cost, and enables deployment on resource-constrained hardware – often with only a small trade-off in accuracy.

Image Credits

A great analogy for understanding quantization is image compression. Compressing an image involves reducing its size by removing some of the information, i.e., bits of data, from it. Now, while decreasing the size of an image typically reduces its quality (to acceptable levels), it also means more images can be saved on a given device while requiring less time and bandwidth to transfer or display to a user. In a similar way, quantizing an LLM increases its portability and the number of ways it can be deployed – albeit with an acceptable sacrifice to detail or precision.

Quantization is an important process within machine learning because reducing the number of bits required for each of a model’s weights adds up to a significant decrease in its overall size. Consequently, quantization produces LLMs that consume less memory, require less storage space, are more energy-efficient, and are capable of faster inference. This all adds up to the critical advantage of enabling LLMs to run on a wider range of devices, including single GPUs, instead of expensive hardware featuring multiple GPUs, and, in some cases, even CPUs.

In the context of LLMs, quantization maps continuous 32-bit floating-point parameters into discrete values with fewer bits. For example, converting FP32 values down to 8-bit integers reduces memory usage by ~4x.

LLM Quantization Types

Quantization isn’t one-size-fits-all. The major types include:

a) Post-Training Quantization (PTQ)

Quantization applied after training is complete. This is usually fast and doesn’t require retraining or access to the original training pipeline, but may sometimes slightly reduce accuracy.

b) Quantization-Aware Training (QAT)

Incorporates quantization effects during the training process so the model learns to be robust to reduced precision. This often yields better performance than PTQ at the cost of longer training times.

c) Static vs Dynamic Quantization

Static Quantization: Both weights and activations are quantized before inference, typically with calibration datasets.
Dynamic Quantization: Weights are quantized ahead of time, while activations are quantized dynamically at inference time. Dynamic approaches can offer flexibility with modest speed gains.

d) Mixed Precision and Layer-Wise Quantization

Some approaches mix bit-widths within a model — for example, keeping sensitive layers at FP16 while quantizing others to INT8 — to balance performance and efficiency.

How Does LLM Quantization Work?

At a high level, quantization maps continuous numeric values to a limited set of discrete representations. This is typically achieved using a scale and zero-point to transform high-precision values into low-precision integers and vice-versa:

This transformation retains the relative semantics of the value but stores it using fewer bits. The scale determines the quantized resolution and zero-point aligns the integer range with the original data range.

The more bits you retain, the closer the quantized data is to the original floating-point values, but with higher memory cost. Fewer bits (e.g., 4-bit) save more space but introduce a larger quantization error.

The Math Behind Quantization

Uniform Affine Quantization

Most LLM quantization uses affine quantization:

Where:

x: original float value
q: quantized integer
s: scale factor
z: zero point

LLM Quantization Techniques

Here are some common quantization formats and methods used in practice:

a) INT8 / INT4 Quantization

Converts parameters into 8-bit or 4-bit integers. INT8 is widely supported on hardware and provides a good balance of speed, memory, and accuracy; INT4 is more aggressive with larger compression but needs careful handling.

b) GPTQ (Gradient Post-Training Quantization)

A sophisticated PTQ method that minimizes layerwise error using gradient-based strategies, allowing reliable 4-bit quantization.

c) AWQ (Activation-Aware Weight Quantization)

Focuses on maintaining accurate activations while quantizing weights, which is crucial for transformer-like models.

d) SmoothQuant

Redistributes scaling between weights and activations to make quantization friendlier and preserve accuracy, even for heavy models.

e) Mixed Precision and Layer Customization

Certain layers (like layer norms, embeddings or attention heads) may be kept in higher precision (e.g., FP16) to avoid critical loss while the rest of the model is aggressively quantized.

Popular Quantization Formats

Quantization formats define how many bits are used to represent weights and activations and how values are distributed within those bits. In LLMs, the choice of format directly impacts memory, latency, hardware compatibility, and model quality.

FP16 (Half Precision)

16-bit floating point
Baseline format for LLM training and inference
High numerical stability and accuracy
Large memory footprint
Widely supported on GPUs

INT8

8-bit integer quantization
Most common production format
Typically weight-only or weight + activation quantization
~2× memory reduction vs FP16
Minimal accuracy loss
Requires calibration for best results

INT4

4-bit integer quantization
Aggressive compression for cost-sensitive inference
Used with group-wise or per-channel scaling
~4× memory reduction vs FP16
Slight degradation on reasoning-heavy tasks
Requires careful layer handling

NF4 (NormalFloat 4)

Non-uniform 4-bit format optimized for LLM weights
Quantization bins follow a normal distribution
Commonly used with QLoRA-style fine-tuning
Better accuracy than standard INT4
Specialized tooling required

FP8

8-bit floating point (E4M3 / E5M2 variants)
Supported on modern accelerators (e.g., NVIDIA Hopper)
Increasingly used for both training and inference
High throughput on compatible hardware
Limited availability outside latest GPUs

Quick Comparison

Format	Bits	Memory Saving	Accuracy	Common Use
FP16	16	1×	⭐⭐⭐⭐⭐	Training / baseline
INT8	8	~2×	⭐⭐⭐⭐☆	Production inference
INT4	4	~4×	⭐⭐⭐☆	Low-cost deployment
NF4	4	~4×	⭐⭐⭐⭐	LLM fine-tuning
FP8	8	~2×	⭐⭐⭐⭐☆	High-end GPUs

Rule of thumb:

Start with INT8 for safe production use
Use INT4 / NF4 when memory or cost dominates
Consider FP8 only if your hardware fully supports it

Advantages of LLM Quantization

Quantization offers several core benefits:

✔ Smaller Model Size

By using fewer bits per parameter (e.g., 4–8 bits instead of 16–32 bits), model size drastically decreases, enabling deployment in constrained environments.

✔ Faster Inference

Lower-precision arithmetic is typically faster on specialized hardware, leading to quicker forward passes.

✔ Lower Memory Bandwidth

Smaller data means fewer memory fetches, which can be a critical bottleneck in large model inference.

✔ Cost-Effective Deployment

Reduced compute and memory results in lower cloud costs and the ability to serve more instances concurrently.

Why LLMs Are Surprisingly Quantization-Friendly

LLMs are surprisingly quantization-friendly due to a combination of statistical redundancy and architectural robustness. Modern LLMs are massively overparameterized, meaning they contain far more weights than strictly required to model language. This excess capacity creates redundancy, allowing small numerical perturbations introduced by low-bit quantization (INT8, INT4) to behave like mild noise rather than destructive errors. Because semantic information is distributed across many neurons and attention heads, individual weight precision is far less critical than the overall relational structure of the representations.

In addition, the core architecture of LLMs actively stabilizes quantization noise as signals flow through the network. Several design choices make low-precision arithmetic viable without major accuracy loss:

Overparameterization: Redundant weights absorb quantization error without collapsing representations
LayerNorm: Normalizes activations, reducing error accumulation across layers
Residual connections: Preserve original signal paths, limiting distortion
Attention mechanisms: Depend more on relative comparisons than absolute numeric precision
Selective precision: Keeping sensitive layers (embeddings, output projections) in higher precision preserves quality

Together, these properties explain why LLMs can remain fluent, coherent, and capable even when aggressively quantized—quantization works not because LLMs are numerically exact systems, but because they are statistically robust ones.

Practical Steps to Follow for Quantizing an LLM Model

Here’s a practical workflow many engineers follow:

1. Choose Bit-Width

Determine the target precision — e.g., INT8, INT6, or INT4 — based on accuracy requirements and deployment environment.

2. Select Quantization Strategy

Decide between PTQ and QAT:

PTQ for fast, no-retraining deployment,
QAT if you need minimal accuracy loss.

3. Calibration

Provide representative data to calibrate activation ranges and scales — especially crucial for static quantization.

4. Apply Quantization Tools

Use frameworks like Hugging Face’s transformers, bitsandbytes, or TensorRT for quantization.

5. Validation & Tuning

Benchmark the quantized model on real inference workloads and compare accuracy against the original. Adjust bit-widths or use mixed precision if needed.

Understanding 4-Bit Quantization (INT4 & NF4) in Large Language Models

4-bit quantization is one of the most aggressive and impactful optimizations used in modern LLM deployment. It reduces model memory by ~4× compared to FP16, often with surprisingly small quality degradation. But achieving this safely requires careful mathematical and architectural choices.

This section explains why 4-bit works, how it’s implemented, and where it can fail.

1. What Does “4-Bit” Actually Mean?

A 4-bit number can represent only:

That’s it.

Compare this with:

FP16 → ~65,000 representable values
FP32 → millions

So the core challenge is:

How do you squeeze billions of floating-point weights into just 16 bins without breaking the model?

2. Why Naive INT4 Fails

If you uniformly quantize weights into 16 equal bins:

you quickly run into problems:

Most LLM weights are clustered near zero
Uniform bins waste resolution on rarely used large values
Small but important weights collapse to zero

Result:

Loss of expressivity
Degraded reasoning
Instability in early layers

This is why plain INT4 almost never works out-of-the-box.

3. The Statistical Structure of LLM Weights

Empirically, LLM weights:

Are approximately normally distributed
Have heavy mass near zero
Contain a few large outliers

Image Credits

This observation is the foundation of modern 4-bit methods.

4. Scale + Zero-Point Are Not Enough

At 8-bit, affine quantization works well:

At 4-bit:

Quantization noise is too large
Error is no longer “locally linear”
Outliers dominate the scale

So we need more structure.

5. Group-Wise Quantization (Key Breakthrough)

Instead of one scale per tensor:

Split weights into groups (e.g., 32 or 64 channels)
Each group gets its own scale
Limits outlier damage

Why This Works

Smaller dynamic range per group
Higher effective precision
Local error containment

This alone enables usable INT4 inference.

6. NF4: NormalFloat 4 (Why It Matters)

NF4 is a non-uniform 4-bit quantization scheme designed specifically for LLMs.

Core Idea

Instead of evenly spaced bins:

Place bins according to a normal distribution
More bins near zero
Fewer bins in the tails

Result

High resolution where weights actually live
Lower error for small but important weights

This is why NF4 consistently outperforms uniform INT4.

7. Dequantization at Runtime (Hidden Cost)

Weights are stored in 4-bit, but computation typically happens in FP16:

This introduces:

Dequantization overhead
Scale lookups
Potential memory stalls

Well-optimized kernels (e.g., fused dequant + GEMM) are essential.

8. Error Propagation in 4-Bit Models

4-bit quantization introduces structured noise, not random noise.

Error accumulates most in:

Early embedding layers
Output projection layers
Attention score computation

Best Practice

Keep these layers in:

FP16 or INT8
Quantize the rest aggressively

This hybrid approach gives the best tradeoff.

9. Why 4-Bit Still Works in Practice

Despite extreme compression, 4-bit works because:

LLMs are overparameterized
LayerNorm stabilizes distributions
Attention relies on relative ranking
Residuals preserve signal paths

4-bit errors get averaged, normalized, and dampened across layers.

10. 4-Bit Quantization vs Fine-Tuning (QLoRA)

4-bit enables training-scale fine-tuning on consumer GPUs.

QLoRA approach:

Base model in NF4
LoRA adapters in FP16
Backprop through dequantized weights

This is how multi-billion-parameter models fit on a single GPU.

12. When You Should (and Shouldn’t) Use 4-Bit

Use 4-Bit When:

Memory is the bottleneck
Serving many replicas
Fine-tuning large models cheaply

Avoid 4-Bit When:

Exact numerical reasoning matters
Model is small (<1B params)
Hardware lacks optimized kernels

4-bit quantization works because LLMs don’t need precise numbers—they need stable relative structure.

When done correctly, 4-bit quantization is not a hack—it’s a principled compression strategy aligned with the statistics of language models.

Conclusion

Quantization has evolved from a low-level optimization trick into a core enabler of scalable LLM deployment. As models grow in size and capability, the bottleneck is no longer just intelligence—it is memory, latency, and cost. By reducing numerical precision while preserving the statistical structure of model representations, quantization makes it possible to run powerful LLMs efficiently across GPUs, CPUs, and even edge devices without fundamentally compromising their behavior.

The key insight is that LLMs do not rely on exact arithmetic; they rely on robust relative signals shaped by overparameterization, normalization, and attention. When applied thoughtfully—using per-channel or group-wise scaling, modern formats like INT8, NF4, or FP8, and selective high-precision layers—quantization delivers dramatic gains with minimal tradeoffs. In practice, production-grade LLM systems are no longer “FP16 models with optimizations,” but quantized-first systems by design. Understanding quantization is therefore not optional for practitioners—it is essential infrastructure knowledge for building efficient, real-world AI systems.