Model Quantization#
Model quantization is the process of converting a model’s high-precision floating-point weights and activations (for example, FP32) into a lower-precision format. This technique is essential for deploying models efficiently on NPU hardware, as it significantly reduces memory footprint and accelerates computation.
This section covers BF16, INT8 and mixed precision quantization workflows for NPU deployment.
About AMD Quark#
AMD Quark is a quantization toolkit for PyTorch and ONNX models. The toolkit provides calibration and quantization tools to prepare models for NPU deployment.
Note
AMD Quark comes pre-installed in the AMD Vitis™ AI Docker container.
Important
Use AMD Quark to quantize models to INT8 before compilation with the Vitis AI compiler. BF16 quantization does not require AMD Quark (except for optional CPU validation); it is handled automatically by the compiler.
The complete documentation for AMD Quark can be found at: https://quark.docs.amd.com
Quantization Workflows#
BF16 (Brain Float 16) Models#
The Vitis AI compiler automatically converts floating-point (FP32) models to BF16 format during NPU compilation. Provide your FP32 model directly to the compiler - no preprocessing with AMD Quark is required.
For compiler invocation details, see Model Compilation.
For most deployments, no explicit BF16 quantization is needed. If you need to debug BF16 accuracy issues on NPU, see the BF16 Quantization for CPU/GPU Validation section at the end of this section.
INT8 Quantization using AMD Quark#
INT8 quantization requires the AMD Quark toolkit before compilation. Use AMD Quark with the VINT8 configuration to convert a floating-point ONNX or PyTorch model into a quantized INT8 model for compilation with the Vitis AI compiler. This is the required workflow for targeting INT8.
For models where pure INT8 quantization degrades accuracy, AMD Quark supports mixed precision, which keeps sensitive operations in higher precision (FP32/FP16) while quantizing the rest to INT8. See YOLO INT8 quantization (mixed precision) for a practical example, and Mixed Precision Compilation for detailed compilation patterns.
List of Supported Quark Configuration Options in Vitis AI#
Only the Quark configuration options listed in this section are supported in Vitis AI. Options marked Mandatory must be set for all INT8 quantization workflows with Vitis AI 6.2. Options marked Optional are used for advanced scenarios such as mixed precision or fast fine-tuning.
Quark Configuration Option |
Required |
Supported Value |
Usage |
|---|---|---|---|
|
Mandatory |
False |
|
|
Mandatory |
True |
|
|
Mandatory |
True, False |
|
|
Mandatory |
True, False |
|
VINT8 |
Mandatory |
|
|
|
Optional |
True, False |
|
|
Optional |
List of tuples |
|
|
Optional |
List of node names |
|
|
Optional |
PowerOfTwoMethod.MinMSE |
|
|
Optional |
True |
|
|
Optional |
True, False |
|
|
Optional |
See usage column |
|
For configuration parameter details and supported enum values, see the AMD Quark ONNX Quantization Configuration Guide.
For a technical breakdown of how the MSE calibration algorithm calculates clipping thresholds to minimize quantization noise, see the AMD Quark PyTorch Calibration Methods Overview.
For advanced quantization workflows, see the AMD Quark ONNX documentation.
INT8 Quantization Example#
The VINT8 configuration uses symmetric INT8 activation with power-of-two scales and is designed for Vitis AI compatibility.
The following example shows the minimum required configuration for INT8
quantization with Vitis AI 6.2. Choose one of the two calibration options
below and substitute it into the quantizer.quantize_model call.
Step 1: Configure and create the quantizer
from quark.onnx import ModelQuantizer, QConfig
# Configure quantization -- all four options are mandatory for Vitis AI 6.2
quant_config = QConfig.get_default_config("VINT8")
quant_config.global_quant_config.extra_options["Int32Bias"] = False
quant_config.global_quant_config.enable_npu_cnn = True
quant_config.global_quant_config.extra_options["DedicatedQDQPair"] = True
quant_config.global_quant_config.extra_options["QuantizeAllOpTypes"] = True
# Create quantizer
quantizer = ModelQuantizer(quant_config)
Step 2: Provide calibration data
Choose one of the following options.
Option 1: Use random data (for testing only)
quant_config.global_quant_config.extra_options["UseRandomData"] = True
quantizer.quantize_model(
model_input="model_fp32.onnx",
model_output="model_int8.onnx",
calibration_data_reader=None, # Random data used automatically
)
Option 2: Use real calibration data (recommended for production)
For best accuracy, provide a representative calibration dataset.
Implement a CalibrationDataReader subclass that yields
preprocessed input samples, then pass an instance to quantize_model.
The following minimal template is based on the AMD Quark CalibrationDataReader reference:
from onnxruntime.quantization.calibrate import CalibrationDataReader
class ImageDataReader(CalibrationDataReader):
def __init__(self, calibration_image_folder, input_name,
input_height, input_width):
self.enum_data = None
self.input_name = input_name
self.data_list = self._preprocess_images(
calibration_image_folder, input_height, input_width)
def _preprocess_images(self, image_folder, input_height,
input_width, batch_size=1):
data_list = []
# User-defined preprocessing logic (resize, normalize, transpose, ...)
return data_list
def get_next(self):
if self.enum_data is None:
self.enum_data = iter(
[{self.input_name: data} for data in self.data_list])
return next(self.enum_data, None)
def rewind(self):
# Called by the quantizer to reset the data iterator between
# calibration passes. Reset enum_data to None so that get_next()
# restarts from the beginning of data_list.
self.enum_data = None
calib_data_reader = ImageDataReader(
calibration_image_folder="calib_images",
input_name="input",
input_height=224,
input_width=224,
)
quantizer.quantize_model(
model_input="model_fp32.onnx",
model_output="model_int8.onnx",
calibration_data_reader=calib_data_reader,
)
Calibration Dataset Selection Guidelines#
While AMD Quark leaves calibration dataset configuration entirely user-defined, adhering to the following guidelines prevents accuracy degradation on target hardware.
Quantity: Supply enough samples to adequately saturate the tensor histograms without significantly increasing compilation time. A larger calibration set improves histogram coverage but yields diminishing accuracy returns beyond a moderate sample count � balance thoroughness with practical runtime constraints.
Balance: Select an equal number of samples across every target class to prevent scale-factor bias.
Variance: Mix environmental factors such as lighting, angles, and distances. Do not use sequential video frames.
Cleanliness: Exclude corrupted, heavily blurred, or extreme anomaly frames to prevent artificial activation spikes.
Pipeline Alignment: Manually apply all production transformations, cropping, and normalizations before passing data to Quark. Quark does not preprocess raw files.
Evaluating Quantization Quality#
After quantizing your model, verify that quantization has not significantly degraded output quality. AMD Quark provides an evaluation tool that compares inference results between a baseline floating-point model and the quantized model.
Generating inference results for comparison
Before running the evaluation tool, run inference with both the baseline
FP32 model and the quantized INT8 model and save the output tensors to
separate folders. The evaluation tool expects one .npy (NumPy binary)
file per sample in each folder, with consistent filenames across both
folders. The dtype and shape of each saved array must match the model output
exactly.
import numpy as np
import onnxruntime as ort
import os
def save_inference_results(model_path, data_reader, output_folder):
os.makedirs(output_folder, exist_ok=True)
session = ort.InferenceSession(model_path)
output_name = session.get_outputs()[0].name
data_reader.rewind()
idx = 0
while True:
inputs = data_reader.get_next()
if inputs is None:
break
result = session.run([output_name], inputs)[0]
# Save as .npy -- the evaluate tool expects NumPy binary format
np.save(os.path.join(output_folder, f"result_{idx}.npy"), result)
idx += 1
save_inference_results("model_fp32.onnx", calib_data_reader, "output_fp32")
save_inference_results("model_int8.onnx", calib_data_reader, "output_int8")
The folder structure produced by the above script is:
output_fp32/
result_0.npy
result_1.npy
...
output_int8/
result_0.npy
result_1.npy
...
Running the evaluation tool
After generating inference results on CPU/GPU from both models (as shown in the code above), run the following command inside the Docker container to evaluate quantization quality:
python3 -m quark.onnx.tools.evaluate \
--baseline_results_folder output_fp32 \
--quantized_results_folder output_int8
The tool analyzes the saved inference results and reports the following metrics:
Metric |
Description |
Interpretation |
|---|---|---|
Cosine Similarity |
Measures directional similarity of output vectors (0 to 1, where 1.0 = identical direction) |
Higher is better. Values above 0.98 indicate good preservation. |
L2 Distance |
Euclidean distance measuring the absolute numerical difference between output tensors. The magnitude depends on output tensor scale and size and is not comparable across models. |
Lower is better. No universal threshold applies. Use as a relative indicator: compare the L2 distance before and after applying accuracy recovery techniques such as mixed precision or fast fine-tuning to confirm improvement. |
PSNR |
Peak Signal-to-Noise Ratio in decibels (dB) |
Higher is better. Values above 30 dB indicate low quantization noise. Thresholds are task-dependent. |
SSIM |
Structural Similarity Index (0 to 1, where 1.0 = identical structure) |
Higher is better. Values above 0.99 indicate excellent structural preservation. Thresholds are task-dependent. |
Example output
Mean cos similarity: 0.9867
Min cos similarity: 0.9761
Mean l2 distance: 11.41
Max l2 distance: 15.47
Mean psnr: 30.16
Min psnr: 26.92
Mean ssim: 0.9931
Min ssim: 0.9888
These signal-level metrics indicate whether the quantized model’s raw outputs are numerically close to the baseline. They do not measure task-level accuracy such as mean average precision (mAP) for object detection or top-1 accuracy for classification. For a complete validation methodology, see Model Accuracy Validation Methodology.
If these metrics indicate significant degradation, mixed precision techniques can recover performance by excluding sensitive operations from quantization. The following case study demonstrates this approach for YOLO models.
YOLO INT8 Quantization: Mixed Precision Solution#
A common issue with YOLO ONNX models is a significant drop in object detection accuracy after quantization to INT8. This often appears as missed detections and lower confidence scores compared to the original floating-point model.
Problem Description#
The accuracy degradation occurs because YOLO’s post-processing subgraph concatenates tensors with vastly different numerical ranges into a single output tensor. Confidence scores are small values between 0.0 and 1.0, while bounding box coordinates are large values representing pixel locations (for example, 0 to 640). When combined, the quantizer must choose a single INT8 scale for the entire output. This scale is dominated by the large coordinate values, which distorts the precision of small confidence scores and causes severe accuracy degradation.
Solution: Exclude Post-Processing from Quantization#
To resolve this, prevent the post-processing subgraph from being quantized. Run the main model body in INT8 and perform the post-processing step externally in floating-point precision.
Important
The subgraph node names used in the example (for example, /model.22/Concat_3)
are model-specific. The exact names depend on the architecture of your
particular YOLO model and how it was exported to ONNX. You must inspect
your model to find the correct node names for the post-processing subgraph
you wish to exclude before running the code below.
To inspect node names, use a model visualization tool such as Netron or the ONNX Python API:
import onnx
model = onnx.load("your_model.onnx")
for node in model.graph.node:
print(node.op_type, node.name)
This is achieved by using the subgraphs_to_exclude option in the Quark
configuration. The subgraphs_to_exclude option targets a connected sequence
of nodes forming a logical processing block, such as a post-processing or NMS
subgraph. The excluded subgraph is automatically compiled to BF16 by the Vitis
AI compiler and runs entirely on the NPU, avoiding CPU fallback. The compiler
can also be configured to use FP16 for excluded subgraphs using specific
compiler options.
Note
AMD Quark also provides an exclude_nodes option, which targets
individual named nodes rather than connected subgraphs. For mixed precision
compilation workflows, use subgraphs_to_exclude to exclude entire
functional blocks. Use exclude_nodes only when a single isolated node
must be excluded.
from quark.onnx import ModelQuantizer, QConfig
# Configure quantization with mandatory options for Vitis AI 6.2
quant_config = QConfig.get_default_config("VINT8")
quant_config.global_quant_config.extra_options["Int32Bias"] = False
quant_config.global_quant_config.extra_options["DedicatedQDQPair"] = True
quant_config.global_quant_config.extra_options["QuantizeAllOpTypes"] = True
quant_config.global_quant_config.enable_npu_cnn = True
# Exclude the post-processing subgraph from INT8 quantization.
# Replace the node names below with the correct names for your model.
quant_config.global_quant_config.subgraphs_to_exclude = [
(["/model.22/Concat_3"], ["/model.22/Concat_5"])
]
# Create quantizer and run quantization
quantizer = ModelQuantizer(quant_config)
quantizer.quantize_model(
model_input="model_fp32.onnx",
model_output="model_int8.onnx",
calibration_data_reader=calib_data_reader,
)
After quantization, the excluded post-processing subgraph remains in floating-point precision in the output model. The Vitis AI compiler automatically converts the excluded FP32 subgraph to BF16 and runs it on the NPU alongside the INT8 quantized body. The excluded subgraph can also be configured to run as FP16 using compiler options. For details on how the compiler handles mixed precision subgraphs and how to control operator placement, see Mixed Precision Compilation.
After running quantization with subgraphs_to_exclude, re-run the
evaluation tool described in the Evaluating Quantization Quality section
to confirm that the cosine similarity, PSNR, and SSIM metrics have improved.
For model accuracy recovery methodology, see
Model Accuracy Validation Methodology.
INT8 Quantization for FP16 Models#
Important
The version of Quark bundled with the Vitis AI 6.2 Docker image does not support this FP16-to-INT8 quantization workflow. Do not attempt it with the bundled version. This workflow will be supported in the upcoming Quark 0.12 release.
Some ONNX models are exported in FP16 format rather than FP32. If you need to deploy such a model with INT8 precision on the NPU, you can use this workflow to selectively quantize parts of the model to VINT8 while keeping other parts in FP16. This is useful when full INT8 quantization would cause accuracy degradation in sensitive subgraphs.
The workflow has two main steps:
To quantize the FP16 model to INT8, you need to first convert the FP16 model to FP32 and then quantize the FP32 model to INT8.
Use the Quark Shapeshifter engine to convert the entire model or selected subgraphs from FP16 to FP32.
Configure Quark for VINT8 quantization. When the source model contains FP16 tensors, enable the FP16-specific options (for example,
QuantizeFP16andUseFP32Scale) and exclude any subgraphs that must remain FP16.Run quantization with representative calibration data. For quick testing you can use random data, but use representative data for production calibration. Validate accuracy and performance on the target hardware.
Note
The node names used in the examples below (for example,
/backbone/stem/rbr_reparam/Conv) are specific to the YOLOv6 model
architecture. Inspect your own model to find the correct node names
using Netron or the ONNX Python API.
Step 1: Convert Selected FP16 Subgraphs to FP32#
The subgraphs you plan to quantize to VINT8 must first be converted from FP16 to FP32. Common candidates include:
Input stem convolutions – first-layer precision affects downstream accuracy.
Detection head reshapes – shape manipulation nodes benefit from stable numerical representation.
Other subgraphs that benefit from VINT8 quantization – any subgraphs where VINT8 quantization provides a accuracy advantage.
Use the Quark Shapeshifter engine to selectively convert these subgraphs while leaving the rest of the model in FP16. Define the subgraphs to convert as pairs of start-node and end-node lists:
import yaml
from quark.shapeshifter import Engine, LoadConfigFromFileOrDict
subgraphs_to_include = [
[['/backbone/stem/rbr_reparam/Conv'],
['/detect/Reshape_6', '/detect/Reshape_7', '/detect/Reshape_8',
'/detect/Reshape', '/detect/Reshape_1', '/detect/Reshape_2',
'/detect/Reshape_3', '/detect/Reshape_4', '/detect/Reshape_5']]
]
Next, generate a YAML configuration file and run the Shapeshifter engine to produce the mixed FP16/FP32 model:
def generate_yaml(input_model, output_model, subgraphs_to_include):
yaml_path = "convert_fp16_to_fp32.yaml"
config = {
"input_model_path": input_model,
"passes": {
"onnx_convert_fp16_to_fp32": {
"convert_fp16_to_fp32": True,
"subgraphs_to_include": subgraphs_to_include
}
},
"output_model_path": output_model,
}
with open(yaml_path, "w", encoding="utf-8") as f:
yaml.dump(config, f, allow_unicode=True, sort_keys=False)
return yaml_path
input_model = "yolov6_fp16.onnx"
converted_model = "yolov6_fp16.converted.onnx"
yaml_file = generate_yaml(input_model, converted_model, subgraphs_to_include)
engine_config = LoadConfigFromFileOrDict(yaml_file).data
engine = Engine(config=engine_config)
engine.initialize()
engine.run()
The resulting model (yolov6_fp16.converted.onnx) is a mixed-precision
model: the selected subgraphs are now in FP32, while all other operations
remain in FP16.
Step 2: Configure VINT8 Quantization#
With the mixed FP16/FP32 model ready, configure Quark for VINT8
quantization. Two additional options are required when the source model
contains FP16 tensors: QuantizeFP16 and UseFP32Scale.
from quark.onnx import ModelQuantizer, QConfig
quant_config = QConfig.get_default_config("VINT8")
# Mandatory options for Vitis AI 6.2
quant_config.global_quant_config.extra_options["Int32Bias"] = False
quant_config.global_quant_config.enable_npu_cnn = True
# Exclude tail subgraphs (detection head -- keep in FP16)
TAIL_SUBGRAPHS = [
(['/detect/Concat', '/detect/Concat_1'],
['/detect/Concat_7'])
]
quant_config.global_quant_config.subgraphs_to_exclude = TAIL_SUBGRAPHS
# FP16-specific options (REQUIRED for FP16 source models)
quant_config.global_quant_config.extra_options["QuantizeFP16"] = True
quant_config.global_quant_config.extra_options["UseFP32Scale"] = False
Step 3: Run Quantization#
You can run quantization with either random data or real calibration data. Random data is faster and suitable for initial testing. Representative calibration data produces more accurate results and is recommended for production deployments.
Option A: Random data (faster)
quant_config.global_quant_config.extra_options["UseRandomData"] = True
quantizer = ModelQuantizer(quant_config)
quantizer.quantize_model(
model_input="yolov6_fp16.converted.onnx",
model_output="yolov6m_quantized.onnx",
calibration_data_reader=None,
)
Option B: Calibration data (more accurate)
Implement a CalibrationDataReader subclass whose get_next() method
yields preprocessed input samples as {input_name: np.ndarray}
dictionaries, returning None when all samples have been consumed. See
the `Preparing Calibration Data`_ section for the full implementation
pattern.
quantizer = ModelQuantizer(quant_config)
quantizer.quantize_model(
model_input="yolov6_fp16.converted.onnx",
model_output="yolov6m_quantized.onnx",
calibration_data_reader=calib_data_reader,
)
After quantization, re-run the evaluation tool described in the Evaluating Quantization Quality section to verify accuracy.
BF16 Quantization for CPU/GPU Validation#
This section is only required when debugging BF16 accuracy issues on NPU. For standard BF16 deployments, the compiler handles conversion automatically as described in the BF16 (Brain Float 16) Models section.
See Model Accuracy Validation Methodology for when to use this workflow.
When BF16 NPU accuracy is inadequate, you can use AMD Quark to create a BF16 model for CPU/GPU validation. This isolates BF16 quantization effects from NPU-specific execution issues.
The following example shows BF16 quantization with the “with-cast” configuration, which produces Cast nodes that match the Vitis AI compiler’s BF16 implementation.
from quark.onnx import ModelQuantizer, QConfig
# 1. Start from the default BF16 config
quant_config = QConfig.get_default_config("BF16")
# 2. Enable "with-cast" mode to match NPU BF16 behavior
# This replaces QDQ nodes with Cast nodes
quant_config.global_quant_config.extra_options["BF16QDQToCast"] = True
# 3. Create quantizer and quantize
quantizer = ModelQuantizer(quant_config)
quantizer.quantize_model(
model_input="models/model_fp32.onnx",
model_output="models/model_bf16_cpu_validation.onnx",
calibration_data_reader=calibration_dataset, # optional for BF16
)
Note
Calibration data is optional for BF16 quantization but might improve
accuracy. You can pass None for calibration_data_reader if
calibration data is unavailable.
After quantization, validate the BF16 model on CPU/GPU using ONNX Runtime. Compare the results against your FP32 baseline to measure BF16 quantization loss.
Important
The BF16 model created by Quark is used only for CPU/GPU validation. When compiling for NPU deployment, use the original FP32 model as input to the Vitis AI compiler. The compiler performs its own BF16 conversion during NPU compilation.
For the complete BF16 validation workflow and accuracy comparison methodology, see Model Accuracy Validation Methodology.