Quantization Parameters for VART Applications#

Quantized models (for example INT8) require consistent quantization and dequantization between preprocessing, NPU inference, and post-processing. Any application built on VART—whether it uses VART-ML Runner APIs, ONNX Runtime + Vitis AI EP, or a VART-X pipeline—must align scale factors and zero-points with the compiled model.

By default, the compiled model artifact does not embed per-tensor quantization scale and zero-point; vart::Runner::get_quant_parameters() returns default values unless you set the compile options below. ONNX Runtime + EP is unaffected—it reads scales from the source ONNX graph.

Embed parameters at compile time (recommended), or supply scales manually (application JSON for VART-X pipelines, or values from the source ONNX graph for custom VART-ML paths).

Note

With mixed types, by default the hardware buffer is required to contain the converted type (for example, bf16). The quantization and other dtype conversions can be seen in the flexmlrt-hsi.json report, as discussed in Tensor Format Conversions. If instead the hardware buffer should contain float32 and the conversion should be performed on the NPU, this can be specified with use-hsi-json-file, as described in Tensor Format Conversions.

Compilation configuration#

During model compilation, set the following options in the vaiml_config section of the Vitis AI EP configuration JSON (see Vitis AI EP Configuration File):

  • fe_experiment: "edge-quantization-in-rt=1" — keeps source quantization parameters in the compiled artifact for the preprocess path and for get_quant_parameters() at runtime.

  • experiment_features: ["SkipDequantizeRemoval"] — preserves dequantize nodes so output scales remain available to postprocess blocks and application code.

The following Vitis AI EP configuration sets both options under vaiml_config:

{
  "passes": [
    { "name": "init", "plugin": "vaip-pass_init" },
    {
      "name": "vaiml_partition",
      "plugin": "vaip-pass_vaiml_partition",
      "vaiml_config": {
        "device": "ve2-xc2ve3858",
        "fe_experiment": "edge-quantization-in-rt=1",
        "experiment_features": ["SkipDequantizeRemoval"]
      }
    }
  ],
  "target": "VAIML",
  "targets": [{ "name": "VAIML", "pass": ["init", "vaiml_partition"] }]
}

Note

Both fields must be set at compile time; they cannot be applied to an already-compiled model. The device value is illustrative—substitute the string for your target platform.

These flags apply to models compiled for any VART deployment lane (VART-ML, ONNX Runtime + EP, or VART-X). At runtime, VART provides scale and zero-point values from the compiled model.

Runtime configuration#

VART-X pipeline JSON (manual vs automatic scales)#

When your application uses VART-X PreProcess and PostProcess modules:

If the compile flags above are not used, set scale factors explicitly in your application JSON (or read them from the source ONNX graph):

  • Preprocess — In preprocess-config, set quant-scale-factor (maps to PreProcessInfo::qt_fctr; use the reciprocal of the input tensor scale). See VART Application Development (PreProcess).

  • Postprocess — In postprocess-config, under model-params, set quant-scale-factors (one entry per output tensor, in model order) so postprocess can dequantize INT8 outputs before softmax, NMS, or other operations. See VART Application Development (PostProcess) and Post Processing Functions.

When the compile flags are used, manual quant scales in JSON are not required—VART queries embedded values for preprocess and postprocess. If quant-scale-factor or quant-scale-factors is set anyway, it overrides the runner (for models compiled without embedding, or to experiment without recompiling).

VART-ML and custom applications#

When you use VART-ML Runner directly (without VART-X JSON), query per-tensor quantization with vart::Runner::get_quant_parameters(); see VART Application Development. You can also inspect parameters with ml_vart --get-model-info; see Inspecting compiled model metadata in Reference Applications.

Without the compile flags, query results are not valid—read scales from the source ONNX graph and pass them in code or JSON. With both flags set, get_quant_parameters(<tensor_name>) returns the per-tensor scale and zero-point from the compiled model.