Vitis AI 6.2 User Guide for Versal AI Edge Series Gen2#

Vitis AI Overview#

Deploying deep learning models on edge devices requires balancing performance, power consumption, and accuracy. Vitis AI provides an integrated toolchain for optimizing models trained in PyTorch or TensorFlow and exported to ONNX format for AMD Versal AI Edge Series Gen2 hardware. The toolchain takes you from ONNX models through quantization and compilation to hardware-accelerated inference on embedded devices.

The workflow follows three sequential stages. Quantization reduces model precision to improve performance while maintaining accuracy. Compilation transforms models into optimized binaries for the Neural Processing Unit (NPU). Deployment executes these optimized models using one of three runtime options, each tailored to different application requirements. This staged approach provides a clear path from development to production while preserving flexibility in balancing accuracy, performance, and integration complexity.

Why Use Vitis AI?#

Vitis AI addresses the key challenges of edge AI deployment through hardware-software co-optimization. The NPU delivers low-latency, real-time inference with consistent, predictable performance characteristics essential for embedded applications. Power consumption is significantly reduced compared to CPU-only execution, critical for battery-powered devices and thermally constrained systems.

The toolchain supports both common CNN architectures (ResNet, MobileNet, EfficientNet) and vision transformer models (DETR, DeiT) for classification, detection, and segmentation tasks. Three precision options (INT8, BF16, FP16) let you balance performance, accuracy, and power consumption based on application requirements.

Integration is simplified through Python APIs for rapid prototyping and C++ APIs for production deployment. The runtime layer supports both ONNX Runtime and native VART interfaces, allowing you to choose the abstraction level that matches your needs. Models trained in PyTorch or TensorFlow deploy through the standard ONNX interchange format, avoiding framework lock-in.

For specialized requirements beyond standard inference, the Versal platform provides programmable hardware resources. The AI Engine (AIE) and Programmable Logic (PL) enable custom preprocessing, postprocessing, or domain-specific operations tailored to your application.

Supported Models and Scope#

The current release supports convolutional neural networks for image classification, object detection, and segmentation. Vision transformer architectures are also supported, primarily for vision tasks. Models from PyTorch and TensorFlow frameworks can be deployed after export to ONNX format. ONNX opset 20 is recommended for broadest operator support, though opsets 11 through 22 are supported.

Source models in FP32 format are automatically quantized to BF16 during compilation. For vision transformers, FP16 precision is available through explicit casting. INT8 quantization delivers maximum performance while requiring a calibration dataset.

Vitis AI Architecture#

Figure 1 illustrates the Vitis AI software stack from application frameworks to hardware execution. At the application layer, you work with familiar frameworks including ONNX, TensorFlow, and PyTorch, bringing your own CNN or transformer models. The Vitis AI toolchain processes these models through AMD Quark for quantization and the Vitis AI Compiler for hardware optimization.

Vitis AI software stack architecture from frameworks to hardware

At runtime, applications interface through either ONNX Runtime APIs with the VitisAI Execution Provider or native Vitis AI Runtime (VART) APIs. These high-level interfaces abstract the underlying Xilinx Runtime (XRT), which manages hardware resources and scheduling. At the hardware layer, the Neural Processing Unit on the Versal AI Edge Gen2 platform executes optimized model operations.

This layered architecture separates application-level concerns from low-level optimization and hardware management. Understanding where each component sits in the stack clarifies how quantization, compilation, and deployment stages interact with the overall system.

The Three-Stage Workflow#

Model deployment in Vitis AI progresses through quantization, compilation, and execution. Each stage transforms your model closer to efficient hardware execution.

Quantization#

Quantization reduces numerical precision of model weights and activations, decreasing memory footprint and improving performance. Three data types are supported: BF16, FP16, and INT8.

The compiler automatically converts FP32 models to BF16 during compilation with no additional configuration. This automatic quantization provides good performance improvement while preserving accuracy, making it the recommended starting point for most deployments.

Vision transformer models can be explicitly cast to FP16 using AMD Quark. FP16 offers higher precision that may better preserve transformer architecture accuracy compared to BF16.

For maximum performance, INT8 quantization is available through AMD Quark. This approach requires a calibration dataset to determine optimal quantization parameters but delivers the highest throughput and lowest latency. The Model Quantization section explains quantization options in detail and guides you through the AMD Quark workflow.

Compilation#

The Vitis AI compiler transforms ONNX models into optimized executables for the NPU, the specialized AI accelerator in Versal AI Edge Series Gen2 devices. During compilation, the compiler analyzes model structure, applies hardware-specific optimizations including operator fusion and memory scheduling, and partitions operations between NPU and CPU based on hardware capabilities.

Compilation is configured through a JSON file specifying the target device and optimization preferences. The compiler outputs device-specific binaries and metadata describing model input/output specifications. The Model Compilation section covers compilation configuration, optimization options, and troubleshooting.

Execution#

Two runtime options execute compiled models, each addressing different application needs and integration requirements.

ONNX Runtime with the Vitis AI Execution Provider offers the most accessible execution path, especially for developers already familiar with ONNX workflows. This runtime automatically partitions model graphs, offloading supported operations to the NPU while executing unsupported operations on the CPU. The heterogeneous execution model provides broad operator coverage and simplifies prototyping and validation workflows.

VART-ML Runtime targets production deployments requiring maximum performance. Optimized for models executing entirely on the NPU, VART-ML eliminates CPU fallback overhead and provides explicit control over buffer management and scheduling. This approach delivers lower latency and higher throughput than ONNX Runtime but requires more integration effort and assumes full NPU compatibility.

The ONNX Runtime and Vitis AI Runtime (VART) sections provide detailed guidance for each runtime with complete implementation examples. Advanced runtime techniques are covered in spatial and temporal NPU sharing, zero-copy inference, and asynchronous inference.

Next Steps#

With an understanding of the Vitis AI architecture and workflow, you can proceed to environment setup and model execution.

Getting Started

Begin with the Quick Start section, which covers system requirements, board setup, and a demonstration running pre-built models on the VEK385 evaluation board. This validates your hardware and software environment before working with your own models. For Docker-based development workflows, consult the Development Flow section.

Model Preparation

The Model Quantization section explains how to use AMD Quark for FP16 casting and INT8 quantization, including calibration strategies and accuracy validation. If you rely on automatic BF16 quantization, you can skip this initially. The Model Compilation section covers compilation configuration, optimization options, and output interpretation.

Execution

For initial execution, use ONNX Runtime, which provides the fastest path to running inference with automatic CPU/NPU partitioning. For production deployments requiring maximum performance, the Vitis AI Runtime (VART) section covers VART-ML and VART-X integration. Advanced users should consult spatial and temporal NPU sharing, zero-copy inference, and asynchronous inference.