Release Notes#
Version 6.2#
This is the first public release of the AMD Vitis™ AI 6.2 User Guide for Versal AI Edge Series Gen2.
Key Features#
Core Platform Support#
Target Hardware: VEK385 RevB and RevA (pre-production silicon) evaluation boards with Versal AI Edge Series Gen2 Adaptive SoCs
Supported Devices: XC2VE3858, XC2VE3504, XC2VE3558, XC2VE3804, XC2VE3804_SE, and XC2VE3858_SE
Docker-Based Development: Pre-built Docker image with all necessary tools for model quantization and compilation on Linux host
Tool Versions: Vivado 2025.2, Vitis 2025.2
Model Quantization and Compilation#
ONNX Format Support: Full support for ONNX models with opset 11-20 (opset 20 recommended for optimal performance), partial support for opset 21-22
Supported Types: Support for FP32, BF16, FP16, INT8 models. FP32, FP16, BF16 are unquantized floating point types.
Quantization Workflows:
INT8 Explicit Quantization: AMD Quark toolkit for INT8 quantization with calibration and optional fast fine-tuning
BF16 Implicit Conversion: Automatic FP32-to-BF16 conversion during Vitis AI compilation without calibration
FP16 Model Conversion: AMD Quark toolkit for converting models to FP16 format
Mixed Precision Compilation: Automatic BF16 and FP16 conversion of FP32 operations to eliminate CPU fallback and improve performance
Operator Support: Comprehensive support for 2D CNN and Vision Transformer models
Data Parallelism: Support for batching inputs multiple, with each input feeding into an independent copy of the model pipeline for efficient inference
Tensor Parallelism: Ability to partition models across multiple NPU columns for improved throughput and parallelization of large models
Dynamic Batch Support: Compile models once and run with different batch sizes at inference time
Deployment Runtimes#
ONNX Runtime with AMD Vitis™ AI Execution Provider: Streamlined, framework-agnostic deployment with automatic CPU/NPU partitioning for heterogeneous execution
VART-ML Runtime: High-performance runtime optimized for fully NPU-offloaded models with zero-copy execution support. Runtime is C++ based and does not include Python APIs.
VART-X APIs: Specialized APIs for video analytics with hardware-accelerated preprocessing (resize, color conversion, normalization) and integrated postprocessing/overlay functions
Spatial and Temporal Execution: Ability to run models spatially across multiple NPU columns and temporally by pipelining execution across time for optimized resource use
Multiple API Support: Python and C++ APIs for ONNX Runtime and C++ APIs for VART-ML based Runtime
Development and Analysis Tools#
AI Analyzer: Comprehensive tool for model compilation visualization and inference profiling with three key sections:
Partitioning Analysis: Visual breakdown of CPU/NPU operator assignments and GOP offloading statistics
NPU Insights: Detailed view of NPU optimization including operator fusion and memory partitioning
Performance Profiling: Inference execution analysis with latency and throughput metrics
DDR Throughput Profiling: Measure and analyze DDR memory throughput between the NoC (Network-on-Chip) and the AI Engine array, and visualize the results in AI Analyzer to identify memory access bottlenecks affecting model performance.
Integrated System Reference Design#
End-to-End Reference Design: Complete source code with hardware-accelerated preprocessing via Image Processing PL kernel, NPU inference, and CPU-based postprocessing
Hardware Preprocessing: Image Processing PL HLS kernel supporting resize, color space conversion, normalization, and cropping
NPU Execution: Full model offload to Neural Processing Unit (NPU) for supported operators
Multi-Model Support: Concurrent execution of different compiled models on spatially partitioned NPU resources, or sequential execution on a shared partition using temporal (time-multiplexed) scheduling
Zero-Copy Inference: Enabled using device-backed tensor buffers (for example, XRT buffer objects, DMA-BUF file descriptors, or CMA-backed pointers)
Boot and Deployment Options#
Multiple Boot Flows: Support for OSPI and SD Card boot flows. Alternatively OSPI AND Universal Flash Storage Boot (UFS) boot flow also available.
Pre-built Boot Images: Ready-to-test images for both RevA and RevB boards
Board Boot Scripts: Helper scripts provided to simplify and automate the board booting process
Compiled Model Formats:
Directory structure format for flexible deployment
Flat-buffer format (.rai files) for memory-mapped efficient inference
Cross-Compilation SDK: Complete sysroot environment for building target applications on host machine
Example Applications and Models#
Quick Start ResNet50 Demo: Ready-to-run ResNet50 example for immediate evaluation on the target board
End-to-End Tutorials: Comprehensive tutorials exercising the complete flow from model quantization through compilation to inference execution on the board
Pre-built C++ Applications: Pre-built C++ applications using ONNX and VART-ML Runtime for functional and performance evaluation
Pre-built Models: Collection of example ONNX models for quick tool evaluation ( ResNet-50, and YOLOx)
Limitations#
Model and Operator Constraints#
ONNX Opset: Operators introduced after ONNX opset 22 are not supported
Operator Constraints: Certain ONNX operators are not supported and cause models to fall back to CPU execution; refer to the ONNX Operators section in the user guide for detailed operator compatibility
Quantization Constraints#
FP16-to-INT8 quantization: The version of Quark bundled with the Vitis AI 6.2 Docker image does not support the FP16-to-INT8 quantization workflow. Do not attempt this workflow with the bundled version. Support for this workflow is planned for the upcoming Quark 0.12 release.
Model Compilation Constraints#
Large ONNX Models: When a model exceeds 2 GB, it is stored using ONNX external data format (model.onnx + model.onnx.data). The full file path to model.onnx must be specified at runtime to ensure the companion .data file is correctly resolved and loaded.
Write Permissions: The cache directory must have write permissions enabled during compilation. This allows the compiler to store generated artifacts necessary for the build process.
Docker Stack Size Configuration: When launching Docker containers, use the –ulimit stack=-1:-1 option to allocate unlimited stack memory. This configuration is essential for compiling large models.
Known System Issues#
Permission Requirements: Must run
sudo -ion target board to avoid permission issues when creating hardware contextAI Analyzer DDR Throughput Profiling Analysis Not Displayed in GUI: Enhanced profiling JSON files are generated successfully but do not appear in the AI Analyzer GUI.
Workaround: Copy
record_timer*jsonandonnxruntime_profile_*jsonfromanalyzed_data/mlprofiler_ddr_merge/toanalyzed_data/and relaunch AI Analyzer.