Vitis AI 5.1 Developer Guide#

Note

Vitis AI 5.1 is the first public release of Vitis AI with support for the Neural Processing Unit (NPU), replacing the Deep Learning Processing Unit (DPU) architecture. This release is available as a Beta version, targeting Versal AI Edge Series Adaptive SoCs. The production release is scheduled for Q1 2026. For support, please contact your local AMD sales representative or post your question to the Vitis AI and AI Community Forums.

Vitis AI 5.1 Developer Guide#

Vitis AI (Product) Overview#

AMD Vitis™ AI is an IDE (Integrated Development Environment) that you can leverage to accelerate AI Inference on AMD’s Adaptive SoCs and FPGAs. The IDE provides optimized IP (Intellectual Property), supporting tools, libraries, models, reference designs, and tutorials that aid you throughout the development process. It is designed with high efficiency and ease of use in mind, unleashing the full potential of AI acceleration.

_images/vitis_ai_high_level_block_diagram.png

Vitis AI Integrated Development Environment Block Diagram

Key Components of Vitis AI#

The Vitis AI solution consists of three primary components:

  1. Neural Processing Unit (NPU) IP: A purpose-built AI Inference IP that leverages a combination of Programmable Logic and the AI Engine Array to accelerate the deployment of neural networks.

  2. Model Compilation Tools: A set of tools to quantize, compile, and optimize ML models for NPU IP.

  3. Model Deployment APIs: A collection of setup scripts, examples, and reference designs to integrate and execute ML inference models on the NPU IP from a software application.

NPU IP#

AMD uses the acronym NPU IP to identify the “soft” accelerators that facilitate deep-learning inference. The NPU IP uses a combination of AI Engines (AIE) and Programmable Logic (PL) to implement the inference accelerator.

Vitis AI provides NPU IP and aiding tools to deploy both standard and custom neural networks on AMD’s adaptable targets.

The Vitis AI NPU IP operates as a general-purpose AI inference accelerator. Multiple NN models can be loaded and run concurrently on a single NPU. Multiple NPU IP instances can also be instantiated per device. The NPU IP can be scaled in size to accommodate your requirements.

The Vitis AI NPU IP architecture is called a “Matrix of (Heterogeneous) Processing Engines.” Although the Vitis AI NPU IP architecture might bear some visual resemblance to a systolic array at first glance, the comparison ends beyond visual similarities. The NPU IP operates as a micro-coded processor with its own Instruction Set Architecture. Each NPU IP architecture has its own instruction set.

The Vitis AI Compiler, in collaboration with the NPU IP software stack, generates snapshots tailored for the deployment of each network. The snapshot contains a quantized model and instructions for execution by NPU IP on the target platform.

Note: One advantage of this architecture is that there is no need to load a new bitstream or build a new hardware platform when changing the neural network. It is an important differentiator from the regular data flow accelerator architectures which are purpose-built for a single network.

Model Compilation Toolset#
  • Vitis AI Quantizer

The Vitis AI Quantizer integrated as a component of either TensorFlow or PyTorch converts 32-bit floating-point weights and activations to narrower datatypes such as INT8 to reduce the computing complexity with minimal loss of accuracy (about 1%). Execution of this fixed-point model requires less memory bandwidth and thus provides higher throughput and better power efficiency than the 32-bit floating-point model.

  • Vitis AI Compiler

The Vitis AI Compiler maps the quantized model to a highly efficient instruction set and dataflow model. The compiler performs multiple optimizations; for example, batch normalization operations are fused with convolution when the convolution operator precedes the batch normalization operator. As the NPU IP supports multiple dimensions of parallelism, efficient instruction scheduling is the key to exploiting the inherent parallelism and its full potential for data reuse in the graph.

Model Deployment APIs#

Vitis AI Runtime (VART) is a set of API functions that support the integration of the NPU IP into software applications. VART is built on top of the legacy Xilinx Runtime (XRT) and provides a unified high-level runtime for embedded targets. Key features include:

  • Asynchronous submission of jobs to the NPU IP.

  • Asynchronous collection of jobs from the NPU IP.

  • C++ and Python API implementations.

  • Support for multi-threaded execution.

Salient Features of Vitis AI#

  • AIE/PL Programmability

  • Low Latency / Real-time AI Inference

  • Low Power Consumption

  • Deep Learning Frameworks: PyTorch, TensorFlow

  • Broad CNN Model Coverage

  • Data Type: INT8, BF16

  • C++ and Python APIs for easier integration

Workflow and Components#

This section provides an overview of how developers can deploy models on AMD embedded platforms.

Development flow with Vitis AI: 100 Ft View

The figure outlines the process for deploying a machine learning (ML) model using Vitis AI across different hardware environments based on the execution of embedded platforms. The initial step involves setting up the IT environment with the necessary AMD hardware and software. After training the ML model, the next step is to verify its performance on CPU/GPU platforms. This involves running the inference on an x86 host and ensuring accuracy. If the initial accuracy is not satisfactory, the model might need to be retrained or fine-tuned with Vitis AI tools. After the accuracy is validated, you can proceed with the deployment by choosing embedded execution for integrated systems.

The embedded execution process involves three steps:

  1. Model compilation

  2. Design

  3. Embedded execution

Embedded Execution Workflow: High-level

The process starts with model compilation. In this step, the trained/your model is compiled on an x86 host machine using the NPU compiler software. This step also includes assessing the accuracy of the model. If the post-quantized accuracy is not satisfactory, the model is tuned with software APIs. Once the accuracy is satisfactory, the model is compiled by the compiler software, generating a snapshot file and CPU sub-graphs. These sub-graphs are not accelerated by default on FPGA. The snapshot file is the compiled model plus some instructions for runtime software, packaged together into one file. Refer to the following figure for model compilation flow.

Model Compilation – Model Compilation Flow

The second step is the design where the NPU IP can be integrated into a full-chip FPGA design along with other IPs/kernels like pre-processing and post-processing, using tools like Vitis/Vivado for compilation and binary generation to be flashed onto an SD card.

There are two options for design:

  • Vitis IDE

  • Vivado IDE

You can build a full-chip FPGA design using Vitis IDE, integrating it with VSS (Vitis Sub-Systems) and Vitis kernels for custom IPs. If you have RTL IP, it can be kernelized in Vitis and integrated into the design. Once the integrated design is ready, you can work on compilation, linking, packaging, and generating the binary using the Vitis IDE. Refer to the following figure for the Vitis IDE design flow.

Design – Vitis IDE Flow

In the Vivado IDE flow, Vitis is used to integrate the VSS and your Vitis kernel IPs and link them together. The output of linking can be exported to Vivado as shown in the following figure. Now you can integrate your custom RTL IPs into the design, build the complete design, and generate the binary using the Vivado IDE. Refer to the following figure for the Vivado IDE design flow.

Design – Vivado IDE Flow

Note

  1. The VSS (Vitis Sub-System) is a combination of AIE configuration for ML inference and Kernelized netlist of PL logic utilized for ML inference.

  2. The model preparation and design steps are independent, and you can work on these steps in parallel. It is expected to complete these two steps before proceeding to the final step, embedded execution.

The final step is embedded execution, which includes board preparation, copying the input videos/images, snapshots, and sub-graphs (generated in the first step) to the SD card, and using the application software with Vitis AI runtime APIs to execute the model/snapshot and generate inference results on the target. Refer to the following figure which consolidates the steps of the embedded execution workflow.

Design –  Embedded Execution Workflow: Consolidated