VART ML Architecture Overview#

VART-ML is a high-performance C++ runtime interface for ML inference on AMD hardware. It supports fully offloaded models—the compiled graph runs entirely on the NPU—and models compiled with the CPU partition feature, which enables heterogeneous NPU/CPU execution using compiler-supported CPU operators. For models that require runtime CPU fallback or broad ONNX operator coverage not handled by CPU partition compilation, use ONNX Runtime with the Vitis AI Execution Provider instead.

VART-ML is designed to:

Execute fully offloaded models on the NPU, and CPU-partitioned models across NPU and CPU subgraphs as defined at compile time.
Expose explicit control of tensor metadata and buffer ownership.
Support both CPU-view and hardware-view tensor flows.
Enable zero-copy execution when hardware-visible memory is used.

Note

For a runtime-selection comparison between VART-ML and ONNX Runtime + Vitis AI EP, see Introduction: VART X and VART ML.

Layered Architecture#

Layer	Role
Application layer	Owns pipeline control, creates runners, allocates/wraps buffers, and invokes sync/async execution.
VART-ML API layer	Public abstractions: `NpuTensor`, `Runner`, and `RunnerFactory`.
Runner implementation layer	Vitis AI compiler and runtime stack backend implementation handling model loading, metadata, execution, and tensor allocation helpers.
Runtime/platform layer	Model scheduling, NPU execution, device management, and hardware-backed buffer allocation.

Core Abstractions#

VART-ML centers around three abstractions:

NpuTensor: Represents tensor metadata and wraps user-allocated or runner-allocated buffers. Supports explicit buffer synchronization (sync_buffer) and DMA-BUF file descriptor export (export_buffer) for inter-process or inter-device sharing.
Runner: Loads models, exposes metadata/quantization data, executes inference, and provides tensor allocation helpers including sub-tensor views (allocate_sub_tensor) for efficient batch memory management.
RunnerFactory: Creates runner instances for supported backends.

Execution Architecture#

Typical integration workflow:

Create a Runner via RunnerFactory for a compiled model.
Query tensor metadata in CPU or HW view.
Allocate or wrap buffers and construct NpuTensor objects.
Execute inference using synchronous execute or asynchronous execute_async APIs.
Consume outputs in CPU or HW view depending on postprocessing path.

Synchronous flow:

execute blocks until completion.

Asynchronous flows:

execute_async + wait for job-handle driven completion.
execute_async + callback for completion notification.

Tensor and Memory Architecture#

VART-ML supports two tensor views:

CPU TensorType: Tensor metadata as defined by the ONNX model (standard shapes, data types, and layouts). Use CPU tensors for simplicity – the Runner converts between CPU and HW formats internally, at the cost of a data copy.
HW TensorType: AMD NPU-native tensor metadata. Shape, data type, and memory layout might differ from the CPU view. For example, a model with CPU format NCHW, FP32, [1,3,224,224] might have HW format HCWNC4, BF16, [224,1,224,1,4]. Use HW tensors for zero-copy performance – data goes to the NPU without conversion.

Input and output tensor types can be configured independently (for example, HW input with CPU output).

Each NpuTensor has a memory type that determines where the buffer lives and whether zero-copy is possible:

XRT_BO – XRT Buffer Object, device-accessible CMA memory. Use when the Runner allocates memory (allocate_npu_tensor) or when the application already manages XRT BOs.
DMA_FD – DMA file descriptor. Recommended for application-allocated zero-copy buffers; portable across Linux subsystems (V4L2, dma_heap, ISP, video decoder).
USER_POINTER_CMA – User-provided pointer to physically contiguous (CMA) memory. Use when you have a CMA buffer from another allocator.
USER_POINTER_NON_CMA – Standard host memory (new, malloc). Use for standard workflows without hardware awareness. Not physically contiguous; the Runner copies data internally.

Zero-copy behavior:

Supported when TensorType = HW and memory type is XRT_BO, DMA_FD, or USER_POINTER_CMA.
Invalid: TensorType = HW with USER_POINTER_NON_CMA – this combination throws std::runtime_error at NpuTensor construction because non-CMA memory is not accessible by the NPU.
Not zero-copy: TensorType = CPU with any memory type – the Runner performs format conversion internally, which involves a data copy.

The same execution APIs support both zero-copy and non-zero-copy paths; tensor view and memory choice determine behavior.

NpuTensor Ownership#

User-constructed tensors: NpuTensor does not take ownership of the buffer. The caller must keep it valid for the tensor’s lifetime.
Runner-allocated tensors (via allocate_npu_tensor): The buffer is owned by the NpuTensor and freed automatically when it goes out of scope (RAII).
Sub-tensors (via allocate_sub_tensor): Share the parent tensor’s buffer via reference counting. The underlying memory is released only after both the parent and all derived sub-tensors are destroyed. Only one level of nesting is supported.

Error Handling#

VART-ML uses a split error handling strategy:

Construction and queries (create_runner, allocate_npu_tensor, get_tensor_info_by_name): Throw std::runtime_error or std::invalid_argument.
Execution (execute, execute_async, wait): Return StatusCode, marked noexcept.
Accessors (get_buffer, get_virtual_address): Return sentinel values (nullptr, 0) on failure.

Thread Safety#

A single Runner instance can be shared across threads via std::shared_ptr.
execute() and execute_async() can be called concurrently from multiple threads.
wait() calls are thread-safe.
NpuTensor copies sharing the same buffer can be used from different threads.
Async callbacks are invoked from internal worker threads. Ensure thread safety when accessing shared resources in callbacks.

RunnerType::VAIML (Versal AI Edge Series Gen 2)#

The following sections are specific to RunnerType::VAIML.

Model loading modes#

RunnerType::VAIML accepts the compiled model in one of two artifact forms:

Directory-based: Load from a compiled model directory containing a vaiml_par_0 partition subdirectory with compiled model artifacts (graph definitions, metadata, and NPU binaries produced by the Vitis AI compiler).
.rai memory-mapped: Load a .rai file (single-file FlatBuffer archive of the compiled model) through memory mapping.

Configuration Options#

Runner creation accepts key-value options (unordered_map<std::string, std::any>) passed to RunnerFactory::create_runner(). All options are optional; defaults are listed in the following table.

Tensor Configuration:

Key	Data Type	Description	Default
`input_tensor_type`	String	Sets input tensor type: `"HW"` (hardware-native) or `"CPU"` (ONNX-compatible).	`"HW"`
`output_tensor_type`	String	Sets output tensor type: `"HW"` (hardware-native) or `"CPU"` (ONNX-compatible).	`"HW"`
`skip_in_bo_sync`	Boolean	Skip input buffer sync for HW tensor types. When true, sync input tensors before inference. For runner-allocated tensors (`allocate_npu_tensor` / `allocate_sub_tensor`), call `NpuTensor::sync_buffer`; for wrapped application buffers, sync is the application’s responsibility.	`false`
`skip_out_bo_sync`	Boolean	Skip output buffer sync for HW tensor types. When true, sync output tensors before reading. For runner-allocated tensors (`allocate_npu_tensor` / `allocate_sub_tensor`), call `NpuTensor::sync_buffer`; for wrapped application buffers, sync is the application’s responsibility.	`false`

NPU Resource Configuration:

Key	Data Type	Description	Default
`cma_index`	Integer	CMA memory bank index on which XRT BOs should be allocated by `Runner::allocate_npu_tensor`	`0`
`aie_columns_sharing`	Boolean	Sets access mode of the AI Engine columns; `true` for shared, `false` for exclusive access mode	`true`
`start_column`	Unsigned~Integer	Starting column index where the model is loaded	Decided based on columns availability

Async Execution Configuration:

Key	Data Type	Description	Default
`async_threadpool_depth`	Unsigned~Integer	Number of threads in the thread pool for asynchronous execution	`10`
`max_concurrent_runs`	Unsigned~Integer	Maximum number of concurrent asynchronous runs allowed	`2`
`callback_order`	String	Order in which callbacks are invoked for asynchronous runs. Accepted values: `"submission"`, `"completion"`	`"submission"`

General:

Key	Data Type	Description	Default
`log_level`	String	Logging verbosity. Accepted values: `"ERROR"`, `"WARNING"`, `"INFO"`, `"DEBUG"`	`"INFO"`
`debug`	Boolean	Enables or disables compiler debug messages	`false`
`config_json`	String	Path to the Vitis AI configuration file (`vitis_ai_config.json`), same configuration file used during model compilation	Null
`cache_path`	String	Extraction path for .rai file loading. Only used with .rai model files	Defaults to vaiml_ cache_ <model>_ <pid>_ <tid>_ <in the current directory
`ai_analyzer_profiling`	Boolean	Enables/disables AI Analyzer profiling logs	`false`

Note

ai_analyzer_profiling adds memory and performance overhead. Use it only for short development and analysis runs—not for long-running sessions or production deployment.

Supported Layouts and Data Types#

The RunnerType::VAIML backend currently supports the following memory layouts. The MemoryLayout enum in the public header defines the full set across all backends.

NHW
NHWC
NCHW
HCWNC4
HCWNC8
HCWNC16
GENERIC

When the memory layout is GENERIC, the NpuTensorInfo::memory_layout_order vector specifies the dimension permutation order relative to the CPU tensor format.

The RunnerType::VAIML backend currently supports the following data types. The DataType enum in the public header defines the full set across all backends.

BOOLEAN
INT8
UINT8
INT16
UINT16
BF16
FP16
INT32
UINT32
FLOAT32
INT64
UINT64

Note

For detailed memory layout semantics and CPU/NPU transformations, see Tensor Format Conversions.

Notes and Recommendations#

To maximize performance:

Prefer zero-copy flow using HW tensors with device-visible memory.
Reuse vart::NpuTensor instances across runs to avoid repeated per-call construction overhead; see the NpuTensor Caching section in VART Application Development.

VART ML Architecture Overview

Contents

VART ML Architecture Overview#

Layered Architecture#

Core Abstractions#

Execution Architecture#

Tensor and Memory Architecture#

NpuTensor Ownership#

Error Handling#

Thread Safety#

RunnerType::VAIML (Versal AI Edge Series Gen 2)#

Model loading modes#

Configuration Options#

Supported Layouts and Data Types#

Notes and Recommendations#