AMD Vitis™ AI EP Configuration File#
Provide a JSON configuration file using the config_file provider option when creating the ONNX Runtime inference session. The following example shows a complete configuration with commonly used options:
{
"passes": [
{
"name": "init",
"plugin": "vaip-pass_init"
},
{
"name": "vaiml_partition",
"plugin": "vaip-pass_vaiml_partition",
"vaiml_config":
{
"device": "ve2-xc2ve3858",
"optimize_level": 2,
"logging_level": "info",
"threshold_gops_percent": 20
}
}
],
"target": "VAIML",
"targets": [
{
"name": "VAIML",
"pass": [
"init",
"vaiml_partition"
]
}
]
}
This example includes all commonly used options. The device field within vaiml_config is mandatory; all other options in vaiml_config have defaults and can be omitted. See the individual option descriptions in the following sections for details on types, supported values, and defaults.
Configuration Options#
The following options can be specified in the vaiml_config section of the configuration file.
device#
Controls the target device for compilation.
Type: String
Required: Yes
Supported values: See Supported Devices for Compilation section
Example: "device": "ve2-xc2ve3558"
optimize_level#
Controls the compiler’s optimization level to balance performance, memory usage, and compile-time stability.
Type: Integer
Supported values: 1, 2, 3
Default: 2
"vaiml_config": {
"optimize_level": 2
}
Optimization Levels#
Level |
Description |
Use Cases |
|---|---|---|
1 |
Prioritizes stability with predictable memory management (fixed or ping-pong buffering). Maximum functional coverage with minimal compile-time risk. |
Early development and debugging, very large models, maximum stability requirements |
2 |
Enables advanced optimizations including kernel chaining and efficient L2 memory allocation. Falls back to DDR (external memory) if model overflows L2 memory (NPU Mem Tiles). Reduces latency and minimizes DDR traffic. |
Production builds, models benefiting from kernel chaining optimizations |
3 |
Instructs the compiler to apply more aggressive latency optimizations beyond what is achieved through tensor parallelism size ( |
When a model fails to meet its latency targets even after tuning |
Important: 3 is an Early Access (EA) option. As such, it might not be fully validated or production-ready, and it is recommended to use this option only when standard optimization paths have been exhausted.
Note: Kernel chaining combines multiple operations to reduce memory transfers. L2 memory refers to on-chip NPU memory tiles. DDR is external memory accessed when on-chip memory is insufficient.
Parallelism Configuration#
Data parallelism and tensor parallelism are strategies for distributing workload across the device. These can be configured independently based on your performance requirements.
dp_size#
Controls data parallelism, which instantiates the entire model multiple times across the device. With dp_size=4, four independent model instances process different inference requests simultaneously.
Type: Integer
Supported values: 1-6 (for ve2-xc2ve3558), 1-9 (for ve2-xc2ve3858)
Default: 1
Use data parallelism when:
You need to maximize throughput for concurrent requests
Your application handles multiple simultaneous inference requests (for example, processing multiple camera streams in video analytics)
Model size fits comfortably within a single processing unit’s memory
"vaiml_config": {
"dp_size": 4
}
tp_size#
Controls tensor parallelism, which partitions a single inference request across multiple processing units. With tp_size=4, the computation for one request is divided into four parallel execution streams, reducing the time required to complete that request.
Type: Integer
Supported values: 0-6 (for ve2-xc2ve3558), 0-9 (for ve2-xc2ve3858)
Default: 0 (When set to 0, the compiler automatically selects an appropriate tp_size value based on the target device characteristics. For the
ve2-xc2ve3858device,
the compiler resolves tp_size to 6.)
Use tensor parallelism when:
Minimizing per-request latency is critical
The model’s memory requirements exceed the capacity of a single processing unit
You process one inference request at a time or have low concurrency
"vaiml_config": {
"tp_size": 4
}
For more details on configuring data and tensor parallelism, refer to the Data Parallelism and Tensor Parallelism section.
preferred_data_storage#
Controls whether intermediate data is stored in vectorized or unvectorized format. Convolution-heavy models (CNNs) perform better with vectorized data. GEMM-heavy models (Transformers) perform better with unvectorized data. The auto mode selects the optimal layout automatically.
Type: String
Supported values: “vectorized”, “unvectorized”, “auto”
Default: “auto”
"vaiml_config": {
"preferred_data_storage": "unvectorized"
}
threshold_gops_percent#
Directs operators to NPU or CPU based on their GOPS (Giga Operations Per Second) performance threshold. Operators above the threshold execute on the NPU; those below execute on the CPU.
Type: Integer (percentage)
Supported values: 0-100
Default: 20
"vaiml_config": {
"threshold_gops_percent": 30
}
logging_level#
Controls the verbosity of compiler logging output.
Type: String
Supported values: “info”, “warning”, “error”
Default: “error”
Level |
Description |
|---|---|
info |
Details about significant events or actions, including comparative information between options |
warning |
Recoverable issues and differences between options |
error |
Critical failures that prevent program continuation (limited details) |
keep_outputs#
Specifies whether to retain intermediate compilation files for debugging.
Type: Boolean
Supported values: true, false
Default: false
Value |
Description |
|---|---|
true |
The Vitis AI compiler preserves both the <cache-dir>/<cache-key>/<model>.rai file and the complete vaiml directory structure. |
false |
Only the <cache-dir>/<cache-key>/<model>.rai file is retained. |
ai_analyzer_enhanced_profiling#
Specifies whether to enable enhanced profiling in AI Analyzer during compilation. When enabled, additional registers are set to allow detailed performance data collection during runtime.
Type: Text
Supported values: control_instrumentation
Default: no enhanced profiling
profiling_runtime_config#
Provides a JSON field with additional configuration options for enhanced profiling during runtime. This field is optional and only applicable when ai_analyzer_enhanced_profiling is enabled. The specific configuration options within this field depend on the profiling features you wish to enable and is currently limited to “control_instrumentation”.
control_instrumentation:
Type: Text
Supported values: peak_read_bandwidth, peak_write_bandwidth
Default: peak_read_bandwidth