AMD Vitis™ AI EP Configuration File#

Provide a JSON configuration file using the config_file provider option when creating the ONNX Runtime inference session. The following example shows a complete configuration with commonly used options:

{
  "passes": [
   {
     "name": "init",
     "plugin": "vaip-pass_init"
   },
   {
    "name": "vaiml_partition",
    "plugin": "vaip-pass_vaiml_partition",
    "vaiml_config":
    {
      "device": "ve2-xc2ve3858",
      "optimize_level": 2,
      "logging_level": "info",
      "threshold_gops_percent": 20
    }
   }
  ],
  "target": "VAIML",
  "targets": [
    {
        "name": "VAIML",
        "pass": [
            "init",
            "vaiml_partition"
        ]
    }
  ]
}

This example includes all commonly used options. The device field within vaiml_config is mandatory; all other options in vaiml_config have defaults and can be omitted. See the individual option descriptions in the following sections for details on types, supported values, and defaults.

Configuration Options#

The following options can be specified in the vaiml_config section of the configuration file.

device#

Controls the target device for compilation.

Example: "device": "ve2-xc2ve3558"

optimize_level#

Controls the compiler’s optimization level to balance performance, memory usage, and compile-time stability.

  • Type: Integer

  • Supported values: 1, 2, 3

  • Default: 2

"vaiml_config": {
  "optimize_level": 2
}

Optimization Levels#

Level

Description

Use Cases

1

Prioritizes stability with predictable memory management (fixed or ping-pong buffering). Maximum functional coverage with minimal compile-time risk.

Early development and debugging, very large models, maximum stability requirements

2

Enables advanced optimizations including kernel chaining and efficient L2 memory allocation. Falls back to DDR (external memory) if model overflows L2 memory (NPU Mem Tiles). Reduces latency and minimizes DDR traffic.

Production builds, models benefiting from kernel chaining optimizations

3

Instructs the compiler to apply more aggressive latency optimizations beyond what is achieved through tensor parallelism size (tp_size) alone. O3 can be stacked on top of any tp_size setting.

When a model fails to meet its latency targets even after tuning tp_size. It is intended as a next-step option for squeezing out additional performance in latency-sensitive workloads.

Important: 3 is an Early Access (EA) option. As such, it might not be fully validated or production-ready, and it is recommended to use this option only when standard optimization paths have been exhausted.

Note: Kernel chaining combines multiple operations to reduce memory transfers. L2 memory refers to on-chip NPU memory tiles. DDR is external memory accessed when on-chip memory is insufficient.

Parallelism Configuration#

Data parallelism and tensor parallelism are strategies for distributing workload across the device. These can be configured independently based on your performance requirements.

dp_size#

Controls data parallelism, which instantiates the entire model multiple times across the device. With dp_size=4, four independent model instances process different inference requests simultaneously.

  • Type: Integer

  • Supported values: 1-6 (for ve2-xc2ve3558), 1-9 (for ve2-xc2ve3858)

  • Default: 1

Use data parallelism when:

  • You need to maximize throughput for concurrent requests

  • Your application handles multiple simultaneous inference requests (for example, processing multiple camera streams in video analytics)

  • Model size fits comfortably within a single processing unit’s memory

"vaiml_config": {
  "dp_size": 4
}

tp_size#

Controls tensor parallelism, which partitions a single inference request across multiple processing units. With tp_size=4, the computation for one request is divided into four parallel execution streams, reducing the time required to complete that request.

  • Type: Integer

  • Supported values: 0-6 (for ve2-xc2ve3558), 0-9 (for ve2-xc2ve3858)

  • Default: 0 (When set to 0, the compiler automatically selects an appropriate tp_size value based on the target device characteristics. For the ve2-xc2ve3858 device,

the compiler resolves tp_size to 6.)

Use tensor parallelism when:

  • Minimizing per-request latency is critical

  • The model’s memory requirements exceed the capacity of a single processing unit

  • You process one inference request at a time or have low concurrency

"vaiml_config": {
  "tp_size": 4
}

For more details on configuring data and tensor parallelism, refer to the Data Parallelism and Tensor Parallelism section.

preferred_data_storage#

Controls whether intermediate data is stored in vectorized or unvectorized format. Convolution-heavy models (CNNs) perform better with vectorized data. GEMM-heavy models (Transformers) perform better with unvectorized data. The auto mode selects the optimal layout automatically.

  • Type: String

  • Supported values: “vectorized”, “unvectorized”, “auto”

  • Default: “auto”

"vaiml_config": {
  "preferred_data_storage": "unvectorized"
}

threshold_gops_percent#

Directs operators to NPU or CPU based on their GOPS (Giga Operations Per Second) performance threshold. Operators above the threshold execute on the NPU; those below execute on the CPU.

  • Type: Integer (percentage)

  • Supported values: 0-100

  • Default: 20

"vaiml_config": {
  "threshold_gops_percent": 30
}

logging_level#

Controls the verbosity of compiler logging output.

  • Type: String

  • Supported values: “info”, “warning”, “error”

  • Default: “error”

Level

Description

info

Details about significant events or actions, including comparative information between options

warning

Recoverable issues and differences between options

error

Critical failures that prevent program continuation (limited details)

keep_outputs#

Specifies whether to retain intermediate compilation files for debugging.

  • Type: Boolean

  • Supported values: true, false

  • Default: false

Value

Description

true

The Vitis AI compiler preserves both the <cache-dir>/<cache-key>/<model>.rai file and the complete vaiml directory structure.

false

Only the <cache-dir>/<cache-key>/<model>.rai file is retained.

ai_analyzer_enhanced_profiling#

Specifies whether to enable enhanced profiling in AI Analyzer during compilation. When enabled, additional registers are set to allow detailed performance data collection during runtime.

  • Type: Text

  • Supported values: control_instrumentation

  • Default: no enhanced profiling

profiling_runtime_config#

Provides a JSON field with additional configuration options for enhanced profiling during runtime. This field is optional and only applicable when ai_analyzer_enhanced_profiling is enabled. The specific configuration options within this field depend on the profiling features you wish to enable and is currently limited to “control_instrumentation”.

control_instrumentation:

  • Type: Text

  • Supported values: peak_read_bandwidth, peak_write_bandwidth

  • Default: peak_read_bandwidth