Multi-Tenancy: Spatial and Temporal Sharing#

Overview#

This guide covers multi-tenancy on Versal AI Edge Series Gen 2: running several independently compiled models concurrently on the same device by assigning NPU column ranges at runtime. It explains spatial sharing and temporal sharing, how compile-time dp_size and tp_size determine each model’s column footprint on the NPU, and how to configure placement through VART Runner options when you create a runner for each model.

For compile-time data parallelism and tensor parallelism (what dp_size and tp_size mean at compilation, and how to set them in vitisai_config.json), see Data Parallelism and Tensor Parallelism.

For step-by-step compilation and inference walkthroughs, see ../cpp_examples/vart_multi_tenancy_guide and the vart_multi_tenancy reference application—a C++ sample that loads multiple model caches from a JSON file and runs them with spatial, temporal, or combined column placement.

Note

Complete Run Your First Inference and Docker Setup before this chapter. Review Data Parallelism and Tensor Parallelism for dp_size, tp_size, and processing-unit concepts used below.

How `dp_size` and `tp_size` Map to NPU Columns#

At compile time, the AMD Vitis™ AI compiler places your model on one or more NPU compute units in the AI Engine array. Processing units are the compute elements that execute model operations; each processing unit corresponds to one NPU compute unit. On ve2-xc2ve3558 and ve2-xc2ve3858 devices, each NPU compute unit uses a 4×4 block of 16 AI Engine tiles (NPU_compute_unit_size = 16 in the compiler guide). See Data Parallelism and Tensor Parallelism for tile-level resource formulas.

At runtime, multi-tenancy placement is expressed in NPU columns, not tiles. The device exposes a fixed number of NPU columns (24 on ve2-xc2ve3558, 36 on ve2-xc2ve3858, numbered from 0). One NPU compute unit = four columns.

The total number of columns one compiled model occupies is:

occupied_columns = dp_size × tp_size × 4

This matches compile-time tile usage:

AI Engine tiles used = dp_size × tp_size × NPU_compute_unit_size

where dp_size and tp_size are set in vitisai_config.json.

`tp_size` (tensor parallelism)#

tp_size is the number of NPU compute units that cooperate on one inference request. The compiler shards that request across those units to reduce per-request latency (or to fit a large model across more than one unit). Those tp_size units are laid out in adjacent 4-column blocks within each data-parallel replica.

Example: dp_size=1, tp_size=4 uses four units in a row (16 columns total) for a single replica:

dp_size=1, tp_size=4, start_column=0

+----------+----------+----------+----------+
| TP unit 0| TP unit 1| TP unit 2| TP unit 3|
| cols 0-3 | cols 4-7 | cols 8-11| cols12-15|
+----------+----------+----------+----------+
                one inference

`dp_size` (data parallelism)#

dp_size is the number of independent replicas of the model on the device. Each replica has its own tp_size units and can process a separate inference request at the same time, which increases throughput.

Example: dp_size=4, tp_size=1 places four replicas side by side (16 columns). Each replica uses one unit (four columns):

dp_size=4, tp_size=1, start_column=0

+--------+ +--------+ +--------+ +--------+
|Repl. 0 | |Repl. 1 | |Repl. 2 | |Repl. 3 |
|cols 0-3| |cols 4-7| |cols8-11| |cols12-15|
+--------+ +--------+ +--------+ +--------+
  req A      req B      req C      req D

Combined `dp_size > 1` and `tp_size > 1`#

When both are greater than 1, the overlay allocates dp_size × tp_size compute units in one contiguous column range. Each of the dp_size replicas has tp_size units for sharding a single request.

Example: dp_size=2, tp_size=2 → four units, 16 columns:

dp_size=2, tp_size=2, start_column=0

Replica 0 (dp index 0)            Replica 1 (dp index 1)
+----------+----------+           +----------+----------+
| TP unit 0| TP unit 1|           | TP unit 0| TP unit 1|
| cols 0-3 | cols 4-7 |           | cols 8-11| cols12-15|
+----------+----------+           +----------+----------+
     one inference                     one inference

When you set start_column at runtime, it is the index of the first column of this entire block. The model occupies columns start_column through start_column + occupied_columns − 1.

Common compile-time configurations#

See Data Parallelism and Tensor Parallelism for valid dp_size and tp_size ranges per device and for AI Engine tile calculations.

Configuration	`tp_size`	`dp_size`	NPU columns consumed
Data Parallelism 1 / Tensor Parallelism 1	1	1	1 × 1 × 4 = 4
Data Parallelism 1 / Tensor Parallelism 4	4	1	4 × 1 × 4 = 16
Data Parallelism 4 / Tensor Parallelism 1	1	4	1 × 4 × 4 = 16
Data Parallelism 2 / Tensor Parallelism 2	2	2	2 × 2 × 4 = 16

The three 16-column rows use the same NPU width but optimize for different goals (latency, throughput, or a balance). Choose dp_size and tp_size at compile time; use start_column and aie_columns_sharing at runtime to place and share those columns across models.

Important

Models that share columns through temporal sharing must be compiled with the exact same dp_size and tp_size. When placing multiple models spatially, ensure the combined column ranges fit on your device (24 columns on ve2-xc2ve3558, 36 on ve2-xc2ve3858).

Runtime Configuration#

Multi-tenancy placement is configured per VART runner when each model is loaded. Call vart::RunnerFactory::create_runner() with RunnerType::VAIML once per model cache, and pass an options map that can include start_column and aie_columns_sharing. See VART ML Architecture Overview for the full list of VART runner options.

Create one runner per model. The following subsections document each option, the placement modes that combine them, and application responsibilities when multiple runners are active.

`start_column`#

Use start_column to specify the first NPU column where this model’s compiled overlay is loaded. The overlay spans occupied_columns = dp_size × tp_size × 4 consecutive columns starting at that index. How to assign start_column across multiple runners is defined in Placement modes.

Aspect	Detail
API	Pass `options["start_column"]` as `uint32_t` to `vart::RunnerFactory::create_runner(RunnerType::VAIML, model_cache_path, options)`
Column range	Columns `[start_column, start_column + occupied_columns − 1]`, where `occupied_columns = dp_size × tp_size × 4` from the compiled model
Valid range	`0` through `(device_columns − occupied_columns)`. On a 36-column device, a model with a 4-column footprint can use `start_column` up to `32` (columns 32–35)
Default	If `start_column` is omitted from the options map, VART selects a starting column based on available columns when the model is loaded

`aie_columns_sharing`#

Use aie_columns_sharing to select shared (temporal) or exclusive (spatial) access to the column block defined by start_column. See Placement modes for how to combine this option with start_column across multiple runners.

Aspect	Detail
API	Pass `options["aie_columns_sharing"]` as `bool` to `create_runner()`
`true` (shared)	Temporal sharing mode (see Spatial and Temporal Sharing above)
`false` (exclusive)	Spatial exclusive mode (see Spatial and Temporal Sharing above)
Default	`true` when the key is omitted (VART runner default)

Placement modes#

Combine start_column and aie_columns_sharing across the runners in your application:

Mode	`start_column`	`aie_columns_sharing`
Temporal sharing	Same value for every runner in the group	`true` for every runner in the group
Spatial sharing (exclusive)	Distinct, non-overlapping value per runner	`false` for each runner (recommended for exclusive reservation)
Combined spatial + temporal	Same value within each temporal group; distinct values across spatial groups	`true` within temporal groups; `false` for spatial-only runners

Application responsibilities

VART does not validate overlapping or incompatible placements across runners. Before calling create_runner() for each model, your application must ensure:

Spatial layouts: start_column ranges do not overlap between concurrent exclusive runners, and the sum of all occupied ranges fits the device (24 columns on ve2-xc2ve3558, 36 on ve2-xc2ve3858).
Temporal layouts: every runner in a group uses the same start_column, aie_columns_sharing: true, and the same compile-time dp_size and tp_size.
Mixed settings on the same columns: do not assign the same start_column to one runner with aie_columns_sharing: false and another with true. The vart_multi_tenancy sample detects this case and aborts with a diagnostic before creating runners.

Combined spatial and temporal deployments — assign disjoint column zones for spatial parallelism (each zone uses exclusive mode and its own start_column), and use temporal sharing within a zone by giving multiple runners the same start_column and aie_columns_sharing: true. Each temporal group still requires matching dp_size and tp_size.

Example — spatial placement of three models on a 36-column device#

Three models run in parallel on disjoint column ranges. Each uses spatial exclusive placement (aie_columns_sharing: false):

Model	`dp_size` / `tp_size`	`start_column`	Columns occupied
Model A (latency-oriented)	1 / 4	0	0–15 (16 columns)
Model B (balanced)	2 / 2	16	16–31 (16 columns)
Model C (minimum footprint)	1 / 1	32	32–35 (4 columns)

Example — C++ spatial exclusive runners#

std::unordered_map<std::string, std::any> options_a = {
    {"start_column", static_cast<uint32_t>(0)},
    {"aie_columns_sharing", false},
    {"input_tensor_type", std::string("HW")},
    {"output_tensor_type", std::string("HW")},
};
auto runner_a = vart::RunnerFactory::create_runner(
    vart::RunnerType::VAIML, "my_cache/ResNet50_INT8_dp1tp4", options_a);

std::unordered_map<std::string, std::any> options_b = {
    {"start_column", static_cast<uint32_t>(16)},
    {"aie_columns_sharing", false},
    {"input_tensor_type", std::string("HW")},
    {"output_tensor_type", std::string("HW")},
};
auto runner_b = vart::RunnerFactory::create_runner(
    vart::RunnerType::VAIML, "my_cache/ResNet50_INT8_dp2tp2", options_b);

Reference: `vart_multi_tenancy` Sample Application#

The vart_multi_tenancy sample is a reference application that reads a JSON file, creates one VART runner per array entry, forwards start_column and aie_columns_sharing to create_runner() using the same semantics documented above, and adds fields for reading IFM files and writing OFM files.

JSON configuration (sample only)#

The configuration file is a JSON array; each object describes one model instance:

Field	Description
`model_cache_path`	Path to the compiled `.rai` model cache directory. Required.
`start_column`	Optional. When set, passed to the VART runner as `start_column`. Same semantics as the Runtime Configuration section.
`aie_columns_sharing`	Optional. `true` or `false` (JSON boolean), or `"true"` / `"1"` / `"false"` / `"0"` as strings. Default `true` in the sample when omitted.
`ifm_node_file_map`	Sample only: maps ONNX input node names to IFM binary file paths on the target.
`ofm_dir`	Sample only: directory where OFM output files are written. Defaults to the current working directory when omitted.

Example JSON for two spatial exclusive models:

[
  {
    "model_cache_path": "my_cache/model_a/model_a.rai",
    "start_column": 0,
    "aie_columns_sharing": false,
    "ifm_node_file_map": { "input": "/path/to/ifm.bin" },
    "ofm_dir": "./outputs/model_a"
  },
  {
    "model_cache_path": "my_cache/model_b/model_b.rai",
    "start_column": 16,
    "aie_columns_sharing": false,
    "ifm_node_file_map": { "images": "/path/to/ifm.bin" },
    "ofm_dir": "./outputs/model_b"
  }
]

Run on the target:

vart_multi_tenancy --config json_configs/two_model_spatial.json

The sample prints an overlay column assignment summary, validates IFM node names and file sizes, then runs inference (one thread per model). Pre-built configs on the board include temporal_config.json, spatial_config.json, and temporal_spatial_config.json. See the sample README for build instructions and additional examples.

Sample command-line options#

Option	Default	Description
`--config` / `-c`	(required)	Path to the JSON configuration file
`--runs` / `-r`	`1`	Number of inference iterations per model using the same IFM data
`--dry-run` / `-d`	off	Fill IFMs with random bytes; skip reading IFM files and writing OFM files
`--log-level` / `-l`	`2`	Log verbosity (0 = errors only, 1 = +warnings, 2 = +info)

The VART-ML vart_ml_test utility also supports --start-column and --aie-columns-sharing for single-model testing of the same runner options without the multi-model JSON wrapper.

Verifying NPU Column Usage#

Use xrt-smi examine --device 0 --report aie-partitions on the target board to confirm column assignment. Run the command while inference is running; partitions are released when inference completes.

Temporal sharing: multiple hardware contexts under one partition index, same column list.
Spatial sharing (exclusive): separate partition indices with disjoint column lists.

See Spatial and Temporal Sharing and Placement modes for how those patterns map to aie_columns_sharing settings.

Multi-Tenancy: Spatial and Temporal Sharing

Contents

Multi-Tenancy: Spatial and Temporal Sharing#

Overview#

Spatial and Temporal Sharing#

How dp_size and tp_size Map to NPU Columns#

tp_size (tensor parallelism)#

dp_size (data parallelism)#

Combined dp_size > 1 and tp_size > 1#