Multi-Instances of NPU Support#

This section covers how to instantiate multiple NPU IPs on a single design. The Default Reference Design supports the execution of multiple models from a single application context, and it has the flexibility to enable the NPU IP in the reference design with either a subset or all available AIE columns and it allowed resource savings for lighter workloads, but any unused AIE columns remained idle and could not be leveraged for parallel execution. The multi-instance design supports multiple models being executed on multiple NPU IPs. The available AIE resources can now be partitioned into multiple sub-groups (e.g., a few columns of AIE per NPU IP) and enables the creation of multiple NPU IPs within the same design.

The multi-instance NPU supports:

Concurrent execution of multiple models/snapshots on separate NPU IPs.
Better hardware utilization by distributing workloads across different AIE sub-groups.
Scalable performance, enabling developers to run diverse models simultaneously with dedicated compute resources.

It unlocks true parallelism and maximizes efficiency, making it easier to deploy multi-model workloads on the same platform.

The following two NPU IPs are used for multi-instance NPU example.

VE2802_NPU_IP_O00_A128_M3 (Offset 0, AIE Columns 16)
VE2802_NPU_IP_O16_A080_M3 (Offset 16, AIE Columns 10)

Note

There is a limiation on IP choices for multi-instance NPU support. Due to the limiting number of resources on both the AIE (GMIO) and PL (LUT/Flip-Flop/NMU) and DDR bandwidth, a high number of IP is not practical. Thus only two NPU IPs are used for multi-instance NPU example implementation.
The number of columns of each of the IP is also restricted by the GMIO availability on the AIE and the routing capability of Vitis. Splitting AIEs between multiple IPs and related physical silicon IO limitations should be reviewed carefully before attempting an implementation.

Integrate Multiple NPU IPs in Reference Design#

Following are the changes required to integrate the multiple NPU IPs in Reference Design.

Note

The changes are already in-place in reference design for your convenience. Following list is for reference only.

The NPU instance–specific FPGA bitstream files(fpga_info_{timestamp}.txt) and device tree files (* .dtsi), and configuration files need to be copied into the target build folder. Refer to the code snippet below that performs this step.

In Vitis-AI/examples/reference_design/vek280/vek280_platform/sw/create_petalinux.sh

Line: 84,85
  [ -z "$NPU_IP2" ] || cp -fv ../npu_versal_two_npus.dtsi project-spec/meta-user/recipes-bsp/device-tree/files/npu_versal.dtsi | tee -a $ABS_PATH/cmd.log
  [ -n "$NPU_IP2" ] || cp -fv ../npu_versal_one_npu.dtsi  project-spec/meta-user/recipes-bsp/device-tree/files/npu_versal.dtsi | tee -a $ABS_PATH/cmd.log

Line: 123
  [ -z "$NPU_IP2" ] || cp ../../fpga_info_${VAISW_SNAPSHOT_TIMESTAMP2#0x}.txt  project-spec/meta-user/recipes-core/vartml-sw/vartml-sw/

While linking and packaging with the Vitis toolchain (v++), ensure that NPU IP–specific files such as VSS and libadf are included. Make sure, use unique names/identifiers for each NPU IP instance-specific VSS, libadf, snapshots, fpga_info file, etc., to avoid conflicts during the linking and packaging stages of the board design. Refer to the code snippet below that performs this step.

In Vitis-AI/examples/reference_design/vek280/vitis-prj/Makefile

  Line : 174
  cd $(ABS_PATH)/link; v++ $(XCXX_COMMON_OPTS) --platform $(PLATFORM) -l $(NPUVSSFILE_O0) $(NPUVSSFILE_O16) $(ABS_PATH)/kernels/image_processing/image_processing.xo $(YOLOV5_TAIL_PL) $(YOLOX_TAIL_PL) -o ${PROJECT_NAME}_link.xsa --config $(ABS_PATH)/link/system.cfg ;cd $(ABS_PATH)

  Line : 190
  cd $(ABS_PATH)/package; v++ -p --debug --save-temp --platform $(PLATFORM)  $(LIBADF_O0)  $(LIBADF_O16) --target hw  --package.out_dir hw_outputs_unpatched --package.kernel_image=$(SW_PLATFORM)/Image --package.rootfs=$(SW_PLATFORM)/rootfs.ext4 --package.image_format=ext4 --package.ext4_fat32_size=2 $(ABS_PATH)/link/${PROJECT_NAME}_link.xsa --package.sd_file=$(SW_PLATFORM)/${PROJECT_NAME}_pfm/xrt/image/boot.scr --package.sd_file=$(ABS_PATH)/kernels/image_processing/image_processing.cfg --package.sd_file=$(ABS_PATH)/version.txt $(YOLOX_CONST_FILES) $(RESNET50_SNAPSHOT) $(RESNET101_SNAPSHOT) $(RESNET50_SNAPSHOT2) $(RESNET101_SNAPSHOT2) -o x_plus_ml.xclbin;

Before linking and packing, ensure that NPU IP insatnce-specific PL NMU placements needs to be merged into a single tcl file and also merge all the IP configurations into a single consolidated system.cfg file and include that configuration file in the above linker command. Refer to the code snippet below that performs this step.
```
In examples/reference_design/vek280/vitis_prj/kernels/npu_vss/Makefile

  Line : 37-39
      python3 config_merger.py
      [ -z "$${NPU_IP2}" ] || python3 nmu_merger.py
      [ -n "$${NPU_IP2}" ] || cp $(NMU_CSTR)/$(NPU_IP)/link/place_pl_nmu.tcl ../../link/
```

After reviewing the changes required for multiple instance of NPU IPs in reference design, refer following section to build the reference design.

Build Reference Design with Multiple NPU IPs#

This section covers the steps required to build the Reference Design with multiple NPU IPs.

Ensure that you have downloaded the Petalinux BSP v2025.1
Navigate to Vitis-AI source code:
```
$ cd <path_to_Vitis-AI>
```

Source with two NPU IPs:

$ source npu_ip/settings.sh VE2802_NPU_IP_O00_A128_M3 VE2802_NPU_IP_O16_A080_M3

Source the Vitis tool:

$ source <path_to_Vitis>/2025.1.1/Vitis/settings64.sh

Source the Petalinux tool:

$ source <path_to_petalinux-v2025.1>/tool/petalinux-v2025.1-final/settings.sh

Build the reference design with two NPU IPs:

$ make -C examples/reference_design/vek280 all BSP_PATH=<path_to_petalinux-v2025.1>/bsp/xsct/xilinx-vek280-xsct-v2025.1-final.bsp

During reference design build (SD Card compilation), ResNet50 and ResNet101 models are downloaded from TensorFlow hub and a snapshot is generated for the target NPU IP. Downloading models from TensorFlow hub may not work properly depending on tfhub availability. It is possible to skip the models download and snapshot generation using the option SKIP_SNAPSHOT=1 during the SD Card generation, as shown in below command. make all SKIP_SNAPSHOT=1 BSP_PATH=<BSP file Path>.

The SD Card VE2802_NPU_IP_O00_A128_M3__O16_A080_M3_sd_card.img gets generated after running above steps, and it contains 2 NPU IPs: VE2802_NPU_IP_O00_A128_M3 (128 AIEs so 16 columns of AIE and column offset 0) and VE2802_NPU_IP_O16_A080_M3 (80 AIEs so 10 AIE columns of AIE and column offset 16). You can refer to Set Up/Flash SD Card section in Software Installations for steps to flash the SD Card.

Compile Multiple Models For Multiple NPU IPs#

This section guides you on compiling multiple models to run on multiple NPU IPs.

From the application, the execution of multiple models can be scheduled:

Using multiple threads (for instance, one thread per model), in that case the execute blocking call is recommended to be used.
Using single thread and issuing outstanding execute_async calls, and the associated wait calls.

In case the models target the same NPU_IP, the multiple inferences will be sequential and one inference will start once the NPU HW is available. In case the models target different NPU_IP, the inference on the different NPU_IP will run at the same time, asynchronously.

Theoretical Example#

Let’s assume we have 2 IPs (named IP_0 and IP_1) and we have 3 models (named model_A, model_B, model_C).

Refer following commands to compile the models for different NPU IPs.

source npu_ip/settings.sh IP_0                        # enable compilation for IP_0
VAISW_SNAPSHOT_DIRECTORY=model_A.$NPU_IP model_a.py   # will generate a snapshot for model_A and will dump it on model_A.IP_0
VAISW_SNAPSHOT_DIRECTORY=model_B.$NPU_IP model_b.py   # will generate a snapshot for model_B and will dump it on model_B.IP_0

source npu_ip/settings.sh IP_1                        # enable compilation for IP_1
VAISW_SNAPSHOT_DIRECTORY=model_A.$NPU_IP model_a.py   # will generate a snapshot for model_A and will dump it on model_A.IP_1
VAISW_SNAPSHOT_DIRECTORY=model_C.$NPU_IP model_c.py   # will generate a snapshot for model_C and will dump it on model_C.IP_1

Loading the snapshot of model_A.IP_0 will create a handle to run model_A on IP_0.

Therefore, the following will be possible by a single application:

Running model A either on IP_0 or IP_1 (depending on which handle is used)
Running model B on IP_0
Running model C on IP_1

The target of the IP is included in the generated snapshot, so there is no need in the embedded SW stack to specify on which target IP a snapshot has to be executed.

Note

A snapshot is necessary bounded to a defined IP, different IP will use different resources and AIE addresses, so it is not possible to generate a single snapshot targeting different IP.
Therefore, it is required to generate multiple snapshot, one per IP, even if the same model is used on the different IPs.

Practical Example#

As VE2802_NPU_IP_O00_A128_M3 and VE2802_NPU_IP_O16_A080_M3 IPs are used in multi-instance of NPUs example, refer the following steps to compile the models for these IPs.

Navigate to Vitis-AI source
```
cd <path_to_Vitis-AI>
```

Build snapshot for VE2802_NPU_IP_O00_A128_M3

source npu_ip/settings.sh VE2802_NPU_IP_O00_A128_M3

./docker/run.bash --acceptLicense -- /bin/bash -c "source npu_ip/settings.sh && cd examples/python_examples/batcher && VAISW_SNAPSHOT_DIRECTORY=$VAISW_HOME/SNAP.$NPU_IP/resnet50.TF ./run_classification.sh -f tensorflow2 -n resnet50 --batchSize 4"

Build snapshot for VE2802_NPU_IP_O16_A080_M3

source npu_ip/settings.sh VE2802_NPU_IP_O16_A080_M3

./docker/run.bash --acceptLicense -- /bin/bash -c "source npu_ip/settings.sh && cd examples/python_examples/batcher && VAISW_SNAPSHOT_DIRECTORY=$VAISW_HOME/SNAP.$NPU_IP/resnet50.TF ./run_classification.sh -f tensorflow2 -n resnet50 --batchSize 4"

Now the snapshots for ResNet50 model are generated for respective NPU IPs. Refer following section to execute the snapshots on multiple NPU IPs.

Execute Multiple Models with Multiple NPU IPs#

Use following steps to execute the snapshots (of ResNet50 model) on multiple NPU IPs (VE2802_NPU_IP_O00_A128_M3 and VE2802_NPU_IP_O16_A080_M3), by using VART Runner application.

Ensure that VE2802_NPU_IP_O00_A128_M3__O16_A080_M3_sd_card.img is flashed on SD Card.
Ensure that the VEK280 board is up and running

Copy the generated snashots (of ResNet50 model) to target board.

scp <snapshot_generated_for_VE2802_NPU_IP_O00_A128_M3> root@<vek280_board_ip>:/root
scp <snapshot_generated_for_VE2802_NPU_IP_O16_A080_M3> root@<vek280_board_ip>:/root

Copy ‘imagenet’ folder which is generated during executing sample model.
```
scp -r imagenet/ root@<vek280_board_ip>:/root
```
Set up the Vitis AI tools environment on the board
```
$ source /etc/vai.sh
```

Run following command to execute the models on both IPs

cd /root
# sample command
# vart_ml_demo --imgPath /root/imagenet/ILSVRC2012_img_val/ --snapshot <snapshot_generated_for_VE2802_NPU_IP_O00_A128_M3>+<snapshot_generated_for_VE2802_NPU_IP_O16_A080_M3> --labels /etc/vai/labels/labels --goldFile /root/imagenet/ILSVRC_2012_val_GroundTruth_10p.txt  --nbImages 1
# actual command
vart_ml_demo --imgPath /root/imagenet/ILSVRC2012_img_val/ --snapshot SNAP.VE2802_NPU_IP_O00_A128_M3/resnet50.TF/+SNAP.VE2802_NPU_IP_O16_A080_M3/resnet50.TF/ --labels /etc/vai/labels/labels --goldFile /root/imagenet/ILSVRC_2012_val_GroundTruth_10p.txt  --nbImages 1

The previous command generates following output on console.

root@xilinx-vek280-xsct-20251:~# vart_ml_demo --imgPath /root/imagenet/ILSVRC2012_img_val/ --snapshot SNAP.VE2802_NPU_IP_O00_A128_M3/resnet50.TF/+SNAP.VE2802_NPU_IP_O16_A080_M3/resnet50.TF/ --labels /etc/vai/labels/labels --goldFile /root/imagenet/ILSVRC_2012_val_GroundTruth_10p.txt  --nbImages 1
XAIEFAL: INFO: Resource group Avail is created.
XAIEFAL: INFO: Resource group Static is created.
XAIEFAL: INFO: Resource group Generic is created.
[VART] Allocated config area in DDR:    Addr = [    0x880000000,  0x50000000000,  0x60000000000 ]       Size = [   0xe7b721,   0xad33e1,   0xe7b721]
[VART] Allocated tmp area in DDR:       Addr = [    0x880e7d000,  0x50000ad5000,  0x60000e7d000 ]       Size = [    0x62801,    0x31401,    0x31401]
[VART] Found snapshot for IP VE2802_NPU_IP_O00_A128_M3 matching running device VE2802_NPU_IP_O00_A128_M3
[VART] Parsing snapshot SNAP.VE2802_NPU_IP_O00_A128_M3/resnet50.TF//
[========================= 100% =========================]
NPU only mode set. Skipping node resnet50_CPU.
Warning: VART ML is already connected using XRT. Ignoring this call.
[VART] Allocated config area in DDR:    Addr = [    0x880fa9000,  0x50000b6c000,  0x60000f14000 ]       Size = [   0xe7b721,   0xad1221,   0xe7b721]
[VART] Allocated tmp area in DDR:       Addr = [    0x881e26000,  0x5000163f000,  0x60001d91000 ]       Size = [    0x62801,    0x31401,    0x31401]
[VART] Found snapshot for IP VE2802_NPU_IP_O16_A080_M3 matching running device VE2802_NPU_IP_O00_A128_M3
[VART] Parsing snapshot SNAP.VE2802_NPU_IP_O16_A080_M3/resnet50.TF//
[========================= 100% =========================]
NPU only mode set. Skipping node resnet50_CPU.


resnet50 Image 0 (0:0) ILSVRC2012_val_00000001.JPEG
resnet50    GOLD - n03982430 pool table, billiard table, snooker table - 1.000000
resnet50    PRED - n03982430 pool table, billiard table, snooker table - 1.00
resnet50    PRED - n03942813 ping-pong ball - 0.00
resnet50    PRED - n04336792 stretcher - 0.00
resnet50    PRED - n03376595 folding chair - 0.00
resnet50    PRED - n03179701 desk - 0.00
resnet50
resnet50 Image 0 (0:0) ILSVRC2012_val_00000001.JPEG
resnet50    GOLD - n03982430 pool table, billiard table, snooker table - 1.000000
resnet50    PRED - n03982430 pool table, billiard table, snooker table - 1.00
resnet50    PRED - n03942813 ping-pong ball - 0.00
resnet50    PRED - n04336792 stretcher - 0.00
resnet50    PRED - n03376595 folding chair - 0.00
resnet50    PRED - n03179701 desk - 0.00
resnet50

============================================================
Accuracy Summary:
[AMD] [resnet50 TEST top1] 100.00% passed.
[AMD] [resnet50 TEST top5] 100.00% passed.
[AMD] [resnet50 ALL TESTS] 100.00% passed.
[AMD] [resnet50 TEST top1] 100.00% passed.
[AMD] [resnet50 TEST top5] 100.00% passed.
[AMD] [resnet50 ALL TESTS] 100.00% passed.
[AMD] 291.33 imgs/s (1 images)
root@xilinx-vek280-xsct-20251:~#

You can also execute the snapshots on multiple NPUs by using X+ML application. Refer to the Multi-instance Support with x_plus_ml_app in VART X APIs Application Developer Guide for more details.

Multi-Instances of NPU Support

Contents

Multi-Instances of NPU Support#

Integrate Multiple NPU IPs in Reference Design#

Build Reference Design with Multiple NPU IPs#

Compile Multiple Models For Multiple NPU IPs#

Theoretical Example#

Practical Example#

Execute Multiple Models with Multiple NPU IPs#