Docker Samples and Demos#
This section covers generating snapshots for widely used models by using a Python application named run_classification.sh. It is provided in the Vitis AI repository at $VITIS_AI_REPO/examples/python_examples/batcher. This section also describes generating snapshots for demo models in Docker.
Run the following steps to check the list of models included as an example in the run_classification.sh application:
Navigate to the
Vitis-AIdirectory:$ cd $VITIS_AI_REPO
Launch Docker:
$ ./docker/run.bashNavigate to the
batcherfolder:$ cd examples/python_examples/batcher
Check the list of supported frameworks:
$ ./run_classification.sh -f list
This command displays the following output on the console.
List of supported frameworks: onnxRuntime, pytorch, tensorflow, tensorflow2
The following table displays the framework versions that are tested on Docker.
Package | Tested up to | Docker version ------------+--------------+----------------- tensorflow | 2.16.1 | 2.9.0 onnx | 1.16.1 | 1.12.1 onnxruntime | 1.18.0 | 1.12.0 torch | 2.3.1 | 1.12.1
Check the list of supported models for the PyTorch framework:
$ ./run_classification.sh -f pytorch -n list
After running the previous command, the following output is displayed on the console:
List of supported networks for the framework pytorch: alexnet densenet121 densenet161 densenet169 densenet201 googlenet_no_lrn inceptionv3 mnasnet0_5 mnasnet0_75 mnasnet1_0 mnasnet1_3 mobilenet_v2 resnet101 resnet152 resnet18 resnet34 resnet50 resnext101_32x8d resnext50_32x4d shufflenet_v2_x0_5 shufflenet_v2_x1_0 shufflenet_v2_x1_5 shufflenet_v2_x2_0 squeezenet squeezenet1_1 vgg11 vgg11_bn vgg13 vgg13_bn vgg16 vgg16_bn vgg19 vgg19_bn wide_resnet101_2 wide_resnet50_2
Similarly, you can check the list of supported models for TensorFlow (1 and 2) and ONNX:
$ ./run_classification.sh -f tensorflow -n list $ ./run_classification.sh -f tensorflow2 -n list $ ./run_classification.sh -f onnxRuntime -n list
Generate Snapshot For ResNet50#
After reviewing all the models supported by the run_classification.sh script, follow these steps to generate a snapshot for the ResNet50 model as an example. Run the steps inside Docker:
Navigate to the
Vitis-AIdirectory:$ cd $VITIS_AI_REPO
Enable the NPU software stack:
$ source npu_ip/settings.sh
Navigate to the
batcherfolder:$ cd examples/python_examples/batcher
Run the following command to generate a snapshot for ResNet50:
$ VAISW_SNAPSHOT_DIRECTORY=snapshot.resnet50.tf2.b19.1007 ./run_classification.sh -f tensorflow2 -n resnet50 -b 19
This command generates the snapshot in
$VITIS_AI_REPO/examples/python_examples/batcher/.[VAISW] [VAISW] 10 batches of 19 samples (the first batch is not used to compute the detailed times) [VAISW] 1 input per batch (19x224x224x3) [VAISW] 1 output per batch (19x1001) [VAISW] 2 total subgraphs: [VAISW] 1 VAISW (FPGA) subgraph: 99.99% of total MACs (79.10 G) [VAISW] precision: FX8 [VAISW] 1 Framework (CPU) subgraph [VAISW] [INFO]: snapshot directory dumped in snapshot.resnet50.tf2.b19.1007 [VAISW] [INFO]: snapshot dumped for VE2802_NPU_IP_O00_A304_M3 [VAISW] 190 samples [VAISW] from 10/07/2025 15:19:29 to 10/07/2025 15:23:00
After successfully generating the snapshot, the terminal displays the message:
snapshot dumped for VE2802_NPU_IP_O00_A304_M3. This message indicates that you must use the corresponding SD Card image created for this specific IP to verify the snapshot. It is essential to build the reference design solution using the same NPU IP. Failing to do so might result in errors when a snapshot generated for one version of the NPU IP is executed on an SD Card image intended for a different NPU IP.Note
The previous command takes a few minutes to generate a snapshot.
The accuracy details displayed in the previous table might differ in data.
Copy the snapshot from the host machine to the target board. Ensure that the board is up and running:
$ scp -r $VITIS_AI_REPO/examples/python_examples/batcher/snapshot.resnet50.tf2.b19.1007 root@<vek280_board_ip>:/root # Use the IP address of the VEK280 board in the previous command
After transferring the snapshot to the target board, you can deploy it using the NPU runner applications. Refer to Execute Sample Model for more details.
Generate Snapshot for SSD_ResNet34#
Inside the Docker container, run the following commands to generate the snapshot for SSD-ResNet34 with a batch size of one. The following command generates a snapshot in the current directory named snapshot.ssd_resnet34.0207.
Navigate to the
Vitis-AIdirectory:$ cd $VITIS_AI_REPO
Enable the NPU software stack:
$ source npu_ip/settings.sh
Navigate to the
ssdResnet34folder:$ cd examples/python_examples/ssdResnet34
Generate a snapshot for SSD ResNet34:
$ VAISW_SNAPSHOT_DIRECTORY=snapshot.ssd_resnet34.1007 make
Note
You can generate a snapshot with different batch sizes by executing the following command and replacing
$batchsizewith your desired batch size number. The batchsize=4 is used in the following command. Refer to the Quantization Options section for more details.# Example command with $batchsize # VAISW_SNAPSHOT_DIRECTORY=snapshot.ssdresnet34 python3 demo_tf2.py ../../samples/samples/ssd/images $batchsize 10 $ VAISW_SNAPSHOT_DIRECTORY=snapshot.ssd_resnet34.b4.1007 python3 demo_tf2.py ../../samples/samples/ssd/images 4 10
The following text displays the last few lines of the output for SSD-ResNet34 snapshot generation.
[VAISW] [VAISW] 7 batches of 1 sample (the first batch is not used to compute the detailed times) [VAISW] 1 input per batch (1x1200x1200x3) [VAISW] 2 output per batchs (1x81x15130, 1x4x15130) [VAISW] 2 total subgraphs: [VAISW] 1 VAISW (FPGA) subgraph: 99.99% of total MACs (218.38 G) [VAISW] precision: FX8 [VAISW] 1 Framework (CPU) subgraph [VAISW] [INFO]: snapshot directory dumped in snapshot.ssd_resnet34.1007 [VAISW] [INFO]: snapshot dumped for VE2802_NPU_IP_O00_A304_M3 [VAISW] 7 samples [VAISW] from 10/07/2025 15:29:05 to 10/07/2025 15:34:17
As indicated by the message on the terminal, you need to use the
VE2802_NPU_IP_O00_A304_M3SD card to deploy the snapshot of the SSD_ResNet34 model.Copy the snapshot from the host machine to the target board. Ensure that the board is up and running:
$ scp -r $VITIS_AI_REPO-AI/examples/python_examples/ssdResnet34/snapshot.ssd_resnet34.1007 root@<vek280_board_ip>:/root # Use the IP address of the VEK280 board in the previous command
After copying the snapshot to the target board, you can deploy it using the NPU runner Python application, as explained in Execute Sample Model
Generate Snapshot for YOLOX#
Inside the Docker container, several demo models are provided in the /home/demo/ directory in the Docker container.
The following steps show how to generate a snapshot for the YOLOX-m model:
Navigate to the
Vitis-AIdirectory:$ cd $VITIS_AI_REPO
Enable the NPU software stack:
$ source npu_ip/settings.sh
Navigate to the
YOLOXfolder:$ cd /home/demo/YOLOX
Generate snapshot for YOLOX:
$ VAISW_SNAPSHOT_DIRECTORY=snapshot.yolox.1007 VAISW_QUANTIZATION_NBIMAGES=1 ./run assets/dog.jpg m --save_result
The following text displays the last few lines of the output for YOLOX-m snapshot generation.
[VAISW] [VAISW] The statistic summary can not be displayed, more than 1 inference must be run but 1 inference has been executed. [VAISW] [INFO]: snapshot directory dumped in snapshot.yolox.1007 [VAISW] [INFO]: snapshot dumped for VE2802_NPU_IP_O00_A304_M3
Note
You can control the number of images for quantization tuning as shown in the following command. Refer to the Quantization Options section for more details.
$ VAISW_SNAPSHOT_DIRECTORY=snapshot.yolox.b4.1007 VAISW_QUANTIZATION_NBIMAGES=4 ./run assets/ m --save_result
As indicated by the message on the terminal, you need to use the
VE2802_NPU_IP_O00_A304_M3SD card to deploy the snapshot of the YOLOX-m model.Copy the snapshot from the host machine to the target board. Ensure that the board is up and running:
$ scp -r <path_snapshot_dir>/snapshot.yolox.0923 root@<vek280_board_ip>:/root # Use the IP address of the VEK280 board in the previous command
After copying the snapshot to the target board, you can deploy it using the NPU runner Python application, as explained in Execute Sample Model
Note
The YOLOX model is downloaded from official github.com/Megvii-BaseDetection/YOLOX/releases/download.
Generate Snapshot for YOLOv5 with UINT8 Option#
The NPU software stack accepts the input buffer in UINT8 format, which avoids the quantization operation and improves the performance execution on the board. These steps explain how to compile and deploy the YOLOv5 model with the UINT8 mode.
Note
YOLOv5 model is provided in the /home/demo/ directory in the Docker container.
On the Linux host machine, run the following commands to generate a snapshot for the YOLOv5 model with the UINT8 option.
$ cd $VITIS_AI_REPO $ source npu_ip/settings.sh $ ./docker/run.bash --acceptLicense -- /bin/bash -c "source npu_ip/settings.sh && source npu_ip/uint8.env && cd /home/demo/yolov5 && VAISW_SNAPSHOT_DIRECTORY=$PWD/SNAP.$NPU_IP/yolov5.b1.uint8 VAISW_USE_UINT_INPUT=1 VAISW_QUANTIZATION_NBIMAGES=1 ./run data/images/bus.jpg --out_file /dev/null --ext pt"
The command generates the
yolov5.b1.uint8snapshot in SNAP.VE2802_NPU_IP_O00_A304_M3 folder, with the UINT8 mode. It is enabled by using theVAISW_USE_UINT_INPUT=1option in thenpu_ip/uint8.envfile.Ensure that the VEK280 board is up and running.
Copy the generated snapshot (yolov5.b1.uint8) from the Linux host machine to
/home/root/on the target board.Copy the
yolov5directory (from/home/demo/in the Docker) to/home/root/on the target board.On the VEK280 target board, run the following commands to execute the YOLOv5 model with the UINT8 option:
$ source /etc/vai.sh $ cd /root/yolov5 $ VAISW_SNAPSHOT_DIRECTORY=/root/yolov5.b1.uint8/ VAISW_USE_UINT_INPUT=1 ./run /root/yolov5/data/images/bus.jpg --out_file /dev/null --ext pt
The following are the results on executing the command:
root@xilinx-vek280-xsct-20251:~/yolov5# VAISW_SNAPSHOT_DIRECTORY=/root/yolov5.b1.uint8/ VAISW_USE_UINT_INPUT=1 ./run /root/yolov5/data/images/bus.jpg --out_file /dev/null --ext pt Error in cpuinfo: prctl(PR_SVE_GET_VL) failed detect: weights=['weights/yolov5s.pt'], source=/root/yolov5/data/images/bus.jpg, data=data/coco128.yaml, imgsz=640x640, conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, batchSize=1, out_file=/dev/null, loop=False, keepClasses=None YOLOv5 ? v6.1-277-gfdc9d919 Python-3.12.9 torch-2.5.0 CPU Fusing layers... YOLOv5s_v6 summary: 213 layers, 7225885 parameters, 0 gradients 640x640 4 persons, 1 bus, Done. (2.537s) 640x640 4 persons, 1 bus, Done. (2.537s) No more input, encoding capture... root@xilinx-vek280-xsct-20251:~/yolov5#
In the above results, the message “prctl(PR_SVE_GET_VL) failed” can be ignored.
You can ignore this step if running the previous command does not cause errors. If you encounter any errors, re-run the command after installing the Python packages as follows:
# Install following Python packages if there are errors with execution of YOLOv5 model with UINT8 mode. $ python3 -m pip install matplotlib==3.7.2 numpy==1.26.4 onnx==1.17.0 onnxruntime==1.18.1 opencv-python==4.10.0.84 pandas==2.0.3 pycocotools==2.0.8 pyyaml scikit-learn==1.3.0 scipy==1.15.2 seaborn==0.13.2 tensorflow==2.19.0 torch==2.5.0 torchvision==0.20.0 tqdm==4.67.1
Note
It takes a few minutes to install the python packages.
The YOLOv5 model is downloaded from official github.com/ultralytics/yolov5/releases/download
Accelerate YOLO Tails on AIE#
The YOLO tail graphs can be fully accelerated on the AIE, resulting in no CPU sub-graph. The tail (the part after the last convolution) is accelerated inside the AIE for YOLOv5, YOLOv7 and YOLOX models.
The tails of YOLO like models are automatically accelerated on AIE with following notes:
Precision of the of the ‘tail’ part is not INT8 (so if BF16 or MIXED precision is used)
It is because the tail operations required much higher precision range than the other operations.
When using INT8 precision, the tail computation on AIE will be wrong, so the compilation SW stack maps those operations on a CPU sub-graph
If the tail contains supported accelerated layers
For example, softMax layers is not accelerated on AIE, YOLOv8 has a softMax and therefore it can’t be accelerated on AIE.
Refer to the following steps to generate snapshot for YOLOv5 model and execute it on the board.
Step1: Generate Snapshot for YOLOv5#
On Linux host machine, navigate to the
Vitis-AIdirectory:$ cd <path_to_Vitis-AI_folder>
Run the following commands to set up the Vitis AI software environment.
$ source npu_ip/settings.sh VE2802_NPU_IP_O00_A304_M3
Generate the snapshot for YOLOv5 using the following command:
$ ./docker/run.bash --acceptLicense -- /bin/bash -c "source npu_ip/settings.sh && cd /home/demo/yolov5 && VAISW_FE_PRECISION=MIXED VAISW_FE_VIEWDTYPEOUTPUT=AUTO VAISW_SNAPSHOT_DIRECTORY=$PWD/SNAP.$NPU_IP/yolo5.MP.FP32 VAISW_QUANTIZATION_NBIMAGES=1 ./run data/images/bus.jpg --out_file /dev/null --ext pt"
This step generates the
yolo5.MP.FP32snapshot in SNAP.VE2802_NPU_IP_O00_A304_M3 folder for YOLOv5 model.Similarly, you can generate the snapshot for YOLOX using the following command:
$ ./docker/run.bash --acceptLicense -- /bin/bash -c "source npu_ip/settings.sh && cd /home/demo/YOLOX && VAISW_FE_PRECISION=MIXED VAISW_FE_VIEWDTYPEOUTPUT=AUTO VAISW_SNAPSHOT_DIRECTORY=$PWD/SNAP.$NPU_IP/YOLOX.MP.FP32 VAISW_QUANTIZATION_NBIMAGES=1 ./run assets/dog.jpg m --save_result"
This step generates the
YOLOX.MP.FP32snapshot in SNAP.VE2802_NPU_IP_O00_A304_M3 folder for YOLOX-m model.
Step2: Execute YOLOv5 on Board#
Flash SD card with V5.1_VE2802_NPU_IP_O00_A304_M3_sd_card.img.gz image. Refer Set Up/Flash SD Card section for instructions to flash it.
Ensure that target board (VEK280) is set up and running. Refer Target Board Setup section for instructions to setup the board.
Copy the
yolo5.MP.FP32to the target board.Set up the Vitis AI tools environment on the board:
$ source /etc/vai.sh
Run the
vart_ml_runner.pyapplication to execute the YOLOv5 snapshot on board.$ vart_ml_runner.py --snapshot yolo5.MP.FP32/ --in_zero_copy --out_zero_copy
The previous command runs the model with random input and verifies that the snapshot is executed on the target board, with following logs on console.
root@xilinx-vek280-xsct-20251:~# vart_ml_runner.py --snapshot yolo5.MP.FP32/ --in_zero_copy --out_zero_copy XAIEFAL: INFO: Resource group Avail is created. XAIEFAL: INFO: Resource group Static is created. XAIEFAL: INFO: Resource group Generic is created. [VART] Allocated config area in DDR: Addr = [ 0x880000000, 0x50000000000, 0x60000000000 ] Size = [ 0x98e211, 0x8383d1, 0x8c8f91] [VART] Allocated tmp area in DDR: Addr = [ 0x880990000, 0x50080000000, 0x60080000000 ] Size = [ 0xaca801, 0, 0] [VART] Found snapshot for IP VE2802_NPU_IP_O00_A304_M3 matching running device VE2802_NPU_IP_O00_A304_M3 [VART] Parsing snapshot yolo5.MP.FP32// [========================= 100% =========================] [VART] [VART] Statistics (in ms), 1 sample, batch number 0: [VART] wrp_network Inference took 4 ms [VART] [VART] Statistics (in ms), 1 sample, batch number 1: [VART] wrp_network : Total 3.99 | AIE 3.51 | CPU sum 0.17 Inference took 4 ms [VART] [VART] Statistics (in ms), 1 sample, batch number 2: [VART] wrp_network : Total 3.97 | AIE 3.49 | CPU sum 0.17 Inference took 4 ms [VART] . . [VART] [VART] Statistics (in ms), 1 sample, batch number 9: [VART] wrp_network : Total 3.94 | AIE 3.50 | CPU sum 0.18 Inference took 4 ms OK: no error found [VART] [VART] board XIL_VEK280_REVB3 (AIE: 304 = 38x8) [VART] 10 inferences of batch size 1 (the first inference is not used to compute the detailed times) [VART] 1 input layer. Tensor shape: 1x3x640x640 (INT8) [VART] 1 output layer. Tensor shape: 1x25200x85 (FLOAT32) [VART] 1 total subgraph: [VART] 1 VART (AIE) subgraph [VART] 0 Framework (CPU) subgraph [VART] 10 samples [VART] [VART] "wrp_network" run summary: [VART] detailed times in ms [VART] +-----------------------------------+------------+------------+------------+------------+ [VART] | Performance Summary | ms/batch | ms/batch | ms/batch | sample/s | [VART] | | min | max | median | median | [VART] +-----------------------------------+------------+------------+------------+------------+ [VART] | Whole Graph total | 3.94 | 4.00 | 3.98 | 251.45 | [VART] | VART total ( 1 sub-graph) | 3.65 | 3.69 | 3.66 | 272.85 | [VART] | AI acceleration (*) | 3.49 | 3.52 | 3.50 | 285.71 | [VART] | CPU processing | 0.16 | 0.18 | 0.16 | | [VART] | Others | | | 0.16 | | [VART] | Others | | | 0.31 | | [VART] +-----------------------------------+------------+------------+------------+------------+ [VART] (min and max are measured individually, only the median sums are meaningful). [VART] (*) AI Acceleration time includes the transfer to/from the external memories. root@xilinx-vek280-xsct-20251:~#
As shown in the performance summary table, the YOLOv5 is fully accelerated on AIE when using the VE2802_NPU_IP_O00_A304_M3 IP.
Note
The performance summary is visible when VAISW_RUNSESSION_SUMMARY=all is exported.