PL Stream–Based YOLO Tails#

Starting from Vitis-AI 6.1, the release includes examples of YOLO tails implemented on the Programmable Logic (PL). The following tails are available:

  • NPU_TAIL_YOLO_V5

  • NPU_TAIL_YOLO_V7

  • NPU_TAIL_YOLO_V8

  • NPU_TAIL_YOLO_X

Description#

The user has the following options for YOLO acceleration.

1. Run only the AI part on AIE

In this mode, the AI portion of the model runs on AIE, and the end of the graph (for example, coordinate correction) is handled manually by the user application. This can be implemented using ONNX Runtime or custom user code.

2. Run the entire model on AIE (AI + tail)

This mode has been available since Vitis-AI 5.1.

To enable this mode, set the precision to MIXED or BF16, because the end of the graph requires extended precision:

VAISW_FE_PRECISION=MIXED

In this configuration, the entire model is accelerated on AIE. However, AIE engines are not optimized for processing the tail of the graph, so the efficiency of this portion is relatively low.

3. Run the AI part on AIE and the tail on PL (stream-based)

Vitis-AI 6.1 introduces a new PL tail architecture in which data is streamed directly from AIE through PL computation logic before reaching DDR. This architecture significantly reduces DDR bandwidth usage and tail latency.

In this mode:

  • AIE computes AI operations.

  • PL logic handles non-AI operations.

This configuration delivers the highest overall performance.

To enable this mode:

  • Set the precision to INT8.

  • Compile the tail of the model and include it as a PL kernel.

Note

The PL tail must be explicitly compiled and integrated into the system design.

Architecture Overview#

PL stream-based YOLO tail architecture

Goals#

The goal of this example is to provide a reference RTL design that demonstrates stream-based processing (YOLO tails) to users. You can use this design as a starting point to implement custom PL processing and connect it to AIE execution.

In Vitis-AI, the PL stream is implemented using a dedicated platform because integration as a standard Vitis kernel could not be achieved.

The example targets:

  • The VEK280 board

  • A platform with three DDRs connected to the AIE

The provided build scripts generate a system with:

  • One NPU IP

  • One TAIL IP

Building a TAIL IP with multiple NPU IPs is possible but not implemented in the provided scripts.

Build Design with YOLOX Tail PL#

To enable a PL tail IP during SD card preparation, add the tail IP to the npu_ip settings.

For example, the following commands build an SD card with the full configuration of VE2802 NPU IP and the YOLOX tail:

$ source <path-to-installed-Petalinux-v2025.2>/settings.sh

$ source <path-to-installed-Vitis-v2025.2>/settings64.sh

$ export PATH=$PATH:/usr/sbin

$ export XILINXD_LICENSE_FILE=<path_to_npu_ip_license_file>

$ cd <path_to_Vitis-AI_source_code>/Vitis-AI

$ source npu_ip/settings.sh VE2802_NPU_IP_O00_A304_M3 NPU_TAIL_YOLO_X

$ make -C examples/reference_design/vek280 all BSP_PATH=<path_to_vek_bsp>.bsp

The above commands help to generate the SD Card VE2802_NPU_IP_O00_A304_M3__YOLO_X_sd_card.img in the path of “Vitis-AI/examples/reference_design/vek280/output/”.

Note

All tails have been verified on the full configuration VE2802 IP, and the YOLOv5 tail has been verified on the VE2302 IP. Because PL stream compilation involves RTL synthesis, placement, and routing, this phase is more susceptible to timing violations depending on available resources, clock constraints, and device grade. Standard Vivado debugging techniques (directive selection, frequency tuning) may be required.

Snapshot Generation with PL Stream Architecture#

To generate a snapshot compatible with the PL stream architecture, use the following option:

VAISW_IRIZ_TRANSFORMFORCEENABLE=SkipDepthFriendlyLeaf

Once a snapshot is generated, use the script Vitis-AI/bin/pl_stream_config.bash to create a PL configuration file.

Example:

$ cd <path_to_Vitis-AI_source_code>/Vitis-AI

$ source npu_ip/settings.sh VE2802_NPU_IP_O00_A304_M3 NPU_TAIL_YOLO_X

$ ./docker/run.bash --acceptLicense -- /bin/bash -c \
  "source npu_ip/settings.sh && \
  cd /home/demo/YOLOX && \
  VAISW_IRIZ_TRANSFORMFORCEENABLE=SkipDepthFriendlyLeaf \
  VAISW_QUANTIZATION_NBIMAGES=1 \
  VAISW_SNAPSHOT_DIRECTORY=$PWD/YOLOX.b1.pl \
  ./run assets/dog.jpg m --save_result"

# Install *jq* tool (skip this step if its already installed)
$ sudo apt install -y jq

$ bin/pl_stream_config.bash YOLOX.b1.pl > YOLOX.b1.pl/pl_config.json

The above commands help to generate the YOLOX.b1.pl snapshot which can be copied to VEK280 board for execution.

Running Snapshot on the Board#

The generated PL configuration file can be passed to the embedded software stack using the VAISW_PL_JSON_PATH environment variable.

On the target board:

  • Ensure that you have completed the SD Card and target board setups. Refer to Installation for more information.

  • Insert the SD card into the VEK280 board and power the board on.

  • Log in with the username root and password root.

  • Copy YOLOX.b1.pl snapshot to target board.

    $ scp <path_to_YOLOX.b1.pl_folder>/YOLOX.b1.pl root@vek280_board_ip:/root/
    
  • Set up the Vitis AI tools environment on the board.

    $ source /etc/vai.sh
    
  • Export the following macro to compute performance statistics by the embedded software stack.

    $ export VAISW_RUNSESSION_SUMMARY=all
    
  • Run the YOLOX tail on the CPU.

    $ cd /root
    $ vart_ml_runner.py --snapshot YOLOX.b1.pl --in_zero_copy --out_zero_copy
    

    The above command executes the YOLOX tail on CPU and outputs the performance details to the console, as illustrated in the following table.

    [VART]
    [VART]           board VE2802 (AIE: 304 = 38x8)
    [VART]           10 inferences of batch size 1 (the first inference is not used to compute the detailed times)
    [VART]           1 input layer. Tensor shape: 1x640x640x4 (INT8)
    [VART]           1 output layer. Tensor shape: 1x8400x85 (FLOAT32)
    [VART]           2 total subgraphs:
    [VART]                   1 VART (AIE) subgraph
    [VART]                   1 Framework (CPU) subgraph
    [VART]           10 samples
    [VART]
    [VART] "wrp_network" run summary:
    [VART]           detailed times in ms
    [VART] +--------------------------------+------------+------------+------------+------------+
    [VART] | Performance Summary            |  ms/batch  |  ms/batch  |  ms/batch  |   sample/s |
    [VART] |                                |    min     |    max     |   median   |   median   |
    [VART] +--------------------------------+------------+------------+------------+------------+
    [VART] | Whole Graph total              |      19.05 |      21.42 |      19.23 |      52.01 |
    [VART] |   VART total (   1 sub-graph)  |       4.19 |       4.23 |       4.20 |     238.04 |
    [VART] |     AI acceleration (*)        |       2.78 |       2.79 |       2.78 |     359.71 |
    [VART] |     CPU processing             |       1.41 |       1.44 |       1.42 |            |
    [VART] |       Output copy (phys->user) |       1.15 |       1.18 |       1.16 |            |
    [VART] |       Others                   |            |            |       0.26 |            |
    [VART] |   OnnxRT CPU (   1 sub-graph)  |      14.71 |      17.06 |      14.90 |            |
    [VART] |   Others                       |            |            |       0.13 |            |
    [VART] +--------------------------------+------------+------------+------------+------------+
    
  • Run the YOLOX tail on the PL.

    $ VAISW_PL_JSON_PATH=YOLOX.b1.pl/pl_config.json VAISW_XRT_DISABLE=true vart_ml_runner.py --snapshot YOLOX.b1.pl --in_zero_copy --out_zero_copy
    

    The above command executes the YOLOX tail on the PL and outputs the performance details to the console, as illustrated in the following table.

    [VART]
    [VART]           board VE2802 (AIE: 304 = 38x8)
    [VART]           10 inferences of batch size 1 (the first inference is not used to compute the detailed times)
    [VART]           1 input layer. Tensor shape: 1x640x640x4 (INT8)
    [VART]           1 output layer. Tensor shape: 1x8400x85 (FLOAT32)
    [VART]           1 total subgraph:
    [VART]                   1 VART (AIE) subgraph
    [VART]                   0 Framework (CPU) subgraph
    [VART]           10 samples
    [VART]
    [VART] "wrp_network" run summary:
    [VART]           detailed times in ms
    [VART] +--------------------------------+------------+------------+------------+------------+
    [VART] | Performance Summary            |  ms/batch  |  ms/batch  |  ms/batch  |   sample/s |
    [VART] |                                |    min     |    max     |   median   |   median   |
    [VART] +--------------------------------+------------+------------+------------+------------+
    [VART] | Whole Graph total              |       2.96 |       2.97 |       2.97 |     337.15 |
    [VART] |   VART total (   1 sub-graph)  |       2.88 |       2.89 |       2.88 |     347.34 |
    [VART] |     AI acceleration (*)        |       2.83 |       2.84 |       2.83 |     353.11 |
    [VART] |     CPU processing             |       0.05 |       0.05 |       0.05 |            |
    [VART] |       Others                   |            |            |       0.05 |            |
    [VART] |   Others                       |            |            |       0.09 |            |
    [VART] +--------------------------------+------------+------------+------------+------------+
    

Limitations#

  • Because the PL tail is not integrated as a Vitis kernel, XRT must be disabled during execution:

    VAISW_XRT_DISABLE=true
    
  • There is no simulation model for PL execution. As a result, it is not possible to evaluate accuracy using PL stream execution on a CPU-based run.