Asynchronous Inference Execution#

Overview#

This guide covers asynchronous multi-frame inference on Versal AI Edge Series Gen 2 using the VART-ML vart::Runner APIs. It describes how synchronous and asynchronous execution differ and when to use each, then walks through the vart_infer_async reference application — a C++ sample that demonstrates pipelined async inference.

Note

Complete Run Your First Inference before this chapter. Review synchronous Runner::execute usage in VART Application Development before proceeding.

Synchronous and Asynchronous Inference#

vart::Runner supports both synchronous and asynchronous inference execution (see VART ML APIs).

An NPU compute unit is:

a 4×4 block of 16 AI Engine tiles on Versal AI Edge Series Gen 2. When the AMD Vitis™ AI compiler compiles a model, each partition is placed on one or more of these blocks.

Synchronous inferenceRunner::execute runs one batch on the NPU compute units and blocks the calling thread until that batch completes. This is the simplest path for single-frame debug or single-shot inference.

Asynchronous inferenceRunner::execute_async submits a batch and returns immediately with a JobHandle (see VART ML APIs). The host can stage the next batch or perform other work while the NPU compute units run. Completion is reported by:

  • Polling — call Runner::wait(job_handle, timeout) on the submitted handle. A return value of StatusCode::JOB_PENDING means the job is still running; poll again or use a longer timeout.

  • Callback — pass an ExecuteAsyncCallback (see VART Application Development) to the execute_async overload; the runtime invokes it on an internal worker thread when that job finishes.

For new applications, start with polling (wait). Use callbacks when you want the runtime to notify your code on a worker thread instead of polling handles.

Several jobs can be in flight at once—each with its own input and output tensor buffers. While the NPU runs one batch, the CPU can queue the next; as each job finishes, the CPU collects the result without waiting for all jobs to complete.

To keep the NPU compute units busy, applications typically maintain multiple in-flight jobs, each with its own input/output buffer set. Submit the next batch while earlier batches still execute; when the runner reports StatusCode::RESOURCE_UNAVAILABLE, all internal execution slots are busy—retry after a job completes.

Timeline comparing synchronous execute with asynchronous execute_async and wait, showing host submit overlapping NPU execution

Host and NPU activity over time. With execute, the CPU blocks during each run on the NPU compute units. With execute_async, the host can submit the next batch while earlier batches still run on the NPU compute units.#

Note

Use execute_async when you need to submit the next batch before the previous one completes—for example, when inputs arrive from a stream or queue. Use execute when you need the simplest single-batch path or are debugging one frame at a time.

Demonstration Example: vart_infer_async#

The vart_infer_async sample demonstrates async pipelined inference. Key constraints:

  • One batch loaded: Reads a single IFM batch from --input-binary, or fills random data with --dry-run.

  • Repeated submissions: Issues --num-iteration execute_async calls on that same in-memory batch — reading new frames from disk on each submission is left to the application.

  • Polling only: Uses the execute_async + Runner::wait path. For the callback-based overload and code examples, see VART Application Development.

See the sample source and sample README for build steps and the full command-line reference.

Note

The sample supports models with only one input tensor.

First-run checklist#

Use this sequence for your first end-to-end run:

  1. Prerequisites — Finish Run Your First Inference (board programmed, runtime environment working).

  2. Build — Cross-compile vart_infer_async on the host; see Build Applications in Reference Applications and the sample README.

  3. Deploy — Copy the binary and a VAIML-compiled model (.rai file or cache directory) to the board. Set LD_LIBRARY_PATH as described in the sample README.

  4. Run — Start with --dry-run (below) or supply --input-binary once you have a raw IFM file.

  5. Verify — Exit code 0 and a Batches completed: <count> / <count> line in the log, where <count> matches --num-iteration.

Quick start (dry-run)#

Run these commands on the board after you deploy the binary and model (checklist step 3).

If the board is set up and a demo model is installed (as in the quick start guide), verify async inference with no IFM file:

vart_infer_async --model-path /etc/vai/models/resnet50_int8/resnet50_int8.rai --dry-run -n 20

-n 20 means 20 repeated async submissions on the same in-memory batch—not 20 different frames from a file.

Expect exit code 0 and a line similar to Batches completed: 20 / 20.

Sample output#

Dry-run or normal async run — exit code 0 and a summary line:

Batches completed: 20 / 20

Run with –input-binary (without --dry-run or --benchmark) — the sample writes OFM files (output_f*_*.bin) for the last completed submission. See the sample README for naming and layout.

Run with –benchmark — after the async pass, the log prints wall-clock timing for both paths, for example:

[INFO] async infer pass: 123.456 ms (12.346 ms/frame)
[INFO] sync infer pass: 234.567 ms (23.457 ms/frame)

Lower total time on the async line indicates overlap between submission and NPU execution. Exact values depend on model, batch size, and board load.

Run with an input binary#

After a successful dry-run, run with a compiled model and a raw IFM file:

vart_infer_async \
  --model-path /path/to/resnet50_int8.rai \
  --input-binary /path/to/ifm.bin

--input-binary supplies a raw IFM (input feature map) binary — a headerless byte stream in the compiled model’s HW tensor layout (TensorType::HW, see VART Application Development).

Creating an IFM binary: Export preprocessed input from your application pipeline in that HW layout and save it as a raw .bin file. For row size, batch layout, and example assets, see the sample README. For your first run, use --dry-run instead of preparing a file.

Command-line options#

Option

Required

Default

Description

--model-path

Yes

Path to a VAIML .rai file or compiled model cache directory.

--input-binary

Yes*

Path to a raw IFM binary in the compiled model’s HW layout (see Run with an input binary above). Not used with --dry-run.

-n, --num-iteration

No

10

Number of execute_async submissions that replay the same loaded batch.

-d, --dry-run

No

Random IFM fill; skips file I/O and OFM writes.

--benchmark

No

Runs an additional synchronous execute pass for comparison; skips OFM file writes.

-h, --help

No

Print help and exit.

* --input-binary is not required when --dry-run is set.

Benchmark async vs sync execution#

vart_infer_async \
  --model-path /path/to/model.rai \
  --input-binary /path/to/ifm.bin \
  --benchmark -n 100

Run on the board. --benchmark adds a synchronous execute pass on the same loaded batch. Compare the [INFO] async infer pass and [INFO] sync infer pass lines in the log — see Sample output for the expected format and how to interpret the results.

What the sample does#

At a high level, the application uses these VART-ML APIs:

  1. Runner and tensors — Creates a vart::Runner from the compiled model (via vart::RunnerFactory, see VART ML APIs) and preallocates HW vart::NpuTensor (see VART Application Development) input/output buffers for each concurrent job slot.

  2. Load input once — Reads one IFM (input feature map) batch from --input-binary (or fills random data with --dry-run) and copies it into the input tensors.

  3. Submit and wait — Queues each inference with Runner::execute_async, which returns a JobHandle immediately. The application then calls Runner::wait on each handle to collect results. By default, two jobs stay in flight (kNumConcurrentJobs in source; not configurable from the CLI): while the NPU runs one batch, the CPU can submit the next.

  4. Replay the same input--num-iteration sets how many execute_async calls run on that same loaded batch. The sample does not read new input from disk on each call.

  5. Optional sync comparison — With --benchmark, the sample also calls Runner::execute on the same data.

The pipeline has three phases:

  1. Prime — Submit Runner::execute_async until all job slots are full (default: 2).

  2. Steady state — For each remaining submission, call Runner::wait to complete the oldest job, then Runner::execute_async on the freed slot.

  3. Drain — Call Runner::wait until every in-flight job completes (no new submissions).

Populate input data in all buffer slots
                |
                v
[1. Prime]  Runner::execute_async (fill all job slots)
                |
                v
+------- [2. Steady state] ------------------+
|                |                           |
|                v                           |
|        Runner::wait (oldest job)           |
|                |                           |
|                v                           |
|   Runner::execute_async (if submissions    |
|    remain for --num-iteration)             |
|                |                           |
+----------------+                           |
                |  (all submissions issued)
                v
[3. Drain]  Runner::wait (until queue empty)
                |
                v
              done