Runtime Processes and APIs#

This section helps you understand the inference operations and learn about the VART APIs’ functioning (ML and X APIs), enabling you to customize the APIs per your use-case specific needs.

Embedded CPU (APU) Operations#

The embedded CPU (APU) does not handle inference computations. However, it performs the following tasks:

  • Initializing the neural network configuration (weights and layer configuration) in DDR, in the NPU IP configuration registers, and in the AIE configuration registers.

  • Adapting the input and output buffers to the NPU’s internal buffer format: - The format is determined by the shape of the buffers. - No adaptation is required if the native buffer format is supported by the NPU.

  • Performing quantization or dequantization operations for the input or output buffers: - The NPU operates on INT8/BF16 buffers; therefore, float32 buffers from the application must be converted to INT8/BF16. - No operation is required if the application’s native buffer is already in INT8/BF16.

  • Starting and polling the inference: - If the graph is not fully accelerated due to unsupported operations, it splits into FPGA sub-graphs and CPU sub-graphs. The CPU sub-graphs execute on the CPU to complete the original graph. - The NPU reports the number of sub-graphs accelerated on the CPU during the preparation phase.

VART APIs#

The Vitis AI Runtime (VART) provides a set of API functions that integrate the NPU into software applications and offer a unified high-level runtime for embedded targets. Key features include:

  • Asynchronous submission of jobs to the NPU.

  • Asynchronous retrieval of job results from the NPU.

  • Implementations in both C++ and Python.

  • Support for multi-threading and multi-process execution (planned for future releases).

Following is a pseudocode example that illustrates how to use the VART APIs for inference:

// Runner engine creation reads the model/snapshot
auto runner = vart::Runner::create_runner(<path_to_snapshot>);
// Submitting an inference job for execution
auto job_id = runner->execute_async(inputsPtr, outputsPtr);
// Waiting for the inference job to complete
runner->wait(job_id.first, -1);

In this example, the application creates a runner instance that reads the model or snapshot. The runner instance contains the execute_async API, which submits an inference job for execution by accepting input and output pointers as arguments. The input pointer includes details such as the input buffer for the model and batch size, while the output pointer holds information like the output buffer for inference results. The wait API then waits for the inference job to complete.

The VART namespace includes ML and X APIs. VART ML APIs are used for inference execution, and VART X APIs are used for running pre-processing and post-processing functions. Pre-processing involves tasks such as color space conversion, resizing, and normalization of input frame data before inference. Post-processing interprets the model’s output data to produce more understandable predictions.

The upcoming sections provide more information about the VART ML and X APIs.