|
VART-ML
0.3.0
|
Abstract base class for executing model inference operations. More...
#include <vart_runner_factory.hpp>
Public Types | |
| using | ExecuteAsyncCallback = std::function< void(const JobHandle &)> |
| Type alias for the callback function used in asynchronous execution operations. More... | |
Public Member Functions | |
| virtual | ~Runner ()=default |
| Destroys the Runner object. More... | |
| virtual const std::vector< NpuTensorInfo > & | get_tensors_info (TensorDirection direction, TensorType type) const =0 |
| Unified API to retrieve tensor information based on direction and tensor type (CPU/HW). More... | |
| virtual const NpuTensorInfo & | get_tensor_info_by_name (const std::string &tensor_name, TensorType type) const =0 |
| Unified API to retrieve tensor information by name and tensor type (CPU/HW). More... | |
| virtual const QuantParameters & | get_quant_parameters (const std::string &tensor_name) const =0 |
| Retrieves the quantization parameters for a specific tensor. More... | |
| virtual size_t | get_num_input_tensors () const =0 |
| Returns the number of input tensors. More... | |
| virtual size_t | get_num_output_tensors () const =0 |
| Returns the number of output tensors. More... | |
| virtual size_t | get_batch_size () const =0 |
| Returns the device batch size. More... | |
| virtual StatusCode | execute (const std::vector< std::vector< NpuTensor >> &inputs, std::vector< std::vector< NpuTensor >> &outputs) noexcept=0 |
| Executes the main computation using the provided input tensors and produces output tensors. More... | |
| virtual JobHandle | execute_async (const std::vector< std::vector< NpuTensor >> &inputs, std::vector< std::vector< NpuTensor >> &outputs) noexcept=0 |
| Executes the job asynchronously with the given input tensors. More... | |
| virtual StatusCode | wait (const JobHandle &job_handle, std::chrono::milliseconds timeout) noexcept=0 |
| Waits for the completion of an asynchronous job. More... | |
| virtual JobHandle | execute_async (const std::vector< std::vector< NpuTensor >> &inputs, std::vector< std::vector< NpuTensor >> &outputs, ExecuteAsyncCallback cb) noexcept=0 |
| Executes the operation asynchronously with the given input tensors. More... | |
| virtual NpuTensor | allocate_npu_tensor (const NpuTensorInfo &info) const =0 |
| Allocates memory for an NPU tensor. More... | |
| virtual NpuTensor | allocate_sub_tensor (const NpuTensor &parent, const NpuTensorInfo &info, size_t offset) const =0 |
| Creates a sub-tensor from a parent tensor with the specified metadata and offset. More... | |
Protected Member Functions | |
| Runner (const std::string &model_path, const std::unordered_map< std::string, std::any > &options={}) | |
| Constructs a Runner object with the specified model path and options. More... | |
Abstract base class for executing model inference operations.
The Runner class defines a unified interface for running synchronous and asynchronous inference tasks on machine learning models. It provides methods for retrieving tensor metadata, executing computations, and managing asynchronous job execution.
Key Features:
Runner instances are thread-safe and designed to be shared across threads via std::shared_ptr<Runner>. Use RunnerFactory::create_runner() to obtain an instance.
Type alias for the callback function used in asynchronous execution operations.
This callback function is invoked when an asynchronous operation completes. The callback receives a const reference to the JobHandle containing the completion status and job identifier.
|
inlineexplicitprotected |
Constructs a Runner object with the specified model path and options.
| model_path | The file path to the model to be used by the Runner. |
| options | Optional configuration parameters for the Runner, provided as a map of string keys to values of any type. |
| std::runtime_error | if Runner initialization fails. |
|
virtualdefault |
Destroys the Runner object.
|
pure virtual |
Allocates memory for an NPU tensor.
This API allocates contiguous tensor memory backed by an XRT Buffer Object (MemoryType::XRT_BO). The user should copy input data into the allocated buffer before passing it to execute/execute_async.
| info | The metadata associated with the tensor. |
| std::runtime_error | if tensor allocation fails. |
| std::invalid_argument | if the provided NpuTensorInfo is invalid. |
|
pure virtual |
Creates a sub-tensor from a parent tensor with the specified metadata and offset.
The returned sub-tensor is a fully usable NpuTensor and can be passed to execute/execute_async like any other NpuTensor.
| parent | The parent tensor from which the sub-tensor will be created. |
| info | The metadata for the sub-tensor. |
| offset | The offset in bytes from the start of the parent tensor's buffer. |
| std::runtime_error | if sub-tensor creation fails. |
| std::invalid_argument | if the provided arguments are invalid. |
allocate_npu_tensor.info must not exceed the parent tensor's buffer bounds.Memory Management:
|
pure virtualnoexcept |
Executes the main computation using the provided input tensors and produces output tensors.
This method is responsible for performing the actual inference or computation using the specified input tensors and generating the corresponding output tensors.
| inputs | A constant reference to a vector of input NpuTensor objects, vector dimensions: [batch][tensors]. The outer vector size must be between 1 and get_batch_size() (inclusive). Each inner vector must contain one NpuTensor per model input. |
| outputs | A reference to a vector of NpuTensor objects where the outputs will be stored, vector dimensions: [batch][tensors]. The outer vector size must be between 1 and get_batch_size() (inclusive). Each inner vector must contain one NpuTensor per model output. |
|
pure virtualnoexcept |
Executes the job asynchronously with the given input tensors.
This method initiates an asynchronous operation using the provided input tensors, and stores the results in the output tensors. The function returns a handle to the asynchronous job, allowing the caller to track or manage its execution.
| inputs | A constant reference to a vector of input tensors required for the job, vector dimensions: [batch][tensors]. The outer vector size must be between 1 and get_batch_size() (inclusive). Each inner vector must contain one NpuTensor per model input. |
| outputs | A reference to a vector where the output tensors will be stored upon completion, vector dimensions: [batch][tensors]. The outer vector size must be between 1 and get_batch_size() (inclusive). Each inner vector must contain one NpuTensor per model output. |
|
pure virtualnoexcept |
Executes the operation asynchronously with the given input tensors.
This method starts the asynchronous execution of the operation using the provided input tensors. The results will be stored in the output tensors, and the specified callback will be invoked upon completion.
| inputs | A constant reference to a vector of input tensors to be processed, vector dimensions: [batch][tensors]. The outer vector size must be between 1 and get_batch_size() (inclusive). Each inner vector must contain one NpuTensor per model input. |
| outputs | A reference to a vector where the output tensors will be stored, vector dimensions: [batch][tensors]. The outer vector size must be between 1 and get_batch_size() (inclusive). Each inner vector must contain one NpuTensor per model output. |
| cb | A callback function to be called when the asynchronous execution is complete. |
|
pure virtual |
Returns the device batch size.
The device batch size is the number of input/output sets the NPU processes in a single inference call as a batch. This value determines the maximum outer vector size accepted by execute() and execute_async().
|
pure virtual |
Returns the number of input tensors.
This method retrieves the number of input tensors required by the model or operation.
|
pure virtual |
Returns the number of output tensors.
This method retrieves the number of output tensors produced by the model or operation.
|
pure virtual |
Retrieves the quantization parameters for a specific tensor.
This method retrieves the quantization parameters for a tensor identified by its name.
| tensor_name | The name of the tensor for which to retrieve quantization parameters. |
| std::runtime_error | if quantization parameters are not found for the tensor. |
|
pure virtual |
Unified API to retrieve tensor information by name and tensor type (CPU/HW).
This method retrieves tensor information for a specific tensor identified by name, with the ability to specify whether to retrieve CPU or HW tensor information.
| tensor_name | The name of the tensor for which to retrieve information. |
| type | Specifies whether to retrieve CPU or HW tensor information. |
| std::runtime_error | if the tensor name is not found. |
|
pure virtual |
Unified API to retrieve tensor information based on direction and tensor type (CPU/HW).
This method retrieves tensor information based on the specified direction (input/output) and tensor type (CPU/HW).
| direction | Specifies whether to retrieve input or output tensor information. |
| type | Specifies whether to retrieve CPU or HW tensor information. |
|
pure virtualnoexcept |
Waits for the completion of an asynchronous job.
This method is used to check the status of a job submitted using execute_async, and blocks until the specified job is completed or the timeout expires.
| job_handle | A constant reference to the handle of the job to wait for. |
| timeout | The maximum time to wait for job completion. Zero timeout means the wait should check the job completion status and return immediately. If a positive timeout is specified, the wait will return once the task is completed, or the specified time has elapsed. |