Configuring NoC Connectivity for Model Deployments#

Overview#

Vitis AI inference deployments on Versal AI Edge Series Gen2 devices require a carefully trained Network-on-Chip (NoC) configuration to meet the bandwidth and latency demands of AIE workloads. However, because the actual AI model artifact is loaded dynamically at runtime, the Vitis v++ linker cannot derive NoC configuration from the model itself at build time. The gmio_train utility bridges this gap by generating a lightweight, parameterized Neural Processing Unit (NPU) artifact – referred to as a dummy graph – that exposes the Global Memory IO (GMIO) interface placement and per-channel bandwidth requirements that the NoC compiler needs during the link stage.

This section describes how to use gmio_train to produce and customize this dummy NPU artifact, how to compile and link it into a Xilinx Shell Archive (XSA) using aiecompiler and v++, and how to align the trained NoC Quality of Service (QoS) profile with the real data-flow requirements of the target system.

Introduction#

The Vitis v++ linker relies on an AIE-side artifact to expose GMIO endpoints and per-channel bandwidth requests before it can invoke the NoC compiler and derive a valid NoC configuration. In a typical development flow, the real compiled AI model would serve this purpose. However, because Vitis AI compiled models are loaded and configured dynamically at runtime – and may change between runs – tying NoC training to any specific model artifact introduces fragility and inflexibility into the build process.

gmio_train solves this problem by decoupling NoC configuration from any particular model. It generates a stand-in ADF graph that carries only the GMIO placement and bandwidth information required for NoC training, without encoding any model-specific compute logic. This dummy graph is used exclusively at build time to guide the NoC compiler; at runtime, it is transparently replaced by the actual compiled AI model artifact running on the AIE Array.

By default, gmio_train produces a uniform layout in which every column on the AIE Array exposes two input and two output GMIO interfaces, each configured with a default bandwidth of 500 MB/s. This default configuration is well-suited for platform bring-up and early-stage development. For production deployments, the generated graph can be customized to reflect the actual column range and per-interface bandwidth profile of the target model, ensuring that the trained NoC configuration accurately mirrors real-world traffic patterns.

The sections that follow describe the default invocation, customization options, compilation steps, and recommended practices for both single-model and multi-model system configurations.

Why Use gmio_train?#

Configuring the NoC in a Versal device requires the v++ linker to have access to an AI Engine (AIE)-side artifact at build time – one that exposes the GMIO endpoints and per-channel bandwidth requests that the NoC compiler uses to derive a valid NoC configuration. In practice, however, the actual compiled AI model artifact is not always available at link time, and even when it is, it may change between runs or be loaded dynamically at runtime. This creates a fundamental tension between the static requirements of the build system and the dynamic nature of AI model deployment.

gmio_train resolves this by generating a dummy Neural Processing Unit (NPU) artifact – a lightweight stand-in Adaptive Data Flow (ADF) graph – that carries only the GMIO placement and bandwidth information required for NoC training. It does not encode any model-specific compute logic and is never executed at runtime. Its sole purpose is to give the NoC compiler a well-defined, realistic set of Quality of Service (QoS) parameters to train against during the v++ link stage.

The following points summarize the key reasons to use gmio_train:

The v++ link stage requires an AIE-side artifact.: The NoC compiler cannot derive a NoC configuration in the absence of an artifact that exposes GMIO endpoints and per-channel bandwidth requests. Without such an artifact, NoC training has nothing to work against, and the link stage will fail or produce an undertrained NoC configuration that may not meet the bandwidth and latency requirements of the target workload.
Vitis AI models are loaded and configured dynamically at runtime.: The real Vitis AI model artifact may change from one run to another, or may not be available at all during the platform build stage. Tying NoC training to a specific model artifact would require rebuilding the platform every time the model changes. gmio_train decouples NoC configuration from any specific model by producing, at build time, a parameterized stand-in graph that only carries the GMIO placement and bandwidth information needed for NoC training.
It eliminates boilerplate and reduces authoring effort.: Although a dummy ADF graph can be written by hand, doing so requires familiarity with the ADF graph API and involves significant boilerplate code. gmio_train automates this process and produces a parameterized, easy-to-customize template that can be adapted to the target column range and bandwidth profile with minimal effort. Users who prefer full control over the graph definition are still free to author their own dummy graph; gmio_train can serve as a convenient and well-structured starting point in either case.

Note

The dummy NPU graph produced by gmio_train is used exclusively at build time to guide NoC configuration. It is not executed at runtime and does not affect the behavior of the actual AI model artifact that runs on the AIE Array.

Note

Before running any of the commands shown in this document (gmio_train, aiecompiler, v++, …), make sure the Vitis environment is set up as described in Build the Reference Design.

The default invocation is:

gmio_train -s --part xc2ve3858-ssva2112-2MP-e-S -o training-libadf.a

This default command generates a design in which every column on the AI Engine Array exposes two input and two output GMIO interfaces, and where the default read bandwidth and write bandwidth are both set to 500 MB/s.

Using a User-Defined Dummy NPU Artifact#

Although gmio_train is the recommended approach for generating the dummy Neural Processing Unit (NPU) artifact, it is not the only option. Users who require precise control over Global Memory IO (GMIO) interface placement, bandwidth assignments, or graph topology are free to author their own dummy Adaptive Data Flow (ADF) graph and pass it directly to the v++ linker in place of the gmio_train-generated artifact.

A user-defined dummy graph must satisfy the same structural requirements as the gmio_train-generated graph: it must expose the GMIO endpoints and per-channel bandwidth requests that the NoC compiler needs in order to derive a valid NoC configuration at link time. Beyond that constraint, the graph definition is entirely under the user’s control.

When authoring a custom dummy graph, consider the following:

GMIO placement must reflect the intended model layout.: The GMIO interface locations defined in the dummy graph directly influence the NoC configuration that the compiler derives. If the placement does not reflect the column range and interface locations of the actual AI model, the resulting NoC configuration may be suboptimal or incompatible with the runtime workload.
Bandwidth values must be non-zero.: As noted in subsequent sections, bandwidth values of 0 are not supported in either the GMIO create() calls or the Vitis NoC Quality of Service (QoS) connectivity options. All per-interface bandwidth values must be set to a positive integer. For paths that are unused or minimally active, a small non-zero value in the range of 1 to 5 MB/s is recommended to keep the configuration valid while minimizing impact on NoC resource allocation.
gmio_train can serve as a starting point.: Even when a fully custom graph is required, gmio_train can accelerate development by generating a well-structured, parameterized template that is straightforward to modify. Starting from the gmio_train-generated graph.h and adapting it to the target layout is generally faster and less error-prone than writing a dummy graph from scratch, particularly for users who are less familiar with the ADF graph API.

Note

Regardless of whether the dummy NPU artifact is produced by gmio_train or authored manually, it is used exclusively at build time to guide NoC configuration during the v++ link stage. It is not executed at runtime and does not affect the behavior of the actual compiled AI model artifact that runs on the AIE Array.

Customizing the Generated Graph#

Before customizing the dummy Neural Processing Unit (NPU) graph generated by gmio_train, the user should have a working understanding of the target model’s deployment characteristics. Specifically, the following information should be known or estimated before making any modifications to the generated graph.h:

Which columns on the AIE Array the actual AI model will occupy
The total number of columns that will be used
The location of each input and output interface within those columns
The bandwidth requirement of each interface – at minimum, the relative magnitudes of the input and output bandwidths across columns should be known, even if exact values are not yet available

With this information in hand, the graph generated by gmio_train can be modified to accurately reflect the target model’s column range and traffic profile. The following subsections describe how to configure the column range and per-interface bandwidth values.

Configuring the Column Range#

By default, gmio_train generates a design in which every column on the AIE Array is included. To restrict the generated graph to the columns that the actual AI model will occupy, use the --start-col and --num-col options to specify the starting column index and the total number of columns, respectively.

For example, the following command generates a design that starts at column 0 and occupies 8 columns:

gmio_train -s --part xc2ve3858-ssva2112-2MP-e-S \
           --ns instance_0 --start-col 0 --num-col 8 \
           -o training-libadf.a

The generated tmp/aie/graph.h will reflect this column range and produce Global Memory IO (GMIO) interface entries only for the specified columns. The default graph structure produced by this command is shown below:

class Shimlock : public adf::graph
{
public:
  adf::kernel gmioKernSet[NMU_COL_COUNT * STACK_DEPTH];
  adf::input_gmio gmioIn[NMU_COL_COUNT * STACK_DEPTH];
  adf::output_gmio gmioOut[NMU_COL_COUNT * STACK_DEPTH];
  Shimlock()
  {
    for(int c = 0; c < NMU_COL_COUNT; c++) {
      for(int i = 0; i < STACK_DEPTH; i++) {
        gmioIn[STACK_DEPTH * c + i] = adf::input_gmio::create(
            "c" + std::to_string(nmuCols.at(c)) + "r" + std::to_string(i),
            256, 500);
        gmioOut[STACK_DEPTH * c + i] = adf::output_gmio::create(
            "c" + std::to_string(nmuCols.at(c)) + "w" + std::to_string(i),
            256, 500);
        gmioKernSet[STACK_DEPTH * c + i] = adf::kernel::create(loop);
        adf::location<adf::kernel>(gmioKernSet[STACK_DEPTH * c + i]) =
            adf::tile(nmuCols.at(c), i);
        adf::location<adf::GMIO>(gmioIn[STACK_DEPTH * c + i]) =
            adf::shim(nmuCols.at(c));
        adf::location<adf::GMIO>(gmioOut[STACK_DEPTH * c + i]) =
            adf::shim(nmuCols.at(c));
        adf::source(gmioKernSet[STACK_DEPTH * c + i]) = "./loop.cpp";
        adf::connect(gmioIn[STACK_DEPTH * c + i].out[0],
                     gmioKernSet[STACK_DEPTH * c + i].in[0]);
        adf::connect(gmioKernSet[STACK_DEPTH * c + i].out[0],
                     gmioOut[STACK_DEPTH * c + i].in[0]);
        adf::runtime<adf::ratio>(gmioKernSet[STACK_DEPTH * c + i]) = 1.0;
      }
    }
  }
};

In this default structure, all GMIO interfaces are assigned a uniform bandwidth of 500 MB/s for both input and output. This is appropriate for bring-up and early-stage development, but should be replaced with per-interface values that reflect the actual traffic profile of the target model before moving to production.

Configuring Per-Interface Bandwidth Values#

To assign per-interface bandwidth values, declare two integer arrays – one for input bandwidths and one for output bandwidths – and reference them in the adf::input_gmio::create() and adf::output_gmio::create() calls, replacing the uniform 500 default. The following example illustrates this pattern:

  class Shimlock : public adf::graph
  {
  public:
    adf::kernel gmioKernSet[NMU_COL_COUNT * STACK_DEPTH];
    adf::input_gmio gmioIn[NMU_COL_COUNT * STACK_DEPTH];
    adf::output_gmio gmioOut[NMU_COL_COUNT * STACK_DEPTH];
    int bwIn[NMU_COL_COUNT*STACK_DEPTH]  = {READ_0,  READ_1,  READ_2,  ...};
    int bwOut[NMU_COL_COUNT*STACK_DEPTH] = {WRITE_0, WRITE_1, WRITE_2, ...};
    Shimlock()
    {
      for(int c = 0; c < NMU_COL_COUNT; c++) {
        for(int i = 0; i < STACK_DEPTH; i++) {
          gmioIn[STACK_DEPTH * c + i] = adf::input_gmio::create(
              "c" + std::to_string(nmuCols.at(c)) + "r" + std::to_string(i),
              256, bwIn[c*STACK_DEPTH + i]);
          gmioOut[STACK_DEPTH * c + i] = adf::output_gmio::create(
              "c" + std::to_string(nmuCols.at(c)) + "w" + std::to_string(i),
              256, bwOut[c*STACK_DEPTH + i]);
          gmioKernSet[STACK_DEPTH * c + i] = adf::kernel::create(loop);
          adf::location<adf::kernel>(gmioKernSet[STACK_DEPTH * c + i]) =
              adf::tile(nmuCols.at(c), i);
          adf::location<adf::GMIO>(gmioIn[STACK_DEPTH * c + i]) =
              adf::shim(nmuCols.at(c));
          adf::location<adf::GMIO>(gmioOut[STACK_DEPTH * c + i]) =
              adf::shim(nmuCols.at(c));
          adf::source(gmioKernSet[STACK_DEPTH * c + i]) = "./loop.cpp";
          adf::connect(gmioIn[STACK_DEPTH * c + i].out[0],
                       gmioKernSet[STACK_DEPTH * c + i].in[0]);
          adf::connect(gmioKernSet[STACK_DEPTH * c + i].out[0],
                       gmioOut[STACK_DEPTH * c + i].in[0]);
          adf::runtime<adf::ratio>(gmioKernSet[STACK_DEPTH * c + i]) = 1.0;
        }
      }
}
  };

Note

Replace READ_0, READ_1, READ_2, and WRITE_0, WRITE_1, WRITE_2 with the actual per-interface bandwidth values, in MB/s, that reflect the traffic profile of the target model. At minimum, the relative magnitudes of the input and output bandwidths across columns should be preserved, even if exact values are not yet available.

Bandwidth Value Constraints#

When assigning per-interface bandwidth values, the following constraints must be observed:

Bandwidth values of 0 are not supported.

A value of 0 is not valid in either the GMIO create() calls or the Vitis NoC Quality of Service (QoS) connectivity options. This behavior differs from Vivado NoC configuration, where a value of 0 can be used to indicate unused paths. In the Vitis flow, attempting to set:

read_bw  = 0
write_bw = 0

can result in linker errors similar to the following:

ERROR: [CFGEN 83-2253] Malformed --connectivity.noc.read_bw switch argument

Use small non-zero values for unused or minimally active paths.

For interfaces that are unused or carry minimal traffic, assign a small positive bandwidth value in the range of 1 to 5 MB/s. This keeps the configuration valid while minimizing the impact on NoC resource allocation.

Preserve relative bandwidth magnitudes across columns.

Even when exact per-interface bandwidth values are not known, the relative magnitudes of the input and output bandwidths across columns should be preserved as accurately as possible. The NoC compiler uses these values to derive a Quality of Service (QoS) profile that reflects the real data-flow requirements of the target model. A configuration in which all interfaces are assigned the same uniform bandwidth may result in a suboptimal NoC configuration that does not meet the latency or throughput requirements of the target workload.

Note

The bandwidth values assigned in the dummy graph are used exclusively at build time to guide NoC training during the v++ link stage. They do not directly control the runtime behavior of the actual AI model artifact. However, they do influence the NoC configuration that is baked into the platform, which in turn affects the bandwidth and latency characteristics available to the runtime workload. For this reason, it is important to assign bandwidth values that are as representative of the real model’s traffic profile as possible.

Compiling the Customized Graph#

After editing tmp/aie/graph.h to reflect the target column range and per-interface bandwidth values, the customized graph must be recompiled to produce an updated training-libadf.a archive. The files generated by gmio_train can be reused directly for this purpose, without regenerating the full gmio_train output.

To recompile the customized graph, run aiecompiler with the configuration file and include path generated by gmio_train:

aiecompiler --config tmp/aie/Work/aie_hw.cfg --include=tmp/aie

This command produces an updated training-libadf.a archive that reflects the customized GMIO placement and bandwidth values. Once compiled, pass the archive to the v++ linker to produce the linked Xilinx Shell Archive (XSA):

v++ -l training-libadf.a ... -o <design>_link.xsa

The v++ linker will invoke the NoC compiler using the GMIO placement and bandwidth information encoded in training-libadf.a to derive a NoC configuration that reflects the target model’s column range and traffic profile.

Note

Before running aiecompiler or v++, ensure that the Vitis environment is set up as described in Build the Reference Design. Attempting to run either tool without the correct environment configuration will result in errors or an incomplete build.

Recommended Practice#

The following practices are recommended when configuring the dummy Neural Processing Unit (NPU) graph for NoC training. Adhering to these guidelines will help ensure that the trained NoC configuration is both valid and representative of the real model’s data-flow requirements.

Use small non-zero values for unused or minimally active paths.: Bandwidth values of 0 are not supported in the Vitis NoC Quality of Service (QoS) connectivity options. For Global Memory IO (GMIO) interfaces that are unused or carry minimal traffic, assign a small positive bandwidth value in the range of 1 to 5 MB/s. This keeps the NoC configuration valid while minimizing the impact on NoC resource allocation for those paths.
Reflect the actual column range of the target model.: Use the --start-col and --num-col options to restrict the generated graph to the columns that the actual AI model will occupy. Including columns that the model does not use will cause the NoC compiler to allocate resources for paths that will never carry traffic at runtime, potentially degrading the NoC configuration quality for the paths that matter.
Preserve relative bandwidth magnitudes across columns.: Even when exact per-interface bandwidth values are not yet known, the relative magnitudes of the input and output bandwidths across columns should be preserved as accurately as possible. The NoC compiler uses these values to derive a QoS profile that reflects the real data-flow requirements of the target model. A uniform bandwidth assignment across all interfaces may result in a suboptimal NoC configuration that does not meet the latency or throughput requirements of the target workload.
Recompile after every graph modification.: Any change to tmp/aie/graph.h – whether to the column range, interface placement, or bandwidth values – must be followed by a recompile using aiecompiler and a re-link using v++ to ensure that the updated NoC configuration is reflected in the output Xilinx Shell Archive (XSA). Changes to graph.h that are not recompiled and re-linked will have no effect on the trained NoC configuration.
Use gmio_train defaults during early bring-up.: When per-model column knowledge is not yet available – for example, during early platform bring-up or initial integration testing – the default gmio_train invocation with uniform 500 MB/s bandwidth across all columns is a suitable and safe starting point. Refine the configuration once the actual model’s column usage and traffic profile have been characterized.

Performance-Oriented Configuration#

The default gmio_train graph, which assigns a uniform bandwidth of 500 MB/s to every column on the AIE Array, is designed for platform bring-up and generic NoC training. It is not intended for performance tuning. Once the actual machine learning (ML) workload has been characterized and its column usage and traffic profile are known, the dummy graph should be regenerated and customized to reflect the real model’s data-flow requirements.

The following steps describe the recommended approach for performance-oriented NoC configuration:

Step 1: Match the column range to the actual model.

Use the --start-col and --num-col options to restrict the generated graph to the columns that the model actually occupies. Columns that the model does not use should be excluded from the dummy graph to avoid allocating NoC resources for unused paths.

For example, if the model occupies columns 0 through 7:

gmio_train -s --part xc2ve3858-ssva2112-2MP-e-S \
           --ns instance_0 --start-col 0 --num-col 8 \
           -o training-libadf.a

Step 2: Replace uniform bandwidth values with per-column estimates.

Replace the uniform 500 MB/s default with per-column READ_x and WRITE_x values that approximate the model’s traffic pattern. At minimum, the relative magnitudes of the input and output bandwidths across columns should be preserved, even if exact values are not yet available. Refer to the Configuring Per-Interface Bandwidth Values section for implementation details.

Step 3: Recompile and re-link.

After customizing tmp/aie/graph.h, recompile the graph using aiecompiler and re-link using v++ to produce an updated XSA that reflects the performance-oriented NoC configuration:

aiecompiler --config tmp/aie/Work/aie_hw.cfg --include=tmp/aie
v++ -l training-libadf.a ... -o <design>_link.xsa

Step 4: Complement with v++ connectivity options.

For finer control over NoC behavior, supplement the dummy graph with v++ --connectivity options such as sp=, noc.read_bw=, and noc.write_bw= to control how kernels, AIE interfaces, and memory resources are interconnected and assigned QoS parameters. Refer to the Further NoC Control Beyond gmio_train section for details.

Note

When per-model column knowledge is not yet available, the plain gmio_train defaults remain a suitable starting point. Performance- oriented configuration should be deferred until the actual model’s column usage and traffic profile have been characterized through profiling or simulation.

Per-Instance NoC Customization#

In deployments where multiple model instances run concurrently on the AIE Array, the dummy NoC training graph must reflect the full multi-instance floorplan to ensure that the trained NoC configuration meets the bandwidth and latency requirements of all active instances. Multiple model instances can be deployed on the AIE Array through two distinct mechanisms, each operating at a different stage of the development flow:

Mechanism	When	Option	Use Case
Data parallelism	Compile time	`dp_size`	Maximize throughput for concurrent requests by replicating the model multiple times across the device at build time
Multi-tenancy	Runtime	`start_column` / `aie_columns_sharing`	Dynamically place multiple model instances across AIE Array columns at runtime using per-runner placement options

Regardless of which mechanism is used, the NoC training graph should be configured to reflect the column range and per-interface bandwidth profile of each model instance. The following guidance applies to both mechanisms.

Generate one dummy graph per model instance.

Invoke gmio_train separately for each model instance, using the --start-col and --num-col options to restrict each graph to the columns that the instance will occupy, and the --ns option to assign a unique namespace to each graph to avoid symbol conflicts during linking.

The total number of columns occupied by each compiled model instance is determined by the following relationship:

\[\text{occupied\_columns} = dp\_size \times tp\_size \times 4\]

For example, for a deployment where each instance occupies 8 columns, generate one dummy graph per instance in different working directories as follows:

gmio_train -s --part xc2ve3858-ssva2112-2MP-e-S \
           --ns instance_0 --start-col 0  --num-col 8 \
           -o training-libadf-instance-0.a

gmio_train -s --part xc2ve3858-ssva2112-2MP-e-S \
           --ns instance_1 --start-col 8  --num-col 8 \
           -o training-libadf-instance-1.a

gmio_train -s --part xc2ve3858-ssva2112-2MP-e-S \
           --ns instance_2 --start-col 16 --num-col 8 \
           -o training-libadf-instance-2.a

gmio_train -s --part xc2ve3858-ssva2112-2MP-e-S \
           --ns instance_3 --start-col 24 --num-col 8 \
           -o training-libadf-instance-3.a

Assign per-instance bandwidth values.

For each generated graph.h, replace the uniform bandwidth defaults with READ_x and WRITE_x values that reflect the traffic profile of the corresponding model instance. This ensures that the NoC compiler trains against a realistic, per-instance Quality of Service (QoS) profile that mirrors the final multi-instance floorplan.

Compile each instance into its own archive.

In each working directory, compile each customized graph.h into its own libadf.a archive using aiecompiler:

aiecompiler --config tmp/aie/Work/aie_hw.cfg \
            --include=tmp/aie
aiecompiler --config tmp/aie/Work/aie_hw.cfg \
            --include=tmp/aie
aiecompiler --config tmp/aie/Work/aie_hw.cfg \
            --include=tmp/aie
aiecompiler --config tmp/aie/Work/aie_hw.cfg \
            --include=tmp/aie

Link all instance archives together with v++.

Pass all per-instance archives to the v++ linker in a single link invocation. The linker will combine the Global Memory IO (GMIO) placement and bandwidth information from all archives and invoke the NoC compiler to derive a unified NoC configuration that reflects the full multi-instance floorplan:

v++ -l training-libadf-instance-0.a \
    training-libadf-instance-1.a \
    training-libadf-instance-2.a \
    training-libadf-instance-3.a \
    ... -o <design>_link.xsa

    .. note::

When using per-instance NoC customization, observe the following constraints regardless of whether compile-time or runtime placement is used:

Column ranges must not overlap. The --start-col ranges assigned to each model instance must be non-overlapping. On a 24-column device (ve2-xc2ve3558), the sum of all occupied column ranges must not exceed 24 columns. On a 36-column device (ve2-xc2ve3858), the sum must not exceed 36 columns.
Column ranges must represent disjoint partitions. The column ranges assigned across all model instances must together form a set of disjoint partitions – that is, every occupied column must belong to exactly one instance, with no gaps or overlaps between instance boundaries. Partial or fragmented column assignments that leave unassigned columns between instances are not supported.
One instance must include column 0. At least one model instance must be assigned a column range that begins at column 0 (--start-col 0). The NoC compiler requires that the column space is anchored at column 0; a configuration in which no instance starts at column 0 is invalid and will produce an incorrect or incomplete NoC configuration.
Namespaces must be unique. The --ns option must be used to assign a unique namespace to each generated graph to avoid symbol conflicts during the v++ link stage.
Bandwidth values must be non-zero. For unused or minimally active paths, assign a small positive value in the range of 1 to 5 MB/s. Bandwidth values of 0 are not supported in the Vitis NoC QoS connectivity options.
Runtime placement must be consistent. When using multi- tenancy runtime placement, VART does not validate overlapping or incompatible placements across runners. The application is responsible for ensuring that spatial layouts do not overlap and that temporal sharing groups use matching dp_size and tp_size values. Refer to the VART Multi-Tenancy documentation for full details on runner placement options and application responsibilities.

Further NoC Control Beyond gmio_train#

The dummy graph produced by gmio_train describes only the AIE-side GMIO interfaces and their associated bandwidth requests. While this information is sufficient for the NoC compiler to derive a baseline NoC configuration, the actual NoC behavior in the final system is shaped by additional Vitis and Vivado mechanisms that can be combined with gmio_train to achieve finer control over NoC resource allocation and Quality of Service (QoS) parameters.

The following mechanisms are available for extending NoC control beyond what gmio_train provides:

Vitis Linker Connectivity Options#

The v++ linker exposes a set of --connectivity options that allow users to control how kernels, AIE interfaces, and memory resources are interconnected and assigned QoS parameters. These options complement the GMIO placement and bandwidth information provided by the gmio_train dummy graph and can be used to fine-tune the NoC configuration without modifying the graph itself.

The most commonly used connectivity options for NoC control are:

Option	Description
`sp=`	Specifies the memory resource to which a kernel port or AIE interface is connected. Used to control the data path between compute resources and memory.
`noc.read_bw=`	Specifies the read bandwidth, in MB/s, for a given NoC path. Overrides the bandwidth value derived from the dummy graph for that path.
`noc.write_bw=`	Specifies the write bandwidth, in MB/s, for a given NoC path. Overrides the bandwidth value derived from the dummy graph for that path.

For full details on all available connectivity options, refer to:

UG1702 - Vitis Reference Guide (Connectivity Options): https://docs.amd.com/r/en-US/ug1702-vitis-accelerated-reference/connectivity-Options

Post-Link Customization in Vivado#

For advanced use cases where the Vitis linker connectivity options do not provide sufficient control, the linked design can be exported to Vivado and the NoC settings tuned directly in the Vivado environment. The customized design can then be carried back into the Vitis flow for final integration and deployment.

Post-link customization in Vivado is recommended in the following scenarios:

The NoC configuration derived by the v++ linker does not meet the bandwidth or latency requirements of the target workload, and the required adjustments cannot be expressed through v++ connectivity options alone.
The target system has complex NoC topology requirements that are more naturally expressed in the Vivado NoC configuration environment than through the v++ connectivity option syntax.
Fine-grained control over individual NoC path parameters – such as traffic class, QoS priority, or arbitration settings – is required beyond what the Vitis flow exposes.

Note

Post-link customization in Vivado is an advanced workflow and requires familiarity with both the Vivado NoC configuration environment and the Vitis platform development flow. Users who are new to Versal NoC configuration are encouraged to exhaust the gmio_train and v++ connectivity options before resorting to post-link Vivado customization.

For a comprehensive description of the Versal NoC architecture, configuration parameters, and QoS tuning, refer to:

PG313 - Versal Adaptive SoC Programmable Network on Chip and Integrated Memory Controller LogiCORE IP Product Guide: https://docs.amd.com/r/en-US/pg313-network-on-chip

Configuring NoC Connectivity for Model Deployments

Contents

Configuring NoC Connectivity for Model Deployments#

Overview#

Introduction#

Why Use gmio_train?#

Using a User-Defined Dummy NPU Artifact#

Customizing the Generated Graph#

Configuring the Column Range#

Configuring Per-Interface Bandwidth Values#

Bandwidth Value Constraints#

Compiling the Customized Graph#

Recommended Practice#

Performance-Oriented Configuration#

Per-Instance NoC Customization#

Further NoC Control Beyond gmio_train#

Vitis Linker Connectivity Options#

Post-Link Customization in Vivado#