Communication And Runtime

Communication operations span multiple NPUs in a parallel group. They express inter-NPU data exchange and collective reduction using a ParallelGroup handle.

Operations

Operation Description Collective Type IR Spelling C++ Spelling
TBROADCAST Broadcast data from root NPU to all ranks One-to-all pto.tbroadcast BROADCAST
TGET Get data from a remote NPU Point-to-point pto.tget GET
TGET_ASYNC Asynchronous variant of TGET Point-to-point pto.tget_async GET_ASYNC
TNOTIFY Notify other ranks of an event Synchronization pto.tnotify NOTIFY
TPUT Put data to a remote NPU Point-to-point pto.tput PUT
TPUT_ASYNC Asynchronous variant of TPUT Point-to-point pto.tput_async PUT_ASYNC
TREDUCE Collective reduction across all ranks All-to-one pto.treduce REDUCE
TSCATTER Scatter data from root to all ranks One-to-all pto.tscatter SCATTER
TGATHER Gather data from all ranks to root All-to-one pto.tgather GATHER
TTEST Test if a notification has been received Synchronization pto.ttest TEST
TWAIT Wait for a notification Synchronization pto.twait WAIT

Mechanism

Communication operations use a ParallelGroup handle (!pto.group<N>) to identify the set of participating NPUs. The group defines:

  • Size: Number of ranks N in the parallel group
  • Root: The designated NPU for broadcast/scatter operations (typically rank 0)
  • Tensors: Per-rank destination/source buffers

Data Flow

All collective communication operations share a common data flow pattern:

Local GM ──► UB (staging tile) ──► Inter-NPU interconnect ──► UB ──► Local GM

A staging tile in UB is always required as an intermediate buffer. For large tensors that exceed the UB tile capacity, the operation automatically performs 2D sliding — chunking along rows and columns to fit each chunk into the tile, iterating over all outer dimensions.

Broadcast

All non-root NPUs receive data from the root:

\[ \mathrm{dst}^{(k)} = \mathrm{src}^{(\text{root})} \quad \forall k \in [0, N) \]

Only the root calls pto.tbroadcast. Non-root ranks must ensure their destination buffers are allocated and writable for the duration of the operation.

Reduce

All ranks contribute data to a reduction operation, with the result delivered to the root:

\[ \mathrm{result}^{(\text{root})} = \bigoplus_{k=0}^{N-1} \mathrm{src}^{(k)} \]

where \(\bigoplus\) is the reduction operator (sum, max, min, etc.).

Scatter/Gather

Scatter distributes slices of the root's data to each rank. Gather collects per-rank data back to the root.

Point-to-Point (TGET/TPUT)

Point-to-point operations transfer data between two specific NPUs without involving the entire group:

  • TGET (pto.tget): Read remote GM → local GM. Data flows from the source NPU to the current NPU.
  • TPUT (pto.tput): Write local GM → remote GM. Data flows from the current NPU to the destination NPU.

Both use a staging tile in UB as the intermediate buffer. For TGET, the data path is: remote GM → staging tile → local GM. For TPUT, the data path is: local GM → staging tile → remote GM.

ParallelGroup Handle

// Define a parallel group of 8 NPUs
%tensors = "pto.make_group"(%addrs0, %addrs1, ..., %addrs7)
    : (!pto.memref<f32, 16x16>, ..., !pto.memref<f32, 16x16>) -> !pto.group<8>

In C++, the ParallelGroup<GTensor> template manages the group handle. See the per-op pages for C++ usage examples.

Large Tile Support

When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, transfers are automatically chunked via 2D sliding:

  • If ValidRow is static, GetShape(DIM_3) must be divisible by ValidRow
  • If ValidCol is static, GetShape(DIM_4) must be divisible by ValidCol
  • To handle non-divisible cases, use tiles with DYNAMIC valid row/column

Constraints

Constraints

  • All participating NPUs must call the collective operation with matching ParallelGroup handles
  • Non-root ranks must not call broadcast/scatter operations
  • Root rank is identified by parallelGroup.GetRootIdx()
  • Destination/source tensors are assumed to have the same shape and strides across ranks
  • The staging tile must be pre-allocated in UB at non-overlapping offsets for ping-pong variants

Cases That Are Not Allowed

Cases That Are Not Allowed

  • Calling collective operations with mismatched ParallelGroup handles across ranks
  • Calling broadcast/scatter on non-root ranks (undefined behavior)
  • Using uninitialized or improperly sized destination buffers
  • Using overlapping UB offsets for ping/pong staging tiles

See Also