Communication And Runtime¶
Communication operations span multiple NPUs in a parallel group. They express inter-NPU data exchange and collective reduction using a ParallelGroup handle.
Operations¶
| Operation | Description | Collective Type | IR Spelling | C++ Spelling | |
|---|---|---|---|---|---|
| TBROADCAST | Broadcast data from root NPU to all ranks | One-to-all | pto.tbroadcast |
BROADCAST |
|
| TGET | Get data from a remote NPU | Point-to-point | pto.tget |
GET |
|
| TGET_ASYNC | Asynchronous variant of TGET | Point-to-point | pto.tget_async |
GET_ASYNC |
|
| TNOTIFY | Notify other ranks of an event | Synchronization | pto.tnotify |
NOTIFY |
|
| TPUT | Put data to a remote NPU | Point-to-point | pto.tput |
PUT |
|
| TPUT_ASYNC | Asynchronous variant of TPUT | Point-to-point | pto.tput_async |
PUT_ASYNC |
|
| TREDUCE | Collective reduction across all ranks | All-to-one | pto.treduce |
REDUCE |
|
| TSCATTER | Scatter data from root to all ranks | One-to-all | pto.tscatter |
SCATTER |
|
| TGATHER | Gather data from all ranks to root | All-to-one | pto.tgather |
GATHER |
|
| TTEST | Test if a notification has been received | Synchronization | pto.ttest |
TEST |
|
| TWAIT | Wait for a notification | Synchronization | pto.twait |
WAIT |
Mechanism¶
Communication operations use a ParallelGroup handle (!pto.group<N>) to identify the set of participating NPUs. The group defines:
- Size: Number of ranks
Nin the parallel group - Root: The designated NPU for broadcast/scatter operations (typically rank 0)
- Tensors: Per-rank destination/source buffers
Data Flow¶
All collective communication operations share a common data flow pattern:
Local GM ──► UB (staging tile) ──► Inter-NPU interconnect ──► UB ──► Local GM
A staging tile in UB is always required as an intermediate buffer. For large tensors that exceed the UB tile capacity, the operation automatically performs 2D sliding — chunking along rows and columns to fit each chunk into the tile, iterating over all outer dimensions.
Broadcast¶
All non-root NPUs receive data from the root:
Only the root calls pto.tbroadcast. Non-root ranks must ensure their destination buffers are allocated and writable for the duration of the operation.
Reduce¶
All ranks contribute data to a reduction operation, with the result delivered to the root:
where \(\bigoplus\) is the reduction operator (sum, max, min, etc.).
Scatter/Gather¶
Scatter distributes slices of the root's data to each rank. Gather collects per-rank data back to the root.
Point-to-Point (TGET/TPUT)¶
Point-to-point operations transfer data between two specific NPUs without involving the entire group:
TGET(pto.tget): Read remote GM → local GM. Data flows from the source NPU to the current NPU.TPUT(pto.tput): Write local GM → remote GM. Data flows from the current NPU to the destination NPU.
Both use a staging tile in UB as the intermediate buffer. For TGET, the data path is: remote GM → staging tile → local GM. For TPUT, the data path is: local GM → staging tile → remote GM.
ParallelGroup Handle¶
// Define a parallel group of 8 NPUs
%tensors = "pto.make_group"(%addrs0, %addrs1, ..., %addrs7)
: (!pto.memref<f32, 16x16>, ..., !pto.memref<f32, 16x16>) -> !pto.group<8>
In C++, the ParallelGroup<GTensor> template manages the group handle. See the per-op pages for C++ usage examples.
Large Tile Support¶
When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, transfers are automatically chunked via 2D sliding:
- If
ValidRowis static,GetShape(DIM_3)must be divisible byValidRow - If
ValidColis static,GetShape(DIM_4)must be divisible byValidCol - To handle non-divisible cases, use tiles with
DYNAMICvalid row/column
Constraints¶
Constraints
- All participating NPUs must call the collective operation with matching
ParallelGrouphandles - Non-root ranks must not call broadcast/scatter operations
- Root rank is identified by
parallelGroup.GetRootIdx() - Destination/source tensors are assumed to have the same shape and strides across ranks
- The staging tile must be pre-allocated in UB at non-overlapping offsets for ping-pong variants
Cases That Are Not Allowed¶
Cases That Are Not Allowed
- Calling collective operations with mismatched
ParallelGrouphandles across ranks - Calling broadcast/scatter on non-root ranks (undefined behavior)
- Using uninitialized or improperly sized destination buffers
- Using overlapping UB offsets for ping/pong staging tiles
See Also¶
- Communication instruction set — Instruction set overview
- Communication ISA — Instruction set description
- Ordering and Synchronization — PTO synchronization model