Communication Instruction Set¶

The Communication ISA covers inter-NPU collective operations and point-to-point exchange. These instructions span multiple NPUs in a parallel group and require a ParallelGroup handle. They are not available on the CPU simulator.

Instruction Overview¶

Operation	PTO Name	Description
Collective broadcast	`pto.tbroadcast`	Broadcast data from root NPU to all ranks in parallel group
Point-to-point get	`pto.tget` / `pto.tget_async`	Get data from a remote NPU; async variant does not block
Point-to-point put	`pto.tput` / `pto.tput_async`	Put data to a remote NPU; async variant does not block
Collective gather	`pto.tgather`	Gather data from all ranks to root NPU
Collective scatter	`pto.tscatter`	Scatter data from root NPU to all ranks
Collective reduction	`pto.treduce`	Collective reduction across all ranks (sum, max, min, etc.)
Notification	`pto.tnotify`	Notify other ranks of an event
Notification test	`pto.ttest`	Test if a notification has been received
Notification wait	`pto.twait`	Wait for a notification

Instruction Classes¶

Collective Broadcast / Scatter / Gather¶

These operations replicate or distribute data across all ranks in a ParallelGroup. All participating NPUs must call the operation with matching group handles.

Operation	PTO Name	Direction	Root Role
Broadcast	`pto.tbroadcast`	1 → N	Source of broadcast data
Scatter	`pto.tscatter`	1 → N	Source of scattered slices
Gather	`pto.tgather`	N → 1	Destination of gathered data

Point-to-Point Exchange¶

Point-to-point operations transfer data between exactly two ranks. The async variants return immediately without waiting for the transfer to complete.

Operation	PTO Name	Blocking?
Get from remote	`pto.tget`	Yes (blocks until data arrives)
Get from remote	`pto.tget_async`	No (returns event token)
Put to remote	`pto.tput`	Yes (blocks until transfer completes)
Put to remote	`pto.tput_async`	No (returns event token)

Collective Reduction¶

pto.treduce performs a collective reduction operation (sum, max, min, prod, etc.) across all ranks in a ParallelGroup, with the result available to all participating ranks.

Runtime Synchronization¶

Notification primitives allow non-blocking signaling between ranks without data transfer:

Operation	PTO Name	Description
Notify	`pto.tnotify`	Send a notification to specified ranks
Test	`pto.ttest`	Non-blocking test for received notification
Wait	`pto.twait`	Blocking wait for notification

Inputs¶

Communication instructions consume:

ParallelGroup handles (!pto.group<N>)
Tile operands for the data to be transferred
Scalar parameters (rank indices, event IDs, reduction operator)
Staging tiles for receive-side buffering

Expected Outputs¶

Modified tiles on receive-side ranks
RecordEvent tokens from async operations
Synchronization state changes

Side Effects¶

	Class	Side Effects
	Collective broadcast/scatter/gather	Network/interconnect traffic; ordering across ranks
	Point-to-point	Network traffic between two specific ranks
	Collective reduction	Network traffic + arithmetic on interconnect hardware
	Runtime sync	Lightweight notification state in interconnect

Shared Constraints¶

Communication ops require all participating NPUs to call the operation with matching ParallelGroup handles simultaneously.
Non-root ranks for broadcast/scatter must have destination tiles allocated and writable.
CPU simulator does not support communication ops.
Async operations require subsequent pto.twait before the transferred data is safe to use.

Constraints¶

Constraints

All participating NPUs must call collective operations with matching ParallelGroup handles simultaneously.
Non-root ranks for broadcast/scatter must have destination tiles allocated and writable.
Async operations require subsequent pto.twait before the transferred data is safe to use.
CPU simulator: These instructions are not available. Programs using them on CPU produce a runtime error.

Cases That Are Not Allowed¶

Cases That Are Not Allowed

Calling collective operations with mismatched ParallelGroup handles across ranks.
Using uninitialized or improperly sized destination tiles for receive-side operations.
Accessing data from an async operation before pto.twait returns.
Using these instructions on the CPU simulator profile.

Syntax¶

Assembly Form (PTO-AS)¶

pto.tbroadcast %group, %src : (!pto.group<8>, !pto.tile<f32, 16, 16>)
pto.treduce %group, %src, %dst : (!pto.group<8>, !pto.tile<f32, 16, 16>, !pto.tile<f32, 16, 16>) {op = "sum"}

SSA Form (AS Level 1)¶

%result = pto.tbroadcast %group, %src
    : (!pto.group<8>, !pto.tile<f32, 16, 16>) -> !pto.tile<f32, 16, 16>

C++ Intrinsic¶

Declared in include/pto/comm/pto_comm_inst.hpp:

#include <pto/comm/pto_comm_inst.hpp>
using namespace pto::comm;

// Broadcast across a parallel group
template <typename GroupType, typename GlobalData, typename TileData>
PTO_INST RecordEvent BROADCAST(GroupType& group, GlobalData& src, TileData& stagingTile);

// Collective reduction
template <typename GroupType, typename GlobalData, typename TileData, ReduceOp Op>
PTO_INST RecordEvent REDUCE(GroupType& group, GlobalData& src, TileData& stagingTile);

See the Communication ISA reference for the full per-op reference.