Producer-Consumer Ordering¶
Producer-consumer ordering is the most useful way to explain PTO visibility rules. A program is legal when each consumer sees the writes or state changes its producer is required to make visible, using the synchronization and movement rules of the active instruction set.
Producer-Consumer State Machine¶
Every data movement or compute operation participates in a producer-consumer chain. The state machine for each operation is:
┌─────────────────┐
│ IDLE │ Operation not yet issued
└────────┬────────┘
│ issue
▼
┌─────────────────┐
│ IN_PROGRESS │ Operation executing (may be on different pipeline)
└────────┬────────┘
│ completion (produces event)
▼
┌─────────────────┐
│ COMPLETE │ Result visible to consumers who have
└────────┬────────┘ established the ordering edge
│ (consumed by next operation)
▼
[Consumer]
An operation is consumed by a subsequent operation when the consumer either:
- Passes the producer's
RecordEventas aWaitEventsargument. - Issues a
wait_flagfor the same event that the producer issued.
Tile Instructions Ordering¶
For pto.t* programs, the common pattern is:
┌─────────────────────────────────────────────────────┐
│ TLOAD(tile, gtensor) │
│ (produces tile state) │
└─────────────────┬───────────────────────────────────┘
│ RecordEvent or implicit TSYNC
▼
┌─────────────────────────────────────────────────────┐
│ Tile Compute (TADD, TMATMUL, etc.) │
│ (consumes tile state; produces tile state) │
└─────────────────┬───────────────────────────────────┘
│ RecordEvent or explicit TSYNC
▼
┌─────────────────────────────────────────────────────┐
│ TSTORE(gtensor, tile) │
│ (consumes tile state; produces GM write) │
└─────────────────────────────────────────────────────┘
RecordEvent Chaining¶
The RecordEvent return value of each tile operation can be passed to the next operation as a WaitEvents... argument:
RecordEvent e0 = TLOAD(a, ga); // e0: TLOAD has completed
RecordEvent e1 = TLOAD(b, gb); // e1: TLOAD has completed
TMATMUL(c, a, b, e0, e1); // waits for e0 and e1 before starting
RecordEvent e2 = TMATMUL(...);
TSTORE(gc, c, e2); // waits for e2 before starting
When an operation has multiple WaitEvents... arguments, it waits for ALL of them before beginning execution.
TSYNC¶
TSYNC provides a lightweight tile-buffer-scoped barrier when fine-grained event chaining is not needed:
TLOAD(a, ga);
TLOAD(b, gb);
TSYNC(); // ensures both loads are complete before compute
TADD(c, a, b);
TSYNC(); // ensures compute is complete before store
TSTORE(gc, c);
TSYNC is equivalent to chaining all prior RecordEvent values for the same tile buffer.
Vector Instructions Ordering¶
For pto.v* programs, the ordering chain involves explicit DMA synchronization:
┌──────────────────────────────────────────────────────────┐
│ copy_gm_to_ubuf(%ub, %gm, ...) │
│ (DMA: GM → UB) │
└────────────────┬─────────────────────────────────────────┘
│ set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0)
▼
┌──────────────────────────────────────────────────────────┐
│ wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0) │
│ (UB data now visible to Vector pipeline) │
└────────────────┬─────────────────────────────────────────┘
│ (implicit on vlds)
▼
┌──────────────────────────────────────────────────────────┐
│ vlds %vreg, %ub[...] {dist = "NORM"} │
│ (UB → Vector Register) │
└────────────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ vadd %result, %vreg, %vreg │
│ (Vector Compute on Vector Register) │
└────────────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ vsts %result, %ub[...] │
│ (Vector Register → UB) │
└────────────────┬─────────────────────────────────────────┘
│ set_flag(PIPE_V, PIPE_MTE3, EVENT_ID1)
▼
┌──────────────────────────────────────────────────────────┐
│ wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID1) │
│ (Vector result now staged for DMA) │
└────────────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ copy_ubuf_to_gm(%gm, %ub, ...) │
│ (DMA: UB → GM) │
└──────────────────────────────────────────────────────────┘
Vector Instructions vs Tile Instructions Ordering¶
| Aspect | Tile Instructions | Vector Instructions |
|---|---|---|
| Synchronization mechanism | RecordEvent, TSYNC |
set_flag/wait_flag on pipe pairs |
| Data path | GM ↔ Tile Buffer (via MTE2/MTE3) | GM ↔ UB ↔ Vector Register (via DMA + vlds/vsts) |
| Visibility model | Producer-consumer chain via events | DMA signal → wait → vlds → compute → vsts → DMA signal |
| Implicit ordering | Within same tile buffer | None — explicit flag required between DMA and compute |
| Store path | Tile → MTE3 → GM | Vector Register → vsts → UB → MTE3 → GM |
Cross-Instruction Set Handoff¶
When a tile-instruction set result is consumed by a vector-instruction set operation (or vice versa), the handoff must go through UB:
Tile Instructions Vector Instructions
│ ▲
│ TLOAD/TSTORE handles GM ↔ Tile Buffer │
│ │
└──── TSTORE → UB → copy_ubuf_to_gm ──────────┘
(via copy_gm_to_ubuf on vector side)
The cross-instruction set handoff goes through GM or through an explicit UB double-buffering pattern:
// Tile instructions produce result in tile c
TSTORE(gc, c);
// Vector instructions consume from gc
copy_gm_to_ubuf(%ub, %gm_out, ...);
set_flag(PIPE_MTE2, PIPE_V, ID);
wait_flag(PIPE_MTE2, PIPE_V, ID);
%v = pto.vlds %ub[...] {dist = "NORM"};
Constraints¶
Constraints
- A consumer may only rely on visibility after the required producer-consumer edge is established.
- The exact synchronization mechanism may vary by instruction set or target profile.
- Instruction Set docs and per-op pages must state the relevant ordering expectations explicitly.
- An operation's
RecordEventreturn value is only valid for chaining to operations that execute AFTER the current operation in program order.
Cases That Are Not Allowed¶
Cases That Are Not Allowed
- Describing a consumer as legal without saying how producer visibility is established.
- Assuming a target's convenient scheduling behavior is the architecture contract.
- Leaving cross-instruction set handoff rules implicit.
- Issuing
vldsbeforecopy_gm_to_ubufcompletes without an interveningwait_flag. - Issuing
copy_ubuf_to_gmbeforevstscompletes without an interveningwait_flag. - Passing a
RecordEventfrom a later operation to an earlier operation (wrong direction) — this is illegal and produces a verification error.