Ordering And Synchronization¶

PTO does not assume that all execution resources are implicitly serialized. The machine model makes ordering visible where data or state moves across instruction sets, pipelines, or shared resources. The synchronization primitives, event model, and producer-consumer ordering contracts are described below.

Synchronization Primitives¶

PTO defines four categories of synchronization primitives, one per instruction set:

Tile Instructions Primitives¶

Primitive	Syntax	Description
`TSYNC`	`pto.tsync %events...` or `pto.tsync<Op>`	Wait on explicit `RecordEvent` tokens; or insert a pipeline barrier for a single op class
`set_flag`	`pto.set_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"]`	Signal an event from one pipeline to another
`wait_flag`	`pto.wait_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"]`	Wait for a previously-signaled event

TSYNC is the primary tile-instruction set synchronization. The event-wait form TSYNC(events...) establishes a happens-before edge on each RecordEvent token, ensuring all prior tile operations that produced those events are complete. The barrier form TSYNC<Op>() inserts a pipeline barrier for all operations of class Op.

Note:

pipe_barrier (pto.pipe_barrier) is a scalar and control instructions primitive, not a tile instructions primitive. It appears in the Scalar Pipeline Sync instruction set.

Vector Instructions Primitives¶

Primitive	Syntax	Description
`set_flag` / `wait_flag`	`pto.set_flag[...]` / `pto.wait_flag[...]`	Event-based handoff between DMA and vector compute pipelines
`mem_bar`	`pto.mem_bar`	Memory fence; ordering boundary for GM↔UB traffic

On the vector instructions, set_flag(PIPE_MTE2, PIPE_V, ID) is issued by the DMA engine (MTE2) to signal the vector pipeline that data is ready. The vector pipeline issues wait_flag(PIPE_MTE2, PIPE_V, ID) before consuming the data.

DMA Primitives¶

Primitive	Syntax	Description
`copy_gm_to_ubuf`	`pto.copy_gm_to_ubuf ...`	DMA: GM → UB
`copy_ubuf_to_gm`	`pto.copy_ubuf_to_gm ...`	DMA: UB → GM
`copy_ubuf_to_ubuf`	`pto.copy_ubuf_to_ubuf ...`	DMA: UB → UB (double-buffering)

DMA operations do not implicitly synchronize with the compute pipeline. Explicit set_flag/wait_flag pairs (or equivalent RecordEvent chaining) are required wherever a DMA transfer and a compute operation share data.

Communication Instructions Primitives¶

Primitive	Description
`TBROADCAST`	Broadcast data to all participating blocks
`TGET` / `TPUT`	Point-to-point communication between blocks
`TWAIT` / `TTEST`	Barrier synchronization across blocks
`TNOTIFY` / `TREDUCE`	Notification and reduction operations

Event Model¶

PTO uses an event-based synchronization model. Events carry ordering information between pipelines.

Event Lifecycle¶

Producer                                  Consumer
  │                                         │
  │  issue DMA / compute                    │
  │  ▼                                      │
  │  set_flag(SRC_PIPE, DST_PIPE, EVENT_ID)│
  │  (produces the event)                   │
  │                                         │
  │                              wait_flag(SRC_PIPE, DST_PIPE, EVENT_ID)
  │                              (consumes the event)
  │                                         │
  │  data/result available                  │
  ▼                                         ▼

An event is identified by a triple (src_pipe, dst_pipe, event_id):

Field	Values	Meaning
`src_pipe`	`PIPE_MTE1`, `PIPE_MTE2`, `PIPE_MTE3`, `PIPE_V`, `PIPE_M`	Source pipeline that produces the event
`dst_pipe`	`PIPE_MTE1`, `PIPE_MTE2`, `PIPE_MTE3`, `PIPE_V`, `PIPE_M`	Destination pipeline that consumes the event
`event_id`	0–15 (profile-specific)	Event slot identifier

Events are fire-and-forget in the ISA contract: producing a flag makes it available to all subsequent waiters on the same (src_pipe, dst_pipe, event_id) triple.

Events and RecordEvent¶

The C++ intrinsics for tile operations (e.g., TLOAD, TSTORE, TMATMUL) return a RecordEvent value. This event can be passed as a WaitEvents... argument to subsequent operations, establishing a happens-before edge:

RecordEvent e0 = TLOAD(a, ga);     // produces event
RecordEvent e1 = TLOAD(b, gb);     // produces event
TMATMUL(c, a, b, e0, e1);          // waits for both e0 and e1 before executing

The RecordEvent return value is the ISA-visible mechanism for chaining tile-instruction set dependencies. This is equivalent to inserting explicit set_flag/wait_flag pairs but expressed at a higher level.

Pipeline Dependency Graph¶

The AI Core contains multiple execution units that operate concurrently. The following diagram shows the dependency relationships:

               ┌──────────────────────────────────────────────┐
GM ──► MTE2 ──►│ Unified Buffer / Tile Buffer                 │──► MTE3 ──► GM
               │                                              │
               │  ┌───────────────┐     ┌──────────────────┐  │
GM ──► MTE1 ──►│  │ Tile Register │────►│ Vector Pipeline  │  │
               │  │ File          │     │ (pto.v* ops)     │  │
               │  │ Vec / Mat /   │     └────────┬─────────┘  │
               │  │ Acc locations │              │            │
               │  └───────┬───────┘              │            │
               │          │                      ▼            │
               │          │             ┌──────────────────┐  │
               │          └────────────►│ Matrix Pipeline  │  │
               │                        │ (pto.tmatmul*)   │  │
               │                        └────────┬─────────┘  │
               └─────────────────────────────────┼────────────┘
                                                 │
              Scalar Unit: control flow, address generation, system queries

Dependency Types¶

Producer	Consumer	Synchronization Required
MTE2 (DMA GM→UB)	Vector pipeline (vlds)	`set_flag(PIPE_MTE2, PIPE_V, ID)` → `wait_flag`
Vector pipeline	MTE3 (store)	`set_flag(PIPE_V, PIPE_MTE3, ID)` → `wait_flag`
TLOAD	Tile compute	`RecordEvent` chaining or `TSYNC`
Tile compute	TSTORE	`RecordEvent` chaining or `TSYNC`
TLOAD	TMATMUL	`RecordEvent` chaining or `set_flag`/`wait_flag`
Tile compute (Mat)	Tile compute (Vec)	`set_flag`/`wait_flag` or `TSYNC`

Ordering Rules¶

Tile Instructions Ordering¶

Tile-instruction set operations are ordered by program order within a single tile buffer, and by event ordering across tile buffers. The following rules apply:

Tile-local order: Within a single tile buffer, operations execute in program order. TSYNC establishes a barrier within that tile's ordering stream.
Event ordering: A set_flag/wait_flag pair establishes a happens-before edge between the producer pipeline and the consumer pipeline.
RecordEvent chaining: When an operation's WaitEvents... arguments include events from prior operations, those prior operations must complete before the current operation begins.

Vector Instructions Ordering¶

Vector-instruction set ordering follows these rules:

DMA ordering: copy_gm_to_ubuf must complete (via set_flag) before any vlds that consumes the copied data.
Compute ordering: Vector operations within a SimdVecScopeOp execute in program order.
Store ordering: vsts must complete (via set_flag to MTE3) before copy_ubuf_to_gm begins copying the data back to GM.

GM Visibility¶

Data written to GM by TSTORE or copy_ubuf_to_gm is guaranteed visible to subsequent GM reads by other blocks only after:

All prior store operations on that block have completed (program order).
Any required mem_bar or pipe_barrier has been issued.
The operation has been synchronized with the host runtime (event completion).

Constraints¶

Constraints

Synchronization is required wherever the architecture does not already guarantee ordering.
A target may add stronger internal ordering, but the manual must not rely on undocumented strength.
Vector-pipe synchronization rules must be documented separately from tile-instruction set synchronization rules when the mechanisms differ.
Events are fire-and-forget; the ISA does not provide a "test-and-clear" event flag.
TSYNC is tile-buffer-scoped; it does not synchronize across tile buffers.

Cases That Are Not Allowed¶

Cases That Are Not Allowed

Writing the manual as if synchronization were optional when the architecture requires it.
Assuming vector-pipe hazards are covered by tile-instruction set rules without saying so.
Documenting target-specific barriers as architecture-wide unless the PTO instruction set guarantees them.
Issuing vlds before copy_gm_to_ubuf completes without an intervening wait_flag.
Issuing copy_ubuf_to_gm before vsts completes without an intervening wait_flag.