Consistency Baseline¶

PTO's memory model is built around explicit movement and explicit ordering. The baseline guarantee is intentionally narrower than "everything is globally ordered." PTO requires the program or the selected instruction set to express when data becomes visible across stages, instruction sets, and blocks.

Memory Spaces¶

PTO defines three architecturally distinct memory spaces:

Space	Address Qualifier	Scope	Visibility
Global Memory (GM)	`__gm__`	All AI Cores	Shared
Unified Buffer (UB)	`!pto.ptr<T, ub>`	Single AI Core	Core-local
Tile Buffer	`!pto.tile_buf<...>`	Single AI Core	Core-local, pipeline-specific

These are NOT interchangeable. Data must be explicitly moved between them, and each space has different visibility semantics.

Ordering Levels¶

PTO defines three levels of ordering guarantee:

Ordering Level	Description	Scope	How to Establish
Program Order	Operations within a single tile buffer or vector register file execute in program order	Single core	Implicit within a buffer
Event Order	Ordering between operations on different pipelines or buffers	Within a block	`set_flag`/`wait_flag` or `RecordEvent` chaining
Barrier Order	Ordering across multiple blocks	Grid-wide	`TBARRIER` / collective ops

Program Order¶

Within a single tile buffer or a single vector register, operations are ordered by program order:

TLOAD(a, ga);  // 1. Load a
TLOAD(b, gb);  // 2. Load b (ordered after 1, same buffer)
TADD(c, a, b); // 3. Compute (ordered after 1 and 2, same buffer)

No explicit synchronization is needed between operations on the same tile buffer.

Event Order¶

When data moves between different buffers or different pipelines, explicit event ordering is required:

RecordEvent e0 = TLOAD(a, ga);     // produces event
RecordEvent e1 = TLOAD(b, gb);     // produces event
TMATMUL(c, a, b, e0, e1);          // waits for e0 and e1 before starting

The RecordEvent return value is a handle to the ordering guarantee. Passing it as a WaitEvents argument to a subsequent operation establishes a happens-before edge.

Barrier Order¶

When multiple blocks must synchronize (e.g., after a collective operation), a grid-wide barrier is required:

pto.tbroadcast %tensor, %src : !pto.tile<f32,16,16> -> ()
pto.twait // block until all blocks have received the broadcast

What PTO Does NOT Guarantee Automatically¶

PTO does not automatically guarantee:

Guarantee NOT Given	Reason
That every cross-pipeline write is immediately visible to every consumer	Requires explicit `set_flag`/`wait_flag`
That vector register, tile buffer, and GM traffic share one implicit fence model	Each space has distinct visibility rules
That target-specific stronger ordering is portable	Stronger ordering on A5 does not apply to CPU/A2/A3
That UB writes are visible to GM reads without explicit `TSTORE` or `copy_ubuf_to_gm`	UB→GM requires explicit data movement
That GM writes are visible to UB reads without explicit `TLOAD` or `copy_gm_to_ubuf`	GM→UB requires explicit data movement

GM Visibility¶

Data written to GM by TSTORE or copy_ubuf_to_gm is guaranteed visible to subsequent GM reads by other blocks only after:

All prior store operations on that block have completed (program order within block).
Any required mem_bar, pipe_barrier, or collective synchronization has been issued.
The host runtime has confirmed completion (via event or runtime call).

The exact moment of GM visibility across blocks is hardware-specific: the ISA guarantees that the ordering contract is satisfied, but the exact timing of cross-block visibility depends on the target profile: - A2/A3 and A5: Visibility is guaranteed after the block-level synchronization (e.g., wait_flag/pipe_barrier or collective op) completes; the hardware enforces visibility within a bounded time window after the sync operation. - CPU simulator: Visibility is immediate within the same process address space, since there is no physical DMA or device memory hierarchy.

UB Visibility¶

UB is core-local: only the AI Core that owns the UB can read or write it. UB data is NOT visible to other cores.

UB visibility within a core follows program order plus event order. The following are guaranteed by the ISA:

UB reads within a core see all prior UB writes within that core (program order).
UB reads see data from copy_gm_to_ubuf only after the corresponding wait_flag returns.

Undefined, Unspecified, and Implementation-Defined¶

PTO uses these terms precisely:

Term	Meaning	Example
Undefined	Behavior is intentionally unspecified; any outcome is permitted	Reading a tile's out-of-valid-region element
Unspecified	The ISA does not define the behavior; implementations may choose	Exact cycle count of an operation
Implementation-Defined	The behavior is defined by the implementation but documented	FTZ behavior for denormals on A5

CPU, A2/A3, and A5 may differ in implementation detail and support subsets, but the baseline manual must still say clearly which ordering facts are:

Category	Portability
Program order within a tile buffer	Portable across all profiles
Event order via `set_flag`/`wait_flag`	Portable across all profiles (same semantics, different pipe spaces)
`RecordEvent` chaining	Portable across all profiles
Barrier order via collective ops	Not available on CPU; available on A2/A3 and A5
UB→GM implicit visibility	Not portable; requires explicit `TSTORE`/`copy_ubuf_to_gm`
A5-specific stronger ordering	A5-specific, not portable to CPU/A2/A3

Cases That Are Not Allowed¶

Cases That Are Not Allowed

Documenting implementation detail as though it were the portable memory model.
Hiding visibility requirements inside vague words like "usually ordered."
Mixing memory-model guarantees with scheduling heuristics.
Claiming that data is visible across blocks without an explicit synchronization operation.
Assuming that "the hardware does it automatically" without specifying which operation provides the guarantee.