Vector Instruction Set: Pipeline Sync¶
The pto.v* synchronization instruction sets inside PTO ISA are defined below. The operation forms describe the vector-pipe contract and the current A5-oriented target-profile details that backends must preserve when lowering legal PTO programs.
Category: Synchronization primitives for coordinating pipeline execution Pipelines: MTE2 (GM→UB), PIPE_V (Vector), MTE3 (UB→GM)
The PTO ISA vector instructions model operates on the A5's Decoupled Access-Execute architecture. The MTE and Vector pipelines run asynchronously, requiring explicit synchronization to prevent data hazards.
Intra-Core Pipeline Sync¶
These ops coordinate data flow between pipelines within a single vector core.
pto.set_flag¶
- syntax:
pto.set_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"] - semantics: Signal event from source pipe to destination pipe.
set_flag(src_pipe, dst_pipe, event_id);
Example: After MTE2 completes GM→UB transfer, signal Vector pipe:
pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
pto.wait_flag¶
- syntax:
pto.wait_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"] - semantics: Block destination pipe until source pipe signals event.
wait_flag(src_pipe, dst_pipe, event_id);
Example: Vector pipe waits for MTE2 data to arrive:
pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
pto.pipe_barrier¶
- syntax:
pto.pipe_barrier "PIPE_*" - semantics: Drain all pending ops in the specified pipe. All previously issued operations on that pipe complete before any subsequent operation begins.
pipe_barrier(pipe);
Pipe identifiers: PIPE_MTE2, PIPE_V, PIPE_MTE3
Example: Two back-to-back copy_ubuf_to_gm calls writing to the same GM address. Without a barrier, MTE3 may reorder them and the final GM value is non-deterministic:
// Both stores target the same GM address — order matters!
pto.copy_ubuf_to_gm %ub_partial_0, %gm_result, ...
// Without pipe_barrier, MTE3 could execute the second copy before the first
// completes, producing a non-deterministic result at %gm_result.
pto.pipe_barrier "PIPE_MTE3"
// After barrier: first copy is guaranteed complete. Second copy overwrites deterministically.
pto.copy_ubuf_to_gm %ub_partial_1, %gm_result, ...
pto.get_buf¶
- syntax:
pto.get_buf "PIPE_*", %buf_id, %mode : i64, i64 - semantics: Acquire buffer slot for inter-pipeline double-buffering coordination.
get_buf(pipe, buf_id, mode);
pto.rls_buf¶
- syntax:
pto.rls_buf "PIPE_*", %buf_id, %mode : i64, i64 - semantics: Release buffer slot to allow other pipeline to proceed.
rls_buf(pipe, buf_id, mode);
pto.mem_bar¶
- syntax:
pto.mem_bar "BARRIER_TYPE" - semantics: Intra-vector-pipe memory fence within
__VEC_SCOPE__. Required when UB addresses alias between vector load/store operations.
mem_bar(barrier_type);
Barrier types:
| Type | Semantics |
|---|---|
VV_ALL |
All prior vector instructions complete before subsequent |
VST_VLD |
All prior vector stores visible before subsequent loads |
VLD_VST |
All prior vector loads complete before subsequent stores |
Example: Ensure stores are visible before loads to same UB region:
pto.vsts %v0, %ub[%c0] : !pto.vreg<64xf32>, !pto.ptr<f32, ub>
pto.mem_bar "VST_VLD"
%v1 = pto.vlds %ub[%c0] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
Intra-Core Sync Patterns & Examples¶
Example 1: set_flag / wait_flag (Explicit Events)¶
Each cross-pipeline data dependency requires an explicit signal/wait pair. The programmer must manually insert set_flag after the producer and wait_flag before the consumer.
// ─── Stage 1: MTE2 loads data from GM into UB ───
pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr, ...
// MTE2 signals: "UB data is ready for Vector pipe"
pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
// ─── Stage 2: Vector pipe consumes UB data ───
// Vector waits until MTE2's signal arrives
pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
scf.for %dummy = %c0 to %c1 step %c1 {
%v = pto.vlds %ub_ptr[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
%mask = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
%abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
} {llvm.loop.aivector_scope}
// Vector signals: "UB output is ready for MTE3"
pto.set_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]
// ─── Stage 3: MTE3 stores result from UB back to GM ───
// MTE3 waits until Vector's signal arrives
pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]
pto.copy_ubuf_to_gm %ub_out, %gm_out, ...
Key property: Every cross-pipeline edge is an explicit (set_flag, wait_flag) pair. Simple for straight-line code, but gets verbose in loops (see Example 3).
Example 2: get_buf / rls_buf (Resource-Based)¶
Instead of naming events, each pipeline declares when it acquires (get_buf) and releases (rls_buf) a shared UB buffer. Cross-pipeline RAW/WAR dependencies are resolved implicitly by program order — if MTE2 releases buf_A and Vector later acquires buf_A, the hardware ensures the acquire cannot proceed until the release completes.
// ─── Stage 1: MTE2 loads data into UB ───
// MTE2 acquires ub_ptr — blocks if Vector hasn't released it from a prior iteration
pto.get_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64
pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr, ...
// MTE2 done writing ub_ptr — release it so Vector can consume
pto.rls_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64
// ─── Stage 2: Vector computation ───
// Vector acquires ub_ptr (input) — blocks until MTE2 releases it (RAW: MTE2 write → V read)
pto.get_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64
// Vector acquires ub_out (output) — blocks until MTE3 releases it from a prior iteration (WAR: MTE3 read → V write)
pto.get_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64
scf.for %dummy = %c0 to %c1 step %c1 {
%mask = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
%v = pto.vlds %ub_ptr[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
%abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
} {llvm.loop.aivector_scope}
// Vector done reading ub_ptr — release so MTE2 can reuse it in next iteration
pto.rls_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64
// Vector done writing ub_out — release so MTE3 can consume
pto.rls_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64
// ─── Stage 3: MTE3 stores result to GM ───
// MTE3 acquires ub_out — blocks until Vector releases it (RAW: V write → MTE3 read)
pto.get_buf "PIPE_MTE3", %bufid_ub_out, %mode : i64, i64
pto.copy_ubuf_to_gm %ub_out, %gm_out, ...
// MTE3 done reading ub_out — release so Vector can reuse it in next iteration
pto.rls_buf "PIPE_MTE3", %bufid_ub_out, %mode : i64, i64
Key property: No event IDs needed. Dependencies are implicit from program order of get_buf/rls_buf on the same buffer ID. This becomes much more convenient in multi-iteration loops (see Example 3).
Example 3: Ping/Pong Double-Buffering Loop¶
Double-buffering overlaps DMA and compute by using two UB buffers alternately. All three stages (MTE2, Vector, MTE3) appear in the same iteration — the hardware pipelines them across iterations because different iterations operate on different buffers (buf[i%2]).
Event ID scheme (set_flag / wait_flag)¶
With 2 ping/pong buffers and 2 pipeline pairs (MTE2↔V, V↔MTE3), set_flag/wait_flag needs 8 event IDs = 2 pipe-pairs × 2 buffers × (forward + reverse):
MTE2 ↔ Vector (input buffers):
| Event ID | Direction | Purpose |
|---|---|---|
EVT_IN_FWD_0 |
MTE2 → V | RAW: buf_in[0] data ready |
EVT_IN_FWD_1 |
MTE2 → V | RAW: buf_in[1] data ready |
EVT_IN_REV_0 |
V → MTE2 | WAR: Vector done reading buf_in[0] |
EVT_IN_REV_1 |
V → MTE2 | WAR: Vector done reading buf_in[1] |
Vector ↔ MTE3 (output buffers):
| Event ID | Direction | Purpose |
|---|---|---|
EVT_OUT_FWD_0 |
V → MTE3 | RAW: buf_out[0] result ready |
EVT_OUT_FWD_1 |
V → MTE3 | RAW: buf_out[1] result ready |
EVT_OUT_REV_0 |
MTE3 → V | WAR: MTE3 done reading buf_out[0] |
EVT_OUT_REV_1 |
MTE3 → V | WAR: MTE3 done reading buf_out[1] |
3a. set_flag / wait_flag version¶
// ═══ Pre-loop: prime ALL reverse-dependency signals ═══
// Both input and output buffers start unused. We must pre-send
// reverse-dep signals so the first iteration's wait_flags don't deadlock.
pto.set_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_0"] // ◀ PRIME: buf_in[0] "free"
pto.set_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_1"] // ◀ PRIME: buf_in[1] "free"
pto.set_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_0"] // ◀ PRIME: buf_out[0] "free"
pto.set_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_1"] // ◀ PRIME: buf_out[1] "free"
scf.for %i = %c0 to %N step %c1 {
// ── All 3 stages in same iteration, indexed by i%2 ──
// %pp = i % 2 (ping/pong selector for buffer & event IDs)
// ── MTE2: load tile[i] into buf_in[i%2] ──
// WAR: wait until Vector has released buf_in[i%2] from iteration i-2
pto.wait_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"]
pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_in[%pp], ...
// RAW: signal Vector that buf_in[i%2] data is ready
pto.set_flag["PIPE_MTE2", "PIPE_V", "EVT_IN_FWD_{pp}"]
// ── Vector: compute buf_in[i%2] → buf_out[i%2] ──
// RAW: wait for MTE2 to finish loading buf_in[i%2]
pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVT_IN_FWD_{pp}"]
// WAR: wait for MTE3 to finish reading buf_out[i%2] from iteration i-2
pto.wait_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"]
scf.for %dummy = %c0 to %c1 step %c1 {
%v = pto.vlds %ub_in[%pp][%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
%mask = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
%abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
} {llvm.loop.aivector_scope}
// WAR: tell MTE2 "done reading buf_in[i%2]"
pto.set_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"]
// RAW: tell MTE3 "buf_out[i%2] result ready"
pto.set_flag["PIPE_V", "PIPE_MTE3", "EVT_OUT_FWD_{pp}"]
// ── MTE3: store result from buf_out[i%2] to GM ──
// RAW: wait for Vector to finish writing buf_out[i%2]
pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVT_OUT_FWD_{pp}"]
pto.copy_ubuf_to_gm %ub_out[%pp], %gm_out[%i], ...
// WAR: tell Vector "done reading buf_out[i%2]"
pto.set_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"]
}
// ═══ Post-loop: drain — match every pre-loop prime with a wait ═══
// Each priming set_flag must be paired. The last loop iteration's
// set_flags are consumed by wait_flags that will never fire inside the
// loop (there is no iteration i+2). Drain them here.
pto.wait_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{(N-1)%2}"] // ◀ DRAIN
pto.wait_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{(N-2)%2}"] // ◀ DRAIN
pto.wait_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{(N-1)%2}"] // ◀ DRAIN
pto.wait_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{(N-2)%2}"] // ◀ DRAIN
What set_flag/wait_flag requires outside the loop:
- Before the loop (4 × set_flag): Prime every reverse-dependency event ID — one per buffer per pipe-pair. Without this, the first iteration's wait_flag for reverse deps would deadlock (no signal was ever sent).
- After the loop (4 × wait_flag): Drain the matching reverse-dep signals from the last iterations. Every set_flag must be paired with a wait_flag — the last loop iterations produce signals that no subsequent iteration consumes, so they must be drained explicitly.
3b. get_buf / rls_buf version¶
Same ping/pong double-buffering, but no pre-loop priming or post-loop draining needed. Buffer acquire/release semantics handle everything.
scf.for %i = %c0 to %N step %c1 {
// %pp = i % 2 (ping/pong selector)
// ── MTE2: load tile[i] into buf[i%2] ──
// Acquires buf[i%2] — on first iteration, buffer is free so proceeds immediately.
// On later iterations, blocks until Vector releases buf[i%2] (WAR: automatic).
pto.get_buf %bufid_buf[%pp], "PIPE_MTE2"
pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_buf[%pp], ...
pto.rls_buf %bufid_buf[%pp], "PIPE_MTE2"
// ── Vector: compute on buf[i%2] ──
// Acquires buf[i%2] — blocks until MTE2 releases it (RAW: automatic)
pto.get_buf %bufid_buf[%pp], "PIPE_V"
pto.get_buf %bufid_out[%pp], "PIPE_V"
scf.for %dummy = %c0 to %c1 step %c1 {
%v = pto.vlds %ub_buf[%pp][%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
%mask = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
%abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
} {llvm.loop.aivector_scope}
// Release buf[i%2] — MTE2 can reuse in iteration i+2 (WAR resolved)
pto.rls_buf %bufid_buf[%pp], "PIPE_V"
pto.rls_buf %bufid_out[%pp], "PIPE_V"
// ── MTE3: store result ──
// Acquires out[i%2] — blocks until Vector releases it (RAW: automatic)
pto.get_buf %bufid_out[%pp], "PIPE_MTE3"
pto.copy_ubuf_to_gm %ub_out[%pp], %gm_out[%i], ...
pto.rls_buf %bufid_out[%pp], "PIPE_MTE3"
}
// No post-loop drain needed — last rls_buf completes the pipeline.
No priming, no draining, no event IDs. The acquire/release protocol on buffer IDs indexed by i%2 implicitly resolves all cross-pipeline dependencies:
- RAW (MTE2→V): Vector's get_buf blocks until MTE2's rls_buf on buf[i%2]
- WAR (V→MTE2): MTE2's get_buf in iteration i+2 blocks until Vector's rls_buf in iteration i (same buffer)
- First iteration: Buffer is initially free, so get_buf proceeds without blocking — no priming needed
Comparison Summary¶
| Aspect | set_flag / wait_flag |
get_buf / rls_buf |
|---|---|---|
| Dependency model | Explicit event signals | Implicit via buffer acquire/release |
| IDs per pipe-pair | 8 = 2 buffers × 2 dirs × 2 (fwd+rev) | 1 fwd + 1 rev per buffer (shared global pool) |
| Total HW IDs | 8 per pipe-pair, grows with buffers | 32 global across all pipes |
| Reverse (WAR) deps | Extra set_flag/wait_flag pair per buffer |
Handled automatically |
| Pre-loop setup | set_flag to prime each reverse dep |
None |
| Post-loop teardown | wait_flag to drain all primed signals |
None |
| Straight-line code | Simple, clear | Slightly more verbose (bracket each stage) |
| Ping/pong loops | 8 event IDs + 4 prime + 4 drain | Same pattern, no overhead |
| Best used for | Simple pipelines, fine-grained control | Double/multi-buffering, complex loops |
Inter-Core Sync¶
Note:
Inter-core sync is only needed for mixed Cube+Vector tasks where Cube produces data that Vector consumes (or vice versa). Vec-only tasks can ignore this section entirely.
These ops coordinate execution across the Cube block and Vector subblocks within a cluster. Each core cluster consists of 1 Cube block : 2 Vector subblocks, each with its own SU (Sequencer Unit) running independent instruction streams.
Core Cluster (1:2 ratio)
┌─────────────────────────────────────────────┐
│ ┌──────────────┐ ┌──────────────┐ │
│ │ AIC (Cube) │ │ AIV0 (Vec) │ │
│ │ ┌────────┐ │ │ ┌────────┐ │ │
│ │ │ SU │──┼────┼──│ SU │ │ │
│ │ └────────┘ │ │ └────────┘ │ │
│ │ CUBE pipe │ │ MTE2/V/MTE3 │ │
│ │ L0C buffer │ │ UB (256KB) │ │
│ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ │
│ │ AIV1 (Vec) │ │
│ │ ┌────────┐ │ │
│ │ │ SU │ │ │
│ │ └────────┘ │ │
│ │ MTE2/V/MTE3 │ │
│ │ UB (256KB) │ │
│ └──────────────┘ │
└─────────────────────────────────────────────┘
Platform Comparison¶
| Aspect | A2A3 (Ascend 910) | A5 (A5) |
|---|---|---|
| Signal op | set_cross_core (mode2) |
set_intra_block |
| Wait op | wait_flag_dev |
wait_intra_core |
| Wait behavior | SU-level blocking (entire core stalls) | Per-pipeline (only named pipe stalls) |
| Semaphore pool | 16 IDs per cluster, 4-bit counter | 16 IDs, but 32-ID address space (see below) |
| C→V | Broadcast: one set reaches both AIV0+AIV1 |
1:1: separate set per subblock required |
| V→C | Reduce: Cube waits for both subblocks in one wait |
1:1: Cube needs separate wait per subblock |
A2A3: set_cross_core / wait_flag_dev¶
// mode2 broadcast/reduce semantics for 1:2 cluster
set_cross_core(pipe, semaphore_id); // pipe: VEC/MTE2/CUBE/FIX
wait_flag_dev(semaphore_id); // SU-level blocking
C→V Broadcast (one set reaches both):
AIC ──set_cross_core──┬──> AIV0 sema++
└──> AIV1 sema++
V→C Reduce (one wait for both):
AIV0 ──set_cross_core──┐
├──> AIC wait_flag_dev (blocks until BOTH)
AIV1 ──set_cross_core──┘
pto.set_cross_core¶
- syntax:
pto.set_cross_core %core_id, %event_id : i64, i64 - semantics: Signal event to another core. Uses mode2 for 1:2 cluster on A2A3.
pto.wait_flag_dev¶
- syntax:
pto.wait_flag_dev %core_id, %event_id : i64, i64 - semantics: Wait for event from another core. SU-level blocking — entire core stalls.
A5: set_intra_block / wait_intra_core¶
set_intra_block(trigger_pipe, semaphore_id);
wait_intra_core(wait_pipe, semaphore_id); // only named pipe stalls
A5 semaphore address space: The hardware has 16 physical semaphore IDs but exposes a 32-ID address space to support 1:1 signaling to each subblock:
| ID Range | Target |
|---|---|
| 0–15 | AIV0 (subblock 0) |
| 16–31 (+15 offset) | AIV1 (subblock 1) |
This means C→V requires separate set_intra_block calls per subblock (no broadcast), and V→C requires separate wait_intra_core calls per subblock (no hardware reduce).
C→V on A5 (1:1, no broadcast — need two sets):
AIC ──set_intra_block(pipe, sema_id)────> AIV0
AIC ──set_intra_block(pipe, sema_id+15)──> AIV1
V→C on A5 (1:1, no reduce — need two waits):
AIV0 ──set_intra_block──> AIC wait_intra_core(pipe, sema_id)
AIV1 ──set_intra_block──> AIC wait_intra_core(pipe, sema_id+15) // extra wait
pto.set_intra_block¶
- syntax:
pto.set_intra_block %block_id, %event_id : i64, i64 - semantics: Signal event within a block (A5). Specifies trigger pipe. 1:1 per subblock.
pto.wait_intra_core¶
- syntax:
pto.wait_intra_core %block_id, %event_id : i64, i64 - semantics: Wait for event within block (A5). Specifies which pipeline should wait — only that pipe stalls, SU and other pipes continue.
Wait Granularity Comparison¶
A2A3 wait_flag_dev (SU-level stall):
SU ──┬── PIPE_MTE2 ───╳ ALL STALLED
├── PIPE_V ───╳ ALL STALLED
└── PIPE_MTE3 ───╳ ALL STALLED
A5 wait_intra_core "PIPE_MTE2" (per-pipe stall):
SU ──┬── PIPE_MTE2 ───╳ STALLED (waiting for Cube)
├── PIPE_V ─── ✓ RUNNING
└── PIPE_MTE3 ─── ✓ RUNNING