Vector Instruction Set: Pipeline Sync¶

The pto.v* synchronization instruction sets inside PTO ISA are defined below. The operation forms describe the vector-pipe contract and the current A5-oriented target-profile details that backends must preserve when lowering legal PTO programs.

Category: Synchronization primitives for coordinating pipeline execution Pipelines: MTE2 (GM→UB), PIPE_V (Vector), MTE3 (UB→GM)

The PTO ISA vector instructions model operates on the A5's Decoupled Access-Execute architecture. The MTE and Vector pipelines run asynchronously, requiring explicit synchronization to prevent data hazards.

Intra-Core Pipeline Sync¶

These ops coordinate data flow between pipelines within a single vector core.

`pto.set_flag`¶

syntax: pto.set_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"]
semantics: Signal event from source pipe to destination pipe.

set_flag(src_pipe, dst_pipe, event_id);

Example: After MTE2 completes GM→UB transfer, signal Vector pipe:

pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]

`pto.wait_flag`¶

syntax: pto.wait_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"]
semantics: Block destination pipe until source pipe signals event.

wait_flag(src_pipe, dst_pipe, event_id);

Example: Vector pipe waits for MTE2 data to arrive:

pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]

`pto.pipe_barrier`¶

syntax: pto.pipe_barrier "PIPE_*"
semantics: Drain all pending ops in the specified pipe. All previously issued operations on that pipe complete before any subsequent operation begins.

pipe_barrier(pipe);

Pipe identifiers: PIPE_MTE2, PIPE_V, PIPE_MTE3

Example: Two back-to-back copy_ubuf_to_gm calls writing to the same GM address. Without a barrier, MTE3 may reorder them and the final GM value is non-deterministic:

// Both stores target the same GM address — order matters!
pto.copy_ubuf_to_gm %ub_partial_0, %gm_result, ...
// Without pipe_barrier, MTE3 could execute the second copy before the first
// completes, producing a non-deterministic result at %gm_result.
pto.pipe_barrier "PIPE_MTE3"
// After barrier: first copy is guaranteed complete. Second copy overwrites deterministically.
pto.copy_ubuf_to_gm %ub_partial_1, %gm_result, ...

`pto.get_buf`¶

syntax: pto.get_buf "PIPE_*", %buf_id, %mode : i64, i64
semantics: Acquire buffer slot for inter-pipeline double-buffering coordination.

get_buf(pipe, buf_id, mode);

`pto.rls_buf`¶

syntax: pto.rls_buf "PIPE_*", %buf_id, %mode : i64, i64
semantics: Release buffer slot to allow other pipeline to proceed.

rls_buf(pipe, buf_id, mode);

`pto.mem_bar`¶

syntax: pto.mem_bar "BARRIER_TYPE"
semantics: Intra-vector-pipe memory fence within __VEC_SCOPE__. Required when UB addresses alias between vector load/store operations.

mem_bar(barrier_type);

Barrier types:

Type	Semantics
`VV_ALL`	All prior vector instructions complete before subsequent
`VST_VLD`	All prior vector stores visible before subsequent loads
`VLD_VST`	All prior vector loads complete before subsequent stores

Example: Ensure stores are visible before loads to same UB region:

pto.vsts %v0, %ub[%c0] : !pto.vreg<64xf32>, !pto.ptr<f32, ub>
pto.mem_bar "VST_VLD"
%v1 = pto.vlds %ub[%c0] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>

Intra-Core Sync Patterns & Examples¶

Example 1: `set_flag` / `wait_flag` (Explicit Events)¶

Each cross-pipeline data dependency requires an explicit signal/wait pair. The programmer must manually insert set_flag after the producer and wait_flag before the consumer.

// ─── Stage 1: MTE2 loads data from GM into UB ───
pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr, ...

// MTE2 signals: "UB data is ready for Vector pipe"
pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]

// ─── Stage 2: Vector pipe consumes UB data ───
// Vector waits until MTE2's signal arrives
pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]

scf.for %dummy = %c0 to %c1 step %c1 {
  %v   = pto.vlds %ub_ptr[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
  %mask = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
  %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
  pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
} {llvm.loop.aivector_scope}

// Vector signals: "UB output is ready for MTE3"
pto.set_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]

// ─── Stage 3: MTE3 stores result from UB back to GM ───
// MTE3 waits until Vector's signal arrives
pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]

pto.copy_ubuf_to_gm %ub_out, %gm_out, ...

Key property: Every cross-pipeline edge is an explicit (set_flag, wait_flag) pair. Simple for straight-line code, but gets verbose in loops (see Example 3).

Example 2: `get_buf` / `rls_buf` (Resource-Based)¶

Instead of naming events, each pipeline declares when it acquires (get_buf) and releases (rls_buf) a shared UB buffer. Cross-pipeline RAW/WAR dependencies are resolved implicitly by program order — if MTE2 releases buf_A and Vector later acquires buf_A, the hardware ensures the acquire cannot proceed until the release completes.

// ─── Stage 1: MTE2 loads data into UB ───
// MTE2 acquires ub_ptr — blocks if Vector hasn't released it from a prior iteration
pto.get_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64
pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr, ...
// MTE2 done writing ub_ptr — release it so Vector can consume
pto.rls_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64

// ─── Stage 2: Vector computation ───
// Vector acquires ub_ptr (input) — blocks until MTE2 releases it (RAW: MTE2 write → V read)
pto.get_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64
// Vector acquires ub_out (output) — blocks until MTE3 releases it from a prior iteration (WAR: MTE3 read → V write)
pto.get_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64

scf.for %dummy = %c0 to %c1 step %c1 {
  %mask = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
  %v   = pto.vlds %ub_ptr[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
  %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
  pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
} {llvm.loop.aivector_scope}

// Vector done reading ub_ptr — release so MTE2 can reuse it in next iteration
pto.rls_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64
// Vector done writing ub_out — release so MTE3 can consume
pto.rls_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64

// ─── Stage 3: MTE3 stores result to GM ───
// MTE3 acquires ub_out — blocks until Vector releases it (RAW: V write → MTE3 read)
pto.get_buf "PIPE_MTE3", %bufid_ub_out, %mode : i64, i64
pto.copy_ubuf_to_gm %ub_out, %gm_out, ...
// MTE3 done reading ub_out — release so Vector can reuse it in next iteration
pto.rls_buf "PIPE_MTE3", %bufid_ub_out, %mode : i64, i64

Key property: No event IDs needed. Dependencies are implicit from program order of get_buf/rls_buf on the same buffer ID. This becomes much more convenient in multi-iteration loops (see Example 3).

Example 3: Ping/Pong Double-Buffering Loop¶

Double-buffering overlaps DMA and compute by using two UB buffers alternately. All three stages (MTE2, Vector, MTE3) appear in the same iteration — the hardware pipelines them across iterations because different iterations operate on different buffers (buf[i%2]).

Event ID scheme (`set_flag` / `wait_flag`)¶

With 2 ping/pong buffers and 2 pipeline pairs (MTE2↔V, V↔MTE3), set_flag/wait_flag needs 8 event IDs = 2 pipe-pairs × 2 buffers × (forward + reverse):

MTE2 ↔ Vector (input buffers):

Event ID	Direction	Purpose
`EVT_IN_FWD_0`	MTE2 → V	RAW: buf_in[0] data ready
`EVT_IN_FWD_1`	MTE2 → V	RAW: buf_in[1] data ready
`EVT_IN_REV_0`	V → MTE2	WAR: Vector done reading buf_in[0]
`EVT_IN_REV_1`	V → MTE2	WAR: Vector done reading buf_in[1]

Vector ↔ MTE3 (output buffers):

Event ID	Direction	Purpose
`EVT_OUT_FWD_0`	V → MTE3	RAW: buf_out[0] result ready
`EVT_OUT_FWD_1`	V → MTE3	RAW: buf_out[1] result ready
`EVT_OUT_REV_0`	MTE3 → V	WAR: MTE3 done reading buf_out[0]
`EVT_OUT_REV_1`	MTE3 → V	WAR: MTE3 done reading buf_out[1]

3a. `set_flag` / `wait_flag` version¶

// ═══ Pre-loop: prime ALL reverse-dependency signals ═══
// Both input and output buffers start unused. We must pre-send
// reverse-dep signals so the first iteration's wait_flags don't deadlock.
pto.set_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_0"]   // ◀ PRIME: buf_in[0] "free"
pto.set_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_1"]   // ◀ PRIME: buf_in[1] "free"
pto.set_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_0"]  // ◀ PRIME: buf_out[0] "free"
pto.set_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_1"]  // ◀ PRIME: buf_out[1] "free"

scf.for %i = %c0 to %N step %c1 {
  // ── All 3 stages in same iteration, indexed by i%2 ──
  // %pp = i % 2  (ping/pong selector for buffer & event IDs)

  // ── MTE2: load tile[i] into buf_in[i%2] ──
  // WAR: wait until Vector has released buf_in[i%2] from iteration i-2
  pto.wait_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"]
  pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_in[%pp], ...
  // RAW: signal Vector that buf_in[i%2] data is ready
  pto.set_flag["PIPE_MTE2", "PIPE_V", "EVT_IN_FWD_{pp}"]

  // ── Vector: compute buf_in[i%2] → buf_out[i%2] ──
  // RAW: wait for MTE2 to finish loading buf_in[i%2]
  pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVT_IN_FWD_{pp}"]
  // WAR: wait for MTE3 to finish reading buf_out[i%2] from iteration i-2
  pto.wait_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"]
  scf.for %dummy = %c0 to %c1 step %c1 {
    %v   = pto.vlds %ub_in[%pp][%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
    %mask = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
    %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
    pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
  } {llvm.loop.aivector_scope}
  // WAR: tell MTE2 "done reading buf_in[i%2]"
  pto.set_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"]
  // RAW: tell MTE3 "buf_out[i%2] result ready"
  pto.set_flag["PIPE_V", "PIPE_MTE3", "EVT_OUT_FWD_{pp}"]

  // ── MTE3: store result from buf_out[i%2] to GM ──
  // RAW: wait for Vector to finish writing buf_out[i%2]
  pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVT_OUT_FWD_{pp}"]
  pto.copy_ubuf_to_gm %ub_out[%pp], %gm_out[%i], ...
  // WAR: tell Vector "done reading buf_out[i%2]"
  pto.set_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"]
}

// ═══ Post-loop: drain — match every pre-loop prime with a wait ═══
// Each priming set_flag must be paired. The last loop iteration's
// set_flags are consumed by wait_flags that will never fire inside the
// loop (there is no iteration i+2). Drain them here.
pto.wait_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_{(N-1)%2}"]  // ◀ DRAIN
pto.wait_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_{(N-2)%2}"]  // ◀ DRAIN
pto.wait_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_{(N-1)%2}"] // ◀ DRAIN
pto.wait_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_{(N-2)%2}"] // ◀ DRAIN

What set_flag/wait_flag requires outside the loop: - Before the loop (4 × set_flag): Prime every reverse-dependency event ID — one per buffer per pipe-pair. Without this, the first iteration's wait_flag for reverse deps would deadlock (no signal was ever sent). - After the loop (4 × wait_flag): Drain the matching reverse-dep signals from the last iterations. Every set_flag must be paired with a wait_flag — the last loop iterations produce signals that no subsequent iteration consumes, so they must be drained explicitly.

3b. `get_buf` / `rls_buf` version¶

Same ping/pong double-buffering, but no pre-loop priming or post-loop draining needed. Buffer acquire/release semantics handle everything.

scf.for %i = %c0 to %N step %c1 {
  // %pp = i % 2  (ping/pong selector)

  // ── MTE2: load tile[i] into buf[i%2] ──
  // Acquires buf[i%2] — on first iteration, buffer is free so proceeds immediately.
  // On later iterations, blocks until Vector releases buf[i%2] (WAR: automatic).
  pto.get_buf %bufid_buf[%pp], "PIPE_MTE2"
  pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_buf[%pp], ...
  pto.rls_buf %bufid_buf[%pp], "PIPE_MTE2"

  // ── Vector: compute on buf[i%2] ──
  // Acquires buf[i%2] — blocks until MTE2 releases it (RAW: automatic)
  pto.get_buf %bufid_buf[%pp], "PIPE_V"
  pto.get_buf %bufid_out[%pp], "PIPE_V"
  scf.for %dummy = %c0 to %c1 step %c1 {
    %v   = pto.vlds %ub_buf[%pp][%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
    %mask = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
    %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
    pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
  } {llvm.loop.aivector_scope}
  // Release buf[i%2] — MTE2 can reuse in iteration i+2 (WAR resolved)
  pto.rls_buf %bufid_buf[%pp], "PIPE_V"
  pto.rls_buf %bufid_out[%pp], "PIPE_V"

  // ── MTE3: store result ──
  // Acquires out[i%2] — blocks until Vector releases it (RAW: automatic)
  pto.get_buf %bufid_out[%pp], "PIPE_MTE3"
  pto.copy_ubuf_to_gm %ub_out[%pp], %gm_out[%i], ...
  pto.rls_buf %bufid_out[%pp], "PIPE_MTE3"
}
// No post-loop drain needed — last rls_buf completes the pipeline.

No priming, no draining, no event IDs. The acquire/release protocol on buffer IDs indexed by i%2 implicitly resolves all cross-pipeline dependencies: - RAW (MTE2→V): Vector's get_buf blocks until MTE2's rls_buf on buf[i%2] - WAR (V→MTE2): MTE2's get_buf in iteration i+2 blocks until Vector's rls_buf in iteration i (same buffer) - First iteration: Buffer is initially free, so get_buf proceeds without blocking — no priming needed

Comparison Summary¶

Aspect	`set_flag` / `wait_flag`	`get_buf` / `rls_buf`
Dependency model	Explicit event signals	Implicit via buffer acquire/release
IDs per pipe-pair	8 = 2 buffers × 2 dirs × 2 (fwd+rev)	1 fwd + 1 rev per buffer (shared global pool)
Total HW IDs	8 per pipe-pair, grows with buffers	32 global across all pipes
Reverse (WAR) deps	Extra `set_flag`/`wait_flag` pair per buffer	Handled automatically
Pre-loop setup	`set_flag` to prime each reverse dep	None
Post-loop teardown	`wait_flag` to drain all primed signals	None
Straight-line code	Simple, clear	Slightly more verbose (bracket each stage)
Ping/pong loops	8 event IDs + 4 prime + 4 drain	Same pattern, no overhead
Best used for	Simple pipelines, fine-grained control	Double/multi-buffering, complex loops

Inter-Core Sync¶

Note:

Inter-core sync is only needed for mixed Cube+Vector tasks where Cube produces data that Vector consumes (or vice versa). Vec-only tasks can ignore this section entirely.

These ops coordinate execution across the Cube block and Vector subblocks within a cluster. Each core cluster consists of 1 Cube block : 2 Vector subblocks, each with its own SU (Sequencer Unit) running independent instruction streams.

Core Cluster (1:2 ratio)
┌─────────────────────────────────────────────┐
│  ┌──────────────┐    ┌──────────────┐       │
│  │  AIC (Cube)  │    │  AIV0 (Vec)  │       │
│  │  ┌────────┐  │    │  ┌────────┐  │       │
│  │  │   SU   │──┼────┼──│   SU   │  │       │
│  │  └────────┘  │    │  └────────┘  │       │
│  │  CUBE pipe   │    │  MTE2/V/MTE3 │       │
│  │  L0C buffer  │    │  UB (256KB)  │       │
│  └──────────────┘    └──────────────┘       │
│                      ┌──────────────┐       │
│                      │  AIV1 (Vec)  │       │
│                      │  ┌────────┐  │       │
│                      │  │   SU   │  │       │
│                      │  └────────┘  │       │
│                      │  MTE2/V/MTE3 │       │
│                      │  UB (256KB)  │       │
│                      └──────────────┘       │
└─────────────────────────────────────────────┘

Platform Comparison¶

Aspect	A2A3 (Ascend 910)	A5 (A5)
Signal op	`set_cross_core` (mode2)	`set_intra_block`
Wait op	`wait_flag_dev`	`wait_intra_core`
Wait behavior	SU-level blocking (entire core stalls)	Per-pipeline (only named pipe stalls)
Semaphore pool	16 IDs per cluster, 4-bit counter	16 IDs, but 32-ID address space (see below)
C→V	Broadcast: one `set` reaches both AIV0+AIV1	1:1: separate `set` per subblock required
V→C	Reduce: Cube waits for both subblocks in one `wait`	1:1: Cube needs separate `wait` per subblock

A2A3: `set_cross_core` / `wait_flag_dev`¶

// mode2 broadcast/reduce semantics for 1:2 cluster
set_cross_core(pipe, semaphore_id);   // pipe: VEC/MTE2/CUBE/FIX
wait_flag_dev(semaphore_id);          // SU-level blocking

C→V Broadcast (one set reaches both):
    AIC ──set_cross_core──┬──> AIV0 sema++
                          └──> AIV1 sema++

V→C Reduce (one wait for both):
    AIV0 ──set_cross_core──┐
                           ├──> AIC wait_flag_dev (blocks until BOTH)
    AIV1 ──set_cross_core──┘

`pto.set_cross_core`¶

syntax: pto.set_cross_core %core_id, %event_id : i64, i64
semantics: Signal event to another core. Uses mode2 for 1:2 cluster on A2A3.

`pto.wait_flag_dev`¶

syntax: pto.wait_flag_dev %core_id, %event_id : i64, i64
semantics: Wait for event from another core. SU-level blocking — entire core stalls.

A5: `set_intra_block` / `wait_intra_core`¶

set_intra_block(trigger_pipe, semaphore_id);
wait_intra_core(wait_pipe, semaphore_id);   // only named pipe stalls

A5 semaphore address space: The hardware has 16 physical semaphore IDs but exposes a 32-ID address space to support 1:1 signaling to each subblock:

ID Range	Target
0–15	AIV0 (subblock 0)
16–31 (+15 offset)	AIV1 (subblock 1)

This means C→V requires separate set_intra_block calls per subblock (no broadcast), and V→C requires separate wait_intra_core calls per subblock (no hardware reduce).

C→V on A5 (1:1, no broadcast — need two sets):
    AIC ──set_intra_block(pipe, sema_id)────> AIV0
    AIC ──set_intra_block(pipe, sema_id+15)──> AIV1

V→C on A5 (1:1, no reduce — need two waits):
    AIV0 ──set_intra_block──> AIC wait_intra_core(pipe, sema_id)
    AIV1 ──set_intra_block──> AIC wait_intra_core(pipe, sema_id+15)  // extra wait

`pto.set_intra_block`¶

syntax: pto.set_intra_block %block_id, %event_id : i64, i64
semantics: Signal event within a block (A5). Specifies trigger pipe. 1:1 per subblock.

`pto.wait_intra_core`¶

syntax: pto.wait_intra_core %block_id, %event_id : i64, i64
semantics: Wait for event within block (A5). Specifies which pipeline should wait — only that pipe stalls, SU and other pipes continue.

Wait Granularity Comparison¶

A2A3 wait_flag_dev (SU-level stall):
    SU ──┬── PIPE_MTE2 ───╳ ALL STALLED
         ├── PIPE_V    ───╳ ALL STALLED
         └── PIPE_MTE3 ───╳ ALL STALLED

A5 wait_intra_core "PIPE_MTE2" (per-pipe stall):
    SU ──┬── PIPE_MTE2 ───╳ STALLED (waiting for Cube)
         ├── PIPE_V    ─── ✓ RUNNING
         └── PIPE_MTE3 ─── ✓ RUNNING

Vector Instruction Set: Pipeline Sync¶

Intra-Core Pipeline Sync¶

pto.set_flag¶

pto.wait_flag¶

pto.pipe_barrier¶

pto.get_buf¶

pto.rls_buf¶

pto.mem_bar¶

Intra-Core Sync Patterns & Examples¶

Example 1: set_flag / wait_flag (Explicit Events)¶

Example 2: get_buf / rls_buf (Resource-Based)¶