Vector Instruction Set: Pipeline Sync

The pto.v* synchronization instruction sets inside PTO ISA are defined below. The operation forms describe the vector-pipe contract and the current A5-oriented target-profile details that backends must preserve when lowering legal PTO programs.

Category: Synchronization primitives for coordinating pipeline execution Pipelines: MTE2 (GM→UB), PIPE_V (Vector), MTE3 (UB→GM)

The PTO ISA vector instructions model operates on the A5's Decoupled Access-Execute architecture. The MTE and Vector pipelines run asynchronously, requiring explicit synchronization to prevent data hazards.


Intra-Core Pipeline Sync

These ops coordinate data flow between pipelines within a single vector core.

pto.set_flag

  • syntax: pto.set_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"]
  • semantics: Signal event from source pipe to destination pipe.
set_flag(src_pipe, dst_pipe, event_id);

Example: After MTE2 completes GM→UB transfer, signal Vector pipe:

pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]


pto.wait_flag

  • syntax: pto.wait_flag["SRC_PIPE", "DST_PIPE", "EVENT_ID"]
  • semantics: Block destination pipe until source pipe signals event.
wait_flag(src_pipe, dst_pipe, event_id);

Example: Vector pipe waits for MTE2 data to arrive:

pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]


pto.pipe_barrier

  • syntax: pto.pipe_barrier "PIPE_*"
  • semantics: Drain all pending ops in the specified pipe. All previously issued operations on that pipe complete before any subsequent operation begins.
pipe_barrier(pipe);

Pipe identifiers: PIPE_MTE2, PIPE_V, PIPE_MTE3

Example: Two back-to-back copy_ubuf_to_gm calls writing to the same GM address. Without a barrier, MTE3 may reorder them and the final GM value is non-deterministic:

// Both stores target the same GM address — order matters!
pto.copy_ubuf_to_gm %ub_partial_0, %gm_result, ...
// Without pipe_barrier, MTE3 could execute the second copy before the first
// completes, producing a non-deterministic result at %gm_result.
pto.pipe_barrier "PIPE_MTE3"
// After barrier: first copy is guaranteed complete. Second copy overwrites deterministically.
pto.copy_ubuf_to_gm %ub_partial_1, %gm_result, ...

pto.get_buf

  • syntax: pto.get_buf "PIPE_*", %buf_id, %mode : i64, i64
  • semantics: Acquire buffer slot for inter-pipeline double-buffering coordination.
get_buf(pipe, buf_id, mode);

pto.rls_buf

  • syntax: pto.rls_buf "PIPE_*", %buf_id, %mode : i64, i64
  • semantics: Release buffer slot to allow other pipeline to proceed.
rls_buf(pipe, buf_id, mode);

pto.mem_bar

  • syntax: pto.mem_bar "BARRIER_TYPE"
  • semantics: Intra-vector-pipe memory fence within __VEC_SCOPE__. Required when UB addresses alias between vector load/store operations.
mem_bar(barrier_type);

Barrier types:

Type Semantics
VV_ALL All prior vector instructions complete before subsequent
VST_VLD All prior vector stores visible before subsequent loads
VLD_VST All prior vector loads complete before subsequent stores

Example: Ensure stores are visible before loads to same UB region:

pto.vsts %v0, %ub[%c0] : !pto.vreg<64xf32>, !pto.ptr<f32, ub>
pto.mem_bar "VST_VLD"
%v1 = pto.vlds %ub[%c0] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>


Intra-Core Sync Patterns & Examples

Example 1: set_flag / wait_flag (Explicit Events)

Each cross-pipeline data dependency requires an explicit signal/wait pair. The programmer must manually insert set_flag after the producer and wait_flag before the consumer.

// ─── Stage 1: MTE2 loads data from GM into UB ───
pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr, ...

// MTE2 signals: "UB data is ready for Vector pipe"
pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]

// ─── Stage 2: Vector pipe consumes UB data ───
// Vector waits until MTE2's signal arrives
pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]

scf.for %dummy = %c0 to %c1 step %c1 {
  %v   = pto.vlds %ub_ptr[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
  %mask = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
  %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
  pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
} {llvm.loop.aivector_scope}

// Vector signals: "UB output is ready for MTE3"
pto.set_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]

// ─── Stage 3: MTE3 stores result from UB back to GM ───
// MTE3 waits until Vector's signal arrives
pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]

pto.copy_ubuf_to_gm %ub_out, %gm_out, ...

Key property: Every cross-pipeline edge is an explicit (set_flag, wait_flag) pair. Simple for straight-line code, but gets verbose in loops (see Example 3).


Example 2: get_buf / rls_buf (Resource-Based)

Instead of naming events, each pipeline declares when it acquires (get_buf) and releases (rls_buf) a shared UB buffer. Cross-pipeline RAW/WAR dependencies are resolved implicitly by program order — if MTE2 releases buf_A and Vector later acquires buf_A, the hardware ensures the acquire cannot proceed until the release completes.

// ─── Stage 1: MTE2 loads data into UB ───
// MTE2 acquires ub_ptr — blocks if Vector hasn't released it from a prior iteration
pto.get_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64
pto.copy_gm_to_ubuf %gm_ptr, %ub_ptr, ...
// MTE2 done writing ub_ptr — release it so Vector can consume
pto.rls_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64

// ─── Stage 2: Vector computation ───
// Vector acquires ub_ptr (input) — blocks until MTE2 releases it (RAW: MTE2 write → V read)
pto.get_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64
// Vector acquires ub_out (output) — blocks until MTE3 releases it from a prior iteration (WAR: MTE3 read → V write)
pto.get_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64

scf.for %dummy = %c0 to %c1 step %c1 {
  %mask = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
  %v   = pto.vlds %ub_ptr[%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
  %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
  pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
} {llvm.loop.aivector_scope}

// Vector done reading ub_ptr — release so MTE2 can reuse it in next iteration
pto.rls_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64
// Vector done writing ub_out — release so MTE3 can consume
pto.rls_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64

// ─── Stage 3: MTE3 stores result to GM ───
// MTE3 acquires ub_out — blocks until Vector releases it (RAW: V write → MTE3 read)
pto.get_buf "PIPE_MTE3", %bufid_ub_out, %mode : i64, i64
pto.copy_ubuf_to_gm %ub_out, %gm_out, ...
// MTE3 done reading ub_out — release so Vector can reuse it in next iteration
pto.rls_buf "PIPE_MTE3", %bufid_ub_out, %mode : i64, i64

Key property: No event IDs needed. Dependencies are implicit from program order of get_buf/rls_buf on the same buffer ID. This becomes much more convenient in multi-iteration loops (see Example 3).


Example 3: Ping/Pong Double-Buffering Loop

Double-buffering overlaps DMA and compute by using two UB buffers alternately. All three stages (MTE2, Vector, MTE3) appear in the same iteration — the hardware pipelines them across iterations because different iterations operate on different buffers (buf[i%2]).

Event ID scheme (set_flag / wait_flag)

With 2 ping/pong buffers and 2 pipeline pairs (MTE2↔V, V↔MTE3), set_flag/wait_flag needs 8 event IDs = 2 pipe-pairs × 2 buffers × (forward + reverse):

MTE2 ↔ Vector (input buffers):

Event ID Direction Purpose
EVT_IN_FWD_0 MTE2 → V RAW: buf_in[0] data ready
EVT_IN_FWD_1 MTE2 → V RAW: buf_in[1] data ready
EVT_IN_REV_0 V → MTE2 WAR: Vector done reading buf_in[0]
EVT_IN_REV_1 V → MTE2 WAR: Vector done reading buf_in[1]

Vector ↔ MTE3 (output buffers):

Event ID Direction Purpose
EVT_OUT_FWD_0 V → MTE3 RAW: buf_out[0] result ready
EVT_OUT_FWD_1 V → MTE3 RAW: buf_out[1] result ready
EVT_OUT_REV_0 MTE3 → V WAR: MTE3 done reading buf_out[0]
EVT_OUT_REV_1 MTE3 → V WAR: MTE3 done reading buf_out[1]

3a. set_flag / wait_flag version

// ═══ Pre-loop: prime ALL reverse-dependency signals ═══
// Both input and output buffers start unused. We must pre-send
// reverse-dep signals so the first iteration's wait_flags don't deadlock.
pto.set_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_0"]   // ◀ PRIME: buf_in[0] "free"
pto.set_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_1"]   // ◀ PRIME: buf_in[1] "free"
pto.set_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_0"]  // ◀ PRIME: buf_out[0] "free"
pto.set_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_1"]  // ◀ PRIME: buf_out[1] "free"

scf.for %i = %c0 to %N step %c1 {
  // ── All 3 stages in same iteration, indexed by i%2 ──
  // %pp = i % 2  (ping/pong selector for buffer & event IDs)

  // ── MTE2: load tile[i] into buf_in[i%2] ──
  // WAR: wait until Vector has released buf_in[i%2] from iteration i-2
  pto.wait_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"]
  pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_in[%pp], ...
  // RAW: signal Vector that buf_in[i%2] data is ready
  pto.set_flag["PIPE_MTE2", "PIPE_V", "EVT_IN_FWD_{pp}"]

  // ── Vector: compute buf_in[i%2] → buf_out[i%2] ──
  // RAW: wait for MTE2 to finish loading buf_in[i%2]
  pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVT_IN_FWD_{pp}"]
  // WAR: wait for MTE3 to finish reading buf_out[i%2] from iteration i-2
  pto.wait_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"]
  scf.for %dummy = %c0 to %c1 step %c1 {
    %v   = pto.vlds %ub_in[%pp][%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
    %mask = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
    %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
    pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
  } {llvm.loop.aivector_scope}
  // WAR: tell MTE2 "done reading buf_in[i%2]"
  pto.set_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"]
  // RAW: tell MTE3 "buf_out[i%2] result ready"
  pto.set_flag["PIPE_V", "PIPE_MTE3", "EVT_OUT_FWD_{pp}"]

  // ── MTE3: store result from buf_out[i%2] to GM ──
  // RAW: wait for Vector to finish writing buf_out[i%2]
  pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVT_OUT_FWD_{pp}"]
  pto.copy_ubuf_to_gm %ub_out[%pp], %gm_out[%i], ...
  // WAR: tell Vector "done reading buf_out[i%2]"
  pto.set_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"]
}

// ═══ Post-loop: drain — match every pre-loop prime with a wait ═══
// Each priming set_flag must be paired. The last loop iteration's
// set_flags are consumed by wait_flags that will never fire inside the
// loop (there is no iteration i+2). Drain them here.
pto.wait_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_{(N-1)%2}"]  // ◀ DRAIN
pto.wait_flag["PIPE_V",    "PIPE_MTE2", "EVT_IN_REV_{(N-2)%2}"]  // ◀ DRAIN
pto.wait_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_{(N-1)%2}"] // ◀ DRAIN
pto.wait_flag["PIPE_MTE3", "PIPE_V",    "EVT_OUT_REV_{(N-2)%2}"] // ◀ DRAIN

What set_flag/wait_flag requires outside the loop: - Before the loop (4 × set_flag): Prime every reverse-dependency event ID — one per buffer per pipe-pair. Without this, the first iteration's wait_flag for reverse deps would deadlock (no signal was ever sent). - After the loop (4 × wait_flag): Drain the matching reverse-dep signals from the last iterations. Every set_flag must be paired with a wait_flag — the last loop iterations produce signals that no subsequent iteration consumes, so they must be drained explicitly.

3b. get_buf / rls_buf version

Same ping/pong double-buffering, but no pre-loop priming or post-loop draining needed. Buffer acquire/release semantics handle everything.

scf.for %i = %c0 to %N step %c1 {
  // %pp = i % 2  (ping/pong selector)

  // ── MTE2: load tile[i] into buf[i%2] ──
  // Acquires buf[i%2] — on first iteration, buffer is free so proceeds immediately.
  // On later iterations, blocks until Vector releases buf[i%2] (WAR: automatic).
  pto.get_buf %bufid_buf[%pp], "PIPE_MTE2"
  pto.copy_gm_to_ubuf %gm_ptr[%i], %ub_buf[%pp], ...
  pto.rls_buf %bufid_buf[%pp], "PIPE_MTE2"

  // ── Vector: compute on buf[i%2] ──
  // Acquires buf[i%2] — blocks until MTE2 releases it (RAW: automatic)
  pto.get_buf %bufid_buf[%pp], "PIPE_V"
  pto.get_buf %bufid_out[%pp], "PIPE_V"
  scf.for %dummy = %c0 to %c1 step %c1 {
    %v   = pto.vlds %ub_buf[%pp][%lane] : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
    %mask = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
    %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
    pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
  } {llvm.loop.aivector_scope}
  // Release buf[i%2] — MTE2 can reuse in iteration i+2 (WAR resolved)
  pto.rls_buf %bufid_buf[%pp], "PIPE_V"
  pto.rls_buf %bufid_out[%pp], "PIPE_V"

  // ── MTE3: store result ──
  // Acquires out[i%2] — blocks until Vector releases it (RAW: automatic)
  pto.get_buf %bufid_out[%pp], "PIPE_MTE3"
  pto.copy_ubuf_to_gm %ub_out[%pp], %gm_out[%i], ...
  pto.rls_buf %bufid_out[%pp], "PIPE_MTE3"
}
// No post-loop drain needed — last rls_buf completes the pipeline.

No priming, no draining, no event IDs. The acquire/release protocol on buffer IDs indexed by i%2 implicitly resolves all cross-pipeline dependencies: - RAW (MTE2→V): Vector's get_buf blocks until MTE2's rls_buf on buf[i%2] - WAR (V→MTE2): MTE2's get_buf in iteration i+2 blocks until Vector's rls_buf in iteration i (same buffer) - First iteration: Buffer is initially free, so get_buf proceeds without blocking — no priming needed


Comparison Summary

Aspect set_flag / wait_flag get_buf / rls_buf
Dependency model Explicit event signals Implicit via buffer acquire/release
IDs per pipe-pair 8 = 2 buffers × 2 dirs × 2 (fwd+rev) 1 fwd + 1 rev per buffer (shared global pool)
Total HW IDs 8 per pipe-pair, grows with buffers 32 global across all pipes
Reverse (WAR) deps Extra set_flag/wait_flag pair per buffer Handled automatically
Pre-loop setup set_flag to prime each reverse dep None
Post-loop teardown wait_flag to drain all primed signals None
Straight-line code Simple, clear Slightly more verbose (bracket each stage)
Ping/pong loops 8 event IDs + 4 prime + 4 drain Same pattern, no overhead
Best used for Simple pipelines, fine-grained control Double/multi-buffering, complex loops

Inter-Core Sync

Note:

Inter-core sync is only needed for mixed Cube+Vector tasks where Cube produces data that Vector consumes (or vice versa). Vec-only tasks can ignore this section entirely.

These ops coordinate execution across the Cube block and Vector subblocks within a cluster. Each core cluster consists of 1 Cube block : 2 Vector subblocks, each with its own SU (Sequencer Unit) running independent instruction streams.

Core Cluster (1:2 ratio)
┌─────────────────────────────────────────────┐
│  ┌──────────────┐    ┌──────────────┐       │
│  │  AIC (Cube)  │    │  AIV0 (Vec)  │       │
│  │  ┌────────┐  │    │  ┌────────┐  │       │
│  │  │   SU   │──┼────┼──│   SU   │  │       │
│  │  └────────┘  │    │  └────────┘  │       │
│  │  CUBE pipe   │    │  MTE2/V/MTE3 │       │
│  │  L0C buffer  │    │  UB (256KB)  │       │
│  └──────────────┘    └──────────────┘       │
│                      ┌──────────────┐       │
│                      │  AIV1 (Vec)  │       │
│                      │  ┌────────┐  │       │
│                      │  │   SU   │  │       │
│                      │  └────────┘  │       │
│                      │  MTE2/V/MTE3 │       │
│                      │  UB (256KB)  │       │
│                      └──────────────┘       │
└─────────────────────────────────────────────┘

Platform Comparison

Aspect A2A3 (Ascend 910) A5 (A5)
Signal op set_cross_core (mode2) set_intra_block
Wait op wait_flag_dev wait_intra_core
Wait behavior SU-level blocking (entire core stalls) Per-pipeline (only named pipe stalls)
Semaphore pool 16 IDs per cluster, 4-bit counter 16 IDs, but 32-ID address space (see below)
C→V Broadcast: one set reaches both AIV0+AIV1 1:1: separate set per subblock required
V→C Reduce: Cube waits for both subblocks in one wait 1:1: Cube needs separate wait per subblock

A2A3: set_cross_core / wait_flag_dev

// mode2 broadcast/reduce semantics for 1:2 cluster
set_cross_core(pipe, semaphore_id);   // pipe: VEC/MTE2/CUBE/FIX
wait_flag_dev(semaphore_id);          // SU-level blocking
C→V Broadcast (one set reaches both):
    AIC ──set_cross_core──┬──> AIV0 sema++
                          └──> AIV1 sema++

V→C Reduce (one wait for both):
    AIV0 ──set_cross_core──┐
                           ├──> AIC wait_flag_dev (blocks until BOTH)
    AIV1 ──set_cross_core──┘

pto.set_cross_core

  • syntax: pto.set_cross_core %core_id, %event_id : i64, i64
  • semantics: Signal event to another core. Uses mode2 for 1:2 cluster on A2A3.

pto.wait_flag_dev

  • syntax: pto.wait_flag_dev %core_id, %event_id : i64, i64
  • semantics: Wait for event from another core. SU-level blocking — entire core stalls.

A5: set_intra_block / wait_intra_core

set_intra_block(trigger_pipe, semaphore_id);
wait_intra_core(wait_pipe, semaphore_id);   // only named pipe stalls

A5 semaphore address space: The hardware has 16 physical semaphore IDs but exposes a 32-ID address space to support 1:1 signaling to each subblock:

ID Range Target
0–15 AIV0 (subblock 0)
16–31 (+15 offset) AIV1 (subblock 1)

This means C→V requires separate set_intra_block calls per subblock (no broadcast), and V→C requires separate wait_intra_core calls per subblock (no hardware reduce).

C→V on A5 (1:1, no broadcast — need two sets):
    AIC ──set_intra_block(pipe, sema_id)────> AIV0
    AIC ──set_intra_block(pipe, sema_id+15)──> AIV1

V→C on A5 (1:1, no reduce — need two waits):
    AIV0 ──set_intra_block──> AIC wait_intra_core(pipe, sema_id)
    AIV1 ──set_intra_block──> AIC wait_intra_core(pipe, sema_id+15)  // extra wait

pto.set_intra_block

  • syntax: pto.set_intra_block %block_id, %event_id : i64, i64
  • semantics: Signal event within a block (A5). Specifies trigger pipe. 1:1 per subblock.

pto.wait_intra_core

  • syntax: pto.wait_intra_core %block_id, %event_id : i64, i64
  • semantics: Wait for event within block (A5). Specifies which pipeline should wait — only that pipe stalls, SU and other pipes continue.

Wait Granularity Comparison

A2A3 wait_flag_dev (SU-level stall):
    SU ──┬── PIPE_MTE2 ───╳ ALL STALLED
         ├── PIPE_V    ───╳ ALL STALLED
         └── PIPE_MTE3 ───╳ ALL STALLED

A5 wait_intra_core "PIPE_MTE2" (per-pipe stall):
    SU ──┬── PIPE_MTE2 ───╳ STALLED (waiting for Cube)
         ├── PIPE_V    ─── ✓ RUNNING
         └── PIPE_MTE3 ─── ✓ RUNNING