Vector Instruction Set: Vector Load/Store¶
UB-to-vector-register data movement inside PTO ISA is defined here. The detailed forms below describe how pto.v* kernels move payloads between vector-visible UB storage and vector registers without crossing back into the tile instructions.
Category: UB ↔ Vector Register data movement Pipeline: PIPE_V (Vector Core)
Vector loads move data from the vector tile buffer (implemented on current hardware by the Unified Buffer / UB) to vector registers (vreg). Vector stores move data from vreg back to that same vector tile buffer. All vector compute operates only on vreg.
Execution Model: DMA–Vector Pipeline¶
Vector load/store operations bridge the DMA (PIPE_MTE2/PIPE_MTE3) and Vector (PIPE_V) pipelines. The complete kernel structure is:
GM → MTE2 → vector tile buffer (hardware UB) → VLDS → vreg → PTO.v* compute → VSTS → vector tile buffer → MTE3 → GM
Typical kernel structure:
module attributes {pto.target_arch = "a5"} {
func.func @kernel_2d(%arg0: !pto.ptr, %arg1: !pto.ptr) {
%false = arith.constant false
// Phase 1: MTE2 DMA (GM → vector tile buffer / hardware UB)
pto.get_buf "PIPE_MTE2", 0, 0
pto.set_loop_size_outtoub %c1_i64, %c1_i64 : i64, i64
pto.copy_gm_to_ubuf %arg0, %ub_in, %c0_i64, %c32_i64, %c128_i64,
%c0_i64, %c0_i64, %false, %c0_i64, %c128_i64, %c128_i64
: !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, i1, i64, i64, i64
pto.rls_buf "PIPE_MTE2", 0, 0
// Phase 2: Producer–Consumer synchronization (MTE2 → V)
pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"]
// Phase 3: Vector compute (V pipe)
pto.get_buf "PIPE_V", 0, 0
pto.vecscope {
%_:1 = scf.for %offset = %c0 to %c1024 step %c64
iter_args(%remaining = %c1024_i32) -> (i32) {
%mask, %next = pto.plt_b32 %remaining : i32 -> !pto.mask<G>, i32
%vec = pto.vlds %ub_in[%offset] : !pto.ptr -> !pto.vreg<64xf32>
%out = pto.vabs %vec, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask<b32>
scf.yield %next : i32
}
}
pto.rls_buf "PIPE_V", 0, 0
// Phase 4: Consumer synchronization (V → MTE3)
pto.set_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]
pto.wait_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"]
// Phase 5: MTE3 DMA (vector tile buffer / hardware UB → GM)
pto.get_buf "PIPE_MTE3", 0, 0
pto.set_loop_size_ubtoout %c1_i64, %c1_i64 : i64, i64
pto.copy_ubuf_to_gm %ub_out, %arg1, %c0_i64, %c32_i64, %c128_i64,
%c0_i64, %c128_i64, %c128_i64
: !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, i64
pto.rls_buf "PIPE_MTE3", 0, 0
pto.barrier #pto.pipe
return
}
}
Pipeline Synchronization¶
| Operation | Semantics |
|---|---|
pto.set_flag["PIPE_MTE2", "PIPE_V", id] |
Signal: MTE2 done → V can read UB |
pto.wait_flag["PIPE_MTE2", "PIPE_V", id] |
Wait: V must wait for MTE2 signal |
pto.set_flag["PIPE_V", "PIPE_MTE3", id] |
Signal: V done → MTE3 can write UB |
pto.wait_flag["PIPE_V", "PIPE_MTE3", id] |
Wait: MTE3 must wait for V signal |
pto.get_buf "PIPE_*", slot, 0 |
Acquire: reserve buffer slot for pipeline |
pto.rls_buf "PIPE_*", slot, 0 |
Release: free buffer slot |
Buffer slot model: get_buf / rls_buf resolve RAW/WAR dependencies automatically — no explicit event IDs or loop peeling required.
A2/A3 Bandwidth Model¶
| Transfer Path | Bandwidth | Constant | Formula |
|---|---|---|---|
| GM → UB (Vec tile) | 128 B/cycle | A2A3_BW_GM_VEC |
ceil(Nbytes / 128) |
| GM → UB (Mat tile) | 256 B/cycle | A2A3_BW_GM_MAT |
ceil(Nbytes / 256) |
| UB → UB (Vec tile) | 128 B/cycle | A2A3_BW_VEC_VEC |
ceil(Nbytes / 128) |
Example: Loading a 4096-byte tile via MTE2:
cycles = ceil(4096 / 128) = 32 cycles
Common Operand Model¶
%source/%destis the base address operand in SSA form. The base pointer MUST address the vector tile buffer (implemented on current hardware by the Unified Buffer / UB).%offsetis the displacement operand in SSA form. The exact encoding is instruction-specific, but the effective address and any post-update behavior MUST match the selected instruction form.%maskis the predicate operand for predicated memory instruction sets. For memory instruction sets, inactive lanes or inactive blocks MUST NOT issue memory requests unless the instruction explicitly documents a different behavior.%resultis the destination vector register value in SSA form.!pto.alignis the SSA carrier for alignment-buffer state used by unaligned load/store instruction sets. The PTO ISA vector instructions representation makes that state explicit rather than implicit.
Contiguous Loads¶
pto.vlds¶
- syntax:
%result = pto.vlds %source[%offset] {dist = "DIST"} : !pto.ptr<T, ub> -> !pto.vreg<NxT> - semantics: Vector load with distribution mode.
- inputs:
%sourceis the UB base address,%offsetis the load displacement, andDISTselects the distribution mode. - outputs:
%resultis the loaded vector register value. - constraints and limitations:
The effective address MUST satisfy the alignment rule of the selected
distribution mode.
NORMreads one full vector footprint. Broadcast, upsample, downsample, unpack, split-channel, and deinterleave modes change how memory bytes are mapped into destination lanes, but they do not change the fact that the source is UB memory.
Distribution modes:
| Mode | Description | C Semantics |
|---|---|---|
NORM |
Contiguous 256B load | dst[i] = UB[base + i * sizeof(T)] |
BRC_B8/B16/B32 |
Broadcast single element | dst[i] = UB[base] for all i |
US_B8/B16 |
Upsample (duplicate each element) | dst[2*i] = dst[2*i+1] = UB[base + i] |
DS_B8/B16 |
Downsample (every 2nd element) | dst[i] = UB[base + 2*i] |
UNPK_B8/B16/B32 |
Unpack (zero-extend to wider type) | dst_i32[i] = (uint32_t)UB_i16[base + 2*i] |
SPLT4CHN_B8 |
Split 4-channel (RGBA → R plane) | Extract every 4th byte |
SPLT2CHN_B8/B16 |
Split 2-channel | Extract every 2nd element |
DINTLV_B32 |
Deinterleave 32-bit | Even elements only |
BLK |
Block load | Blocked access pattern |
Example — Contiguous load:
%v = pto.vlds %ub[%offset] {dist = "NORM"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
Example — Broadcast scalar to all lanes:
%v = pto.vlds %ub[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
pto.vldas¶
- syntax:
%result = pto.vldas %source : !pto.ptr<T, ub> -> !pto.align - semantics: Prime alignment buffer for subsequent unaligned load.
- inputs:
%sourceis the UB address whose surrounding aligned block seeds the load alignment state. - outputs:
%resultis the initialized load-alignment state. - constraints and limitations:
This op is the required leading operation for a
pto.vldusstream using the same alignment state. The source address itself need not be 32-byte aligned; hardware truncates it to the aligned block boundary for the priming load.
pto.vldus¶
- syntax:
%result, %align_out, %base_out = pto.vldus %source, %align : !pto.ptr<T, ub>, !pto.align -> !pto.vreg<NxT>, !pto.align, !pto.ptr<T, ub> - semantics: Unaligned load using primed align state.
- inputs:
%sourceis the current UB address and%alignis the incoming load alignment state primed bypto.vldasor a priorpto.vldus. - outputs:
%resultis the assembled vector value,%align_outis the updated alignment state, and%base_outis the post-update base pointer state exposed in SSA form. - constraints and limitations:
A matching
pto.vldasMUST appear before the first dependentpto.vldusstream in the same vector loop. Both the alignment state and the base address advance across the stream, and the PTO ISA vector instructions representation exposes those updates as SSA results.
Unaligned load pattern:
%align = pto.vldas %ub : !pto.ptr<f32, ub> -> !pto.align
%vec, %align2, %ub2 = pto.vldus %ub, %align : !pto.ptr<f32, ub>, !pto.align -> !pto.vreg<64xf32>, !pto.align, !pto.ptr<f32, ub>
Dual Loads (Deinterleave)¶
pto.vldsx2¶
- syntax:
%low, %high = pto.vldsx2 %source[%offset], "DIST" : !pto.ptr<T, ub>, index -> !pto.vreg<NxT>, !pto.vreg<NxT> - semantics: Dual load with deinterleave (AoS → SoA conversion).
- inputs:
%sourceis the UB base pointer,%offsetis the displacement, andDISTselects a dual-load/deinterleave layout. - outputs:
%lowand%highare the two destination vectors. - constraints and limitations: This instruction set is only legal for interleave/deinterleave style distributions. The two outputs form an ordered pair, and that pairing MUST be preserved.
Distribution modes: DINTLV_B8, DINTLV_B16, DINTLV_B32, BDINTLV
// DINTLV_B32: deinterleave 32-bit elements
for (int i = 0; i < 64; i++) {
low[i] = UB[base + 8*i]; // even elements
high[i] = UB[base + 8*i + 4]; // odd elements
}
Example — Load interleaved XY pairs into separate X/Y vectors:
%x, %y = pto.vldsx2 %ub[%offset], "DINTLV_B32" : !pto.ptr<f32, ub>, index -> !pto.vreg<64xf32>, !pto.vreg<64xf32>
Strided Loads¶
pto.vsld¶
- syntax:
%result = pto.vsld %source[%offset], "STRIDE" : !pto.ptr<T, ub> -> !pto.vreg<NxT> - semantics: Strided load with fixed stride pattern.
- inputs:
%sourceis the UB base pointer and%offsetis the displacement encoded with the selected fixed stride mode. - outputs:
%resultis the loaded vector. - constraints and limitations: The selected stride token determines which sub-elements are read from each source block.
Stride modes: STRIDE_S3_B16, STRIDE_S4_B64, STRIDE_S8_B32, STRIDE_S2_B64
pto.vsldb¶
- syntax:
%result = pto.vsldb %source, %offset, %mask : !pto.ptr<T, ub>, i32, !pto.mask<G> -> !pto.vreg<NxT> - semantics: Block-strided load for 2D tile access.
- inputs:
%sourceis the UB base pointer,%offsetis the packed stride/control word, and%maskcontrols which blocks participate. - outputs:
%resultis the loaded vector. - constraints and limitations:
%offsetis not a plain byte displacement; it encodes the block stride and repeat pattern. If a block is masked off, the corresponding destination block is zeroed and MUST NOT raise an address overflow exception for that block.
Gather (Indexed) Loads¶
pto.vgather2¶
- syntax:
%result = pto.vgather2 %source, %offsets, %active_lanes : !pto.ptr<T, ub>, !pto.vreg<NxI>, index -> !pto.vreg<NxT> - semantics: Indexed gather from UB.
- inputs:
%sourceis the UB base pointer,%offsetsprovides per-lane element offsets, and%active_lanesbounds how many lanes participate. - outputs:
%resultis the gathered vector. - constraints and limitations:
Only the first
%active_lanesindices participate. The index element width and interpretation MUST match the selected gather form, and each effective address must satisfy that form's alignment rules.
for (int i = 0; i < active_lanes; i++)
dst[i] = UB[base + offsets[i] * sizeof(T)];
pto.vgatherb¶
- syntax:
%result = pto.vgatherb %source, %offsets, %active_lanes : !pto.ptr<T, ub>, !pto.vreg<NxI>, index -> !pto.vreg<NxT> - semantics: Byte-granularity indexed gather from UB.
- inputs:
%sourceis the UB base pointer,%offsetscontains per-block byte offsets, and%active_lanesbounds the number of active gathered blocks. - outputs:
%resultis the gathered vector. - constraints and limitations:
This is a block gather, not a byte-per-lane gather.
%sourceMUST be 32-byte aligned, each participating offset MUST describe a 32-byte-aligned block, and inactive blocks are zero-filled.
for (int i = 0; i < active_lanes; i++)
dst[i] = UB[base + offsets[i]]; // byte-addressed
pto.vgather2_bc¶
- syntax:
%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr<T, ub>, !pto.vreg<NxI>, !pto.mask<G> -> !pto.vreg<NxT> - semantics: Gather with broadcast, conditioned by mask.
- inputs:
%sourceis the UB base pointer,%offsetscontains gather indices, and%maskgates which lanes participate. - outputs:
%resultis the gathered vector. - constraints and limitations: This is a backward-compatible instruction set. Masked-off lanes do not participate in address coalescing and do not trigger address overflow exceptions; their destination lanes are zero-filled.
Contiguous Stores¶
pto.vsts¶
- syntax:
pto.vsts %value, %dest[%offset], %mask {dist = "DIST"} : !pto.vreg<NxT>, !pto.ptr<T, ub>, !pto.mask<G> - semantics: Vector store with distribution mode.
- inputs:
%valueis the source vector,%destis the UB base pointer,%offsetis the displacement,%maskselects the active lanes or sub-elements, andDISTselects the store distribution. - outputs: This op has no SSA result; it writes to UB memory.
- constraints and limitations: The effective destination address MUST satisfy the alignment rule of the selected store mode. Narrowing/packing modes may only preserve a subset of the source bits. Merge-channel modes reinterpret the source vector as channel planes and interleave them on store.
Distribution modes:
| Mode | Description | C Semantics |
|---|---|---|
NORM_B8/B16/B32 |
Contiguous store | UB[base + i] = src[i] |
PK_B16/B32 |
Pack/narrowing store | UB_i16[base + 2*i] = truncate_16(src_i32[i]) |
MRG4CHN_B8 |
Merge 4 channels (R,G,B,A → RGBA) | Interleave 4 planes |
MRG2CHN_B8/B16 |
Merge 2 channels | Interleave 2 planes |
Example — Contiguous store:
pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.ptr<f32, ub>, !pto.mask<b32>
Dual Stores (Interleave)¶
pto.vstsx2¶
- syntax:
pto.vstsx2 %low, %high, %dest[%offset], "DIST", %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.ptr<T, ub>, index, !pto.mask<G> - semantics: Dual interleaved store (SoA → AoS conversion).
- inputs:
%lowand%highare the two source vectors,%destis the UB base pointer,%offsetis the displacement,DISTselects the interleave layout, and%maskgates the participating elements. - outputs: This op has no SSA result; it writes an interleaved stream to UB.
- constraints and limitations: This instruction set is only legal for interleave distributions. The two source vectors form an ordered pair, and the interleave semantics of that pair MUST be preserved.
Distribution modes: INTLV_B8, INTLV_B16, INTLV_B32
// INTLV_B32:
for (int i = 0; i < 64; i++) {
UB[base + 8*i] = low[i];
UB[base + 8*i + 4] = high[i];
}
Strided Stores¶
pto.vsst¶
- syntax:
pto.vsst %value, %dest[%offset], "STRIDE" : !pto.vreg<NxT>, !pto.ptr<T, ub> - semantics: Strided store with fixed stride pattern.
- inputs:
%valueis the source vector,%destis the UB base pointer, and%offset/STRIDEselect the fixed strided layout. - outputs: This op writes UB memory and returns no SSA value.
- constraints and limitations: The stride token, not the vector lane number alone, determines which destination elements are written.
pto.vsstb¶
- syntax:
pto.vsstb %value, %dest, %offset, %mask : !pto.vreg<NxT>, !pto.ptr<T, ub>, i32, !pto.mask<G> - semantics: Block-strided store for 2D tile access.
- inputs:
%valueis the source vector,%destis the UB base pointer,%offsetis the packed stride/control word, and%maskcontrols block participation. - outputs: This op writes UB memory and returns no SSA value.
- constraints and limitations:
%offsetis a control word, not a plain byte displacement. It selects the block-strided store pattern.
Scatter (Indexed) Stores¶
pto.vscatter¶
- syntax:
pto.vscatter %value, %dest, %offsets, %active_lanes : !pto.vreg<NxT>, !pto.ptr<T, ub>, !pto.vreg<NxI>, index - semantics: Indexed scatter to UB.
- inputs:
%valueis the source vector,%destis the UB base pointer,%offsetsprovides per-lane or per-block indices, and%active_lanesbounds the active requests. - outputs: This op writes UB memory and returns no SSA value.
- constraints and limitations:
Only
b8,b16, andb32element sizes are supported. The index vector must use a supported integer element type and layout for this instruction set. Each computed address MUST be element-aligned. If two or more indices alias, only one write is guaranteed and the winning lane is undefined behavior on A2/A3 (scatter aliases are illegal and must not occur); on A5 the lane with the numerically lowest index wins the write.
for (int i = 0; i < active_lanes; i++)
UB[base + offsets[i] * sizeof(T)] = src[i];
Alignment State Stores¶
pto.vsta¶
- syntax:
pto.vsta %value, %dest[%offset] : !pto.align, !pto.ptr<T, ub>, index - semantics: Flush alignment state to memory.
- inputs:
%valueis the pending store-alignment state,%destis the UB base pointer, and%offsetis the flush displacement. - outputs: This op writes buffered tail bytes to UB and returns no SSA value.
- constraints and limitations: The flush address MUST match the post-updated address expected by the preceding unaligned-store stream. After the flush, the corresponding store alignment state is consumed.
pto.vstu¶
There are two syntactic forms for pto.vstu:
Form A (Offset-state form):
- syntax: %align_out, %base_out = pto.vstu %align_in, %base_in, %value, %dest, %mode : !pto.align, !pto.ptr<T, ub>, !pto.vreg<NxT>, !pto.ptr<T, ub>, index -> !pto.align, !pto.ptr<T, ub>
- semantics: Unaligned store with explicit threaded alignment/base state.
- inputs: %align_in is the incoming store-alignment state, %base_in is the current stream base, %value is the vector to store, %dest is the UB base pointer, and %mode selects the post-update behavior.
- outputs: %align_out is the updated buffered-tail state and %base_out is the post-update base pointer state.
- constraints and limitations: This op models a stateful unaligned-store sequence in SSA form. A final pto.vsta / pto.vstas / pto.vstar is still required to flush the trailing buffered bytes.
Form B (Index-state form, explicit MODE):
- syntax: %align_out, %offset_out = pto.vstu %align_in, %offset_in, %value, %base, "MODE" : !pto.align, index, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align, index
- semantics: Unaligned store with align + offset state update.
- inputs: %align_in is the incoming store-alignment state, %offset_in is the current logical byte/element displacement, %value is the vector being stored, and %base is the UB base pointer.
- outputs: %align_out is the updated alignment/tail state and %offset_out is the next offset after applying the selected post-update rule.
- constraints and limitations: The alignment state MUST be threaded in program order. A terminating flush form such as pto.vstar/pto.vstas is still required to commit the buffered tail bytes.
- Mode tokens: POST_UPDATE, NO_POST_UPDATE
pto.vstus¶
There are two syntactic forms:
Form A (Offset-state form):
- syntax: %align_out, %base_out = pto.vstus %align_in, %base_in, %value, %dest, %offset : !pto.align, !pto.ptr<T, ub>, !pto.vreg<NxT>, !pto.ptr<T, ub>, i32 -> !pto.align, !pto.ptr<T, ub>
- semantics: Scalar-offset unaligned store with threaded state.
- inputs: Same roles as pto.vstu Form A, but %offset is provided explicitly as the scalar displacement.
- outputs: Updated alignment state and base state.
- constraints and limitations: The same final flush requirement and state-threading constraints as pto.vstu apply.
Form B (Index-state form, explicit MODE):
- syntax: %align_out, %base_out = pto.vstus %align_in, %offset, %value, %base, "MODE" : !pto.align, i32, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align, !pto.ptr<T, ub>
- semantics: Unaligned store with scalar offset and state update.
- inputs: %align_in is the incoming store-alignment state, %offset is the scalar displacement, %value is the vector being stored, and %base is the UB base pointer.
- outputs: %align_out is the updated buffered-tail state and %base_out is the next base pointer when the lowering chooses a post-update form.
- constraints and limitations: The scalar offset width and update mode MUST match the selected form, and a later flush op is still required.
pto.vstur¶
There are two syntactic forms:
Form A (Simple form, no MODE):
- syntax: %align_out = pto.vstur %align_in, %value, %dest : !pto.align, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align
- semantics: Register-update unaligned store form.
- inputs: %align_in is the incoming store-alignment state, %value is the vector to store, and %dest is the UB base pointer.
- outputs: %align_out is the updated buffered-tail state.
- constraints and limitations: This op updates only the residual alignment state. A matching flush op is still required to emit the trailing bytes.
Form B (With MODE, explicit state):
- syntax: %align_out = pto.vstur %align_in, %value, %base, "MODE" : !pto.align, !pto.vreg<NxT>, !pto.ptr<T, ub> -> !pto.align
- semantics: Unaligned store with residual flush and state update.
- inputs: %align_in is the incoming store-alignment state, %value is the vector to store, and %base is the UB base pointer.
- outputs: %align_out is the updated residual state after the current partial store.
- constraints and limitations: This form exposes only the evolving state; it does not by itself guarantee that all buffered bytes have reached memory. A compatible final flush is still required unless the surrounding sequence is known to be complete.
pto.vstas¶
- syntax:
pto.vstas %align, %dest, %offset : !pto.align, !pto.ptr<T, ub>, i32 - semantics: Scalar-register-offset form of alignment-state flush. Emits any remaining buffered bytes from a prior unaligned-store sequence.
- inputs:
%valueis the pending store-alignment state,%destis the UB base pointer, and%offsetis the scalar-register style displacement. - outputs: This op writes buffered tail bytes to UB and returns no SSA value.
- constraints and limitations: MUST be paired with a compatible prior state-producing store sequence. The
%offsetis counted in bytes.
pto.vstar¶
- syntax:
pto.vstar %value, %dest : !pto.align, !pto.ptr<T, ub> - semantics: Flush remaining alignment state.
- inputs:
%valueis the pending alignment/buffer state that still needs to be emitted, and%destis the UB destination base pointer. - outputs: No SSA result. The effect is a memory-side flush that writes the remaining buffered bytes to memory.
- constraints and limitations: This op terminates an unaligned-store sequence. It MUST be paired with a compatible prior state-producing store sequence so that the pending tail state is well-defined.