Vector Instruction Set: SFU And DSA Instructions¶
Special-function, fused, and domain-specific pto.v* instruction sets are defined here. These forms are narrower than generic arithmetic and therefore carry explicit target-profile restrictions.
Category: Domain-specific accelerator and special function unit operations Pipeline: PIPE_V (Vector Core) / SFU
Fused operations, special functions, and UB-to-UB operations that leverage hardware acceleration.
Common Operand Model¶
%input,%lhs,%rhs,%acc, and%alphaare source SSA values whose roles are called out per instruction.%maskis the predicate operandPgwhen present.%resultis the destination SSA value.- This instruction-set page mixes three different backend shapes: pure
vreg -> vregops, conversion/fusion ops, and UB-to-UB helpers. Each instruction section calls out which storage model it uses.
Fused Activation Ops (vreg→vreg)¶
pto.vlrelu¶
- syntax:
%result = pto.vlrelu %input, %alpha, %mask : !pto.vreg<NxT>, T, !pto.mask<G> -> !pto.vreg<NxT> - A5 types: f16, f32
- semantics: Leaky ReLU with scalar alpha.
for (int i = 0; i < N; i++)
dst[i] = (src[i] >= 0) ? src[i] : alpha * src[i];
- inputs:
%inputis the activation vector,%alphais the scalar slope, and%maskselects active lanes. - outputs:
%resultis the leaky-ReLU vector. - constraints and limitations: Only
f16andf32forms are currently documented forpto.vlrelu.
pto.vprelu¶
- syntax:
%result = pto.vprelu %input, %alpha : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT> - A5 types: f16, f32
- semantics: Parametric ReLU with per-element alpha vector.
for (int i = 0; i < N; i++)
dst[i] = (src[i] >= 0) ? src[i] : alpha[i] * src[i];
- inputs:
%inputis the activation vector and%alphais the per-element slope vector. - outputs:
%resultis the parametric-ReLU vector. - constraints and limitations: Floating-point element types only on the current A5 instruction set.
pto.vexpdif¶
- syntax:
%result = pto.vexpdif %input, %max : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT> - A5 types: f16, f32
- semantics: Fused exp(x - max) for numerically stable softmax.
for (int i = 0; i < N; i++)
dst[i] = expf(src[i] - max[i]);
Use case: Softmax numerator computation with numerical stability.
- inputs:
%inputis the source vector and%maxis the broadcasted subtraction term. - outputs:
%resultis the fusedexp(input - max)vector. - constraints and limitations: Floating-point element types only.
Fused Compute+Convert Ops¶
pto.vaddrelu¶
- syntax:
%result = pto.vaddrelu %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT> - A5 types: f16, f32
- semantics: Fused add + ReLU.
for (int i = 0; i < N; i++)
dst[i] = max(src0[i] + src1[i], 0);
- inputs:
%lhsand%rhsare the two addends. - outputs:
%resultis the fused add-then-ReLU result. - constraints and limitations: Floating-point element types only on the current documented instruction set.
pto.vsubrelu¶
- syntax:
%result = pto.vsubrelu %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT> - A5 types: f16, f32
- semantics: Fused sub + ReLU.
for (int i = 0; i < N; i++)
dst[i] = max(src0[i] - src1[i], 0);
- inputs:
%lhsis the minuend and%rhsis the subtrahend. - outputs:
%resultis the fused sub-then-ReLU result. - constraints and limitations: Floating-point element types only on the current documented instruction set.
pto.vaxpy¶
- syntax:
%result = pto.vaxpy %src0, %src1, %alpha : !pto.vreg<NxT>, !pto.vreg<NxT>, T -> !pto.vreg<NxT> - A5 types: f16, f32
- semantics: AXPY — scalar-vector multiply-add.
for (int i = 0; i < N; i++)
dst[i] = alpha * src0[i] + src1[i];
- inputs:
%src0is the scaled vector,%src1is the addend vector, and%alphais the scalar multiplier. - outputs:
%resultis the fused AXPY result. - constraints and limitations: Floating-point element types only on the current documented instruction set.
pto.vaddreluconv¶
- syntax:
%result = pto.vaddreluconv %lhs, %rhs : !pto.vreg<NxT0>, !pto.vreg<NxT0> -> !pto.vreg<MxT1> - semantics: Fused add + ReLU + type conversion (HW fusion).
// f32→f16 variant:
for (int i = 0; i < 64; i++)
dst_f16[i] = f32_to_f16(max(src0_f32[i] + src1_f32[i], 0));
// f16→i8 variant:
for (int i = 0; i < 128; i++)
dst_i8[i] = f16_to_i8(max(src0_f16[i] + src1_f16[i], 0));
- inputs:
%lhsand%rhsare the source vectors. - outputs:
%resultis the fused add/ReLU/convert result. - constraints and limitations: Only backend-supported source/destination type pairs are legal. Rounding, saturation, and packing rules follow the semantics of this fused operation, not an arbitrary sequence of standalone ops.
pto.vmulconv¶
- syntax:
%result = pto.vmulconv %lhs, %rhs : !pto.vreg<NxT0>, !pto.vreg<NxT0> -> !pto.vreg<MxT1> - semantics: Fused mul + type conversion (HW fusion).
// f16→i8 variant:
for (int i = 0; i < 128; i++)
dst_i8[i] = f16_to_i8(src0_f16[i] * src1_f16[i]);
- inputs:
%lhsand%rhsare the source vectors. - outputs:
%resultis the fused mul/convert result. - constraints and limitations: Only backend-supported source/destination type pairs are legal.
Extended Arithmetic¶
pto.vmull¶
- syntax:
%low, %high = pto.vmull %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>, !pto.vreg<NxT> - A5 types: i32/u32 (native 32×32→64 widening multiply)
- semantics: Widening multiply with high/low results.
for (int i = 0; i < 64; i++) {
int64_t r = (int64_t)src0_i32[i] * (int64_t)src1_i32[i];
dst_lo[i] = (int32_t)(r & 0xFFFFFFFF);
dst_hi[i] = (int32_t)(r >> 32);
}
- inputs:
%lhsand%rhsare the source vectors and%maskselects active lanes. - outputs:
%lowand%highexpose the widened-product low/high parts. - constraints and limitations: The current documented A5 form is the native widening 32x32->64 integer multiply instruction set.
pto.vmula¶
- syntax:
%result = pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - semantics: Multiply-accumulate.
for (int i = 0; i < N; i++)
if (mask[i])
dst[i] = acc[i] + lhs[i] * rhs[i];
- inputs:
%accis the accumulator input,%lhsand%rhsare the multiplicands, and%maskselects active lanes. - outputs:
%resultis the multiply-accumulate result. - constraints and limitations:
pto.vmulais a fused multiply-accumulate operation and is not always interchangeable with separatevmulplusvadd.
Index Generation¶
pto.vci¶
- syntax:
%result = pto.vci %index {order = "ORDER"} : integer -> !pto.vreg<NxT> - semantics: Generate lane index vector.
for (int i = 0; i < N; i++)
dst[i] = base_index + i;
Use case: Generate indices for gather/scatter, argsort, etc.
- inputs:
%indexis the scalar seed/base index. - outputs:
%resultis the generated index vector. - constraints and limitations: The arithmetic/indexing use of the instruction set; the conversion page also records the same opcode for completeness.
UB-to-UB Operations¶
pto.vtranspose¶
- syntax:
pto.vtranspose %dest, %src, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64 - semantics: UB-to-UB transpose operation (not vreg-to-vreg).
Note: This operates on UB memory directly, not on vector registers.
- inputs:
%destand%srcare UB pointers and%configis the ISA control/config word. - outputs: This op writes UB memory and returns no SSA value.
- constraints and limitations: This is not a
vreg -> vregop even though it lives in thepto.v*namespace. Its correctness depends on the control word and UB layout contract.
Sorting Operations¶
pto.vsort32¶
- syntax:
pto.vsort32 %dest, %src, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64 - semantics: Sort 32 elements in UB.
- inputs:
%destand%srcare UB pointers and%configis the ISA control/config word. - outputs: This op writes UB memory and returns no SSA value.
- constraints and limitations: This is a UB-to-UB accelerator helper, not a pure vector-register op.
pto.vbitsort¶
- syntax:
pto.vbitsort %dest, %src, %indices, %repeat_times : !pto.ptr<T, ub>, !pto.ptr<T, ub>, !pto.ptr<T, ub>, index - semantics: Sort 32 region proposals by score (descending) and materialize sorted proposal records into
%dest. - inputs:
%destis the UB destination buffer,%srcis the UB score buffer,%indicesis the UB index buffer, and%repeat_timescontrols how many adjacent groups of 32 elements to process. - outputs: This op writes UB memory and returns no SSA value. Each output record is 8 bytes: upper 4 bytes = index, lower 4 bytes = score.
- constraints and limitations: Scores are sorted in descending order. Equal-score ties are stable. All pointers MUST be UB-backed. A5-specific (
VBS32hardware unit).
pto.vmrgsort¶
- syntax:
pto.vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub> x4, i64, i64 - semantics: Merge-sort 4 pre-sorted input vectors.
- inputs:
%destis the UB destination,%src0..%src3are the four pre-sorted UB inputs,%countis the number of valid elements, and%configis the operation control word. - outputs: This op writes UB memory and returns no SSA value.
- constraints and limitations: Inputs MUST already be sorted according to
the sort order encoded by
%config. The discussion below uses the shorter mnemonicpto.vmrgsort, while the current implementation summary still refers topto.vmrgsort4.
Current Implementation Instruction Set Summary¶
pto.vmull %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>, !pto.vreg<NxT>pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>pto.vci %index {order = "ORDER"} : integer -> !pto.vreg<NxT>pto.vbitsort %dest, %src, %indices, %repeat_times : !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, indexpto.vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config : !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, i64, i64
Typical Usage¶
// Softmax with fused expdiff
%max_broadcast = pto.vlds %ub_max[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
%exp_stable = pto.vexpdif %logits, %max_broadcast : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>
// Leaky ReLU activation
%activated = pto.vlrelu %linear_out, %alpha_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask<b32> -> !pto.vreg<64xf32>
// Fused residual add + ReLU
%residual = pto.vaddrelu %conv_out, %skip_connection : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>
// Generate indices for argsort
%indices = pto.vci %c0 {order = "ASC"} : i32 -> !pto.vreg<64xi32>