Vector Instruction Set: SFU And DSA Instructions

Special-function, fused, and domain-specific pto.v* instruction sets are defined here. These forms are narrower than generic arithmetic and therefore carry explicit target-profile restrictions.

Category: Domain-specific accelerator and special function unit operations Pipeline: PIPE_V (Vector Core) / SFU

Fused operations, special functions, and UB-to-UB operations that leverage hardware acceleration.

Common Operand Model

  • %input, %lhs, %rhs, %acc, and %alpha are source SSA values whose roles are called out per instruction.
  • %mask is the predicate operand Pg when present.
  • %result is the destination SSA value.
  • This instruction-set page mixes three different backend shapes: pure vreg -> vreg ops, conversion/fusion ops, and UB-to-UB helpers. Each instruction section calls out which storage model it uses.

Fused Activation Ops (vreg→vreg)

pto.vlrelu

  • syntax: %result = pto.vlrelu %input, %alpha, %mask : !pto.vreg<NxT>, T, !pto.mask<G> -> !pto.vreg<NxT>
  • A5 types: f16, f32
  • semantics: Leaky ReLU with scalar alpha.
for (int i = 0; i < N; i++)
    dst[i] = (src[i] >= 0) ? src[i] : alpha * src[i];
  • inputs: %input is the activation vector, %alpha is the scalar slope, and %mask selects active lanes.
  • outputs: %result is the leaky-ReLU vector.
  • constraints and limitations: Only f16 and f32 forms are currently documented for pto.vlrelu.

pto.vprelu

  • syntax: %result = pto.vprelu %input, %alpha : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
  • A5 types: f16, f32
  • semantics: Parametric ReLU with per-element alpha vector.
for (int i = 0; i < N; i++)
    dst[i] = (src[i] >= 0) ? src[i] : alpha[i] * src[i];
  • inputs: %input is the activation vector and %alpha is the per-element slope vector.
  • outputs: %result is the parametric-ReLU vector.
  • constraints and limitations: Floating-point element types only on the current A5 instruction set.

pto.vexpdif

  • syntax: %result = pto.vexpdif %input, %max : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
  • A5 types: f16, f32
  • semantics: Fused exp(x - max) for numerically stable softmax.
for (int i = 0; i < N; i++)
    dst[i] = expf(src[i] - max[i]);

Use case: Softmax numerator computation with numerical stability.

  • inputs: %input is the source vector and %max is the broadcasted subtraction term.
  • outputs: %result is the fused exp(input - max) vector.
  • constraints and limitations: Floating-point element types only.

Fused Compute+Convert Ops

pto.vaddrelu

  • syntax: %result = pto.vaddrelu %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
  • A5 types: f16, f32
  • semantics: Fused add + ReLU.
for (int i = 0; i < N; i++)
    dst[i] = max(src0[i] + src1[i], 0);
  • inputs: %lhs and %rhs are the two addends.
  • outputs: %result is the fused add-then-ReLU result.
  • constraints and limitations: Floating-point element types only on the current documented instruction set.

pto.vsubrelu

  • syntax: %result = pto.vsubrelu %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
  • A5 types: f16, f32
  • semantics: Fused sub + ReLU.
for (int i = 0; i < N; i++)
    dst[i] = max(src0[i] - src1[i], 0);
  • inputs: %lhs is the minuend and %rhs is the subtrahend.
  • outputs: %result is the fused sub-then-ReLU result.
  • constraints and limitations: Floating-point element types only on the current documented instruction set.

pto.vaxpy

  • syntax: %result = pto.vaxpy %src0, %src1, %alpha : !pto.vreg<NxT>, !pto.vreg<NxT>, T -> !pto.vreg<NxT>
  • A5 types: f16, f32
  • semantics: AXPY — scalar-vector multiply-add.
for (int i = 0; i < N; i++)
    dst[i] = alpha * src0[i] + src1[i];
  • inputs: %src0 is the scaled vector, %src1 is the addend vector, and %alpha is the scalar multiplier.
  • outputs: %result is the fused AXPY result.
  • constraints and limitations: Floating-point element types only on the current documented instruction set.

pto.vaddreluconv

  • syntax: %result = pto.vaddreluconv %lhs, %rhs : !pto.vreg<NxT0>, !pto.vreg<NxT0> -> !pto.vreg<MxT1>
  • semantics: Fused add + ReLU + type conversion (HW fusion).
// f32→f16 variant:
for (int i = 0; i < 64; i++)
    dst_f16[i] = f32_to_f16(max(src0_f32[i] + src1_f32[i], 0));

// f16→i8 variant:
for (int i = 0; i < 128; i++)
    dst_i8[i] = f16_to_i8(max(src0_f16[i] + src1_f16[i], 0));
  • inputs: %lhs and %rhs are the source vectors.
  • outputs: %result is the fused add/ReLU/convert result.
  • constraints and limitations: Only backend-supported source/destination type pairs are legal. Rounding, saturation, and packing rules follow the semantics of this fused operation, not an arbitrary sequence of standalone ops.

pto.vmulconv

  • syntax: %result = pto.vmulconv %lhs, %rhs : !pto.vreg<NxT0>, !pto.vreg<NxT0> -> !pto.vreg<MxT1>
  • semantics: Fused mul + type conversion (HW fusion).
// f16→i8 variant:
for (int i = 0; i < 128; i++)
    dst_i8[i] = f16_to_i8(src0_f16[i] * src1_f16[i]);
  • inputs: %lhs and %rhs are the source vectors.
  • outputs: %result is the fused mul/convert result.
  • constraints and limitations: Only backend-supported source/destination type pairs are legal.

Extended Arithmetic

pto.vmull

  • syntax: %low, %high = pto.vmull %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>, !pto.vreg<NxT>
  • A5 types: i32/u32 (native 32×32→64 widening multiply)
  • semantics: Widening multiply with high/low results.
for (int i = 0; i < 64; i++) {
    int64_t r = (int64_t)src0_i32[i] * (int64_t)src1_i32[i];
    dst_lo[i] = (int32_t)(r & 0xFFFFFFFF);
    dst_hi[i] = (int32_t)(r >> 32);
}
  • inputs: %lhs and %rhs are the source vectors and %mask selects active lanes.
  • outputs: %low and %high expose the widened-product low/high parts.
  • constraints and limitations: The current documented A5 form is the native widening 32x32->64 integer multiply instruction set.

pto.vmula

  • syntax: %result = pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>
  • semantics: Multiply-accumulate.
for (int i = 0; i < N; i++)
    if (mask[i])
        dst[i] = acc[i] + lhs[i] * rhs[i];
  • inputs: %acc is the accumulator input, %lhs and %rhs are the multiplicands, and %mask selects active lanes.
  • outputs: %result is the multiply-accumulate result.
  • constraints and limitations: pto.vmula is a fused multiply-accumulate operation and is not always interchangeable with separate vmul plus vadd.

Index Generation

pto.vci

  • syntax: %result = pto.vci %index {order = "ORDER"} : integer -> !pto.vreg<NxT>
  • semantics: Generate lane index vector.
for (int i = 0; i < N; i++)
    dst[i] = base_index + i;

Use case: Generate indices for gather/scatter, argsort, etc.

  • inputs: %index is the scalar seed/base index.
  • outputs: %result is the generated index vector.
  • constraints and limitations: The arithmetic/indexing use of the instruction set; the conversion page also records the same opcode for completeness.

UB-to-UB Operations

pto.vtranspose

  • syntax: pto.vtranspose %dest, %src, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64
  • semantics: UB-to-UB transpose operation (not vreg-to-vreg).

Note: This operates on UB memory directly, not on vector registers.

  • inputs: %dest and %src are UB pointers and %config is the ISA control/config word.
  • outputs: This op writes UB memory and returns no SSA value.
  • constraints and limitations: This is not a vreg -> vreg op even though it lives in the pto.v* namespace. Its correctness depends on the control word and UB layout contract.

Sorting Operations

pto.vsort32

  • syntax: pto.vsort32 %dest, %src, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64
  • semantics: Sort 32 elements in UB.
  • inputs: %dest and %src are UB pointers and %config is the ISA control/config word.
  • outputs: This op writes UB memory and returns no SSA value.
  • constraints and limitations: This is a UB-to-UB accelerator helper, not a pure vector-register op.

pto.vbitsort

  • syntax: pto.vbitsort %dest, %src, %indices, %repeat_times : !pto.ptr<T, ub>, !pto.ptr<T, ub>, !pto.ptr<T, ub>, index
  • semantics: Sort 32 region proposals by score (descending) and materialize sorted proposal records into %dest.
  • inputs: %dest is the UB destination buffer, %src is the UB score buffer, %indices is the UB index buffer, and %repeat_times controls how many adjacent groups of 32 elements to process.
  • outputs: This op writes UB memory and returns no SSA value. Each output record is 8 bytes: upper 4 bytes = index, lower 4 bytes = score.
  • constraints and limitations: Scores are sorted in descending order. Equal-score ties are stable. All pointers MUST be UB-backed. A5-specific (VBS32 hardware unit).

pto.vmrgsort

  • syntax: pto.vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub> x4, i64, i64
  • semantics: Merge-sort 4 pre-sorted input vectors.
  • inputs: %dest is the UB destination, %src0..%src3 are the four pre-sorted UB inputs, %count is the number of valid elements, and %config is the operation control word.
  • outputs: This op writes UB memory and returns no SSA value.
  • constraints and limitations: Inputs MUST already be sorted according to the sort order encoded by %config. The discussion below uses the shorter mnemonic pto.vmrgsort, while the current implementation summary still refers to pto.vmrgsort4.

Current Implementation Instruction Set Summary

  • pto.vmull %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>, !pto.vreg<NxT>
  • pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>
  • pto.vci %index {order = "ORDER"} : integer -> !pto.vreg<NxT>
  • pto.vbitsort %dest, %src, %indices, %repeat_times : !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, index
  • pto.vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config : !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, i64, i64

Typical Usage

// Softmax with fused expdiff
%max_broadcast = pto.vlds %ub_max[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
%exp_stable = pto.vexpdif %logits, %max_broadcast : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>

// Leaky ReLU activation
%activated = pto.vlrelu %linear_out, %alpha_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask<b32> -> !pto.vreg<64xf32>

// Fused residual add + ReLU
%residual = pto.vaddrelu %conv_out, %skip_connection : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>

// Generate indices for argsort
%indices = pto.vci %c0 {order = "ASC"} : i32 -> !pto.vreg<64xi32>