Vector Instruction Set: SFU And DSA Instructions¶

Special-function, fused, and domain-specific pto.v* instruction sets are defined here. These forms are narrower than generic arithmetic and therefore carry explicit target-profile restrictions.

Category: Domain-specific accelerator and special function unit operations Pipeline: PIPE_V (Vector Core) / SFU

Fused operations, special functions, and UB-to-UB operations that leverage hardware acceleration.

Common Operand Model¶

%input, %lhs, %rhs, %acc, and %alpha are source SSA values whose roles are called out per instruction.
%mask is the predicate operand Pg when present.
%result is the destination SSA value.
This instruction-set page mixes three different backend shapes: pure vreg -> vreg ops, conversion/fusion ops, and UB-to-UB helpers. Each instruction section calls out which storage model it uses.

Fused Activation Ops (vreg→vreg)¶

`pto.vlrelu`¶

syntax: %result = pto.vlrelu %input, %alpha, %mask : !pto.vreg<NxT>, T, !pto.mask<G> -> !pto.vreg<NxT>
A5 types: f16, f32
semantics: Leaky ReLU with scalar alpha.

for (int i = 0; i < N; i++)
    dst[i] = (src[i] >= 0) ? src[i] : alpha * src[i];

inputs: %input is the activation vector, %alpha is the scalar slope, and %mask selects active lanes.
outputs: %result is the leaky-ReLU vector.
constraints and limitations: Only f16 and f32 forms are currently documented for pto.vlrelu.

`pto.vprelu`¶

syntax: %result = pto.vprelu %input, %alpha : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
A5 types: f16, f32
semantics: Parametric ReLU with per-element alpha vector.

for (int i = 0; i < N; i++)
    dst[i] = (src[i] >= 0) ? src[i] : alpha[i] * src[i];

inputs: %input is the activation vector and %alpha is the per-element slope vector.
outputs: %result is the parametric-ReLU vector.
constraints and limitations: Floating-point element types only on the current A5 instruction set.

`pto.vexpdif`¶

syntax: %result = pto.vexpdif %input, %max : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
A5 types: f16, f32
semantics: Fused exp(x - max) for numerically stable softmax.

for (int i = 0; i < N; i++)
    dst[i] = expf(src[i] - max[i]);

Use case: Softmax numerator computation with numerical stability.

inputs: %input is the source vector and %max is the broadcasted subtraction term.
outputs: %result is the fused exp(input - max) vector.
constraints and limitations: Floating-point element types only.

Fused Compute+Convert Ops¶

`pto.vaddrelu`¶

syntax: %result = pto.vaddrelu %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
A5 types: f16, f32
semantics: Fused add + ReLU.

for (int i = 0; i < N; i++)
    dst[i] = max(src0[i] + src1[i], 0);

inputs: %lhs and %rhs are the two addends.
outputs: %result is the fused add-then-ReLU result.
constraints and limitations: Floating-point element types only on the current documented instruction set.

`pto.vsubrelu`¶

syntax: %result = pto.vsubrelu %lhs, %rhs : !pto.vreg<NxT>, !pto.vreg<NxT> -> !pto.vreg<NxT>
A5 types: f16, f32
semantics: Fused sub + ReLU.

for (int i = 0; i < N; i++)
    dst[i] = max(src0[i] - src1[i], 0);

inputs: %lhs is the minuend and %rhs is the subtrahend.
outputs: %result is the fused sub-then-ReLU result.
constraints and limitations: Floating-point element types only on the current documented instruction set.

`pto.vaxpy`¶

syntax: %result = pto.vaxpy %src0, %src1, %alpha : !pto.vreg<NxT>, !pto.vreg<NxT>, T -> !pto.vreg<NxT>
A5 types: f16, f32
semantics: AXPY — scalar-vector multiply-add.

for (int i = 0; i < N; i++)
    dst[i] = alpha * src0[i] + src1[i];

inputs: %src0 is the scaled vector, %src1 is the addend vector, and %alpha is the scalar multiplier.
outputs: %result is the fused AXPY result.
constraints and limitations: Floating-point element types only on the current documented instruction set.

`pto.vaddreluconv`¶

syntax: %result = pto.vaddreluconv %lhs, %rhs : !pto.vreg<NxT0>, !pto.vreg<NxT0> -> !pto.vreg<MxT1>
semantics: Fused add + ReLU + type conversion (HW fusion).

// f32→f16 variant:
for (int i = 0; i < 64; i++)
    dst_f16[i] = f32_to_f16(max(src0_f32[i] + src1_f32[i], 0));

// f16→i8 variant:
for (int i = 0; i < 128; i++)
    dst_i8[i] = f16_to_i8(max(src0_f16[i] + src1_f16[i], 0));

inputs: %lhs and %rhs are the source vectors.
outputs: %result is the fused add/ReLU/convert result.
constraints and limitations: Only backend-supported source/destination type pairs are legal. Rounding, saturation, and packing rules follow the semantics of this fused operation, not an arbitrary sequence of standalone ops.

`pto.vmulconv`¶

syntax: %result = pto.vmulconv %lhs, %rhs : !pto.vreg<NxT0>, !pto.vreg<NxT0> -> !pto.vreg<MxT1>
semantics: Fused mul + type conversion (HW fusion).

// f16→i8 variant:
for (int i = 0; i < 128; i++)
    dst_i8[i] = f16_to_i8(src0_f16[i] * src1_f16[i]);

inputs: %lhs and %rhs are the source vectors.
outputs: %result is the fused mul/convert result.
constraints and limitations: Only backend-supported source/destination type pairs are legal.

Extended Arithmetic¶

`pto.vmull`¶

syntax: %low, %high = pto.vmull %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>, !pto.vreg<NxT>
A5 types: i32/u32 (native 32×32→64 widening multiply)
semantics: Widening multiply with high/low results.

for (int i = 0; i < 64; i++) {
    int64_t r = (int64_t)src0_i32[i] * (int64_t)src1_i32[i];
    dst_lo[i] = (int32_t)(r & 0xFFFFFFFF);
    dst_hi[i] = (int32_t)(r >> 32);
}

inputs: %lhs and %rhs are the source vectors and %mask selects active lanes.
outputs: %low and %high expose the widened-product low/high parts.
constraints and limitations: The current documented A5 form is the native widening 32x32->64 integer multiply instruction set.

`pto.vmula`¶

syntax: %result = pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>
semantics: Multiply-accumulate.

for (int i = 0; i < N; i++)
    if (mask[i])
        dst[i] = acc[i] + lhs[i] * rhs[i];

inputs: %acc is the accumulator input, %lhs and %rhs are the multiplicands, and %mask selects active lanes.
outputs: %result is the multiply-accumulate result.
constraints and limitations: pto.vmula is a fused multiply-accumulate operation and is not always interchangeable with separate vmul plus vadd.

Index Generation¶

`pto.vci`¶

syntax: %result = pto.vci %index {order = "ORDER"} : integer -> !pto.vreg<NxT>
semantics: Generate lane index vector.

for (int i = 0; i < N; i++)
    dst[i] = base_index + i;

Use case: Generate indices for gather/scatter, argsort, etc.

inputs: %index is the scalar seed/base index.
outputs: %result is the generated index vector.
constraints and limitations: The arithmetic/indexing use of the instruction set; the conversion page also records the same opcode for completeness.

UB-to-UB Operations¶

`pto.vtranspose`¶

syntax: pto.vtranspose %dest, %src, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64
semantics: UB-to-UB transpose operation (not vreg-to-vreg).

Note: This operates on UB memory directly, not on vector registers.

inputs: %dest and %src are UB pointers and %config is the ISA control/config word.
outputs: This op writes UB memory and returns no SSA value.
constraints and limitations: This is not a vreg -> vreg op even though it lives in the pto.v* namespace. Its correctness depends on the control word and UB layout contract.

Sorting Operations¶

`pto.vsort32`¶

syntax: pto.vsort32 %dest, %src, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64
semantics: Sort 32 elements in UB.
inputs: %dest and %src are UB pointers and %config is the ISA control/config word.
outputs: This op writes UB memory and returns no SSA value.
constraints and limitations: This is a UB-to-UB accelerator helper, not a pure vector-register op.

`pto.vbitsort`¶

syntax: pto.vbitsort %dest, %src, %indices, %repeat_times : !pto.ptr<T, ub>, !pto.ptr<T, ub>, !pto.ptr<T, ub>, index
semantics: Sort 32 region proposals by score (descending) and materialize sorted proposal records into %dest.
inputs: %dest is the UB destination buffer, %src is the UB score buffer, %indices is the UB index buffer, and %repeat_times controls how many adjacent groups of 32 elements to process.
outputs: This op writes UB memory and returns no SSA value. Each output record is 8 bytes: upper 4 bytes = index, lower 4 bytes = score.
constraints and limitations: Scores are sorted in descending order. Equal-score ties are stable. All pointers MUST be UB-backed. A5-specific (VBS32 hardware unit).

`pto.vmrgsort`¶

syntax: pto.vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub> x4, i64, i64
semantics: Merge-sort 4 pre-sorted input vectors.
inputs: %dest is the UB destination, %src0..%src3 are the four pre-sorted UB inputs, %count is the number of valid elements, and %config is the operation control word.
outputs: This op writes UB memory and returns no SSA value.
constraints and limitations: Inputs MUST already be sorted according to the sort order encoded by %config. The discussion below uses the shorter mnemonic pto.vmrgsort, while the current implementation summary still refers to pto.vmrgsort4.

Current Implementation Instruction Set Summary¶

pto.vmull %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>, !pto.vreg<NxT>
pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>
pto.vci %index {order = "ORDER"} : integer -> !pto.vreg<NxT>
pto.vbitsort %dest, %src, %indices, %repeat_times : !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, index
pto.vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config : !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, i64, i64

Typical Usage¶

// Softmax with fused expdiff
%max_broadcast = pto.vlds %ub_max[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
%exp_stable = pto.vexpdif %logits, %max_broadcast : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>

// Leaky ReLU activation
%activated = pto.vlrelu %linear_out, %alpha_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask<b32> -> !pto.vreg<64xf32>

// Fused residual add + ReLU
%residual = pto.vaddrelu %conv_out, %skip_connection : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>

// Generate indices for argsort
%indices = pto.vci %c0 {order = "ASC"} : i32 -> !pto.vreg<64xi32>

Vector Instruction Set: SFU And DSA Instructions¶

Common Operand Model¶

Fused Activation Ops (vreg→vreg)¶

pto.vlrelu¶

pto.vprelu¶

pto.vexpdif¶

Fused Compute+Convert Ops¶

pto.vaddrelu¶

pto.vsubrelu¶

pto.vaxpy¶

pto.vaddreluconv¶

pto.vmulconv¶