Vector Instruction Set: Reduction Instructions¶

pto.v* reduction instruction sets are defined here. Lane grouping, result placement, and inactive-lane rules are part of the visible vector contract and are not left to backend folklore.

Category: Vector reduction operations Pipeline: PIPE_V (Vector Core)

Operations that reduce a vector to a scalar or per-group result.

Common Operand Model¶

%input is the source vector register value.
%mask is the predicate operand Pg; inactive lanes do not participate.
%result is the destination vector register value.
Reduction results are written into the low-significance portion of the destination vector and the remaining destination bits are zero-filled.

Execution Model: vecscope¶

Reduction operations execute inside a pto.vecscope { ... } region. Cross-lane reductions (vcadd/vcmax/vcmin) are issued to PIPE_V and perform tree-structured reduction in a single instruction. VLane-group reductions (vcgadd/vcgmax/vccgmin) operate within each 32-byte VLane independently.

Typical pattern for row-wise sum (Softmax denominator):

pto.vecscope {
  %active = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
  scf.for %row = %c0 to %row_count step %c1 {
    %vec = pto.vlds %ub_q[%row] : !pto.ptr -> !pto.vreg<64xf32>
    %row_sum_raw = pto.vcadd %vec, %active : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
    // row_sum_raw[0] contains the sum
    pto.vsts %row_sum_raw, %ub_sum[%row], %one_mask {dist = "1PT"} : ...
  }
}

Cross-lane reduction mechanism: - vcadd: Tree reduction — pairs of lanes are added recursively until lane 0 holds the total - vcmax/vcmin: Tree reduction with value+index packing in lane 0 - A PIPE_V barrier (pto.barrier #pto.pipe) is needed after group reductions when chaining with subsequent vector ops

A5 Latency and Throughput (Ascend910_9599)¶

All values are popped→retire cycle counts on the cycle-accurate simulator.

Latency Summary Table¶

PTO op	A5 RV (CA)	f32	f16	i32	i16
`pto.vcadd`	`RV_VCADD`	19	21	19	17
`pto.vcmax` / `vcmin`	`RV_VCMIN`	19	21	19	17
`pto.vcpadd`	`RV_VCPADD`	19	21	—	—
`pto.vcgadd`	`RV_VCGADD`	19	21	19	17
`pto.vcgmax` / `vcgmin`	`RV_VCGMAX`	19	21	19	17

A2/A3 Latency and Throughput¶

Metric	Constant	Value (cycles)	Applies To
Startup latency	`A2A3_STARTUP_REDUCE`	13	all reduction ops
Completion: FP group reduce (f16)	`A2A3_COMPL_FP_CGOP`	21	`vcgadd`/`vcgmax`/`vcgmin` (f16)
Completion: FP reduce (f32)	`A2A3_COMPL_FP_BINOP`	19	`vcadd`/`vcmax`/`vcmin` (f32)
Completion: INT reduce (i16)	`A2A3_COMPL_INT_BINOP`	17	all INT16 reductions
Completion: INT reduce (i32/f32)	`A2A3_COMPL_FP_BINOP`	19	all INT32/FP32 reductions
Per-repeat throughput	`A2A3_RPT_1`	1	INT16 group reductions
Per-repeat throughput	`A2A3_RPT_2`	2	INT32/FP32/FP16 reductions
Pipeline interval	`A2A3_INTERVAL`	18	all vector ops

Cycle model (A2/A3): total_cycles = startup + completion + repeats × per_repeat + (repeats - 1) × interval

Full Vector Reductions¶

`pto.vcadd`¶

syntax: %result = pto.vcadd %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>
A5 types: i16-i64, f16, f32
semantics: Sum all elements. Result in lane 0, others zeroed.

T sum = 0;
for (int i = 0; i < N; i++)
    sum += src[i];
dst[0] = sum;
for (int i = 1; i < N; i++)
    dst[i] = 0;

inputs: %input is the source vector and %mask selects participating lanes.
outputs: %result contains the reduction result in its low element(s).
constraints and limitations: Some narrow integer forms may widen the internal accumulation or result placement. If all predicate bits are zero, the result is zero.

`pto.vcmax`¶

syntax: %result = pto.vcmax %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>
A5 types: i16-i32, f16, f32
semantics: Find max element with argmax. Result value + index in lane 0.

T mx = -INF; int idx = 0;
for (int i = 0; i < N; i++)
    if (src[i] > mx) { mx = src[i]; idx = i; }
dst_val[0] = mx;
dst_idx[0] = idx;

inputs: %input is the source vector and %mask selects participating lanes.
outputs: %result carries the reduction result in the low destination positions.
constraints and limitations: This instruction set computes both the extremum and location information, but the exact packing of that information into the destination vector depends on the chosen form. If all predicate bits are zero, the result follows the zero-filled convention.

`pto.vcmin`¶

syntax: %result = pto.vcmin %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>
A5 types: i16-i32, f16, f32
semantics: Find min element with argmin. Result value + index in lane 0.

T mn = INF; int idx = 0;
for (int i = 0; i < N; i++)
    if (src[i] < mn) { mn = src[i]; idx = i; }
dst_val[0] = mn;
dst_idx[0] = idx;

inputs: %input is the source vector and %mask selects participating lanes.
outputs: %result carries the reduction result in the low destination positions.
constraints and limitations: As with pto.vcmax, the exact value/index packing depends on the chosen form and MUST be preserved consistently.

Per-VLane (Group) Reductions¶

The vector register is organized as 8 VLanes of 32 bytes each. Group reductions operate within each VLane independently.

vreg layout (f32 example, 64 elements total):
VLane 0: [0..7]   VLane 1: [8..15]  VLane 2: [16..23] VLane 3: [24..31]
VLane 4: [32..39] VLane 5: [40..47] VLane 6: [48..55] VLane 7: [56..63]

`pto.vcgadd`¶

syntax: %result = pto.vcgadd %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>
A5 types: i16-i32, f16, f32
semantics: Sum within each VLane. 8 results at indices 0, 8, 16, 24, 32, 40, 48, 56 (for f32).

int K = N / 8;  // elements per VLane
for (int g = 0; g < 8; g++) {
    T sum = 0;
    for (int i = 0; i < K; i++)
        sum += src[g*K + i];
    dst[g*K] = sum;
    for (int i = 1; i < K; i++)
        dst[g*K + i] = 0;
}
// For f32: results at dst[0], dst[8], dst[16], dst[24], dst[32], dst[40], dst[48], dst[56]

inputs: %input is the source vector and %mask selects participating lanes.
outputs: %result contains one sum per 32-byte VLane group, written contiguously into the low slot of each group.
constraints and limitations: This is a per-32-byte VLane-group reduction. Inactive lanes are treated as zero.

`pto.vcgmax`¶

syntax: %result = pto.vcgmax %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>
A5 types: i16-i32, f16, f32
semantics: Max within each VLane.

int K = N / 8;
for (int g = 0; g < 8; g++) {
    T mx = -INF;
    for (int i = 0; i < K; i++)
        if (src[g*K + i] > mx) mx = src[g*K + i];
    dst[g*K] = mx;
    for (int i = 1; i < K; i++)
        dst[g*K + i] = 0;
}

inputs: %input is the source vector and %mask selects participating lanes.
outputs: %result contains one maximum per 32-byte VLane group.
constraints and limitations: Grouping is by hardware 32-byte VLane, not by arbitrary software subvector.

`pto.vcgmin`¶

syntax: %result = pto.vcgmin %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>
A5 types: i16-i32, f16, f32
semantics: Min within each VLane.

int K = N / 8;
for (int g = 0; g < 8; g++) {
    T mn = INF;
    for (int i = 0; i < K; i++)
        if (src[g*K + i] < mn) mn = src[g*K + i];
    dst[g*K] = mn;
    for (int i = 1; i < K; i++)
        dst[g*K + i] = 0;
}

inputs: %input is the source vector and %mask selects participating lanes.
outputs: %result contains one minimum per 32-byte VLane group.
constraints and limitations: Grouping is by hardware 32-byte VLane, not by arbitrary software subvector.

Prefix Operations¶

`pto.vcpadd`¶

syntax: %result = pto.vcpadd %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT>
A5 types: f16, f32
semantics: Inclusive prefix sum (scan).

dst[0] = src[0];
for (int i = 1; i < N; i++)
    dst[i] = dst[i-1] + src[i];

Example:

// input:  [1, 2, 3, 4, 5, ...]
// output: [1, 3, 6, 10, 15, ...]

inputs: %input is the source vector and %mask selects participating lanes.
outputs: %result is the inclusive prefix-sum vector.
constraints and limitations: Only floating-point element types are documented on the current A5 instruction set here.

Typical Usage¶

// Softmax: find max for numerical stability
%max_vec = pto.vcmax %logits, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
// max is in lane 0, broadcast it
%max_broadcast = pto.vlds %ub_tmp[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>

// Row-wise sum using vcgadd (for 8-row tile)
%row_sums = pto.vcgadd %tile, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
// Results at indices 0, 8, 16, 24, 32, 40, 48, 56

// Full vector sum for normalization
%total = pto.vcadd %values, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
// total[0] contains the sum

// Prefix sum for cumulative distribution
%cdf = pto.vcpadd %pdf, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>