Vector Instruction Set: Reduction Instructions¶
pto.v* reduction instruction sets are defined here. Lane grouping, result placement, and inactive-lane rules are part of the visible vector contract and are not left to backend folklore.
Category: Vector reduction operations Pipeline: PIPE_V (Vector Core)
Operations that reduce a vector to a scalar or per-group result.
Common Operand Model¶
%inputis the source vector register value.%maskis the predicate operandPg; inactive lanes do not participate.%resultis the destination vector register value.- Reduction results are written into the low-significance portion of the destination vector and the remaining destination bits are zero-filled.
Execution Model: vecscope¶
Reduction operations execute inside a pto.vecscope { ... } region. Cross-lane reductions (vcadd/vcmax/vcmin) are issued to PIPE_V and perform tree-structured reduction in a single instruction. VLane-group reductions (vcgadd/vcgmax/vccgmin) operate within each 32-byte VLane independently.
Typical pattern for row-wise sum (Softmax denominator):
pto.vecscope {
%active = pto.pset_b32 "PAT_ALL" : !pto.mask<G>
scf.for %row = %c0 to %row_count step %c1 {
%vec = pto.vlds %ub_q[%row] : !pto.ptr -> !pto.vreg<64xf32>
%row_sum_raw = pto.vcadd %vec, %active : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
// row_sum_raw[0] contains the sum
pto.vsts %row_sum_raw, %ub_sum[%row], %one_mask {dist = "1PT"} : ...
}
}
Cross-lane reduction mechanism:
- vcadd: Tree reduction — pairs of lanes are added recursively until lane 0 holds the total
- vcmax/vcmin: Tree reduction with value+index packing in lane 0
- A PIPE_V barrier (pto.barrier #pto.pipe) is needed after group reductions when chaining with subsequent vector ops
A5 Latency and Throughput (Ascend910_9599)¶
All values are popped→retire cycle counts on the cycle-accurate simulator.
Latency Summary Table¶
| PTO op | A5 RV (CA) | f32 | f16 | i32 | i16 |
|---|---|---|---|---|---|
pto.vcadd |
RV_VCADD |
19 | 21 | 19 | 17 |
pto.vcmax / vcmin |
RV_VCMIN |
19 | 21 | 19 | 17 |
pto.vcpadd |
RV_VCPADD |
19 | 21 | — | — |
pto.vcgadd |
RV_VCGADD |
19 | 21 | 19 | 17 |
pto.vcgmax / vcgmin |
RV_VCGMAX |
19 | 21 | 19 | 17 |
A2/A3 Latency and Throughput¶
| Metric | Constant | Value (cycles) | Applies To |
|---|---|---|---|
| Startup latency | A2A3_STARTUP_REDUCE |
13 | all reduction ops |
| Completion: FP group reduce (f16) | A2A3_COMPL_FP_CGOP |
21 | vcgadd/vcgmax/vcgmin (f16) |
| Completion: FP reduce (f32) | A2A3_COMPL_FP_BINOP |
19 | vcadd/vcmax/vcmin (f32) |
| Completion: INT reduce (i16) | A2A3_COMPL_INT_BINOP |
17 | all INT16 reductions |
| Completion: INT reduce (i32/f32) | A2A3_COMPL_FP_BINOP |
19 | all INT32/FP32 reductions |
| Per-repeat throughput | A2A3_RPT_1 |
1 | INT16 group reductions |
| Per-repeat throughput | A2A3_RPT_2 |
2 | INT32/FP32/FP16 reductions |
| Pipeline interval | A2A3_INTERVAL |
18 | all vector ops |
Cycle model (A2/A3): total_cycles = startup + completion + repeats × per_repeat + (repeats - 1) × interval
Full Vector Reductions¶
pto.vcadd¶
- syntax:
%result = pto.vcadd %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 types: i16-i64, f16, f32
- semantics: Sum all elements. Result in lane 0, others zeroed.
T sum = 0;
for (int i = 0; i < N; i++)
sum += src[i];
dst[0] = sum;
for (int i = 1; i < N; i++)
dst[i] = 0;
- inputs:
%inputis the source vector and%maskselects participating lanes. - outputs:
%resultcontains the reduction result in its low element(s). - constraints and limitations: Some narrow integer forms may widen the internal accumulation or result placement. If all predicate bits are zero, the result is zero.
pto.vcmax¶
- syntax:
%result = pto.vcmax %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 types: i16-i32, f16, f32
- semantics: Find max element with argmax. Result value + index in lane 0.
T mx = -INF; int idx = 0;
for (int i = 0; i < N; i++)
if (src[i] > mx) { mx = src[i]; idx = i; }
dst_val[0] = mx;
dst_idx[0] = idx;
- inputs:
%inputis the source vector and%maskselects participating lanes. - outputs:
%resultcarries the reduction result in the low destination positions. - constraints and limitations: This instruction set computes both the extremum and location information, but the exact packing of that information into the destination vector depends on the chosen form. If all predicate bits are zero, the result follows the zero-filled convention.
pto.vcmin¶
- syntax:
%result = pto.vcmin %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 types: i16-i32, f16, f32
- semantics: Find min element with argmin. Result value + index in lane 0.
T mn = INF; int idx = 0;
for (int i = 0; i < N; i++)
if (src[i] < mn) { mn = src[i]; idx = i; }
dst_val[0] = mn;
dst_idx[0] = idx;
- inputs:
%inputis the source vector and%maskselects participating lanes. - outputs:
%resultcarries the reduction result in the low destination positions. - constraints and limitations: As with
pto.vcmax, the exact value/index packing depends on the chosen form and MUST be preserved consistently.
Per-VLane (Group) Reductions¶
The vector register is organized as 8 VLanes of 32 bytes each. Group reductions operate within each VLane independently.
vreg layout (f32 example, 64 elements total):
VLane 0: [0..7] VLane 1: [8..15] VLane 2: [16..23] VLane 3: [24..31]
VLane 4: [32..39] VLane 5: [40..47] VLane 6: [48..55] VLane 7: [56..63]
pto.vcgadd¶
- syntax:
%result = pto.vcgadd %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 types: i16-i32, f16, f32
- semantics: Sum within each VLane. 8 results at indices 0, 8, 16, 24, 32, 40, 48, 56 (for f32).
int K = N / 8; // elements per VLane
for (int g = 0; g < 8; g++) {
T sum = 0;
for (int i = 0; i < K; i++)
sum += src[g*K + i];
dst[g*K] = sum;
for (int i = 1; i < K; i++)
dst[g*K + i] = 0;
}
// For f32: results at dst[0], dst[8], dst[16], dst[24], dst[32], dst[40], dst[48], dst[56]
- inputs:
%inputis the source vector and%maskselects participating lanes. - outputs:
%resultcontains one sum per 32-byte VLane group, written contiguously into the low slot of each group. - constraints and limitations: This is a per-32-byte VLane-group reduction. Inactive lanes are treated as zero.
pto.vcgmax¶
- syntax:
%result = pto.vcgmax %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 types: i16-i32, f16, f32
- semantics: Max within each VLane.
int K = N / 8;
for (int g = 0; g < 8; g++) {
T mx = -INF;
for (int i = 0; i < K; i++)
if (src[g*K + i] > mx) mx = src[g*K + i];
dst[g*K] = mx;
for (int i = 1; i < K; i++)
dst[g*K + i] = 0;
}
- inputs:
%inputis the source vector and%maskselects participating lanes. - outputs:
%resultcontains one maximum per 32-byte VLane group. - constraints and limitations: Grouping is by hardware 32-byte VLane, not by arbitrary software subvector.
pto.vcgmin¶
- syntax:
%result = pto.vcgmin %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 types: i16-i32, f16, f32
- semantics: Min within each VLane.
int K = N / 8;
for (int g = 0; g < 8; g++) {
T mn = INF;
for (int i = 0; i < K; i++)
if (src[g*K + i] < mn) mn = src[g*K + i];
dst[g*K] = mn;
for (int i = 1; i < K; i++)
dst[g*K + i] = 0;
}
- inputs:
%inputis the source vector and%maskselects participating lanes. - outputs:
%resultcontains one minimum per 32-byte VLane group. - constraints and limitations: Grouping is by hardware 32-byte VLane, not by arbitrary software subvector.
Prefix Operations¶
pto.vcpadd¶
- syntax:
%result = pto.vcpadd %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 types: f16, f32
- semantics: Inclusive prefix sum (scan).
dst[0] = src[0];
for (int i = 1; i < N; i++)
dst[i] = dst[i-1] + src[i];
Example:
// input: [1, 2, 3, 4, 5, ...]
// output: [1, 3, 6, 10, 15, ...]
- inputs:
%inputis the source vector and%maskselects participating lanes. - outputs:
%resultis the inclusive prefix-sum vector. - constraints and limitations: Only floating-point element types are documented on the current A5 instruction set here.
Typical Usage¶
// Softmax: find max for numerical stability
%max_vec = pto.vcmax %logits, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
// max is in lane 0, broadcast it
%max_broadcast = pto.vlds %ub_tmp[%c0] {dist = "BRC_B32"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>
// Row-wise sum using vcgadd (for 8-row tile)
%row_sums = pto.vcgadd %tile, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
// Results at indices 0, 8, 16, 24, 32, 40, 48, 56
// Full vector sum for normalization
%total = pto.vcadd %values, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
// total[0] contains the sum
// Prefix sum for cumulative distribution
%cdf = pto.vcpadd %pdf, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>