Vector Instruction Set: Unary Vector Instructions¶
Single-input pto.v* compute instruction sets are defined here. Unless a form states otherwise, the vector-register shape, active-lane mask semantics, and target-profile restrictions below define the portable contract.
Category: Single-input vector operations Pipeline: PIPE_V (Vector Core)
Element-wise operations that take one vector input and produce one vector output.
Common Operand Model¶
%inputis the source vector register value.%maskis the predicate operand. For this instruction set, inactive lanes follow the predication behavior of the selected instruction form: zeroing forms zero-fill inactive lanes, while merging forms preserve the destination value.%resultis the destination vector register value. Unless stated otherwise,%resulthas the same lane count and element type as%input.
Execution Model: vecscope¶
Unary vector operations execute inside a pto.vecscope { ... } region, which establishes the Vector Core's execution context. The pto.vecscope region is implicitly scoped to PIPE_V; all vector instructions inside it are issued to the Vector pipeline.
Typical loop structure:
pto.vecscope {
%remaining_init = arith.constant 1024 : i32
%_:1 = scf.for %offset = %c0 to %total step %c64
iter_args(%remaining = %remaining_init) -> (i32) {
%mask, %next_remaining = pto.plt_b32 %remaining : i32 -> !pto.mask<G>, i32
%vec = pto.vlds %ub_in[%offset] : !pto.ptr -> !pto.vreg<64xf32>
%out = pto.vabs %vec, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask<b32>
scf.yield %next_remaining : i32
}
}
Predicate generation: %mask = pto.pset_b32 "PAT_ALL" creates a full-active mask; pto.plt_b32 %remaining generates a tail mask based on the number of remaining elements.
A5 Latency and Throughput (Ascend910_9599)¶
All values are popped→retire cycle counts on the cycle-accurate simulator. Float16 types use
aclFloat16tracing.
Latency Summary Table¶
| PTO op | A5 RV (CA) | f32 | f16 | bf16 | i32 | i16 | i8 |
|---|---|---|---|---|---|---|---|
pto.vabs |
RV_VABS_FP |
5 | 5 | — | 5 | 5 | 5 |
pto.vneg |
RV_VMULS |
8 | 8 | — | 8 | 8 | 8 |
pto.vexp |
RV_VEXP |
16 | 21 | — | — | — | — |
pto.vln |
RV_VLN |
18 | 23 | — | — | — | — |
pto.vsqrt |
RV_VSQRT |
17 | 22 | — | — | — | — |
pto.vrelu |
RV_VRELU |
5 | 5 | — | — | — | — |
pto.vrec |
RV_VREC |
(see note) | (see note) | — | — | — | — |
pto.vrsqrt |
RV_VRSQRT |
(see note) | (see note) | — | — | — | — |
pto.vnot |
RV_VNOT |
— | — | — | 5 | 5 | 5 |
pto.vbcnt |
— | — | — | — | (per-lane) | (per-lane) | (per-lane) |
pto.vcls |
— | — | — | — | (per-lane) | (per-lane) | (per-lane) |
pto.vmov |
RV_VLD (proxy) |
9 | 9 | — | 9 | 9 | 9 |
Note on reciprocals:
vrec and vrsqrt are synthesized from vdiv and vsqrt respectively; their latency matches the corresponding divide instruction throughput.
A2/A3 Latency and Throughput¶
| Metric | Constant | Value (cycles) | Applies To |
|---|---|---|---|
| Startup latency (reduce/transcendental) | A2A3_STARTUP_REDUCE |
13 | vexp, vsqrt, vln |
| Startup latency (binary/arith) | A2A3_STARTUP_BINARY |
14 | vabs, vneg, vadd, vmul |
| Completion: FP binary ops | A2A3_COMPL_FP_BINOP |
19 | vabs, vneg, vadd (f32), vsub (f32) |
| Completion: INT binary ops | A2A3_COMPL_INT_BINOP |
17 | vabs/vadd/vsub (int16/i32) |
| Completion: FP transcendental | A2A3_COMPL_FP32_EXP |
26 | vexp (f32) |
| Completion: FP transcendental | A2A3_COMPL_FP16_EXP |
28 | vexp (f16) |
| Completion: FP transcendental | A2A3_COMPL_FP32_SQRT |
27 | vsqrt (f32) |
| Completion: FP transcendental | A2A3_COMPL_FP16_SQRT |
29 | vsqrt (f16) |
| Per-repeat throughput | A2A3_RPT_1 |
1 | scalar/unary/simple ops |
| Per-repeat throughput | A2A3_RPT_2 |
2 | binary ops (vadd, vmul) |
| Per-repeat throughput | A2A3_RPT_4 |
4 | transcendental ops (vexp, vsqrt f16) |
| Pipeline interval | A2A3_INTERVAL |
18 | all vector ops |
| Pipeline interval (vmov/vcopy) | A2A3_INTERVAL_VCOPY |
13 | vmov |
Cycle model (A2/A3):
total_cycles = startup + completion + repeats × per_repeat + (repeats - 1) × interval
Arithmetic¶
pto.vabs¶
- syntax:
%result = pto.vabs %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 RV:
RV_VABS_FP; Latency: 5 (f32/f16), 5 (i32/i16/i8) - A2/A3 throughput: 1 cycle/repeat; interval: 18 cycles
for (int i = 0; i < N; i++)
dst[i] = (src[i] < 0) ? -src[i] : src[i];
- inputs:
%inputsupplies the source lanes and%maskselects which lanes participate. - outputs:
%resultreceives the lane-wise absolute values. - constraints and limitations: Source and result types MUST match. Integer overflow on the most-negative signed value follows the target-defined behavior.
pto.vneg¶
- syntax:
%result = pto.vneg %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 RV:
RV_VMULS(uses scalar-multiply hardware); Latency: 8 (f32/f16), 8 (i32/i16/i8) - A2/A3 throughput: 1 cycle/repeat; interval: 18 cycles
for (int i = 0; i < N; i++)
dst[i] = -src[i];
- inputs:
%inputis the source vector and%maskselects active lanes. - outputs:
%resultis the lane-wise arithmetic negation. - constraints and limitations: Source and result types MUST match.
Transcendental¶
pto.vexp¶
- syntax:
%result = pto.vexp %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 RV:
RV_VEXP; Latency: 16 (f32), 21 (f16) - A2/A3 throughput: 2 cycles/repeat (f32), 4 cycles/repeat (f16); interval: 18 cycles
for (int i = 0; i < N; i++)
dst[i] = expf(src[i]);
- inputs:
%inputis the source vector and%maskselects active lanes. - outputs:
%resultholdsexp(input[i])per active lane. - constraints and limitations: Only floating-point element types are legal.
- Performance note: f32 is significantly faster than f16 on A5 (16 vs 21 cycles). For f16, prefer
vexpdif(fused exp-diff) for numerical stability in softmax.
pto.vln¶
- syntax:
%result = pto.vln %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 RV:
RV_VLN; Latency: 18 (f32), 23 (f16) - A2/A3 throughput: 2 cycles/repeat (f32), 4 cycles/repeat (f16); interval: 18 cycles
for (int i = 0; i < N; i++)
dst[i] = logf(src[i]);
- inputs:
%inputis the source vector and%maskselects active lanes. - outputs:
%resultholds the natural logarithm per active lane. - constraints and limitations: Only floating-point element types are legal. For real-number semantics, active inputs SHOULD be strictly positive; non-positive inputs follow the target's exception/NaN rules.
pto.vsqrt¶
- syntax:
%result = pto.vsqrt %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 RV:
RV_VSQRT; Latency: 17 (f32), 22 (f16) - A2/A3 throughput: 2 cycles/repeat (f32), 4 cycles/repeat (f16); interval: 18 cycles
for (int i = 0; i < N; i++)
dst[i] = sqrtf(src[i]);
- inputs:
%inputis the source vector and%maskselects active lanes. - outputs:
%resultholds the square root per active lane. - constraints and limitations: Only floating-point element types are legal. Negative active inputs follow the target's exception/NaN rules.
- Performance note:
vrsqrt(reciprocal square root) uses the same hardware asvsqrtand costs equivalent cycles.
pto.vrsqrt¶
- syntax:
%result = pto.vrsqrt %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - Latency: equivalent to
vsqrt; usesRV_VRSQRThardware path - A2/A3 throughput: 2 cycles/repeat (f32), 4 cycles/repeat (f16); interval: 18 cycles
for (int i = 0; i < N; i++)
dst[i] = 1.0f / sqrtf(src[i]);
- inputs:
%inputis the source vector and%maskselects active lanes. - outputs:
%resultholds reciprocal-square-root values per active lane. - constraints and limitations: Only floating-point element types are legal. Active inputs containing
+0or-0follow the target's divide-style exceptional behavior.
pto.vrec¶
- syntax:
%result = pto.vrec %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - Latency: synthesized via
vdiv; throughput matchesvdiv - A2/A3 throughput: 1 cycle/repeat; interval: 18 cycles
for (int i = 0; i < N; i++)
dst[i] = 1.0f / src[i];
- inputs:
%inputis the source vector and%maskselects active lanes. - outputs:
%resultholds the reciprocal per active lane. - constraints and limitations: Only floating-point element types are legal. Active inputs containing
+0or-0follow the target's divide-style exceptional behavior.
Activation¶
pto.vrelu¶
- syntax:
%result = pto.vrelu %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 RV:
RV_VRELU; Latency: 5 (f32/f16) - A2/A3 throughput: 1 cycle/repeat; interval: 18 cycles
for (int i = 0; i < N; i++)
dst[i] = (src[i] > 0) ? src[i] : 0;
- inputs:
%inputis the source vector and%maskselects active lanes. - outputs:
%resultholdsmax(input[i], 0)per active lane. - constraints and limitations: Only floating-point element types are legal on the current A5 instruction set described here.
- Performance note:
vreluis the lowest-latency unary operation (5 cycles). Usevlrelufor leaky-ReLU (adds one scalar multiply).
Bitwise¶
pto.vnot¶
- syntax:
%result = pto.vnot %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 RV:
RV_VNOT; Latency: 5 (integer types only) - A2/A3 throughput: 1 cycle/repeat; interval: 18 cycles
for (int i = 0; i < N; i++)
dst[i] = ~src[i];
- inputs:
%inputis the source vector and%maskselects active lanes. - outputs:
%resultholds the lane-wise bitwise inversion. - constraints and limitations: Integer element types only.
pto.vbcnt¶
- syntax:
%result = pto.vbcnt %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - semantics: Population count — counts the number of set bits in each lane's element.
for (int i = 0; i < N; i++)
dst[i] = __builtin_popcount(src[i]);
- inputs:
%inputis the source vector and%maskselects active lanes. - outputs:
%resultholds the population count for each active lane. - constraints and limitations: Integer element types only. The count is over the source element width, not over the full vector register.
pto.vcls¶
- syntax:
%result = pto.vcls %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - semantics: Count leading sign bits — for a signed integer, counts how many bits from the MSB are equal to the sign bit.
for (int i = 0; i < N; i++)
dst[i] = count_leading_sign_bits(src[i]);
- inputs:
%inputis the source vector and%maskselects active lanes. - outputs:
%resultholds the leading-sign-bit count per active lane. - constraints and limitations: Integer element types only. This operation is sign-aware, so signed interpretation matters.
Movement¶
pto.vmov¶
- syntax:
%result = pto.vmov %input, %mask : !pto.vreg<NxT>, !pto.mask<G> -> !pto.vreg<NxT> - A5 RV:
RV_VLD(proxy); Latency: 9 (f32/f16), 9 (integer) - A2/A3 throughput: 1 cycle/repeat; interval: 13 cycles (
A2A3_INTERVAL_VCOPY)
for (int i = 0; i < N; i++)
dst[i] = src[i];
- inputs:
%inputis the source vector and%maskselects active lanes. - outputs:
%resultis a copy of the source vector. - constraints and limitations: Predicated
pto.vmovbehaves like a masked copy, while the unpredicated form behaves like a full-register copy.
Typical Usage¶
// Softmax numerator: exp(x - max) using vexp
%sub = pto.vsub %x, %max_broadcast, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
%exp = pto.vexp %sub, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
// Reciprocal for division
%sum_rcp = pto.vrec %sum, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
// ReLU activation (lowest latency unary on A5: 5 cycles)
%activated = pto.vrelu %linear_out, %mask : !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>