Matrix And Matrix-Vector Instruction Set¶
This family covers the cube-pipeline instructions that evaluate matrix products on tile buffers. The basic forms produce a new accumulator tile, the _acc forms continue accumulation on an existing accumulator tile, the _bias forms inject a bias tile, and the *_mx forms add explicit scale tiles for block-scale MX formats.
These instructions are not generic vector-tile operations. Their legality depends on dedicated matrix roles such as Left, Right, Acc, Bias, ScaleLeft, and ScaleRight, plus the target profile's layout and datatype rules.
Operations¶
| Operation | Purpose | C++ intrinsic | Notes |
|---|---|---|---|
| pto.tmatmul | Matrix multiply producing a fresh accumulator tile | TMATMUL(C, A, B) |
New result tile |
| pto.tmatmul_acc | Matrix multiply that continues accumulation | TMATMUL_ACC(C, A, B) |
K-loop body form |
| pto.tmatmul_bias | Matrix multiply with column bias | TMATMUL_BIAS(C, A, B, bias) |
Bias tile is one row |
| pto.tmatmul_mx | Matrix multiply in MX block-scale format | TMATMUL_MX(C, A, AScale, B, BScale) |
A5 only |
| pto.tgemv | Matrix-vector multiply producing a fresh accumulator tile | TGEMV(C, A, B) |
m = 1 GEMV shape |
| pto.tgemv_acc | GEMV that continues accumulation | TGEMV_ACC(C, A, B) |
Accumulating form |
| pto.tgemv_bias | GEMV with bias add | TGEMV_BIAS(C, A, B, bias) |
Bias tile is one row |
| pto.tgemv_mx | GEMV in MX block-scale format | TGEMV_MX(C, A, AScale, B, BScale) |
A5 only |
Why This Family Exists¶
PTO keeps matrix-product instructions separate from ordinary tile arithmetic because the cube path has different operand roles, different legality checks, and different target constraints from the vector path. A reader needs one place that answers:
- which tile roles are legal,
- how accumulation differs from fresh output generation,
- how bias is injected,
- and which profile-specific layout rules apply on A2A3 versus A5.
Mechanism¶
TMATMUL¶
For M = a.GetValidRow(), K = a.GetValidCol(), and N = b.GetValidCol():
pto.tmatmul treats the destination accumulator as a newly produced output tile.
TMATMUL_ACC¶
This form exists for split-K and blocked GEMM loops where a partial accumulator must be carried across iterations.
TMATMUL_BIAS¶
Bias is a one-row tile and is broadcast by output column.
TGEMV¶
GEMV is the m = 1 specialization of the same cube contract. PTO still exposes it as a separate instruction family because it has its own operand spelling and its own common usage pattern.
MX Variants¶
*_mx uses block-scale MX formats such as MXFP4 and MXMP8. Those forms require:
- one left operand tile in
Left, - one right operand tile in
Right, - one left scale tile in
ScaleLeft, - one right scale tile in
ScaleRight, - and an accumulator/output tile in
Acc.
MX is not "one extra scale tensor". It is a paired scale-tile contract on both sides of the product.
Tile Roles And Buffer Mapping¶
The architectural tile roles are abstractions over target tile buffers:
Leftis the left matrix operand tile and corresponds to the L0A-backed operand path.Rightis the right matrix operand tile and corresponds to the L0B-backed operand path.Accis the accumulator/output tile.Biasis the one-row bias tile used by*_bias.ScaleLeftandScaleRightare the scale tiles used by MX block-scale variants.
Programs should not assume one portable physical layout for Right. A2A3 and A5 both use the Right role, but the legal right-tile layout details differ by target profile.
Target Profiles¶
A2A3 in this manual means the Ascend 910B and Ascend 910C class targets. A5 means the Ascend 950 PR and Ascend 950 DT class targets.
| Capability | CPU simulator | A2A3 | A5 |
|---|---|---|---|
TMATMUL, TMATMUL_ACC, TMATMUL_BIAS |
Yes | Yes | Yes |
TGEMV, TGEMV_ACC, TGEMV_BIAS |
Yes | Yes | Yes |
| int8 cube path | No | Yes | Yes |
| fp16 / bf16 / fp32 cube path | Yes | Yes | Yes |
| fp8 cube path | No | No | Yes |
| MX block-scale path | No | No | Yes |
Common Legality¶
- Shapes must satisfy
(M, K) x (K, N) -> (M, N)for matmul. - GEMV uses the same contract with
m = 1. - Left, Right, Acc, Bias, and MX scale-tile roles must match the operation being issued.
- Valid-region values outside the legal output domain are not repaired implicitly.
A2A3 Notes¶
- The base cube path supports the repository's documented triples such as
(int32, int8, int8)and(float, half, half). - Dynamic
m,k, andnare constrained to[1, 4095]. - The backend checks the
Left/Right/Accrole combination explicitly.
A5 Notes¶
- The base cube path accepts
int32accumulators for int8 input pairs andfloataccumulators for fp16, bf16, fp32, and selected fp8 pairs. - The
Rightrole has A5-specific layout/fractal constraints; do not copy an A2A3 right-tile layout assumption onto A5. - MX variants are A5-only and require both
ScaleLeftandScaleRight.
Performance And Throughput¶
The repository currently exposes an A2A3 cost-model formula for the shared mad/mmad cube instruction used by TMATMUL, TMATMUL_ACC, TMATMUL_BIAS, TGEMV, TGEMV_ACC, and TGEMV_BIAS.
For A2A3:
- startup cost:
14cycles, - repeat count:
ceil(M/16) * ceil(N/16) * ceil(K / baskK), baskK = 32 / sizeof(left_element_type),- steady-state cost per repeat:
1cycle for int8 and fp16 buckets,2cycles for fp32 buckets.
So the published A2A3 model is:
cycles = 14 + repeat_count * repeat_cost
Examples backed by tests/costmodel/st/testcase/tmatmul/tmatmul_kernel.cpp include:
- half
40x50 * 50x60:62cycles, - int8
6x7 * 7x8:15cycles, - float
120x110 * 110x50:910cycles.
The current repository does not publish an equivalent A5 latency or throughput table for this family. A5 legality is specified, but cycle figures are not single-listed in the public cost-model headers.