Matrix And Matrix-Vector Instruction Set

This family covers the cube-pipeline instructions that evaluate matrix products on tile buffers. The basic forms produce a new accumulator tile, the _acc forms continue accumulation on an existing accumulator tile, the _bias forms inject a bias tile, and the *_mx forms add explicit scale tiles for block-scale MX formats.

These instructions are not generic vector-tile operations. Their legality depends on dedicated matrix roles such as Left, Right, Acc, Bias, ScaleLeft, and ScaleRight, plus the target profile's layout and datatype rules.

Operations

Operation Purpose C++ intrinsic Notes
pto.tmatmul Matrix multiply producing a fresh accumulator tile TMATMUL(C, A, B) New result tile
pto.tmatmul_acc Matrix multiply that continues accumulation TMATMUL_ACC(C, A, B) K-loop body form
pto.tmatmul_bias Matrix multiply with column bias TMATMUL_BIAS(C, A, B, bias) Bias tile is one row
pto.tmatmul_mx Matrix multiply in MX block-scale format TMATMUL_MX(C, A, AScale, B, BScale) A5 only
pto.tgemv Matrix-vector multiply producing a fresh accumulator tile TGEMV(C, A, B) m = 1 GEMV shape
pto.tgemv_acc GEMV that continues accumulation TGEMV_ACC(C, A, B) Accumulating form
pto.tgemv_bias GEMV with bias add TGEMV_BIAS(C, A, B, bias) Bias tile is one row
pto.tgemv_mx GEMV in MX block-scale format TGEMV_MX(C, A, AScale, B, BScale) A5 only

Why This Family Exists

PTO keeps matrix-product instructions separate from ordinary tile arithmetic because the cube path has different operand roles, different legality checks, and different target constraints from the vector path. A reader needs one place that answers:

  • which tile roles are legal,
  • how accumulation differs from fresh output generation,
  • how bias is injected,
  • and which profile-specific layout rules apply on A2A3 versus A5.

Mechanism

TMATMUL

For M = a.GetValidRow(), K = a.GetValidCol(), and N = b.GetValidCol():

\[ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} \]

pto.tmatmul treats the destination accumulator as a newly produced output tile.

TMATMUL_ACC

\[ \mathrm{C1}_{i,j} = \mathrm{C0}_{i,j} + \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} \]

This form exists for split-K and blocked GEMM loops where a partial accumulator must be carried across iterations.

TMATMUL_BIAS

\[ \mathrm{C}_{i,j} = \sum_{k=0}^{K-1} \mathrm{A}_{i,k} \cdot \mathrm{B}_{k,j} + \mathrm{Bias}_{0,j} \]

Bias is a one-row tile and is broadcast by output column.

TGEMV

GEMV is the m = 1 specialization of the same cube contract. PTO still exposes it as a separate instruction family because it has its own operand spelling and its own common usage pattern.

MX Variants

*_mx uses block-scale MX formats such as MXFP4 and MXMP8. Those forms require:

  • one left operand tile in Left,
  • one right operand tile in Right,
  • one left scale tile in ScaleLeft,
  • one right scale tile in ScaleRight,
  • and an accumulator/output tile in Acc.

MX is not "one extra scale tensor". It is a paired scale-tile contract on both sides of the product.

Tile Roles And Buffer Mapping

The architectural tile roles are abstractions over target tile buffers:

  • Left is the left matrix operand tile and corresponds to the L0A-backed operand path.
  • Right is the right matrix operand tile and corresponds to the L0B-backed operand path.
  • Acc is the accumulator/output tile.
  • Bias is the one-row bias tile used by *_bias.
  • ScaleLeft and ScaleRight are the scale tiles used by MX block-scale variants.

Programs should not assume one portable physical layout for Right. A2A3 and A5 both use the Right role, but the legal right-tile layout details differ by target profile.

Target Profiles

A2A3 in this manual means the Ascend 910B and Ascend 910C class targets. A5 means the Ascend 950 PR and Ascend 950 DT class targets.

Capability CPU simulator A2A3 A5
TMATMUL, TMATMUL_ACC, TMATMUL_BIAS Yes Yes Yes
TGEMV, TGEMV_ACC, TGEMV_BIAS Yes Yes Yes
int8 cube path No Yes Yes
fp16 / bf16 / fp32 cube path Yes Yes Yes
fp8 cube path No No Yes
MX block-scale path No No Yes

Common Legality

  • Shapes must satisfy (M, K) x (K, N) -> (M, N) for matmul.
  • GEMV uses the same contract with m = 1.
  • Left, Right, Acc, Bias, and MX scale-tile roles must match the operation being issued.
  • Valid-region values outside the legal output domain are not repaired implicitly.

A2A3 Notes

  • The base cube path supports the repository's documented triples such as (int32, int8, int8) and (float, half, half).
  • Dynamic m, k, and n are constrained to [1, 4095].
  • The backend checks the Left/Right/Acc role combination explicitly.

A5 Notes

  • The base cube path accepts int32 accumulators for int8 input pairs and float accumulators for fp16, bf16, fp32, and selected fp8 pairs.
  • The Right role has A5-specific layout/fractal constraints; do not copy an A2A3 right-tile layout assumption onto A5.
  • MX variants are A5-only and require both ScaleLeft and ScaleRight.

Performance And Throughput

The repository currently exposes an A2A3 cost-model formula for the shared mad/mmad cube instruction used by TMATMUL, TMATMUL_ACC, TMATMUL_BIAS, TGEMV, TGEMV_ACC, and TGEMV_BIAS.

For A2A3:

  • startup cost: 14 cycles,
  • repeat count: ceil(M/16) * ceil(N/16) * ceil(K / baskK),
  • baskK = 32 / sizeof(left_element_type),
  • steady-state cost per repeat:
  • 1 cycle for int8 and fp16 buckets,
  • 2 cycles for fp32 buckets.

So the published A2A3 model is:

cycles = 14 + repeat_count * repeat_cost

Examples backed by tests/costmodel/st/testcase/tmatmul/tmatmul_kernel.cpp include:

  • half 40x50 * 50x60: 62 cycles,
  • int8 6x7 * 7x8: 15 cycles,
  • float 120x110 * 110x50: 910 cycles.

The current repository does not publish an equivalent A5 latency or throughput table for this family. A5 legality is specified, but cycle figures are not single-listed in the public cost-model headers.

See Also