Cube Micro-Instruction Reference¶

This section documents the PTO Cube micro-instruction surface: the matrix-multiply (MAD) and cube-side data-movement ops that program the cube core (AIC) and its dedicated buffer hierarchy (L1 / L0A / L0B / L0C / BT).

Scope and audience

Tile-level matrix ops such as pto.tmatmul (covered under Tile ISA Matrix & Matrix-Vector) hide most of these primitives behind a tile-shaped interface. The cube micro-instructions documented here are the lower-level surface that compiler back-ends and hand-tuned cube kernels target directly. They make NZ fractal layout, L1/L0 buffer hierarchy, and FIXPIPE writeback explicit.

Architectural Background¶

Page	Purpose
NZ Fractal Layout	The fractal NZ format used by L1, L0A, L0B, and L0C. Defines the `(k1, m1, m0, k0)` re-indexing and per-buffer layout variants.
Buffer Hierarchy	The L1 / L0A / L0B / L0C / BT memory hierarchy: address spaces, sizes, and data-flow contracts.
FIXPIPE Model	The FIXPIPE writeback path: how L0C results are converted back to ND and routed to UB or GM.

Matrix Multiply (MAD) Ops¶

The MAD family computes dst = lhs @ rhs on tiles staged into the cube's L0A / L0B / L0C buffers. All variants share the same (M, N, K) shape parameters and a common set of optional clauses (unit_flag, disable_gemv, sat/nosat, tf32_mode, n_dir).

Op	Semantics
pto.mad	Zero-init: `dst = lhs @ rhs`
pto.mad_acc	Accumulate: `dst = dst + lhs @ rhs`
pto.mad_bias	Bias-init: `dst = lhs @ rhs + bias[n]`
pto.mad_mx	Zero-init MX (microscaled) matmul
pto.mad_mx_acc	Accumulating MX matmul
pto.mad_mx_bias	Bias-init MX matmul

Cube Data Movement Ops¶

These ops move tiles between GM, L1, L0A/L0B, and L0C using grouped nburst(...) / loop(...) clauses analogous to the scalar DMA Copy surface.

GM → L1¶

pto.mte_gm_l1 — Direct GM→L1 load (no layout transform)
pto.mte_gm_l1_frac — GM→L1 with ND→NZ fractal repack

L1 ↔ UB¶

pto.mte_l1_ub — L1→UB transfer (cube-to-vector data path)
pto.mte_ub_l1 — UB→L1 transfer (vector-to-cube data path; lives in the scalar DMA section)

L1 → L0A / L0B (cube operand load)¶

pto.mte_l1_l0a — Stage L1 NZ tile into L0A (left operand)
pto.mte_l1_l0b — Stage L1 NZ tile into L0B (right operand, K-innermost transpose)
pto.mte_l1_l0a_mx — Load MX scale payload for L0A
pto.mte_l1_l0b_mx — Load MX scale payload for L0B

L1 → BT (bias)¶

pto.mte_l1_bt — Stage bias vector into BT for pto.mad_bias / pto.mad_mx_bias
pto.mte_l1_fb — Stage FIXPIPE-relevant payload (e.g., dequant params)

L0C writeback (FIXPIPE)¶

pto.mte_l0c_l1 — FIXPIPE: L0C → L1
pto.mte_l0c_gm — FIXPIPE: L0C → GM
pto.mte_l0c_ub — FIXPIPE: L0C → UB

Full Cube Pipeline¶

GM (ND)          L1/cbuf (NZ)            L0A/B (NZ)          L0C (NZ)    GM (ND)

A[M,K] --mte_gm_l1_frac/mte_gm_l1--> K1 M1 M0 K0 --mte_l1_l0a-->  K1 M1 M0 K0 -+
                                                             +-MAD-> N1 M1 M0 N0 --> C[M,N]
B[K,N] --mte_gm_l1_frac/mte_gm_l1--> K1 N1 K0 N0 --mte_l1_l0b--> K1 N1 N0 K0 -+
                               ^
                    transpose as part of mte_l1_l0b when requested
                    NOT at GM->L1

Tile ISA: Matrix and Matrix-Vector — Tile-level matrix ops
Scalar DMA Copy — UB-side DMA grouped transfers
Pipeline Synchronization — Cube/Vector synchronization primitives