Cube Micro-Instruction Reference

This section documents the PTO Cube micro-instruction surface: the matrix-multiply (MAD) and cube-side data-movement ops that program the cube core (AIC) and its dedicated buffer hierarchy (L1 / L0A / L0B / L0C / BT).

Scope and audience

Tile-level matrix ops such as pto.tmatmul (covered under Tile ISA Matrix & Matrix-Vector) hide most of these primitives behind a tile-shaped interface. The cube micro-instructions documented here are the lower-level surface that compiler back-ends and hand-tuned cube kernels target directly. They make NZ fractal layout, L1/L0 buffer hierarchy, and FIXPIPE writeback explicit.

Architectural Background

Page Purpose
NZ Fractal Layout The fractal NZ format used by L1, L0A, L0B, and L0C. Defines the (k1, m1, m0, k0) re-indexing and per-buffer layout variants.
Buffer Hierarchy The L1 / L0A / L0B / L0C / BT memory hierarchy: address spaces, sizes, and data-flow contracts.
FIXPIPE Model The FIXPIPE writeback path: how L0C results are converted back to ND and routed to UB or GM.

Matrix Multiply (MAD) Ops

The MAD family computes dst = lhs @ rhs on tiles staged into the cube's L0A / L0B / L0C buffers. All variants share the same (M, N, K) shape parameters and a common set of optional clauses (unit_flag, disable_gemv, sat/nosat, tf32_mode, n_dir).

Op Semantics
pto.mad Zero-init: dst = lhs @ rhs
pto.mad_acc Accumulate: dst = dst + lhs @ rhs
pto.mad_bias Bias-init: dst = lhs @ rhs + bias[n]
pto.mad_mx Zero-init MX (microscaled) matmul
pto.mad_mx_acc Accumulating MX matmul
pto.mad_mx_bias Bias-init MX matmul

Cube Data Movement Ops

These ops move tiles between GM, L1, L0A/L0B, and L0C using grouped nburst(...) / loop(...) clauses analogous to the scalar DMA Copy surface.

GM → L1

L1 ↔ UB

  • pto.mte_l1_ub — L1→UB transfer (cube-to-vector data path)
  • pto.mte_ub_l1 — UB→L1 transfer (vector-to-cube data path; lives in the scalar DMA section)

L1 → L0A / L0B (cube operand load)

L1 → BT (bias)

  • pto.mte_l1_bt — Stage bias vector into BT for pto.mad_bias / pto.mad_mx_bias
  • pto.mte_l1_fb — Stage FIXPIPE-relevant payload (e.g., dequant params)

L0C writeback (FIXPIPE)

Full Cube Pipeline

GM (ND)          L1/cbuf (NZ)            L0A/B (NZ)          L0C (NZ)    GM (ND)

A[M,K] --mte_gm_l1_frac/mte_gm_l1--> K1 M1 M0 K0 --mte_l1_l0a-->  K1 M1 M0 K0 -+
                                                             +-MAD-> N1 M1 M0 N0 --> C[M,N]
B[K,N] --mte_gm_l1_frac/mte_gm_l1--> K1 N1 K0 N0 --mte_l1_l0b--> K1 N1 N0 K0 -+
                               ^
                    transpose as part of mte_l1_l0b when requested
                    NOT at GM->L1