NZ Fractal Layout¶
The cube's internal buffers (L1 / cbuf, L0A, L0B, L0C) all use a fractal NZ layout rather than row-major ND. Understanding NZ layout is essential when authoring cube data-movement ops or reasoning about MAD operand organization.
Definition¶
Given the hardware constant C0 = 32 bytes, for an element type with byte width E = sizeof(T):
- Inner tile width:
K0 = N0 = C0 / E(for example,K0 = 16forf16andbf16;K0 = 8forf32) - Inner tile height:
M0 = 16
NZ re-indexing for a logical [M, K] tensor:
NZ index: (k1, m1, m0, k0)
where k1 = k / K0, k0 = k % K0
m1 = m / M0, m0 = m % M0
Physical layout: K1 x M1 x M0 x K0 (last dimension contiguous)
The same outer / inner factorization is applied to [K, N] tensors, swapping the inner-width axis.
Per-Buffer NZ Layouts¶
| Buffer | Logical shape | Physical NZ layout | Notes |
|---|---|---|---|
| L1 (cbuf) — Tensor A | [M, K] |
K1 M1 M0 K0 |
Row-major A staged into NZ layout |
| L1 (cbuf) — Tensor B | [K, N] |
K1 N1 K0 N0 |
Row-major B staged into NZ layout |
| L0A (left operand) | — | K1 M1 M0 K0 |
FRACTAL_NZ on A5 / FRACTAL_ZZ on A3: same NZ order as L1 cbuf |
| L0B (right operand) | — | K1 N1 N0 K0 |
FRACTAL_ZN: row-major outer, col-major inner (K0 innermost) |
| L0C (accumulator) | [M, N] |
N1 M1 M0 N0 |
Output of MMAD (FRACTAL_NZ: col-major outer, row-major inner) |
Why K-Innermost on L0B?¶
The cube reduction axis is K. L0B requires K innermost (K1 N1 N0 K0) so the cube hardware reads all K0 elements per cycle without striding.
The inner-box transpose is performed as part of the pto.mte_l1_l0b structured right-load movement itself; no separate user-visible pass is required. Each 512B fractal Z-block is permuted as it moves from L1 to L0B.
Data Flow: GM → L1 → L0A/B → L0C¶
+------------------------------------------------------------------------------+
| GEMM Data Layout: GM -> L1 (NZ) -> L0A/B -> L0C |
+------------------------------------------------------------------------------+
STEP 1 - Global Memory (ND, row-major)
--------------------------------------
Tensor A [M, K] Tensor B [K, N]
(K is the contiguous axis) (N is the contiguous axis)
Physical: A[m*K + k] Physical: B[k*N + n]
STEP 2 - GM -> L1 (cbuf): ND-to-NZ fractal repack
-------------------------------------------------
A in L1: K1 x M1 x M0 x K0 B in L1: K1 x N1 x K0 x N0
For each outer block (k1, m1): For each outer block (k1, n1):
inner is M0 rows x K0 cols inner is K0 rows x N0 cols
(16x16 elems contiguous) (16x16 elems contiguous)
Physical: A_nz[k1][m1][m0][k0] Physical: B_nz[k1][n1][k0][n0]
STEP 3 - L1 -> L0A / L0B
--------------------------
L0A: cbuf K1 M1 M0 K0 --mte_l1_l0a--> L0A K1 M1 M0 K0 (FRACTAL_NZ on A5)
L0B: cbuf K1 N1 K0 N0 --mte_l1_l0b--> L0B K1 N1 N0 K0 (FRACTAL_ZN, K0 innermost)
STEP 4 - MAD: L0A x L0B -> L0C
-------------------------------
dst[m, n] = sum k in 0..K-1: lhs[m, k] * rhs[k, n]
L0C layout: N1 M1 M0 N0
STEP 5 - L0C writeback (FIXPIPE)
---------------------------------
FIXPIPE MTE ops (mte_l0c_l1 / mte_l0c_gm / mte_l0c_ub) convert the L0C NZ
result to the requested destination layout (typically ND) and memory space.
Authoring Guidance¶
When the source GEMM operand is already in a transposed logical layout, express that at the structured load level (pto.mte_l1_l0a / pto.mte_l1_l0b) instead of relying on a later reinterpretation of the same bytes. Operating on a reinterpreted NZ buffer with the wrong outer / inner factorization is a verifier error and a common source of correctness bugs.