Data Format Reference

The physical data format defines how tiles, vectors, and scalars are represented in memory and in hardware registers. It covers memory spaces, element packing, address alignment, VLane architecture, and the relationship between the PTO logical view and the underlying storage.

Memory Spaces

PTO distinguishes global memory from local tile-buffer storage. The ISA-level concept is the tile buffer; the hardware names behind that concept depend on TileType.

Memory Space Location Access Unit Bandwidth Access Pattern
GM (Global Memory) Off-chip device DRAM Byte-granular Low Global backing store
Local tile buffers On-chip local storage Role-specific High Direct tile/vector access through the selected tile role

The hardware mapping of local tile buffers is:

PTO tile-buffer role Hardware-local buffer
TileType::Vec Unified Buffer (UB)
TileType::Left L0A
TileType::Right L0B
TileType::Acc L0C
TileType::ScaleLeft L0A scale buffer
TileType::ScaleRight L0B scale buffer

Tile Register File terminology in the current manual should be read as the ISA abstraction over these local tile buffers, not as a second user-visible storage class separate from the buffers themselves.

Tile Buffer Format

A tile occupies a contiguous region in one local tile buffer. Its logical shape (Rows, Cols) is independent of its physical storage format.

In-Memory Format

In a local tile buffer, elements are stored in their BLayout order — either RowMajor or ColMajor. Each element occupies sizeof(DType) bytes. For TileType::Vec, that local tile buffer is the hardware Unified Buffer.

For BLayout = RowMajor, shape (R, C):

\[ \text{addr}(r, c) = (r \times C + c) \times \mathrm{sizeof(DType)} \]

For BLayout = ColMajor, shape (R, C):

\[ \text{addr}(r, c) = (c \times R + r) \times \mathrm{sizeof(DType)} \]

Tile-Register View

The tile-register view is the ISA abstraction presented to authors. It names typed local tile buffers and hides the fact that different TileType values are backed by different hardware-local buffers. Tile data is moved in and out via explicit TLOAD/TSTORE/TMOV*-family operations rather than by scalar byte addressing.

Address Alignment

Access Type Required Alignment
GM read/write Element-size aligned (2 bytes for f16/i16, 4 bytes for f32)
Vector tile buffer DMA transfer 32-byte block aligned (DMA engine unit)
Local tile-buffer access Element-size aligned, plus any role-specific backend constraints

The DMA engine operates on 32-byte blocks (BLOCK_BYTE_SIZE = 32). Misaligned GM addresses produce target-specific behavior: - A2/A3 and A5: The DMA engine requires natural alignment for best performance; unaligned addresses may cause DMA errors or performance degradation. Software should align GM addresses to at least 32-byte boundaries. - CPU simulator: Unaligned addresses are accepted and handled by the host CPU's memory access instructions.

Element Type Encoding

Standard Types

Type C++ Type SSA Name Size (bytes) Register Width
IEEE FP16 half f16 2 128 lanes
Brain FP16 bfloat16_t bf16 2 128 lanes
IEEE FP32 float f32 4 64 lanes
Signed int8 int8_t i8 1 256 lanes
Unsigned int8 uint8_t u8 1 256 lanes
Signed int16 int16_t i16 2 128 lanes
Unsigned int16 uint16_t u16 2 128 lanes
Signed int32 int32_t i32 4 64 lanes
Unsigned int32 uint32_t u32 4 64 lanes

A5-Only Types

Type C++ Type SSA Name Size (bytes) Notes
FP8 E4M3 float8_e4m3_t f8e4m3 1 256 lanes
FP8 E5M2 float8_e5m2_t f8e5m2 1 256 lanes
HI Float8 hifloat8_t hifloat8 1 256 lanes
Float4 E1M2x2 float4_e1m2x2_t float4_e1m2x2 1 256 lanes (packed 2×2)
Float4 E2M1x2 float4_e2m1x2_t float4_e2m1x2 1 256 lanes (packed 2×2)

Vector Register Format (VLane Architecture)

On A5 (Ascend 950 PR / DT), the vector register is organized as 8 VLanes of 32 bytes each. A VLane is the atomic unit for group reduction operations. This architecture is architecturally visible in PTO.

vreg (256 bytes total):
┌─────────┬─────────┬─────────┬─────┬─────────┬─────────┐
│ VLane 0 │ VLane 1 │ VLane 2 │ ... │ VLane 6 │ VLane 7 │
│   32B   │   32B   │   32B   │     │   32B   │   32B   │
└─────────┴─────────┴─────────┴─────┴─────────┴─────────┘

Vector registers hold N elements of type DType packed contiguously with no padding. The register width is always 256 bytes (2048 bits):

Element Type Lane Count N Bytes/Lane Total
f32 64 4 256 B
f16 / bf16 / i16 / u16 128 2 256 B
i8 / u8 / FP8 / HI-FP8 256 1 256 B
float4_* (packed) 256 (effective) 1 256 B

Group Reduction and VLanes

Group reduction operations (vcgadd, vcgmax, vcgmin) reduce within each VLane independently. The reduction produces one result per VLane (one value per 32-byte lane), which is then broadcast or stored:

// Per-VLane group reduction: each VLane independently reduces its K elements
int K = N / 8;  // elements per VLane (e.g., 8 for f32, 16 for f16)
for (int g = 0; g < 8; g++) {
    T sum = 0;
    for (int i = 0; i < K; i++)
        sum += src[g*K + i];
    dst[g*K] = sum;           // write result to first position of each VLane
    for (int i = 1; i < K; i++)
        dst[g*K + i] = 0;    // zero-fill remaining positions
}

This is architecturally visible: the result is not a single scalar but one value per VLane.

Pad Value Encoding

The Pad parameter in Tile<DType, ..., Pad> specifies the value of out-of-valid-region elements. Declared in include/pto/common/constants.hpp.

Standard Pad Values

Pad Value Meaning float Encoding half/bf16 Encoding i8/u8 Encoding
Zero Initialize to zero 0x00000000 0x0000 0x00
Null Undefined; must not be read 0x00000000 0x0000 0x00
Min Fill with type minimum 0xff800000 (≈ −0) 0xfc00 0xff
Max Fill with type maximum 0x7f800000 (+Inf) 0x7c00 0x7f

Custom Pad Values (A5)

The PadValueCustom(value) helper allows compile-time-specified float patterns as pad values. This is useful for operations that need a specific fill value (e.g., -1.0f for softmax):

// Custom pad value: all out-of-valid-region elements become -1.0f
using TilePadNeg1 = Tile<TileType::Vec, float, 16, 16, RowMajor, NoneBox, None, PadValueCustom(-1.0f)>;

Custom pad values encode the float bit pattern in the upper bits of the 64-bit PadValue enum. They are processed by PadValueMap and applied via GetPadValue() at load time.

MX Block-Scale Formats

MX block-scale matmul forms use extra scale tiles in addition to the left and right payload tiles. In the current codebase:

  • TileLeft corresponds to L0A
  • TileRight corresponds to L0B
  • TileLeftScale corresponds to the L0A-side scale buffer
  • TileRightScale corresponds to the L0B-side scale buffer

The A5 TMATMUL_MX / TGEMV_MX code paths explicitly require both scale tiles, and the supported combinations include MX FP4 and MX FP8 families. These are block-scale formats, not plain elementwise FP formats.

Fractal Layout Encoding

The TileLayoutCustom enum in include/pto/common/constants.hpp encodes the concrete layout used at runtime:

TileLayoutCustom BLayout SLayout Fractal Block Size Typical Use
ND RowMajor NoneBox Standard tile; most ops
DN ColMajor NoneBox Fortran-order tile
NZ ColMajor RowMajor NZ 512 B Left/L0A-side matmul operand on A5
ZN RowMajor ColMajor ZN 512 B Symmetric NZ variant
ZZ RowMajor RowMajor ZZ 512 B CUBE-specific pattern

The BLOCK_BYTE_SIZE = 32 constant and FRACTAL_NZ_ROW = 16 and CUBE_BLOCK_SIZE = 512 give the fractal block dimensions used in address generation.

Constants Reference

Constant Value Units Use
BLOCK_BYTE_SIZE 32 bytes DMA block transfer unit
FIXP_BURST_UNIT_LEN 64 half-words DMA burst length
FRACTAL_NZ_ROW 16 elements Fractal row dimension for NZ/ZN
CUBE_BLOCK_SIZE 512 bytes CUBE fractal block
C0_SIZE_BYTE 32 bytes Cube C0 dimension (in bytes)
MX_COL_LEN 2 elements MX block-scale column block
MX_ROW_LEN 16 elements MX block-scale row block
MX_BLOCK_SIZE 32 elements MX block-scale block
TMP_UB_SIZE 8 × 1024 bytes Temporary UB buffer size
TMP_UB_OFFSET 184 × 1024 bytes Temporary UB offset
MASK_LEN 64 bits Predicate mask width
BLOCK_LEN 16 elements Standard block length
VLane_COUNT 8 lanes VLanes per vector register (A5)

See Also