pto.tmov¶

pto.tmov is part of the Layout And Rearrangement instruction set.

Summary¶

TMOV copies or transforms tile data between tiles. It is the workhorse for tile-to-tile data movement, accumulator-to-vector conversion, and fix-pipe quantization paths.

Two variants are documented here:

Variant	Suffix	Description	Typical Use
Standard move	(none)	Direct tile-to-tile copy or conversion	Vec→Vec, Mat→Left/Right, Acc→Mat
Fix-pipe move	`_fp`	Move through fix-pipe quantization path	Acc→Vec/int8_t with scaling

Mechanism¶

Conceptually copies or transforms elements from src into dst over the valid region. Exact transformation depends on the selected mode and variant.

Standard move (pure copy case):

\[ \mathrm{dst}_{i,j} = \mathrm{src}_{i,j} \]

Fix-pipe variant (TMOV_FP): Routes through the hardware fix-pipe quantization pipeline, applying conversion configured by the fp sideband tile:

\[ \mathrm{dst}_{i,j} = \mathrm{Convert}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) \]

Variants¶

Variant 1: Standard Move¶

TMOV(dst, src) — plain tile-to-tile copy with optional ReLU mode.

Variant 2: ReLU Move¶

TMOV<..., reluMode>(dst, src) — copy with ReLU pre-processing.

Variant 3: Accumulator-to-Vector¶

TMOV<..., mode, reluMode>(dst, src) — converts accumulator tile to vector tile with optional ReLU. The mode parameter selects the splitting strategy for multi-core configurations.

Variant 4: Vector-Quant Move¶

TMOV<..., FpTileData, mode, reluMode>(dst, src, fp) — converts accumulator to vector through the fix-pipe quantization path. The fp tile carries scale factors.

Variant 5: Scalar-Quant Move¶

TMOV<..., reluMode>(dst, src, preQuantScalar) — converts accumulator with a scalar quantization parameter.

Variant 6: Fix-Pipe Move (`TMOV_FP`)¶

TMOV_FP(dst, src, fp) — explicit fix-pipe move. Same semantics as the vector-quant move but named explicitly for the assembly spelling.

AccToVecMode Reference¶

The AccToVecMode parameter controls how accumulator tiles are split and transferred to vector tiles, especially in multi-core (dual-mode) configurations:

Mode	Meaning	Used When
`SingleModeVec0`	Transfer to vector 0 only	1-Cube, 1-Vec configurations
`SingleModeVec1`	Transfer to vector 1 only	Single-mode targeting Vec1
`DualModeSplitM`	Split accumulator rows evenly across two vectors	1-Cube, 2-Vec with row-wise split
`DualModeSplitN`	Split accumulator columns across two vectors	1-Cube, 2-Vec with column-wise split

Supported Tile-Type Pairs¶

A2/A3¶

Source Type	Destination Type	Notes
`TileType::Mat`	`TileType::Left/Right/Bias/Scaling`	MX block-format extraction
`TileType::Vec`	`TileType::Vec`	Direct copy
`TileType::Acc`	`TileType::Mat`	Accumulator-to-matrix conversion

Mat→Bias restrictions: - Supported dtype pairs: int32_t → int32_t, float → float, half → float - Source row must be 1 - SrcTileData::Cols * sizeof(SrcType) must be aligned to 64 bytes

Mat→Scaling restrictions: - Destination dtype must be uint64_t - Source row must be 1 - SrcTileData::Cols * sizeof(SrcType) must be aligned to 128 bytes

A5¶

In addition to A2/A3 pairs:

Source Type	Destination Type	Notes
`TileType::Mat`	`TileType::Left/Right/Bias/Scaling/ScaleLeft/ScaleRight`	Extended MX formats
`TileType::Vec`	`TileType::Mat`	Vector-to-matrix conversion
`TileType::Acc`	`TileType::Vec`	Accumulator-to-vector with mode selection
`TileType::Acc`	`TileType::Mat`	Accumulator-to-matrix

A5 Mat→Bias: - Supported dtype pairs: int32_t → int32_t, float → float, half → float, bfloat16_t → float - DstTileData::Cols * sizeof(DstType) must be aligned to 64 bytes - Bias-table footprint ≤ 4096 bytes

A5 Mat→Scaling: - DstTileData::Cols * sizeof(DstType) must be aligned to 128 bytes - Fix-pipe-buffer footprint ≤ 4096 bytes

A5 Acc→Vec: - mode selects SingleModeVec0, SingleModeVec1, DualModeSplitM, DualModeSplitN - Dual-mode requires QuantMode_t::NoQuant - Dual-mode does not support the nz2dn path - For 32-bit destination types (float/int32_t), when using DualModeSplitN the ValidCol before the split must be a multiple of 32 - dstStride * sizeof(dstType) must be a multiple of 32 bytes

Syntax¶

PTO Assembly Form¶

Standard move:

%dst = tmov.s2d %src : !pto.tile<...> -> !pto.tile<...>

The PTO AS design recommends splitting TMOV into a small set of instructions:

%left  = tmov.m2l %mat  : !pto.tile<...> -> !pto.tile<...>
%right = tmov.m2r %mat  : !pto.tile<...> -> !pto.tile<...>
%bias  = tmov.m2b %mat  : !pto.tile<...> -> !pto.tile<...>
%scale = tmov.m2s %mat  : !pto.tile<...> -> !pto.tile<...>
%vec   = tmov.a2v %acc  : !pto.tile<...> -> !pto.tile<...>
%v1    = tmov.v2v %v0   : !pto.tile<...> -> !pto.tile<...>

Fix-pipe move:

%dst = tmov.fp %src, %fp : !pto.tile<...>, !pto.tile<...> -> !pto.tile<...>

AS Level 1 (SSA)¶

// Standard
%dst = pto.tmov.s2d %src  : !pto.tile<...> -> !pto.tile<...>

// Fix-pipe
%dst = pto.tmov.fp %src, %fp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>

AS Level 2 (DPS)¶

// Standard
pto.tmov ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)

// Fix-pipe
pto.tmov.fp ins(%src, %fp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)

C++ Intrinsic¶

#include <pto/pto-inst.hpp>
using namespace pto;

// Variant 1: Plain move
template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);

// Variant 2: ReLU move
template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode, typename... WaitEvents>
PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);

// Variant 3: Accumulator-to-vector with mode
template <typename DstTileData, typename SrcTileData, AccToVecMode mode,
          ReluPreMode reluMode = ReluPreMode::NoRelu, typename... WaitEvents>
PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, WaitEvents &... events);

// Variant 4: Vector-quant (fix-pipe) move
template <typename DstTileData, typename SrcTileData, typename FpTileData,
          AccToVecMode mode, ReluPreMode reluMode = ReluPreMode::NoRelu,
          typename... WaitEvents>
PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, FpTileData &fp, WaitEvents &... events);

// Variant 5: Scalar-quant move
template <typename DstTileData, typename SrcTileData,
          ReluPreMode reluMode = ReluPreMode::NoRelu, typename... WaitEvents>
PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, WaitEvents &... events);

// Variant 5b: Scalar-quant with mode
template <typename DstTileData, typename SrcTileData, AccToVecMode mode,
          ReluPreMode reluMode = ReluPreMode::NoRelu, typename... WaitEvents>
PTO_INST RecordEvent TMOV(DstTileData &dst, SrcTileData &src, uint64_t preQuantScalar, WaitEvents &... events);

// Variant 6: Explicit fix-pipe move
template <typename DstTileData, typename SrcTileData, typename FpTileData,
          ReluPreMode reluMode = ReluPreMode::NoRelu, typename... WaitEvents>
PTO_INST RecordEvent TMOV_FP(DstTileData &dst, SrcTileData &src, FpTileData &fp, WaitEvents &... events);

Constraints¶

Constraints

Shape: SrcTileData::Rows == DstTileData::Rows and SrcTileData::Cols == DstTileData::Cols
reluMode: ReluPreMode::{NoRelu, NormalRelu}
mode: AccToVecMode::{SingleModeVec0, SingleModeVec1, DualModeSplitM, DualModeSplitN}
FpTileData::Loc: Must be TileType::Scaling on both A2/A3 and A5 (verified by static_assert)
Vec→Vec: Shape must match exactly
Mat→Left/Right/Bias/Scaling: Compile-time restricted by tile type

Common Patterns¶

Pattern 1: Vec-to-Vec Tile Copy¶

// Copy one vector tile to another (e.g., for double-buffering)
void tileCopy(TileT& dst, TileT& src) {
    TMOV(dst, src);  // Straight copy
}

Pattern 2: MX Block Extraction (GEMM Setup)¶

// Extract Left and Right block tiles from a matrix in NC1HWO layout
using MatT = Tile<TileType::Mat, float, 64, 64, BLayout::RowMajor, 64, 64, SLayout::ColMajor>;
using LeftT = TileLeft<float, 64, 64>;
using RightT = TileRight<float, 64, 64>;

MatT mat;
LeftT left;
RightT right;
TASSIGN(mat, 0x1000);

TMOV(left, mat);   // Mat → Left
TMOV(right, mat);  // Mat → Right

Pattern 3: Accumulator-to-Vector Conversion (Single Mode)¶

// Convert accumulator to vector in single-mode (1 Cube, 1 Vec)
using AccT = TileAcc<float, 64, 128>;
using VecT = Tile<TileType::Vec, float, 64, 128>;

AccT acc;
VecT vec;
TASSIGN(acc, 0x1000);

TMOV<VecT, AccT, AccToVecMode::SingleModeVec0>(vec, acc);

Pattern 4: Dual-Mode Accumulator-to-Vector (GEMM with 2 Vectors)¶

// Accumulator split across two vector cores (1 Cube, 2 Vec)
using AccT = TileAcc<float, 64, 256>;
using VecT = Tile<TileType::Vec, float, 64, 128>;  // Half-width per vector

AccT acc;
VecT vec0, vec1;
TMOV<VecT, AccT, AccToVecMode::DualModeSplitN>(vec0, acc);  // Columns 0-127
TMOV<VecT, AccT, AccToVecMode::DualModeSplitN>(vec1, acc);  // Columns 128-255

Pattern 5: Fix-Pipe Quantized Move (Production Inference)¶

// Move accumulator through fix-pipe: float32 → int8_t with per-channel scaling
using AccT = TileAcc<float, 32, 32>;
using VecT = Tile<TileType::Vec, int8_t, 32, 32>;
using FpT = Tile<TileType::Scaling, uint64_t, 1, 32>;

AccT acc;
VecT vec;
FpT fp(32);  // 32 scale factors (one per output channel)
TASSIGN(acc, 0x1000);
TASSIGN(fp, 0x2000);

TMOV_FP(vec, acc, fp);  // Quantize through fix-pipe

Pattern 6: Bias Tile Extraction¶

// Extract a bias vector from a wider matrix (row=1 requirement)
using MatT = Tile<TileType::Mat, float, 1, 64, BLayout::RowMajor, 1, 64, SLayout::ColMajor>;
using BiasT = TileBias<float, 64>;

MatT mat;
BiasT bias;
TASSIGN(mat, 0x3000);

TMOV(bias, mat);  // Mat → Bias (row must be 1, width aligned to 64 bytes)