pto.mgather¶
Tile Operation Diagram¶
Introduction¶
MGATHER reads data from a GM GlobalTensor into a UB destination tile through a UB index tile. The operating mode is selected explicitly through the Coalesce template parameter:
Coalesce::Row(default) — gather full rows fromtable[idx[r], :]intodst[r, :]. The index tile is 1-D ([1, R]row-major; on A5 also[R, 1]column-major).R = 1is allowed.Coalesce::Elem— element-wise gather from a linearizedtableintodst[R, C]throughidx[R, C]. The index tile must have the same valid shape as the destination. The degenerate(1, 1)case is allowed.
Out-of-bounds handling is selected through the GatherOOB template parameter. MGATHER has no atomic or conflict policy: every destination slot has exactly one defined source index, so collisions cannot occur.
The destination may also be an L1 / cube TileType::Mat tile in NZ layout (with the index supplied as a GM tensor). This GM → L1 path — for both Coalesce::Row and Coalesce::Elem, on A2/A3 and A5 — is documented in the GM → L1 Gather section below; the GM → UB behaviour described here is unchanged.
Per-target dispatch summary:
- CPU Simulator — pure C++ reference. Templates on the same
Coalesce/GatherOOBparameters as A5, with the sameCoalesce::Rowdefault, so a non-templatedMGATHER(dst, src, idx)meansCoalesce::Rowon sim exactly as on hardware. Row mode readstable[idx[r], :]intodst[r, :](table row strideShape[4],tableRows = Shape[3]); Elem mode walksvalidRow * validColand readstable[idx[i, j]].GatherOOBis modeled (Clamp/Wrapremap the index,Zerowrites a zero element on out-of-range;Undefinedreads unchecked). Row iteration usespto::cpu::parallel_for_rows, which by default runs sequentially becausePTO_CPU_MAX_THREADSdefaults to1u. - A2/A3 VEC-CORE — single-threaded scalar / MTE2 walk driven from the scalar pipe. Row mode issues one wide
copy_gm_to_ubuf_align_b*DMA per row throughtablePtr + safeIdx * tableRowStride(wheretableRowStride = table.GetStride(DIM_3)andtableRows = ∏ Shape[0..3]); Elem mode performs a scalar GM→UB copy per element (DMA bursts cannot satisfy per-element UB-destination 32-byte alignment). Supports ND-GM with ND-UB and NZ-GM with NZ-UB tile pairs. - A5 SIMT — SIMT launch through
cce::async_invokewith up todim3{32, 32}(1024 threads). Row mode uses warp-parallel lane reads that the SIMT hardware coalesces into 128 B GM bursts when consecutive; Elem mode maps one lane to one element with per-lane scalar GM loads. The Row kernel computessrcRow = table + safeIdx * validCols, so the GM table is treated as packed ND with row stride =validCols—MGatherCheckenforcesGlobalTable::staticShape[4] == TileDst::ValidColat compile time, andtableRows = Shape[3]. NZ block-stride layouts are not implemented on A5. The(1, 1)Elem case bypasses the SIMT launch and runsMGatherScalarImplon the AIV vector core.
Math Interpretation¶
Row Coalesce (Coalesce::Row)¶
Destination dst[R, C], index idx[1, R] (or idx[R, 1] on A5), table table[TableRows, C]:
Element Coalesce (Coalesce::Elem)¶
Destination dst[R, C], index idx[R, C] (same valid shape as dst), flat table of length TableSize:
TableSize = Shape[0] * Shape[1] * Shape[2] * Shape[3] * Shape[4] of the source GlobalTensor (5-D, any combination of static and dynamic dims). For NZ tables (A2/A3) the scalar idx is decomposed into (logicalRow = idx / nLogicalCols, logicalCol = idx % nLogicalCols) and then translated to the NZ block-stride GM offset.
Out-of-Bounds Behaviour¶
enum class GatherOOB : uint8_t {
Undefined = 0, // No bounds check; caller guarantees valid indices
Clamp = 1, // Clamp index to capacity - 1
Wrap = 2, // Index modulo capacity
Zero = 3 // Return zero for OOB; in-bounds indices are loaded normally
};
capacity is TableRows in Row mode and the full flat table length in Elem mode.
Undefined: caller guaranteesidx < capacity; no remap is applied.Clamp:idx = min(idx, capacity - 1)before access.Wrap:idx = idx % capacitybefore access.Zero: out-of-bounds destinations receivestatic_cast<T>(0). In Row mode the OOB row is filled withT(0)(A2/A3 ND fills inline on the scalar pipe; A2/A3 NZ pre-zeros the whole tile once before the DMA loop; A5 SIMT does the substitution inline per lane). In Elem mode the OOB lane writesT(0)inline through the same store. All dtypes are supported under everyGatherOOBvalue.- CPU simulator: models
GatherOOBthe same way as A5 —Clamp/Wrapremap the (uint32-cast) index,ZerowritesT(0)for out-of-range, and onlyUndefined(the default) readstable.data()[idx]unchecked (target-defined).
Assembly Syntax¶
PTO-AS form: see PTO-AS Specification.
Synchronous form:
%dst = mgather %mem, %idx : !pto.memref<...>, !pto.tile<...> -> !pto.tile<...>
Row coalesce:
mgather.row %dst, %table, %idx : (!pto.tile<RxCxT>, !pto.memref<...>, !pto.tile<1xRxi32>)
Element coalesce:
mgather.elem %dst, %table, %idx : (!pto.tile<RxCxT>, !pto.memref<...>, !pto.tile<RxCxi32>)
OOB-aware variants append the mode suffix (mgather.row.clamp, mgather.elem.zero, etc.).
AS Level 1 (SSA)¶
%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
-> !pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
AS Level 2 (DPS)¶
pto.mgather ins(%mem, %idx : !pto.partition_tensor_view<MxNxdtype>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
C++ Intrinsic¶
Declared in include/pto/common/pto_instr.hpp (the shared dispatcher) and the per-target implementation headers (include/pto/cpu/MGatherScatter.hpp, include/pto/npu/a2a3/MGather.hpp, include/pto/npu/a5/MGather.hpp).
CPU Reference Form¶
template <Coalesce Mode = Coalesce::Row,
GatherOOB Oob = GatherOOB::Undefined,
typename TileDst, typename GlobalData, typename TileInd, typename... WaitEvents>
PTO_INST RecordEvent MGATHER(TileDst &dst, GlobalData &src, TileInd &indexes, WaitEvents &... events);
The CPU form mirrors the A5 signature and defaults, so a non-templated MGATHER(dst, src, idx) resolves to Coalesce::Row on both sim and hardware and one kernel source validates against one golden. Row mode reads src[idx[r], :] into dst[r, :] with tableRows = Shape[3] and row stride Shape[4]; Elem mode walks validRow × validCol and reads dst[i, j] = src[idx[i, j]]. GatherOOB is modeled the same way as hardware: Clamp/Wrap remap the (uint32-cast) index, Zero writes a zero element for out-of-range, and Undefined reads unchecked.
A2/A3 Form¶
template <Coalesce CMode = Coalesce::Row,
GatherOOB Oob = GatherOOB::Undefined,
typename TileDst, typename GlobalTable, typename TileIdx,
typename... WaitEvents>
PTO_INST RecordEvent MGATHER(TileDst& dst, GlobalTable& table, TileIdx& idx,
WaitEvents&... events);
The kernel iterates over TileDst::ValidRow * TileDst::ValidCol logical positions. Physical UB strides come from each tile's RowStride (which equals padded Cols for BLayout::RowMajor). Dispatched as a regular AIV function call — no async-launch or cross-core orchestration.
A5 Form¶
template <Coalesce CMode = Coalesce::Row,
GatherOOB Mode = GatherOOB::Undefined,
typename TileDst, typename GlobalData, typename TileInd,
typename... WaitEvents>
PTO_INST RecordEvent MGATHER(TileDst& dst, GlobalData& table, TileInd& idx,
WaitEvents&... events);
The implementation launches simt_mgather_row_kernel or simt_mgather_elem_kernel through cce::async_invoke<…>(cce::dim3{32, kLaunchWarps}, …). UB addressing inside the SIMT kernel goes through tile_offset_2d<TileX>(r, c), which makes the kernel layout-agnostic for UB tiles. The degenerate Elem (1, 1) case bypasses the SIMT launch and runs MGatherScalarImpl on the AIV vector core.
Parameters (NPU forms)¶
dst— UB destination tile (TileType::Vec); shape[R, C].table— source GMGlobalTensor.GlobalTensor::DTypemust be__gm__ Tmatching the destination element type.idx— UB index tile (TileType::Vec).CMode—Coalescevalue (RoworElem). First template parameter, so the operating mode is always explicit at the call site.Oob/Mode—GatherOOBvalue for out-of-bounds handling.
Enums¶
enum class Coalesce : uint8_t {
Row = 0, // dst[r, :] = table[idx[r], :] (1-D index of length R)
Elem = 1 // dst[i, j] = table[idx[i, j]] (idx shape == dst shape)
};
enum class GatherOOB : uint8_t {
Undefined = 0,
Clamp = 1,
Wrap = 2,
Zero = 3
};
Constraints¶
The constraints below are split into target-specific sections so each backend lists exactly what its implementation enforces. Symbols that are common across targets (T = TileDst::DType, TIdx = TileIdx::DType) are reused.
Tile Constraints (CPU)¶
Supported data types:
dst/srcelement type must be one of:int8_t,uint8_t,int16_t,uint16_t,int32_t,uint32_t,half,bfloat16_t,float.- On AICore targets (CPU simulator compiled with
__CCE_AICORE__),float8_e4m3_tandfloat8_e5m2_tare also supported. indexeselement type must beint32_toruint32_t.
Tile and memory types:
dstmust be a vector tile (TileType::Vec).indexesmust be a vector tile (TileType::Vec).dstandindexesmust use row-major layout (BLayout::RowMajor + SLayout::NoneBox).srcmust be aGlobalTensorin GM memory.srcmust useLayout::ND.
Shape constraints:
dst.Rows == indexes.Rows(the index tile and destination tile share the row count).indexesmust be shaped as[N, 1]for row-indexed gather or[N, M]for element-indexed gather.dstrow width must be 32-byte aligned, that is,dst.Cols * sizeof(TileDst::DType)must be a multiple of 32.srcstatic shape must satisfyShape<1, 1, 1, TableRows, RowWidth>.
Index interpretation:
- Index interpretation follows the
Coalescemode, with the sameCoalesce::Rowdefault as hardware.Rowtreatsidx[r]as a logical row index into a[TableRows, Shape[4]]table;Elemtreatsidx[i, j]as a linear element index intosrc.data(). - The CPU simulator models
GatherOOB:Clamp/Wrapremap the index,ZerowritesT(0)for out-of-range, andUndefined(the default) readssrc.data()[idx]unchecked (target-defined). Indices are cast touint32_tfirst, so negative indices resolve as large values.
Header-level enforcement. The CPU header (include/pto/cpu/MGatherScatter.hpp) static-asserts only the minimum closure rules: std::is_integral_v<TileInd::DType> for the index dtype and sizeof(TileDst::DType) == sizeof(GlobalData::DType) for byte-wise compatibility. The dtype / shape / layout constraints above are the PTO ABI contract — callers are expected to honor them so the same kernel source compiles and runs unmodified against the A2/A3 and A5 backends.
Tile Constraints (A2/A3)¶
Supported data types:
dst/srcelement type must be one of:int8_t,uint8_t,int16_t,uint16_t,int32_t,uint32_t,half,bfloat16_t,float. Nofloat8_e4m3_t/float8_e5m2_t/hifloat8_ton the A2/A3 vec-core.indexeselement type must beint32_toruint32_t.
Tile and memory types:
dstmust be a vector tile (TileDst::Loc == TileType::Vec).indexesmust be a vector tile (TileIdx::Loc == TileType::Vec).- The index tile is always
BLayout::RowMajor + SLayout::NoneBox(ND), regardless of the table layout. srcmust be aGlobalTensorin GM memory;GlobalTable::DType == __gm__ T.- The destination tile's bulk + sub layout must be paired with the table layout exactly:
GlobalTable::layout == Layout::ND⇒TileDstisBLayout::RowMajor + SLayout::NoneBox.GlobalTable::layout == Layout::NZ⇒TileDstisBLayout::ColMajor + SLayout::RowMajor + SFractalSize == TileConfig::fractalABSize(= 512 B).
Shape constraints:
- Padded
TileDst::Cols * sizeof(T)must be 32-byte aligned in both layouts (the same DMA-burst rule thatTLOAD/TSTOREenforce).ValidRow/ValidColare not constrained by this rule. - For
Coalesce::Row:TileIdx::ValidRow == 1andTileIdx::ValidCol == TileDst::ValidRow. - For
Coalesce::Elem:TileIdx::ValidRow == TileDst::ValidRowandTileIdx::ValidCol == TileDst::ValidCol. - Both modes require
TileDst::ValidRow >= 1andTileDst::ValidCol >= 1. - NZ tables additionally require
GlobalTable::staticShape[3] == FRACTAL_NZ_ROW(= 16),GlobalTable::staticShape[4] == C0_SIZE_BYTE / sizeof(T)(= 32 B / element width),TileDst::Cols % (C0_SIZE_BYTE / sizeof(T)) == 0, andTileDst::Rows % FRACTAL_NZ_ROW == 0.
Index interpretation:
- For
Coalesce::Row, each index is treated as a logical row index into the[TableRows, RowWidth]ND table (or as a logical row acrossgShape2 * FRACTAL_NZ_ROWfor NZ). - For
Coalesce::Elem, each index is treated as a linear element index into the flat 5-D table (tableSize = ∏ shape[0..4]). For NZ the kernel splits the linear index into(logicalRow = idx / nLogicalCols, logicalCol = idx % nLogicalCols)and applies the NZ block-stride translation throughMGatherNZGmOffset. - OOB handling follows the
GatherOOBtemplate parameter (Undefined/Clamp/Wrap/Zero); see the Out-of-Bounds subsection above.
Tile Constraints (A5)¶
Supported data types:
dst/srcelement type must be one of:int8_t,uint8_t,int16_t,uint16_t,int32_t,uint32_t,half,bfloat16_t,float. On__CCE_AICORE__builds the list also includeshifloat8_t,float8_e4m3_t,float8_e5m2_t.indexeselement type must beint32_toruint32_t.
Tile and memory types:
dstmust be a vector tile (TileDst::Loc == TileType::Vec).indexesmust be a vector tile (TileIdx::Loc == TileType::Vec).- The SIMT kernel is layout-agnostic for UB tiles: every UB read/write goes through
tile_offset_2d<TileX>(r, c), soTileDstmay beBLayout::RowMajororBLayout::ColMajor(withSLayout::NoneBox). srcmust be aGlobalTensorin GM memory;GlobalTable::DType == __gm__ T.- GM table layout:
Layout::NDonly. A5 SIMT kernels address GM as a flat row-major buffer with row stride hard-wired tovalidCols(srcRow = table + safeIdx * validCols);MGatherCheckenforcesGlobalTable::staticShape[4] == TileDst::ValidColso the table cannot have any inter-row padding. The kernel does not honor any otherGlobalTensor::layoutvalue —MGatherCheckdoes not static-assert the GMlayoutfield, but usingLayout::NZ(or anything other than packed ND with row width =validCols) produces wrong output because the kernel never performs block-stride translation. Callers that need NZ data must pre-stage into ND.
Shape constraints:
- Padded
TileDst::Cols * sizeof(T)(RowMajor) orTileDst::Rows * sizeof(T)(ColMajor) must be 32-byte aligned — this is an upstreamTLOAD/TSTORErequirement, not aMGATHERconstraint per se.ValidRow/ValidColare not constrained by this rule. - For
Coalesce::Row: the index tile's valid shape is[1, R](BLayout::RowMajor) or[R, 1](BLayout::ColMajor). Both forms produce a linear R-element layout in UB and the kernel readsidx[row]directly. The choice ofBLayoutfor the index tile is independent of the GM table layout —MGatherCheckdoes not constrain the GMlayoutfield for the table. - For
Coalesce::Elem:TileIdx::ValidRow == TileDst::ValidRowandTileIdx::ValidCol == TileDst::ValidCol. TheBLayoutofTileIdxis independent ofTileDst's — the kernel walks both through per-tiletile_offset_2d. - Both modes require
TileDst::ValidRow >= 1andTileDst::ValidCol >= 1. The degenerate(1, 1)shape in Elem mode bypasses the SIMT launch and runsMGatherScalarImplon the AIV vector core.
Index interpretation:
- For
Coalesce::Row, each index is treated as a logical row index into the[TableRows, validCols]ND table; the kernel computessrcRow = table + safeIdx * validCols. - For
Coalesce::Elem, each index is treated as a linear element index into the flat 5-D table (tableSize = ∏ shape[0..4]); the kernel readstable[safeIdx]. - OOB handling follows the
GatherOOBtemplate parameter (Undefined/Clamp/Wrap/Zero); see the Out-of-Bounds subsection above.
Dynamic Runtime Shapes (A2/A3 and A5)¶
MGATHER accepts both compile-time and runtime-dynamic shapes:
Tile<…, RowMask, ColMask>withRowMask == -1and/orColMask == -1stores the runtime valid extents in the tile object; the implementation reads them throughdst.GetValidRow()/dst.GetValidCol().Shape<S0, S1, S2, S3, S4>/Stride<…>with one or more-1entries are constructed with the runtime sizes; the implementation reads them throughtable.GetShape(GlobalTensorDim::DIM_*)and folds them intotableRows(Row mode) ortableSize = ∏ shape[0..4](Elem mode).
Static-asserts in MGatherCheck are gated on if constexpr (DIM > 0), so they fire only for compile-time-known dimensions. Padded Tile::Rows / Cols are always compile-time — they govern the UB DMA-burst alignment and the SIMT lane addressing.
Example (mirrors case_elem2d_dyn_user_float_1x9_in_1x16_3x10):
constexpr auto kPadCols = 16;
using DstTileT = Tile<TileType::Vec, float, 1, kPadCols, BLayout::RowMajor, -1, -1>;
using IdxTileT = Tile<TileType::Vec, int32_t, 1, kPadCols, BLayout::RowMajor, -1, -1>;
using TableShape = Shape<1, 1, 1, -1, -1>;
using TableStride = Stride<1, 1, 1, -1, -1>;
int64_t validCols = 9, d3 = 3, d4 = 10, srcStride3 = 10;
TableShape tableShape(d3, d4);
TableStride tableStride(srcStride3, (int64_t)1);
GlobalTensor<float, TableShape, TableStride> tableGM(srcGm, tableShape, tableStride);
DstTileT dstTile(1, validCols);
IdxTileT idxTile(1, validCols);
TASSIGN(dstTile, dstUbOffsetBytes);
TASSIGN(idxTile, idxUbOffsetBytes);
MGATHER<Coalesce::Elem, GatherOOB::Undefined>(dstTile, tableGM, idxTile);
At dispatch the implementation resolves validRows = 1, validCols = 9, and tableSize = 1·1·1·3·10 = 30. The padded UB Tile::Cols = 16 is purely a TLOAD burst-alignment artifact — the gather loop walks only the valid 9 elements.
Mode Resolution¶
Mode is explicit on A2/A3 and A5, never auto-detected. Static asserts in MGatherCheck validate the supplied tile shapes against the chosen Coalesce value:
A2/A3:
Coalesce::Row : Idx.ValidRow == 1 && Idx.ValidCol == Dst.ValidRow
Coalesce::Elem : Idx.ValidRow == Dst.ValidRow && Idx.ValidCol == Dst.ValidCol
A5:
Coalesce::Row : (Idx.ValidRow == 1 && Idx.ValidCol == Dst.ValidRow) ||
(Idx.ValidRow == Dst.ValidRow && Idx.ValidCol == 1)
Coalesce::Elem : (Idx.ValidRow == Dst.ValidRow) && (Idx.ValidCol == Dst.ValidCol)
The CPU reference path enforces the same A5 Row rule ([1, R] or [R, 1] index valid shape) when valid shapes are statically known; Elem walks element-wise over the destination valid region.
Layout Support¶
UB addressing is computed from each tile's Rows / Cols plus (on A2/A3 NZ) an optional fractal block-col stride; GM addressing is driven from GlobalTensor::GetStride(DIM_*). The matrix below summarises what each backend accepts.
| Tile / Tensor | CPU | A2/A3 | A5 |
|---|---|---|---|
TileDst (UB) — ND |
BLayout::RowMajor + SLayout::NoneBox only |
BLayout::RowMajor + SLayout::NoneBox |
BLayout::RowMajor or ColMajor, SLayout::NoneBox (Elem mode walks both via tile_offset_2d) |
TileDst (UB) — NZ |
not supported | BLayout::ColMajor + SLayout::RowMajor + SFractalSize == 512 (paired with NZ GM table) |
not supported |
TileIdx (UB) — Row |
[1, R] or [R, 1] row-major (same valid-shape rule as A5) |
[1, R] BLayout::RowMajor + SLayout::NoneBox (always ND, regardless of table layout) |
[1, R] RowMajor or [R, 1] ColMajor (independent of GM table layout) |
TileIdx (UB) — Elem |
BLayout::RowMajor + SLayout::NoneBox |
[R, C] BLayout::RowMajor + SLayout::NoneBox |
any BLayout, independent of TileDst (kernel reads through tile_offset_2d<TileIdx>) |
GlobalTable (GM) — ND |
Layout::ND only |
Layout::ND (linear contiguous addressing); 5-D Shape<…, R, C>; kernel uses tableRowStride = GetStride(DIM_3), so stride-padded ND tables are supported |
Layout::ND only; Row mode hard-wires tableRowStride = validCols and requires Shape[4] == validCols |
GlobalTable (GM) — NZ |
not supported | Layout::NZ; 5-D Shape<B, BCols, BRows, 16, C0> with staticShape[3] == 16 and staticShape[4] == 32 / sizeof(T) (B may be > 1; the kernel loops for i in 0..gShape0) |
not supported |
NZ Layout (A2/A3)¶
NZ paths exist only on A2/A3. The A5 SIMT kernel addresses GM as a flat ND buffer and has no NZ block-stride translation — there is no "inherent SIMT NZ support" to be enabled. Callers that need NZ data on A5 must transpose / repack into ND first.
When GlobalTable::layout == Layout::NZ and TileDst is the matching BLayout::ColMajor + SLayout::RowMajor + SFractalSize = 512 tile, MGATHER (A2/A3) runs the dedicated NZ paths (MGatherRowNzImpl, MGatherElemNzImpl).
- Constants.
kC0 = C0_SIZE_BYTE / sizeof(T) = 32 / sizeof(T);kFRow = FRACTAL_NZ_ROW = 16. Each fractal block iskFRow × kC0elements (= 32 B × 16 = 512 B). - Logical shape. Logical rows =
gShape2 * kFRow. Logical cols =gShape0 * gShape1 * kC0. Row-modemgather_remapclamps / wraps against the logical row count; Elem-mode against the total element count(gShape2 * kFRow) * (gShape0 * gShape1 * kC0). - Row mode. For each logical row
r, the kernel mapsidx[r]to(srcBlockRow, srcRowInBlock)andrto(dstBlockRow, dstRowInBlock), then issues one multi-burst MTE2 transfer per outer batch (nBurst = gShape1,lenBurst = kC0 * sizeof(T) = 32 B,gmGap = (gStride1 - kC0) * sizeof(T),ubGap = TileDst::Rows - 1blocks). WhenOob == GatherOOB::Zerothe kernel pre-fills the whole tile withT(0)once before the DMA loop and simply skips DMAs for OOB rows. - Elem mode. For each
(r, c)the kernel mapsidxto(logicalRow, logicalCol) = (idx / nLogicalCols, idx % nLogicalCols), then to NZ physical offsets throughMGatherNZGmOffset. The destination offset is(c / kC0) * (TileDst::Rows * kC0) + r * kC0 + (c % kC0). The walk order is block-col → row → col-in-block so consecutive writes always target consecutive 32 B UB blocks. Out-of-bounds lanes writeT(0)inline whenOob == GatherOOB::Zero. - Stride vs. valid-shape.
MGather*NzImplreads strides from theGlobalTensorruntime (GetStride(DIM_*)), so packed (gStride1 == gShape2 * gShape3 * gShape4) and stride-padded NZ tensors both work without caller-side adjustment.
Why Elem Mode Uses Scalar GM Reads (A2/A3)¶
copy_gm_to_ubuf_align_b8/b16/b32 requires the UB destination address to be 32-byte aligned, and the destination must be a whole number of 32-byte burst chunks. A per-element MTE2 burst of lenBurst = sizeof(T) to dstPtr + r * RowStride + c does not satisfy that rule whenever (c * sizeof(T)) % 32 != 0, which covers almost every elem-mode lane. On the simulator the runtime accepts the misaligned burst; on real A2/A3 hardware the transfer silently drops, leaving the destination lane at its initial value (typically zero). Row mode does not hit this problem because each row write starts at r * RowStride, and RowStride * sizeof(T) is always a multiple of 32 bytes.
A2/A3 Elem mode therefore uses scalar GM→UB copies, which have element-level addressing granularity and place no alignment requirement on the destination. Atomic semantics are not needed for gather (no destination is written from multiple sources), and OOB::Zero collapses into a direct scalar zero-write. The scalar GM↔UB path is the same (1, 1) fallback already validated on hardware, extended to all elem shapes.
A5 does not face this constraint: the SIMT lane issues per-thread loads instead of a UB-aligned DMA burst, so Elem mode on A5 is naturally per-element with no extra alignment workaround.
Aligned vs Unaligned Tile Shapes¶
The kernel does not care whether the tile's logical shape is "aligned" — it walks all ValidRow * ValidCol positions. The 32-byte alignment of the contiguous dim is enforced upstream by the Tile system because every TLOAD / TSTORE issues 32 B GM↔UB bursts.
- Row mode (A2/A3 ND). per-row DMA
lenBurst = validCol * sizeof(T);Tile::RowStride * sizeof(T)is forced 32-byte aligned by theTilesystem, so subsequent rows always start on a 32 B burst boundary. - Row mode (A2/A3 NZ). one multi-burst transfer per logical row × outer-batch;
lenBurst = kC0 * sizeof(T) = 32 Bis fixed by the fractal layout, so per-row alignment is automatic.validRowdoes not have to be a multiple ofkFRow. - Elem mode (A2/A3). one scalar GM→UB copy per element. The scalar pipe has element-level addressing granularity, so any
(ValidRow, ValidCol)inside the padded tile works. - Row + Elem mode (A5). the SIMT kernel walks through
tile_offset_2d<TileX>(r, c)so any(ValidRow, ValidCol)with1 ≤ Valid ≤ Paddedis accepted, including the degenerate(1, 1)which routes toMGatherScalarImpl.
Callers handle "unaligned valid region" by padding the tile up to the nearest 32-byte alignment (for example valid [3, 3] int32 → tile [3, 8]), and either zero-initializing the padding (TASSIGN-then-clear) or only inspecting the valid region post-gather.
Minimum Tile Shape¶
The minimum padded inner dim is set by the upstream TLOAD / TSTORE burst-alignment rule; MGATHER itself accepts any (ValidRow, ValidCol) >= (1, 1).
A2/A3 — row-major tile only (NZ paths use fractal [16, kC0] blocks instead):
T |
Min Cols (BLayout::RowMajor) |
|---|---|
int8 / uint8 |
32 |
int16 / uint16 / half / bfloat16 |
16 |
int32 / uint32 / float |
8 |
A5 — row-major or column-major tile (the contiguous dim must satisfy the 32 B rule):
T |
Min Cols (RowMajor) |
Min Rows (ColMajor) |
|---|---|---|
int8 / uint8 / float8_e4m3 / float8_e5m2 / hifloat8 |
32 | 32 |
int16 / uint16 / half / bfloat16 |
16 | 16 |
int32 / uint32 / float |
8 | 8 |
CPU — same row-major min rule as A2/A3 (any Cols * sizeof(T) that the test's TLOAD path requires).
Pipe / Synchronisation Model¶
The two NPU targets use very different mechanisms; the CPU reference path has no pipes to synchronise.
A2/A3 — explicit pipe handshakes¶
The implementation centralises every pipe handshake the kernel needs. Callers do not need to insert any extra barriers beyond the standard TLOAD post-load set_flag(PIPE_MTE2, PIPE_V) / wait_flag(PIPE_MTE2, PIPE_V) pair that brings the index tile into a clean state on the vector pipe before MGATHER. The kernel never uses pipe_barrier(PIPE_ALL) — every wait is a specific producer→consumer pair.
| Phase | Pipe transition | What it guards |
|---|---|---|
| Pre-amble (Row ND + Row NZ) | V→S, MTE3→S flag chain |
Make the index tile visible to scalar reads (V→S transitively waits for MTE2 through the caller's TLOAD post-load MTE2→V flag) and flush any pending MTE3 writes that might overlap UB before the scalar loop starts. |
| Pre-amble (Elem ND + Elem NZ) | V→S, MTE3→S, MTE2→S flag chain |
Same as Row, with an additional MTE2→S flag that flushes any in-flight MTE2 burst before the scalar loop reads idxPtr[r * IdxRowStride + c]. |
| Body (Row, ND) | copy_gm_to_ubuf_align_b* per row |
One DMA per row, lenBurst = validCol * sizeof(T); trip count = validRow. |
| Body (Row, NZ) | copy_gm_to_ubuf_align_b* multi-burst per logical row × batch |
nBurst = gShape1, lenBurst = C0 * sizeof(T) = 32 B, gmGap = (gStride1 - C0) * sizeof(T), ubGap = Tile::Rows - 1 blocks; trip count = validRow * gShape0. |
| Body (Elem, ND and NZ) | Scalar dstUb[r, c] = tableGm[gmOff] per element |
Per-element scalar GM→UB copy; trip count = validRow * validCol. NZ walks block-col-major to keep each 32 B UB block written contiguously in time. OOB::Zero lanes write T(0) inline. |
| Row post-amble (ND + NZ) | S→MTE2, MTE2→V, MTE2→MTE3, S→V, S→MTE3 flag chain (each pair is a set_flag + wait_flag) |
First close the scalar→MTE2 race so any in-flight DMA observes the loop's final scalar state; then drain the MTE2 DMAs before V or MTE3 consumers touch the destination tile; finally release the scalar pipe to V and MTE3 for downstream operators (e.g. a follow-up vector op or TSTORE). |
| Elem post-amble (ND + NZ) | S→V, S→MTE2, S→MTE3 flag chain |
Make the scalar UB writes visible to V (for downstream vector ops), MTE2 (for follow-up gathers / loads), and MTE3 (bridges to the caller's TSTORE). |
A5 — SIMT launch with V↔S handshake¶
The A5 implementation hides almost the entire pipe model behind cce::async_invoke, which establishes the warp-scheduler context and orchestrates per-lane GM and UB accesses. The only explicit kernel-side handshake is in the scalar fallback path (MGatherScalarImpl, used for Elem (1, 1)):
| Phase | Pipe transition | What it guards |
|---|---|---|
| Pre-amble (scalar fallback) | set_flag(PIPE_V, PIPE_S) / wait_flag(PIPE_V, PIPE_S) |
Make the index tile visible to the scalar pipe before the single-element gather. |
| Body (SIMT Row / Elem) | cce::async_invoke<simt_mgather_*_kernel>(dim3{32, kLaunchWarps}, …) |
The SIMT launch handles all per-lane GM loads and UB stores internally; no caller-visible flags. |
| Post-amble (scalar fallback) | set_flag(PIPE_S, PIPE_V) / wait_flag(PIPE_S, PIPE_V) |
Release the scalar UB write to downstream vector ops. |
Caller responsibility on A5. The same TLOAD post-load set_flag(PIPE_MTE2, PIPE_V) / wait_flag(PIPE_MTE2, PIPE_V) pair is required before MGATHER so the index tile is observable on the vector pipe by the time the SIMT launch begins. Downstream consumers (TSTORE, vector ops on the destination) only need their normal V↔MTE3 / V↔V handshake — the SIMT launch is treated as a vector-pipe producer.
UB Memory Budget¶
The unified buffer is shared between user tiles, the runtime reserved region, and the Data Cache. Budget rules differ per target.
CPU simulator¶
No UB — the implementation runs in host memory. Tile sizes are bounded by the test harness only.
A2/A3¶
The AIV vector core has the standard CANN 192 KB UB layout. MGATHER does not allocate any UB scratch from inside the kernel — the only UB consumers are the caller-allocated destination tile (R * C * sizeof(T), padded up to the 32-byte burst alignment) and index tile (R * C * sizeof(TIdx), same padding rule). A2/A3 has no dynUBufSize knob; the working set must fit in the caller's static UB budget.
A5¶
A5 SIMT kernels run on the AIV vector core. All user tiles must fit inside the AIV's 256 KB Unified Buffer alongside two fixed runtime reservations: an 8 KB reserved region (AscendC / TBE bookkeeping) and the Data Cache (32 KB minimum, sized at launch time). The UB layout is:
+---------------------------+
| Static memory | Compile-time tile allocations
+---------------------------+
| Dynamic memory | Sized at launch through dynUBufSize
+---------------------------+
| Reserved (8 KB) | Fixed compiler / AscendC reservation
+---------------------------+
| Data Cache (>= 32 KB) | Min 32 KB; grows when dynUBufSize is small
+---------------------------+
The configurable maximum is therefore:
max dynUBufSize = 256 KB - 8 KB (reserved) - 32 KB (min DCache) - static_memory
= 216 KB - static_memory
When tiles are placed manually with TASSIGN (as in the A5 ST suite), the compiler sees static_memory ≈ 0 and the full 216 KB is available as dynUBufSize. MGATHER itself does not allocate any UB scratch — every read flows GM → register → UB. The only UB consumers are the destination and index tiles.
Default Per-Call Budget (No dynUBufSize)¶
When the kernel is launched without an explicit dynUBufSize (<<<numBlocks, nullptr, stream>>>), the runtime keeps the default DCache size and reserves only a small default dynamic region. In practice the safe dst + idx working set is ≤ 128 KB; beyond that, on-board execution may silently corrupt or zero out the result while still passing the CPU simulator (which does not model these reservations).
Extending Per-Call UB Beyond 128 KB¶
Callers that need a single-shot dst + idx footprint larger than 128 KB must declare the dynamic-UB request explicitly through the second argument of the kernel launch:
kernel_name<<<numBlocks, dynUBufSize, stream>>>(args...);
dynUBufSize is the byte size of the dynamic-UB region the kernel will use. The bisheng/CCE compiler routes such launches through __cce_rtKernelLaunchWithFlagV2, setting rtTaskCfgInfo_t::localMemorySize = dynUBufSize. The runtime then shrinks the DCache toward its 32 KB minimum and hands the remaining space back to the kernel.
Key points:
- The simulator does not enforce this. Passing
nullptr(or0) still runs to completion in sim regardless of the actual UB footprint. Always setdynUBufSizeexplicitly when the workload exceeds 128 KB so the binary stays correct on real hardware. - Exceeding the ceiling is silent. The compiler does not error and the simulator does not flag it. On-board, the first overflow byte corrupts the reserved region or DCache and the kernel returns undefined output.
- Size to actual usage. For Elem coalesce with
R × C × sizeof(T)destination andR × C × sizeof(int32_t)index, the working set isR * C * (sizeof(T) + 4). Round up to a comfortable margin when passingdynUBufSize.
Example launch wrapper for an extended-UB gather:
// Round dst + idx up to a safe dynUBufSize (here T = float, idx = int32_t, so
// per-element footprint = 4 + 4 = 8 B; pass R * C * 8 with some headroom).
constexpr uint32_t kDynUbBytes = (uint32_t)(R * C * (sizeof(T) + sizeof(int32_t)));
runMGATHER_kernel<<<numBlocks, kDynUbBytes, stream>>>(out, table, indices);
The current MGATHER ST suite does not include extended-UB cases — every existing case fits in the default budget — but workloads that exceed 128 KB should follow this pattern (see MSCATTER.md for worked examples that push the destination + index footprint up to 216 KB).
Runtime Dispatch Requirement (A5)¶
MGATHER on A5 uses cce::async_invoke<simt_mgather_*_kernel>(cce::dim3{32, kLaunchWarps}, …) internally to fan a per-warp / per-lane workload out across up to 1024 threads. async_invoke consumes runtime state (TID registers, warp / lane configuration, vector-pipe scheduling) that the launch path must install before the kernel function is entered. The standard CANN launch (rtKernelLaunchWithHandleV2, used by the <<<numBlocks, dynUBufSize, stream>>> syntax) installs this state correctly.
A runtime variant that dispatches kernels as a direct C function-pointer call is fine for SPMD ops (TLOAD, TSTORE, TADD, …) but skips the SIMT context init, so the first async_invoke inside MGATHER has no warp scheduler to dispatch into and hangs. Use the standard launch syntax for any SIMT kernel.
Performance Considerations¶
A2/A3¶
- Row vs. Elem. Row coalesce achieves the best aggregate bandwidth — one wide DMA per logical row (ND) or one multi-burst DMA per logical row × batch (NZ). Elem coalesce issues one scalar GM read + UB write per active lane: no DMA-engine pipelining, throughput bound by scalar GM access latency. Prefer Row whenever the indexing structure permits.
- Sequential scalar loop (Elem). A2/A3 dispatches
MGATHERas a single-thread sequential walk of thevalidRow * validCollanes. Trip counts ofValidCol ≤ 32 / sizeof(T)rows are the sweet spot; large flat tiles are bound by scalar GM read latency. The block-col-major walk used for NZ keeps consecutive writes spatially-local in UB. - Why not per-element MTE2 in Elem mode. See "Why Elem mode uses scalar GM reads" above — the UB-destination 32 B alignment rule rules out per-element DMA bursts.
- DMA cost (Row). ND: each row is one
copy_gm_to_ubuf_align_b*call withnBurst = 1,lenBurst = validCol * sizeof(T). NZ: each (logical row, batch) pair is one call withnBurst = gShape1,lenBurst = C0 * sizeof(T) = 32 B,gmGap = (gStride1 - C0) * sizeof(T),ubGap = Tile::Rows - 1blocks. Back-pressure is bounded byMAX_OUTSTANDING_MTE2; no row chunking. - OOB cost.
Undefinedis free;Clamp/Wrapadd a single arithmetic remap per lane;ZerowritesT(0)inline (Elem) or pre-zeros the tile and skips DMAs for OOB rows (Row). - Single-pass dispatch.
MGATHERis a regular AIV function call from the kernel (no async-launch or cross-core orchestration). Concurrency comes from DMA engine pipelining behind the scalar issue loop, not from multiple worker threads.
A5¶
- Shape-adaptive launch. The SIMT grid is sized as
dim3{32, kLaunchWarps}from the resolvedvalidRows/validCols. Small tiles do not pay the cost of 1024 idle threads.- Row.
kRowWarps = min(validRows, 32)warps own rows;kWarpsPerRow = min(32 / kRowWarps, ceil(validCols / 32))cooperate on each row's column chunks. When consecutive lanes in a warp read consecutivecolvalues from the samesrcRow, the SIMT hardware coalesces them into a 128 B GM burst. - Elem.
kLaunchWarps = min(ceil(validRows*validCols / 32), 32). Threads withtid >= totalElemsskip the loop body. FortotalElems > 1024the strided loop walkslaunchThreadsat a time.
- Row.
- OOB policy cost.
Undefined: zero overhead.Clamp/Wrap: a single arithmetic remap per lane.Zero: one extra compare-and-select per lane to substitutestatic_cast<T>(0)for OOB lanes. - No thread divergence for mode / OOB. All decisions are
if constexpr. Thegather_remaplookup compiles to a small data-dependent transform with no control-flow split. - Unrolled inner loops. Inner column loop in Row coalesce carries
#pragma unroll(4); outer per-row and elem flat loops are#pragma unroll(1)to keep code size bounded. - Row vs. Elem bandwidth. Row coalesce achieves the best aggregate bandwidth (per-warp 32 lanes coalesce a single 128 B GM burst when
idx[r]is the same across the warp). Elem coalesce performs one scalar GM load per lane — non-coalesced for random indices; UB writes remain coalesced because consecutivetidmap to consecutive UB offsets. - Register pressure. Kernels carry
LAUNCH_BOUND(1024)(32 regs/thread) and use ≤ 12 live registers in the hot path. No spills are produced.
Examples¶
Auto¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_auto() {
using DstT = Tile<TileType::Vec, float, 16, 16>;
using IdxT = Tile<TileType::Vec, int32_t, 16, 16>;
DstT dst;
IdxT idx;
// src is a GlobalTensor in GM
MGATHER(dst, src, idx);
}
Manual¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_manual() {
using DstT = Tile<TileType::Vec, float, 16, 16>;
using IdxT = Tile<TileType::Vec, int32_t, 16, 16>;
DstT dst;
IdxT idx;
TASSIGN(dst, 0x1000);
TASSIGN(idx, 0x2000);
MGATHER(dst, src, idx);
}
Row Coalesce — Embedding Lookup (A2/A3 or A5)¶
#include <pto/pto-inst.hpp>
using namespace pto;
template <typename T, int R, int C, int TableRows>
AICORE void example_embedding_lookup(__gm__ T* tablePtr, __gm__ int32_t* idxPtr, __gm__ T* outPtr)
{
using DstTile = Tile<TileType::Vec, T, R, C, BLayout::RowMajor, R, C>;
using IdxTile = Tile<TileType::Vec, int32_t, 1, R, BLayout::RowMajor, 1, R>;
using TableShape = Shape<1, 1, 1, TableRows, C>;
using TableStride = Stride<1, 1, 1, C, 1>;
using TableTensor = GlobalTensor<T, TableShape, TableStride>;
using IdxShape = Shape<1, 1, 1, 1, R>;
using IdxStride = Stride<1, 1, 1, R, 1>;
using IdxTensor = GlobalTensor<int32_t, IdxShape, IdxStride>;
TableTensor tableGM(tablePtr);
IdxTensor idxGM(idxPtr);
DstTile dst; TASSIGN(dst, 0x0000);
IdxTile idx; TASSIGN(idx, 0x1000);
TLOAD(idx, idxGM);
set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
MGATHER<Coalesce::Row, GatherOOB::Clamp>(dst, tableGM, idx);
}
Element Coalesce — 2-D Random Access¶
AICORE void example_elem_2d(__gm__ float* tablePtr, __gm__ int32_t* idxPtr)
{
constexpr int R = 8, C = 32, TableSize = 256;
using DstTile = Tile<TileType::Vec, float, R, C, BLayout::RowMajor, R, C>;
using IdxTile = Tile<TileType::Vec, int32_t, R, C, BLayout::RowMajor, R, C>;
using TableShape = Shape<1, 1, 1, 1, TableSize>;
using TableStride = Stride<1, 1, 1, TableSize, 1>;
using TableTensor = GlobalTensor<float, TableShape, TableStride>;
TableTensor tableGM(tablePtr);
DstTile dst; TASSIGN(dst, 0x0000);
IdxTile idx; TASSIGN(idx, 0x0800);
MGATHER<Coalesce::Elem, GatherOOB::Wrap>(dst, tableGM, idx);
}
Row Coalesce — [R, 1] ColMajor Index (A5 only)¶
AICORE void example_row_colidx(__gm__ half* tablePtr, __gm__ int32_t* idxPtr)
{
constexpr int R = 8, C = 64, TableRows = 64;
using DstTile = Tile<TileType::Vec, half, R, C, BLayout::RowMajor, R, C>;
using IdxTile = Tile<TileType::Vec, int32_t, R, 1, BLayout::ColMajor, R, 1>;
using TableShape = Shape<1, 1, 1, TableRows, C>;
using TableStride = Stride<1, 1, 1, C, 1>;
using TableTensor = GlobalTensor<half, TableShape, TableStride>;
using IdxShape = Shape<1, 1, 1, R, 1>;
using IdxStride = Stride<1, 1, 1, 1, 1>;
using IdxTensor = GlobalTensor<int32_t, IdxShape, IdxStride, Layout::DN>;
TableTensor tableGM(tablePtr);
IdxTensor idxGM(idxPtr);
DstTile dst; TASSIGN(dst, 0x0000);
IdxTile idx; TASSIGN(idx, 0x1000);
TLOAD(idx, idxGM);
set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
MGATHER<Coalesce::Row, GatherOOB::Undefined>(dst, tableGM, idx);
}
Element Coalesce — (1, 1) Degenerate Case¶
AICORE void example_scalar(__gm__ float* tablePtr, __gm__ int32_t* idxPtr)
{
constexpr int TableSize = 32;
using DstTile = Tile<TileType::Vec, float, 1, 8, BLayout::RowMajor, 1, 1>;
using IdxTile = Tile<TileType::Vec, int32_t, 1, 8, BLayout::RowMajor, 1, 1>;
using TableShape = Shape<1, 1, 1, 1, TableSize>;
using TableStride = Stride<1, 1, 1, TableSize, 1>;
using TableTensor = GlobalTensor<float, TableShape, TableStride>;
TableTensor tableGM(tablePtr);
DstTile dst; TASSIGN(dst, 0x0000);
IdxTile idx; TASSIGN(idx, 0x0080);
MGATHER<Coalesce::Elem>(dst, tableGM, idx);
}
ASM Form Examples¶
Auto Mode¶
# Auto mode: compiler/runtime-managed placement and scheduling.
%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
Manual Mode¶
# Manual mode: resources must be bound explicitly before issuing the instruction.
# Optional for tile operands:
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
%dst = pto.mgather %mem, %idx : (!pto.partition_tensor_view<MxNxdtype>, pto.tile<...>)
PTO Assembly Form¶
%dst = mgather %mem, %idx : !pto.memref<...>, !pto.tile<...> -> !pto.tile<...>
# AS Level 2 (DPS)
pto.mgather ins(%mem, %idx : !pto.partition_tensor_view<MxNxdtype>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
GM → L1 Gather (TileType::Mat destination)¶
In addition to the GM → UB gather described above, MGATHER supports gathering an
indexed selection directly into an L1 / cube TileType::Mat tile in NZ fractal layout.
This is the form a matmul consumes (the A/NZ operand produced by TLOAD(MatTile_NZ,
GlobalTensor_ND)), so a gathered table can be fed straight into TEXTRACT / TMOV →
TMATMUL without a UB round-trip. Both Coalesce::Row and Coalesce::Elem are supported,
on A2/A3 and A5, for all element dtypes and all four GatherOOB policies.
The GM → L1 path is selected automatically when TileDst::Loc == TileType::Mat; the GM → UB
behaviour is unchanged for TileType::Vec destinations.
Index source — GM¶
For GM → L1 the scalar that reads the index and issues the L1 DMA runs on the cube core
(__DAV_CUBE__). On A5 the AIC and AIV are separate cores and the AIC cannot read AIV's UB,
so a UB index tile is not portable. The GM → L1 variant therefore takes the index as a GM
GlobalTensor (int32_t / uint32_t), read by the same core that issues the L1 transfer.
This works identically on A2/A3 and A5.
API¶
Row mode reuses the 3-argument form; the dispatcher routes to the GM → L1 path because the
destination is a Mat tile:
template <Coalesce CMode = Coalesce::Row, GatherOOB Oob = GatherOOB::Undefined,
typename MatTileDst, typename GlobalTable, typename GlobalIdx>
PTO_INST RecordEvent MGATHER(MatTileDst& dst, GlobalTable& table, GlobalIdx& idx);
Elem mode needs a contiguous GM scratch workspace to stage the discrete elements into NZ layout before the bulk GM → L1 copy, so it takes a fourth operand:
template <Coalesce CMode = Coalesce::Elem, GatherOOB Oob = GatherOOB::Undefined,
typename MatTileDst, typename GlobalTable, typename GlobalIdx, typename GlobalScratch>
PTO_INST RecordEvent MGATHER(MatTileDst& dst, GlobalTable& table, GlobalIdx& idx,
GlobalScratch& scratch);
scratch is a GM GlobalTensor<T, …> of at least TileDst::Rows * TileDst::Cols elements.
Algorithm¶
- Row mode. For each logical row
r, the scalar readsidx[r]from GM, remaps it per theGatherOOBpolicy, and issues onecopy_gm_to_cbuf_multi_nd2nzper row straight from the ND table rowtable[safeIdx, :]into the NZ slotdstBase + r * kC0of the L1 tile. The ND → NZ conversion happens in-flight, exactly like the per-row slice ofTLoadGm2L1Nd2nz(A2/A3) /TLoadCubeND2NZ(A5). No GM scratch is needed. - Elem mode. The scalar gathers each
table[idx[r, c]]into the GMscratchbuffer at the NZ offset(c / kC0) * (Rows * kC0) + r * kC0 + (c % kC0)(the buffer is pre-zeroed so OOB /Zerolanes stay zero). After the scalar fill, the buffer is flushed to DDR (dcciper cache line +dsb(DSB_DDR)) so the MTE2 engine observes the scalar writes, then a single contiguouscopy_gm_to_cbuf(pto_copy_gm_to_cbuf_align_v2on A5) moves the whole NZ buffer GM → L1.
kC0 = C0_SIZE_BYTE / sizeof(T) = 32 / sizeof(T). OOB::Zero in Row mode pre-zeros the whole
L1 tile once via pto_create_cbuf_matrix and skips the DMA for OOB rows.
Constraints (MGatherCheckGm2L1, A2/A3 and A5)¶
- Dtypes. A2/A3:
int8/uint8/int16/uint16/int32/uint32/half/bfloat16/float. A5 additionally allowshifloat8/float8_e4m3/float8_e5m2. Row mode further requiressizeof(T) <= 4(the nd2nz engine handles b8/b16/b32). Elem mode supports every listed dtype (byte-wise staging). - Index.
idxmust be a GMGlobalTensorofint32_toruint32_t. - Table.
GlobalTable::DType == __gm__ TandGlobalTable::layout == Layout::ND. - Scratch (Elem only).
GlobalScratch::DType == __gm__ T. - Destination.
TileDst::Loc == TileType::Mat, NZ form (!isRowMajor && SFractal == SLayout::RowMajor && SFractalSize == TileConfig::fractalABSize= 512 B),TileDst::Cols % (C0_SIZE_BYTE / sizeof(T)) == 0, andTileDst::Rows % FRACTAL_NZ_ROW (16) == 0. - Shape-coupled checks are gated on
if constexpr (DIM > 0)/ValidRow|ValidCol > 0, so both static and runtime-dynamic shapes are accepted (Rows/Colsare always compile-time and govern NZ addressing).tableRows = ∏ Shape[0..3],tableRowStride = GetStride(DIM_3)for Row;tableSize = ∏ Shape[0..4]for Elem.
Per-arch realisation¶
- A2/A3 (vec-core, unified AI core). The scalar address computation, the
nd2nz/copy_gm_to_cbuf, and the optionalpto_create_cbuf_matrixpre-zero all run on the same core. Internal handshakes use specific producer→consumer pairs only — neverpipe_barrier(PIPE_ALL). In Row mode theOOB::Zeropre-zero (pto_create_cbuf_matrix) and thend2nzgather are both MTE2 instructions on a single in-order DMA queue, so the WAW on L1 is ordered with no extra flag; Elem mode usesdcci/dsb+S→MTE2before the bulkcopy_gm_to_cbuf. The 11-argumentpto_copy_gm_to_cbuf_multi_nd2nzform is used for Row. - A5 (separate AIC + AIV). The gather (nd2nz /
copy_gm_to_cbuf_align_v2) runs on the AIC cube core;set_mte2_nz_paraconfigures the NZ destination strides once before the Row loop. The L1 tile is a cube-only resource, so reading it back to GM (e.g. for verification or a vector consumer) goes AIC → UB (copy_cbuf_to_ubuf) → AIV → GM with an intra-block handshake, exactly liketload_mix.copy_cbuf_to_gm/copy_ubuf_to_gmare not available on the AIC cube target. Elem mode additionally offers an opt-in SIMT executor (GatherExec::Simt) that runs the gather on the AIV vector core — see A5 only — SIMT executor below.
Example — Row gather into an L1 NZ tile¶
template <typename T, int R, int C, int TableRows>
AICORE void example_gm2l1_row(__gm__ T* tablePtr, __gm__ int32_t* idxPtr)
{
using TableShape = Shape<1, 1, 1, TableRows, C>;
using TableStride = Stride<1, 1, 1, C, 1>;
using IdxShape = Shape<1, 1, 1, 1, R>;
using IdxStride = Stride<1, 1, 1, R, 1>;
GlobalTensor<T, TableShape, TableStride, Layout::ND> tableGM(tablePtr);
GlobalTensor<int32_t, IdxShape, IdxStride, Layout::ND> idxGM(idxPtr);
using DstTile = Tile<TileType::Mat, T, R, C, BLayout::ColMajor, R, C, SLayout::RowMajor, 512>;
DstTile dst; TASSIGN(dst, 0x0);
MGATHER<Coalesce::Row, GatherOOB::Clamp>(dst, tableGM, idxGM); // GM (ND) -> L1 (NZ)
}
Example — Elem gather into an L1 NZ tile (with GM scratch)¶
template <typename T, int R, int C, int TableSize>
AICORE void example_gm2l1_elem(__gm__ T* tablePtr, __gm__ int32_t* idxPtr, __gm__ T* scratchPtr)
{
using TableShape = Shape<1, 1, 1, 1, TableSize>;
using TableStride = Stride<1, 1, 1, TableSize, 1>;
using IdxShape = Shape<1, 1, 1, R, C>;
using IdxStride = Stride<1, 1, 1, C, 1>;
using ScratchShape = Shape<1, 1, 1, 1, R * C>;
using ScratchStride= Stride<1, 1, 1, R * C, 1>;
GlobalTensor<T, TableShape, TableStride, Layout::ND> tableGM(tablePtr);
GlobalTensor<int32_t, IdxShape, IdxStride, Layout::ND> idxGM(idxPtr);
GlobalTensor<T, ScratchShape, ScratchStride, Layout::ND> scratchGM(scratchPtr);
using DstTile = Tile<TileType::Mat, T, R, C, BLayout::ColMajor, R, C, SLayout::RowMajor, 512>;
DstTile dst; TASSIGN(dst, 0x0);
MGATHER<Coalesce::Elem, GatherOOB::Zero>(dst, tableGM, idxGM, scratchGM); // GM -> GM scratch (NZ) -> L1
}
A5 only — SIMT executor for Elem GM → L1 (GatherExec::Simt)¶
On A5 the Elem GM → L1 path has two executors, selected by a third template parameter
GatherExec (defined alongside Coalesce); the existing scalar path is unchanged and remains
the default for the 4-operand form:
enum class GatherExec : uint8_t { Scalar = 0, Simt = 1 };
GatherExec::Scalar(the default for the 4-operand form) — the cube core walks the indices with scalar loads. Best for small / sparse tiles.GatherExec::Simt— the AIV vector core collects the discrete elements with a SIMT kernel (simt_mgather_l1_elem_kernel), which parallelizes the gather across warps. This is the A5-only "GM → GM" stage; A2/A3 has no SIMT engine and therefore noSimtexecutor.
The two executors share the same 4-operand signature; only the third template parameter selects the SIMT path. There is no UB operand: the SIMT engine moves data GM → D-cache → registers → GM, so the D-cache is implicit hardware plumbing that neither the caller nor the kernel manages. At the API level the SIMT executor is a pure GM → GM gather:
template <Coalesce CMode, GatherOOB Oob, GatherExec Exec, typename MatTileDst,
typename GlobalTable, typename GlobalIdx, typename GlobalScratch>
PTO_INST RecordEvent MGATHER(MatTileDst& dst, GlobalTable& table, GlobalIdx& idx,
GlobalScratch& scratch);
Algorithm. One AIV subcore (get_subblockid() == 0) launches the SIMT grid over the full
padded Rows × Cols NZ tile. Each thread maps its linear NZ offset back to (r, c)
(blockCol = off / (Rows * kC0), r = (off % (Rows * kC0)) / kC0,
c = blockCol * kC0 + (off % kC0)), gathers table[remap(idx[r, c])] for in-bounds lanes and
writes 0 for padding / Zero-policy lanes — so padding and OOB are handled in one pass with no
separate pre-zero. Each thread stores its result directly to GM scratch at its linear NZ
offset; because consecutive lanes hold consecutive offsets, the stores coalesce into contiguous
GM bursts through the D-cache. After the grid retires, the AIV flushes the scratch range
(dcci per cache line + dsb(DSB_DDR)) so the cube core's DMA observes the writes, then hands
off to the AIC with a single vec → cube intra-block flag (producer set_intra_block(PIPE_S, id),
consumer wait_intra_block(PIPE_MTE2, id), hardware auto-maps the subcore offset). The AIC then
issues the contiguous copy_gm_to_cbuf scratch → L1. No pipe_barrier(PIPE_ALL) is needed in this case.
Coalescing / mapping. The grid is sized dim3{32, kLaunchWarps} with
kLaunchWarps = min(ceil(Rows * Cols / 32), 32), so small tiles do not pay for idle warps. The
linear-NZ-offset mapping makes the scratch stores fully coalesced (lane i writes offset
base + i) and the index reads contiguous within each kC0 block; the only unavoidable
random traffic is table[remap(idx)], which is the intrinsic cost of a gather and is serviced
through the D-cache (warm lines are reused across lanes that hit the same row).
Constraints. Identical to the scalar Elem path (index / table / scratch / destination); the
SIMT executor adds no new operand-level constraints. All Elem dtypes and all GatherOOB policies
are supported.
template <typename T, int R, int C, int TableSize>
AICORE void example_gm2l1_elem_simt(__gm__ T* tablePtr, __gm__ int32_t* idxPtr, __gm__ T* scratchPtr)
{
using TableShape = Shape<1, 1, 1, 1, TableSize>;
using TableStride = Stride<1, 1, 1, TableSize, 1>;
using IdxShape = Shape<1, 1, 1, R, C>;
using IdxStride = Stride<1, 1, 1, C, 1>;
using ScratchShape = Shape<1, 1, 1, 1, R * C>;
using ScratchStride= Stride<1, 1, 1, R * C, 1>;
GlobalTensor<T, TableShape, TableStride, Layout::ND> tableGM(tablePtr);
GlobalTensor<int32_t, IdxShape, IdxStride, Layout::ND> idxGM(idxPtr);
GlobalTensor<T, ScratchShape, ScratchStride, Layout::ND> scratchGM(scratchPtr);
using DstTile = Tile<TileType::Mat, T, R, C, BLayout::ColMajor, R, C, SLayout::RowMajor, 512>;
DstTile dst; TASSIGN(dst, 0x0);
// AIV SIMT gather (GM -> D-cache -> regs -> GM scratch, NZ) -> AIC copy_gm_to_cbuf -> L1.
// No UB operand: the D-cache is implicit; the API deals only GM -> GM gather.
MGATHER<Coalesce::Elem, GatherOOB::Zero, GatherExec::Simt>(dst, tableGM, idxGM, scratchGM);
}