MSCATTER¶
Tile Operation Diagram¶
Introduction¶
Scatter-store elements from a tile into global memory using per-element indices.
Math Interpretation¶
For each element (i, j) in the source valid region:
\[ \mathrm{mem}[\mathrm{idx}_{i,j}] = \mathrm{src}_{i,j} \]
If multiple elements map to the same destination location, the final value is implementation-defined (CPU simulator: last writer wins in row-major iteration order).
Assembly Syntax¶
PTO-AS form: see PTO-AS Specification.
Synchronous form:
mscatter %src, %mem, %idx : !pto.memref<...>, !pto.tile<...>, !pto.tile<...>
AS Level 1 (SSA)¶
pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
AS Level 2 (DPS)¶
pto.mscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
C++ Intrinsic¶
Declared in include/pto/common/pto_instr.hpp:
template <typename GlobalData, typename TileSrc, typename TileInd, typename... WaitEvents>
PTO_INST RecordEvent MSCATTER(GlobalData &dst, TileSrc &src, TileInd &indexes, WaitEvents &... events);
Constraints¶
- Supported data types:
src/dstelement type must be one of:int8_t,uint8_t,int16_t,uint16_t,int32_t,uint32_t,half,bfloat16_t,float.- On AICore targets,
float8_e4m3_tandfloat8_e5m2_tare also supported. indexeselement type must beint32_toruint32_t.
- Tile and memory types:
srcmust be a vector tile (TileType::Vec).indexesmust be a vector tile (TileType::Vec).srcandindexesmust use row-major layout.dstmust be aGlobalTensorin GM memory.dstmust useNDlayout.
- Atomic operation constraints:
- Non-atomic scatter is supported for all supported element types.
Addatomic mode requiresint32_t,uint32_t,float, orhalf.Max/Minatomic mode requiresint32_torfloat.
- Shape constraints:
src.Rows == indexes.Rows.indexesmust be shaped as[N, 1]for row-indexed scatter or[N, M]for element-indexed scatter.srcrow width must be 32-byte aligned, that is,src.Cols * sizeof(DType)must be a multiple of 32.dststatic shape must satisfyShape<1, 1, 1, TableRows, RowWidth>.
- Index interpretation:
- Index interpretation is target-defined. The CPU simulator treats indices as linear element indices into
dst.data(). - The CPU simulator does not enforce bounds checks on
indexes.
- Index interpretation is target-defined. The CPU simulator treats indices as linear element indices into
Examples¶
See related examples in docs/isa/ and docs/coding/tutorials/.
ASM Form Examples¶
Auto Mode¶
# Auto mode: compiler/runtime-managed placement and scheduling.
pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
Manual Mode¶
# Manual mode: bind resources explicitly before issuing the instruction.
# Optional for tile operands:
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
pto.mscatter %src, %idx, %mem : (!pto.tile<...>, !pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
PTO Assembly Form¶
mscatter %src, %mem, %idx : !pto.memref<...>, !pto.tile<...>, !pto.tile<...>
# AS Level 2 (DPS)
pto.mscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)