Memory And Data Movement Instruction Set¶

Memory operations transfer data between global memory (GM) and tile buffers. These are the only tile operations that cross between tile-visible state and GM-visible state.

Operations¶

Operation	Description	Direction	C++ Intrinsic
pto.tload	Load from GM into tile	GM → local tile buffer	`TLOAD(dst, gtensor)`
pto.tprefetch	Prefetch from GM into tile (non-blocking)	GM → local tile buffer	`TPREFETCH(dst, gtensor)`
pto.tprefetch_async	Asynchronously prefetch GM into L2 via SDMA CMO	GM → L2 cache	`TPREFETCH_ASYNC(gtensor, ctx)`
pto.tstore	Store from tile to GM	local tile buffer → GM	`TSTORE(gtensor, src)`
pto.tstore_fp	Store through the fix-pipe path	Tile → local tile buffer → GM	`TSTORE_FP(gtensor, src, fp)`
pto.mgather	Gather scattered elements from GM	GM → local tile buffer	`MGATHER(dst, gtensor, indices)`
pto.mscatter	Scatter tile elements to GM	local tile buffer → GM	`MSCATTER(gtensor, indices, src)`

Mechanism¶

Contiguous Transfer (TLOAD, TSTORE)¶

Data is transferred in a rectangular region determined by the tile's valid region:

TLOAD:  dst[i,j] = src[ r0 + i, c0 + j ]   (i ∈ [0, dst.Rv), j ∈ [0, dst.Cv))
TSTORE: dst[ r0 + i, c0 + j ] = src[i,j]

Transfer size: dst.GetValidRow() × dst.GetValidCol() elements.

Prefetch (TPREFETCH)¶

TPREFETCH initiates a non-blocking DMA transfer from GM to the tile buffer. It does not stall the pipeline. A subsequent operation that reads the tile buffer must wait for the transfer to complete via TSYNC or set_flag/wait_flag.

TPREFETCH_ASYNC uses SDMA CMO to warm the L2 cache for a flat contiguous GlobalTensor region. It returns a comm::AsyncEvent; consumers that require the prefetched data must wait on that event before issuing the dependent load.

Gather/Scatter (MGATHER, MSCATTER)¶

An index tile specifies which GM elements to transfer:

\[ \mathrm{dst}_i = \mathrm{src}_{\mathrm{index}_i} \]

Fix-Pipe Variants (TSTORE_FP)¶

TSTORE_FP is a fix-pipe variant, not a “floating-point” variant. The _fp suffix names the backend path that programs fix-pipe state before storing.

Layout Compatibility¶

TileType	ND→ND	DN→DN	NZ→NZ	ND→NZ	DN→ZN	Notes
`TileType::Vec`	Yes	Yes	Yes	No	No
`TileType::Mat`	Yes	Yes	Yes	Yes	Yes
`TileType::Acc`	Yes	No	Yes	No	No	Atomic store only

Additional constraints on A5: - TileType::Vec with ND→NZ or DN→ZN: requires GlobalData::staticShape[0..2] == 1 and TileData::SFractalSize == 512. - TileType::Vec with int64_t/uint64_t: only ND→ND or DN→DN supported.

Type Support by Target Profile¶

Element Type	CPU Simulator	A2/A3	A5
f32 (float)	Yes	Yes	Yes
f16 (half)	Yes	Yes	Yes
bf16 (bfloat16_t)	Yes	Yes	Yes
i8 / u8	Yes	Yes	Yes
i16 / u16	Yes	Yes	Yes
i32 / u32	Yes	Yes	Yes
i64 / u64	Yes	Yes	Yes
f8e4m3 / f8e5m2	No	No	Yes
hifloat8_t / float4_e*	No	No	Yes

Ordering¶

Memory operations are subject to PTO's producer-consumer ordering rules. Programs MUST use explicit synchronization (TSYNC, set_flag/wait_flag) to ensure data is ready before use.

See Producer Consumer Ordering for the full ordering model.

Constraints¶

Constraints

Source and destination element types MUST have the same size: sizeof(tile.dtype) == sizeof(gtensor.dtype).
Transfer size is determined by the destination tile's valid region for TLOAD, or source tile's valid region for TSTORE.
Layout compatibility between GM layout and tile layout is profile-dependent (see layout compatibility table above).
Gather/scatter index tiles must have compatible shapes.
TSTORE with TileType::Acc supports AtomicType: AtomicNone, AtomicAdd, AtomicMax, AtomicMin (A5 only).
TSTORE_FP is only legal for TileType::Acc on A2A3 and A5 and uses the fix-pipe sideband state carried by the auxiliary fp tile argument.

Cases That Are Not Allowed¶

Cases That Are Not Allowed

Transferring to or from an uninitialized tile register.
Using a GlobalTensor with strides incompatible with the transfer pattern.
Accessing GM addresses outside the tensor's declared shape.
Using TSTORE_FP with a non-Acc tile type.
Using atomic store variants on CPU simulator.

C++ Intrinsic¶

#include <pto/pto-inst.hpp>
using namespace pto;

// Basic load
template <typename TileData, typename GlobalData, typename... WaitEvents>
PTO_INST RecordEvent TLOAD(TileData& dst, GlobalData& src, WaitEvents&... events);

// Atomic store
template <typename TileData, typename GlobalData,
          AtomicType atomicType = AtomicType::AtomicNone, typename... WaitEvents>
PTO_INST RecordEvent TSTORE(GlobalData& dst, TileData& src, WaitEvents&... events);

// FP store (quantized, A2/A3+)
template <typename TileData, typename GlobalData, typename FpTileData,
          AtomicType atomicType = AtomicType::AtomicNone, typename... WaitEvents>
PTO_INST RecordEvent TSTORE_FP(GlobalData& dst, TileData& src, FpTileData& fp,
                               WaitEvents&... events);

// Prefetch
template <typename TileData, typename GlobalData>
PTO_INST RecordEvent TPREFETCH(TileData& dst, GlobalData& src);

template <typename GlobalData, typename... WaitEvents>
PTO_INST comm::AsyncEvent TPREFETCH_ASYNC(GlobalData& src, PrefetchAsyncContext& ctx, WaitEvents&... events);

// Gather/Scatter
template <typename TileData, typename GlobalData, typename IndexData>
PTO_INST RecordEvent MGATHER(TileData& dst, GlobalData& src, IndexData& indices);

template <typename TileData, typename GlobalData, typename IndexData>
PTO_INST RecordEvent MSCATTER(GlobalData& dst, IndexData& indices, TileData& src);