Memory And Data Movement Instruction Set¶
Memory operations transfer data between global memory (GM) and tile buffers. These are the only tile operations that cross between tile-visible state and GM-visible state.
Operations¶
| Operation | Description | Direction | C++ Intrinsic |
|---|---|---|---|
| pto.tload | Load from GM into tile | GM → local tile buffer | TLOAD(dst, gtensor) |
| pto.tprefetch | Prefetch from GM into tile (non-blocking) | GM → local tile buffer | TPREFETCH(dst, gtensor) |
| pto.tprefetch_async | Asynchronously prefetch GM into L2 via SDMA CMO | GM → L2 cache | TPREFETCH_ASYNC(gtensor, ctx) |
| pto.tstore | Store from tile to GM | local tile buffer → GM | TSTORE(gtensor, src) |
| pto.tstore_fp | Store through the fix-pipe path | Tile → local tile buffer → GM | TSTORE_FP(gtensor, src, fp) |
| pto.mgather | Gather scattered elements from GM | GM → local tile buffer | MGATHER(dst, gtensor, indices) |
| pto.mscatter | Scatter tile elements to GM | local tile buffer → GM | MSCATTER(gtensor, indices, src) |
Mechanism¶
Contiguous Transfer (TLOAD, TSTORE)¶
Data is transferred in a rectangular region determined by the tile's valid region:
TLOAD: dst[i,j] = src[ r0 + i, c0 + j ] (i ∈ [0, dst.Rv), j ∈ [0, dst.Cv))
TSTORE: dst[ r0 + i, c0 + j ] = src[i,j]
Transfer size: dst.GetValidRow() × dst.GetValidCol() elements.
Prefetch (TPREFETCH)¶
TPREFETCH initiates a non-blocking DMA transfer from GM to the tile buffer. It does not stall the pipeline. A subsequent operation that reads the tile buffer must wait for the transfer to complete via TSYNC or set_flag/wait_flag.
TPREFETCH_ASYNC uses SDMA CMO to warm the L2 cache for a flat contiguous GlobalTensor region. It returns a comm::AsyncEvent; consumers that require the prefetched data must wait on that event before issuing the dependent load.
Gather/Scatter (MGATHER, MSCATTER)¶
An index tile specifies which GM elements to transfer:
Fix-Pipe Variants (TSTORE_FP)¶
TSTORE_FP is a fix-pipe variant, not a “floating-point” variant. The _fp suffix names the backend path that programs fix-pipe state before storing.
Layout Compatibility¶
| TileType | ND→ND | DN→DN | NZ→NZ | ND→NZ | DN→ZN | Notes |
|---|---|---|---|---|---|---|
TileType::Vec |
Yes | Yes | Yes | No | No | |
TileType::Mat |
Yes | Yes | Yes | Yes | Yes | |
TileType::Acc |
Yes | No | Yes | No | No | Atomic store only |
Additional constraints on A5:
- TileType::Vec with ND→NZ or DN→ZN: requires GlobalData::staticShape[0..2] == 1 and TileData::SFractalSize == 512.
- TileType::Vec with int64_t/uint64_t: only ND→ND or DN→DN supported.
Type Support by Target Profile¶
| Element Type | CPU Simulator | A2/A3 | A5 |
|---|---|---|---|
| f32 (float) | Yes | Yes | Yes |
| f16 (half) | Yes | Yes | Yes |
| bf16 (bfloat16_t) | Yes | Yes | Yes |
| i8 / u8 | Yes | Yes | Yes |
| i16 / u16 | Yes | Yes | Yes |
| i32 / u32 | Yes | Yes | Yes |
| i64 / u64 | Yes | Yes | Yes |
| f8e4m3 / f8e5m2 | No | No | Yes |
| hifloat8_t / float4_e* | No | No | Yes |
Ordering¶
Memory operations are subject to PTO's producer-consumer ordering rules. Programs MUST use explicit synchronization (TSYNC, set_flag/wait_flag) to ensure data is ready before use.
See Producer Consumer Ordering for the full ordering model.
Constraints¶
Constraints
- Source and destination element types MUST have the same size:
sizeof(tile.dtype) == sizeof(gtensor.dtype). - Transfer size is determined by the destination tile's valid region for
TLOAD, or source tile's valid region forTSTORE. - Layout compatibility between GM layout and tile layout is profile-dependent (see layout compatibility table above).
- Gather/scatter index tiles must have compatible shapes.
TSTOREwithTileType::AccsupportsAtomicType:AtomicNone,AtomicAdd,AtomicMax,AtomicMin(A5 only).TSTORE_FPis only legal forTileType::Accon A2A3 and A5 and uses the fix-pipe sideband state carried by the auxiliaryfptile argument.
Cases That Are Not Allowed¶
Cases That Are Not Allowed
- Transferring to or from an uninitialized tile register.
- Using a GlobalTensor with strides incompatible with the transfer pattern.
- Accessing GM addresses outside the tensor's declared shape.
- Using
TSTORE_FPwith a non-Acc tile type. - Using atomic store variants on CPU simulator.
C++ Intrinsic¶
#include <pto/pto-inst.hpp>
using namespace pto;
// Basic load
template <typename TileData, typename GlobalData, typename... WaitEvents>
PTO_INST RecordEvent TLOAD(TileData& dst, GlobalData& src, WaitEvents&... events);
// Atomic store
template <typename TileData, typename GlobalData,
AtomicType atomicType = AtomicType::AtomicNone, typename... WaitEvents>
PTO_INST RecordEvent TSTORE(GlobalData& dst, TileData& src, WaitEvents&... events);
// FP store (quantized, A2/A3+)
template <typename TileData, typename GlobalData, typename FpTileData,
AtomicType atomicType = AtomicType::AtomicNone, typename... WaitEvents>
PTO_INST RecordEvent TSTORE_FP(GlobalData& dst, TileData& src, FpTileData& fp,
WaitEvents&... events);
// Prefetch
template <typename TileData, typename GlobalData>
PTO_INST RecordEvent TPREFETCH(TileData& dst, GlobalData& src);
template <typename GlobalData, typename... WaitEvents>
PTO_INST comm::AsyncEvent TPREFETCH_ASYNC(GlobalData& src, PrefetchAsyncContext& ctx, WaitEvents&... events);
// Gather/Scatter
template <typename TileData, typename GlobalData, typename IndexData>
PTO_INST RecordEvent MGATHER(TileData& dst, GlobalData& src, IndexData& indices);
template <typename TileData, typename GlobalData, typename IndexData>
PTO_INST RecordEvent MSCATTER(GlobalData& dst, IndexData& indices, TileData& src);
See Also¶
- Memory model — GM ordering and consistency
- Producer consumer ordering — Sync rules
- Tile instruction set — Instruction set overview
- Tile instruction set — Instruction Set description