pto.tload¶
pto.tload is part of the Memory And Data Movement instruction set.
Summary¶
Load data from global memory into a tile. The transfer is rectangular, spanning dst.GetValidRow() by dst.GetValidCol() elements.
Mechanism¶
pto.tload initiates a DMA transfer from the source GlobalTensor to the destination tile buffer. The transfer reads a rectangular region from the GlobalTensor and writes it into the tile's on-chip storage.
Let R = dst.GetValidRow() and C = dst.GetValidCol(). The transfer size is R × C elements. The element mapping depends on the GlobalTensor layout:
where (r_0, c_0) is the base offset within the GlobalTensor. The exact address computation also depends on the GlobalTensor stride.
The operation is asynchronous. A RecordEvent token is returned; use TSYNC or set_flag/wait_flag before reading the tile data.
Syntax¶
PTO Assembly Form¶
%t0 = tload %sv[%c0, %c0] : (!pto.memref<...>, index, index) -> !pto.tile<...>
AS Level 1 (SSA)¶
%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
!pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>
AS Level 2 (DPS)¶
pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>)
outs(%dst : !pto.tile_buf<...>)
C++ Intrinsic¶
Declared in include/pto/common/pto_instr.hpp:
template <typename TileData, typename GlobalData, typename... WaitEvents>
PTO_INST RecordEvent TLOAD(TileData &dst, GlobalData &src, WaitEvents &... events);
Inputs¶
| Operand | Description |
|---|---|
dst |
Destination tile. The transfer shape is dst.GetValidRow() × dst.GetValidCol(). |
src |
Source GlobalTensor. Must be addressable from the local NPU. |
events... |
Optional RecordEvent tokens to wait on before issuing the operation |
Expected Outputs¶
| Result | Type | Description |
|---|---|---|
RecordEvent |
RecordEvent |
Token signaling completion of the load. Must be waited on before the tile data is consumed. |
After the load completes, dst contains the loaded data with element layout determined by the tile layout and GlobalTensor stride.
Side Effects¶
Reads from global memory and writes to the tile buffer. Does not implicitly fence unrelated tile traffic.
Constraints¶
Constraints
- Valid region: The transfer size is
dst.GetValidRow()×dst.GetValidCol(). - Element size match:
sizeof(tile.dtype) == sizeof(gtensor.dtype). - Layout compatibility: Source (GlobalTensor) layout and destination (tile) layout must be a supported combination. See the layout compatibility table in Memory And Data Movement.
- Shape positivity:
src.GetShape(dim) > 0anddst.GetValidRow() > 0anddst.GetValidCol() > 0at runtime.
Layout Compatibility¶
| TileType | ND→ND | DN→DN | NZ→NZ | ND→NZ | DN→ZN |
|---|---|---|---|---|---|
TileType::Vec |
Yes | Yes | Yes | No | No |
TileType::Mat |
Yes | Yes | Yes | Yes | Yes |
TileType::Acc |
Yes | No | Yes | No | No |
Additional constraints (A5):
- Vec with ND→NZ or DN→ZN: requires GlobalData::staticShape[0..2] == 1 and TileData::SFractalSize == 512.
- Vec with int64_t/uint64_t: only ND→ND or DN→DN supported.
Exceptions¶
Exceptions
- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend instruction set.
- Programs must not rely on behavior outside the documented legal domain of this operation.
Target-Profile Restrictions¶
Target-Profile Restrictions
A2/A3:
- TileData::DType must be one of: int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float.
- Destination tile location must be TileType::Vec or TileType::Mat.
- sizeof(TileData::DType) == sizeof(GlobalData::DType).
- Vec loads: layouts must match (ND→ND, DN→DN, NZ→NZ).
- Mat loads: supports all combinations including ND→NZ and DN→ZN.
- For ND→NZ or DN→ZN: GlobalData::staticShape[0..2] == 1 and TileData::SFractalSize == 512.
- int64_t/uint64_t: only ND→ND or DN→DN.
A5:
- sizeof(TileData::DType) must be 1, 2, 4, or 8 bytes, and must match sizeof(GlobalData::DType).
- Vec loads: row-major ND→ND, col-major DN→DN, or row-major NZ→NZ only.
- Mat loads: constrained by TLoadCubeCheck (specific ND/DN/NZ conversions and L1-size limits).
- Mat loads also handle mx format loads including MX_A_ZZ/MX_A_ND/MX_A_DN to ZZ for scalarA and MX_B_NN/MX_B_ND/MX_B_DN to NN for scalarB.
- For MX_A_ZZ/MX_B_NN: GlobalData::staticShape[3] == 16 and GlobalData::staticShape[4] == 2.
- For MX_A_ND/MX_ADN/MX_B_ND/MX_B_DN: GlobalData::staticShape[0] == 1 and GlobalData::staticShape[1] == 1 and GlobalData::staticShape[4] == 2.
Examples¶
#include <pto/pto-inst.hpp>
using namespace pto;
template <typename T>
void example(__gm__ T* in) {
using TileT = Tile<TileType::Vec, T, 16, 16>;
using GShape = Shape<1, 1, 1, 16, 16>;
using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
GTensor gin(in);
TileT t;
RecordEvent e = TLOAD(t, gin);
TSYNC(e);
}
See Also¶
- Instruction set overview: Memory And Data Movement
- Next op in instruction set: pto.tprefetch