pto.tload¶

pto.tload is part of the Memory And Data Movement instruction set.

Summary¶

Load data from global memory into a tile. The transfer is rectangular, spanning dst.GetValidRow() by dst.GetValidCol() elements.

Mechanism¶

pto.tload initiates a DMA transfer from the source GlobalTensor to the destination tile buffer. The transfer reads a rectangular region from the GlobalTensor and writes it into the tile's on-chip storage.

Let R = dst.GetValidRow() and C = dst.GetValidCol(). The transfer size is R × C elements. The element mapping depends on the GlobalTensor layout:

\[ \mathrm{dst}_{i,j} = \mathrm{src}_{r_0 + i,\; c_0 + j} \]

where (r_0, c_0) is the base offset within the GlobalTensor. The exact address computation also depends on the GlobalTensor stride.

The operation is asynchronous. A RecordEvent token is returned; use TSYNC or set_flag/wait_flag before reading the tile data.

Syntax¶

PTO Assembly Form¶

%t0 = tload %sv[%c0, %c0] : (!pto.memref<...>, index, index) -> !pto.tile<...>

AS Level 1 (SSA)¶

%dst = pto.tload %mem : !pto.partition_tensor_view<MxNxdtype> ->
!pto.tile<loc, dtype, rows, cols, blayout, slayout, fractal, pad>

AS Level 2 (DPS)¶

pto.tload ins(%mem : !pto.partition_tensor_view<MxNxdtype>)
          outs(%dst : !pto.tile_buf<...>)

C++ Intrinsic¶

Declared in include/pto/common/pto_instr.hpp:

template <typename TileData, typename GlobalData, typename... WaitEvents>
PTO_INST RecordEvent TLOAD(TileData &dst, GlobalData &src, WaitEvents &... events);

Inputs¶

Operand	Description
`dst`	Destination tile. The transfer shape is `dst.GetValidRow()` × `dst.GetValidCol()`.
`src`	Source GlobalTensor. Must be addressable from the local NPU.
`events...`	Optional `RecordEvent` tokens to wait on before issuing the operation

Expected Outputs¶

Result	Type	Description
`RecordEvent`	`RecordEvent`	Token signaling completion of the load. Must be waited on before the tile data is consumed.

After the load completes, dst contains the loaded data with element layout determined by the tile layout and GlobalTensor stride.

Side Effects¶

Reads from global memory and writes to the tile buffer. Does not implicitly fence unrelated tile traffic.

Constraints¶

Constraints

Valid region: The transfer size is dst.GetValidRow() × dst.GetValidCol().
Element size match: sizeof(tile.dtype) == sizeof(gtensor.dtype).
Layout compatibility: Source (GlobalTensor) layout and destination (tile) layout must be a supported combination. See the layout compatibility table in Memory And Data Movement.
Shape positivity: src.GetShape(dim) > 0 and dst.GetValidRow() > 0 and dst.GetValidCol() > 0 at runtime.

Layout Compatibility¶

TileType	ND→ND	DN→DN	NZ→NZ	ND→NZ	DN→ZN
`TileType::Vec`	Yes	Yes	Yes	No	No
`TileType::Mat`	Yes	Yes	Yes	Yes	Yes
`TileType::Acc`	Yes	No	Yes	No	No

Additional constraints (A5): - Vec with ND→NZ or DN→ZN: requires GlobalData::staticShape[0..2] == 1 and TileData::SFractalSize == 512. - Vec with int64_t/uint64_t: only ND→ND or DN→DN supported.

Exceptions¶

Exceptions

Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend instruction set.
Programs must not rely on behavior outside the documented legal domain of this operation.

Target-Profile Restrictions¶

Target-Profile Restrictions

A2/A3: - TileData::DType must be one of: int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float. - Destination tile location must be TileType::Vec or TileType::Mat. - sizeof(TileData::DType) == sizeof(GlobalData::DType). - Vec loads: layouts must match (ND→ND, DN→DN, NZ→NZ). - Mat loads: supports all combinations including ND→NZ and DN→ZN. - For ND→NZ or DN→ZN: GlobalData::staticShape[0..2] == 1 and TileData::SFractalSize == 512. - int64_t/uint64_t: only ND→ND or DN→DN. - TileData::Rows range: 1 <= Rows <= 4095.

A5: - sizeof(TileData::DType) must be 1, 2, 4, or 8 bytes, and must match sizeof(GlobalData::DType). - Vec loads: row-major ND→ND, col-major DN→DN, or row-major NZ→NZ only. - Mat loads: constrained by TLoadCubeCheck (specific ND/DN/NZ conversions and L1-size limits). - Mat loads also handle mx format loads including MX_A_ZZ/MX_A_ND/MX_A_DN to ZZ for scalarA and MX_B_NN/MX_B_ND/MX_B_DN to NN for scalarB. - For MX_A_ZZ/MX_B_NN: GlobalData::staticShape[3] == 16 and GlobalData::staticShape[4] == 2. - For MX_A_ND/MX_ADN/MX_B_ND/MX_B_DN: GlobalData::staticShape[0] == 1 and GlobalData::staticShape[1] == 1 and GlobalData::staticShape[4] == 2.

Examples¶

#include <pto/pto-inst.hpp>
using namespace pto;

template <typename T>
void example(__gm__ T* in) {
  using TileT = Tile<TileType::Vec, T, 16, 16>;
  using GShape = Shape<1, 1, 1, 16, 16>;
  using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
  using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;

  GTensor gin(in);
  TileT t;
  RecordEvent e = TLOAD(t, gin);
  TSYNC(e);
}