GlobalTensor And Data Movement¶

PTO does not hide movement between global memory and local execution state. GlobalTensor is the architecture-visible GM-facing object, and movement operations define when data enters or leaves local tile buffers or vector registers. PTO treats the vector tile buffer and the hardware Unified Buffer as the same architectural destination for TileType::Vec; this page avoids describing them as two separate user-visible concepts.

GlobalTensor¶

GlobalTensor Template Signature¶

GlobalTensor<DType, Shape, Stride, Layout>

Parameter	Type	Description
`DType`	C++ type	Element type matching the target tile
`Shape`	`Shape<ND>()`	N-dimensional shape: `Shape<B, H, W, R, C>`
`Stride`	`Stride<ND>()`	Per-dimension strides in elements
`Layout`	enum	Memory layout: `ND` (row-major), `DN` (col-major), `NZ` (row-major fractal)

GlobalTensor represents a view of __gm__ (off-chip device) memory. It is not the storage itself — it is a descriptor that pairs a pointer with shape and stride metadata.

GlobalTensor vs PartitionTensorView¶

Two GM-facing types appear in PTO programs:

Type	Description	Usage
`GlobalTensor`	C++ API type; wraps a `__gm__ T*` with shape/stride	C++ kernel code
`!pto.partition_tensor_view<MxNxdtype>`	SSA/IR type; GM partition descriptor	PTO-AS and MLIR IR
`!pto.memref<dtype, Nd>`	MLIR standard memref	Lowered form

The partition_tensor_view describes a sub-partition of GM visible to a specific block or sub-block. Its shape is always 5D: (B, H, W, R, C) — batch, height, width, tile rows, tile columns.

Supported Layouts¶

Layout	Stride Pattern	Description
`ND` (default)	`stride[R] = C, stride[W] = RC, stride[H] = WH*C, ...`	Row-major, C-contiguous
`DN`	`stride[C] = B, stride[R] = BC, stride[W] = BC*R, ...`	Column-major, Fortran-contiguous
`NZ`	Row-major fractal stride	Used with fractal tile layouts

Tile Instructions Data Path¶

The tile instructions (pto.t*) move data between GM and tile buffers through MTE2/MTE3. For TileType::Vec, the destination tile buffer is the hardware Unified Buffer; for Left, Right, and Acc, the destination tile buffers map to L0A, L0B, and L0C respectively.

GM
  │
  │  copy via DMA engine
  ▼
Local Tile Buffer (`Vec` uses hardware UB; `Left`/`Right`/`Acc` use L0A/L0B/L0C)
  │
  │  tile compute reads and writes the selected tile buffer role directly
  ▼
Tile Compute
  │
  │  copy via DMA engine
  ▼
GM

TLOAD¶

TLOAD moves data from a GlobalTensor into the destination tile buffer:

dst[i, j] = src[ r0 + i, c0 + j ]

Where r0 and c0 are the base offsets derived from the GlobalTensor shape/stride and the tile's declared valid region (Rv, Cv).

Transfer size: TLOAD transfers exactly dst.GetValidRow() × dst.GetValidCol() elements.

Constraints: - Source dtype size MUST equal destination dtype size. - Layout compatibility MUST be satisfied: - TileType::Vec: ND→ND, DN→DN, NZ→NZ - TileType::Mat: ND→ND, DN→DN, NZ→NZ, ND→NZ, DN→ZN

TSTORE¶

TSTORE moves data from the source tile buffer to a GlobalTensor:

dst[ r0 + i, c0 + j ] = src[i, j]

Where i ∈ [0, src.GetValidRow()), j ∈ [0, src.GetValidCol()).

Transfer size: TSTORE transfers exactly src.GetValidRow() × src.GetValidCol() elements.

Atomic Store Variants¶

TSTORE supports atomic store modes via the AtomicType attribute:

AtomicType	Behavior
`AtomicNone`	Normal store (overwrite)
`AtomicAdd`	Atomic add to GM location
`AtomicMax`	Atomic max
`AtomicMin`	Atomic min

Vector Instructions Data Path¶

The vector instructions (pto.v*) require an explicit GM↔vector-tile-buffer DMA step before vector loads and after vector stores:

GM
  │
  │  copy_ubuf_to_gm / copy_gm_to_ubuf (DMA, MTE2/MTE3)
  ▼
Vector Tile Buffer (hardware UB, 256 KB on-chip)
  │
  │  vlds / vsld / vgather2 (vector load, from the vector tile buffer to vreg)
  ▼
Vector Registers  ──►  Vector Compute  ──►  Vector Registers
                                                │
                                                │  vsts / vsst / vscatter (vector store)
                                                ▼
                                   Vector Tile Buffer ──► GM

DMA Copy Operations¶

The following scalar/control operations configure and execute GM↔UB DMA:

Operation	Direction	Description
`copy_gm_to_ubuf`	GM → vector tile buffer	Move data from GM into the vector tile buffer (hardware UB)
`copy_ubuf_to_gm`	vector tile buffer → GM	Move data from the vector tile buffer back to GM
`copy_ubuf_to_ubuf`	vector tile buffer → vector tile buffer	Copy within the vector tile buffer space (e.g., double-buffering)

These are pto.* control-instruction set operations. They do NOT implicitly synchronize — a set_flag/wait_flag sequence or explicit TSYNC is required before the data is consumed by subsequent vector compute.

Vector Load/Store (pto.v*)¶

After DMA staging, vlds/vsld bring data from UB into vector registers, and vsts/vsst write data from vector registers back to UB:

Operation	Path	Description
`vlds`	vector tile buffer → vreg	Standard vector load with distribution mode
`vsld`	vreg → vector tile buffer	Standard vector store
`vgather2`	vector tile buffer → vreg	Strided/gather load from the vector tile buffer
`vscatter`	vreg → vector tile buffer	Strided/scatter store to the vector tile buffer

Distribution modes (for vlds):

Mode	Meaning
`NORM`	Contiguous 256-byte load
`BRC_B8/B16/B32`	Broadcast: all lanes read the same address
`US_B8/B16`	Upsample: duplicate every Nth element
`DS_B8/B16`	Downsample: keep every Nth element
`UNPK_B8/B16/B32`	Unpack: zero-extend to wider type
`DINTLV_B32`	Deinterleave: extract even/odd lanes
`SPLT2CHN_B8/B16`	Split 2-channel
`SPLT4CHN_B8`	Split 4-channel (RGBA→R)

MTE Pipeline¶

The DMA engine uses three sub-units that operate in a pipeline:

MTE	Direction	Role in Tile Instructions	Role in Vector Instructions
`MTE1`	GM → vector tile buffer	Optional: explicit prefetch	Pre-stage data before vector load
`MTE2`	GM → local tile buffer	Load staging into the selected local tile buffer (via `TLOAD`)	DMA copy: GM→vector tile buffer (via `copy_gm_to_ubuf`)
`MTE3`	local tile buffer → GM	Store from the selected local tile buffer (via `TSTORE`)	DMA copy: vector tile buffer → GM (via `copy_ubuf_to_gm`)

MTE1, MTE2, and MTE3 can operate in parallel with the Vector Pipeline and Matrix Multiply Unit when proper set_flag/wait_flag synchronization is used.

Constraints¶

Constraints

Movement legality depends on source instruction set, destination instruction set, layout, and target profile.
Movement ops do not erase valid-region rules; they carry or define them.
Vector-instruction set loads and stores obey their own buffer/register rules and are NOT interchangeable with tile movement.
DMA copy operations require explicit synchronization before their data is consumed by vector compute.
TLOAD/TSTORE carry valid-region information implicitly; the transfer size is determined by the destination/source tile's valid region.

Cases That Are Not Allowed¶

Cases That Are Not Allowed

Documenting data movement as though it were implicit when the ISA requires an explicit move.
Assuming vector-buffer traffic and tile-buffer traffic share the same legality contract.
Silently relying on target-specific movement shortcuts as if they were architecture-wide.
Issuing a vlds before the corresponding copy_gm_to_ubuf has completed without an intervening set_flag/wait_flag.

Examples¶

Tile Instructions: Elementwise Add¶

#include <pto/pto-inst.hpp>
using namespace pto;

void vec_add(Tile<float, 16, 16>& c, const GlobalTensor<float>& ga,
             const GlobalTensor<float>& gb) {
    Tile<float, 16, 16> a, b;
    TLOAD(a, ga);           // GM → local tile buffer A
    TLOAD(b, gb);           // GM → local tile buffer B
    TADD(c, a, b);          // c = a + b, iterated over c's valid region
    TSTORE(gc, c);          // Local tile buffer C → GM
}

Vector Instructions: Fine-Grained Vector Load/Store¶

// 1. DMA copy from GM to UB staging area
copy_gm_to_ubuf(%ub_ptr, %gm_ptr, %sid, %n_burst, %len_burst, %stride_dst, %stride_src);

// 2. Signal Vector pipe that data is ready
set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);

// 3. Wait for data, then vector load
wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
%vreg = pto.vlds %ub[%offset] {dist = "NORM"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>;

// 4. Vector compute
%result = pto.vadd %vreg, %vreg : !pto.vreg<64xf32> -> !pto.vreg<64xf32>;

// 5. Vector store
pto.vsts %result, %ub_out[%offset] : !pto.vreg<64xf32>, !pto.ptr<f32, ub> -> ();

// 6. DMA copy from UB back to GM
copy_ubuf_to_gm(%ub_out, %gm_out, %sid, %n_burst, %len_burst, %reserved, %stride_dst, %stride_src);