GlobalTensor And Data Movement

PTO does not hide movement between global memory and local execution state. GlobalTensor is the architecture-visible GM-facing object, and movement operations define when data enters or leaves local tile buffers or vector registers. PTO treats the vector tile buffer and the hardware Unified Buffer as the same architectural destination for TileType::Vec; this page avoids describing them as two separate user-visible concepts.

GlobalTensor

GlobalTensor Template Signature

GlobalTensor<DType, Shape, Stride, Layout>
Parameter Type Description
DType C++ type Element type matching the target tile
Shape Shape<ND>() N-dimensional shape: Shape<B, H, W, R, C>
Stride Stride<ND>() Per-dimension strides in elements
Layout enum Memory layout: ND (row-major), DN (col-major), NZ (row-major fractal)

GlobalTensor represents a view of __gm__ (off-chip device) memory. It is not the storage itself — it is a descriptor that pairs a pointer with shape and stride metadata.

GlobalTensor vs PartitionTensorView

Two GM-facing types appear in PTO programs:

Type Description Usage
GlobalTensor C++ API type; wraps a __gm__ T* with shape/stride C++ kernel code
!pto.partition_tensor_view<MxNxdtype> SSA/IR type; GM partition descriptor PTO-AS and MLIR IR
!pto.memref<dtype, Nd> MLIR standard memref Lowered form

The partition_tensor_view describes a sub-partition of GM visible to a specific block or sub-block. Its shape is always 5D: (B, H, W, R, C) — batch, height, width, tile rows, tile columns.

Supported Layouts

Layout Stride Pattern Description
ND (default) stride[R] = C, stride[W] = R*C, stride[H] = W*H*C, ... Row-major, C-contiguous
DN stride[C] = B, stride[R] = B*C, stride[W] = B*C*R, ... Column-major, Fortran-contiguous
NZ Row-major fractal stride Used with fractal tile layouts

Tile Instructions Data Path

The tile instructions (pto.t*) move data between GM and tile buffers through MTE2/MTE3. For TileType::Vec, the destination tile buffer is the hardware Unified Buffer; for Left, Right, and Acc, the destination tile buffers map to L0A, L0B, and L0C respectively.

GM
  │
  │  copy via DMA engine
  ▼
Local Tile Buffer (`Vec` uses hardware UB; `Left`/`Right`/`Acc` use L0A/L0B/L0C)
  │
  │  tile compute reads and writes the selected tile buffer role directly
  ▼
Tile Compute
  │
  │  copy via DMA engine
  ▼
GM

TLOAD

TLOAD moves data from a GlobalTensor into the destination tile buffer:

dst[i, j] = src[ r0 + i, c0 + j ]

Where r0 and c0 are the base offsets derived from the GlobalTensor shape/stride and the tile's declared valid region (Rv, Cv).

Transfer size: TLOAD transfers exactly dst.GetValidRow() × dst.GetValidCol() elements.

Constraints: - Source dtype size MUST equal destination dtype size. - Layout compatibility MUST be satisfied: - TileType::Vec: ND→ND, DN→DN, NZ→NZ - TileType::Mat: ND→ND, DN→DN, NZ→NZ, ND→NZ, DN→ZN

TSTORE

TSTORE moves data from the source tile buffer to a GlobalTensor:

dst[ r0 + i, c0 + j ] = src[i, j]

Where i ∈ [0, src.GetValidRow()), j ∈ [0, src.GetValidCol()).

Transfer size: TSTORE transfers exactly src.GetValidRow() × src.GetValidCol() elements.

Atomic Store Variants

TSTORE supports atomic store modes via the AtomicType attribute:

AtomicType Behavior
AtomicNone Normal store (overwrite)
AtomicAdd Atomic add to GM location
AtomicMax Atomic max
AtomicMin Atomic min

Vector Instructions Data Path

The vector instructions (pto.v*) require an explicit GM↔vector-tile-buffer DMA step before vector loads and after vector stores:

GM
  │
  │  copy_ubuf_to_gm / copy_gm_to_ubuf (DMA, MTE2/MTE3)
  ▼
Vector Tile Buffer (hardware UB, 256 KB on-chip)
  │
  │  vlds / vsld / vgather2 (vector load, from the vector tile buffer to vreg)
  ▼
Vector Registers  ──►  Vector Compute  ──►  Vector Registers
                                                │
                                                │  vsts / vsst / vscatter (vector store)
                                                ▼
                                   Vector Tile Buffer ──► GM

DMA Copy Operations

The following scalar/control operations configure and execute GM↔UB DMA:

Operation Direction Description
copy_gm_to_ubuf GM → vector tile buffer Move data from GM into the vector tile buffer (hardware UB)
copy_ubuf_to_gm vector tile buffer → GM Move data from the vector tile buffer back to GM
copy_ubuf_to_ubuf vector tile buffer → vector tile buffer Copy within the vector tile buffer space (e.g., double-buffering)

These are pto.* control-instruction set operations. They do NOT implicitly synchronize — a set_flag/wait_flag sequence or explicit TSYNC is required before the data is consumed by subsequent vector compute.

Vector Load/Store (pto.v*)

After DMA staging, vlds/vsld bring data from UB into vector registers, and vsts/vsst write data from vector registers back to UB:

Operation Path Description
vlds vector tile buffer → vreg Standard vector load with distribution mode
vsld vreg → vector tile buffer Standard vector store
vgather2 vector tile buffer → vreg Strided/gather load from the vector tile buffer
vscatter vreg → vector tile buffer Strided/scatter store to the vector tile buffer

Distribution modes (for vlds):

Mode Meaning
NORM Contiguous 256-byte load
BRC_B8/B16/B32 Broadcast: all lanes read the same address
US_B8/B16 Upsample: duplicate every Nth element
DS_B8/B16 Downsample: keep every Nth element
UNPK_B8/B16/B32 Unpack: zero-extend to wider type
DINTLV_B32 Deinterleave: extract even/odd lanes
SPLT2CHN_B8/B16 Split 2-channel
SPLT4CHN_B8 Split 4-channel (RGBA→R)

MTE Pipeline

The DMA engine uses three sub-units that operate in a pipeline:

MTE Direction Role in Tile Instructions Role in Vector Instructions
MTE1 GM → vector tile buffer Optional: explicit prefetch Pre-stage data before vector load
MTE2 GM → local tile buffer Load staging into the selected local tile buffer (via TLOAD) DMA copy: GM→vector tile buffer (via copy_gm_to_ubuf)
MTE3 local tile buffer → GM Store from the selected local tile buffer (via TSTORE) DMA copy: vector tile buffer → GM (via copy_ubuf_to_gm)

MTE1, MTE2, and MTE3 can operate in parallel with the Vector Pipeline and Matrix Multiply Unit when proper set_flag/wait_flag synchronization is used.

Constraints

Constraints

  • Movement legality depends on source instruction set, destination instruction set, layout, and target profile.
  • Movement ops do not erase valid-region rules; they carry or define them.
  • Vector-instruction set loads and stores obey their own buffer/register rules and are NOT interchangeable with tile movement.
  • DMA copy operations require explicit synchronization before their data is consumed by vector compute.
  • TLOAD/TSTORE carry valid-region information implicitly; the transfer size is determined by the destination/source tile's valid region.

Cases That Are Not Allowed

Cases That Are Not Allowed

  • Documenting data movement as though it were implicit when the ISA requires an explicit move.
  • Assuming vector-buffer traffic and tile-buffer traffic share the same legality contract.
  • Silently relying on target-specific movement shortcuts as if they were architecture-wide.
  • Issuing a vlds before the corresponding copy_gm_to_ubuf has completed without an intervening set_flag/wait_flag.

Examples

Tile Instructions: Elementwise Add

#include <pto/pto-inst.hpp>
using namespace pto;

void vec_add(Tile<float, 16, 16>& c, const GlobalTensor<float>& ga,
             const GlobalTensor<float>& gb) {
    Tile<float, 16, 16> a, b;
    TLOAD(a, ga);           // GM → local tile buffer A
    TLOAD(b, gb);           // GM → local tile buffer B
    TADD(c, a, b);          // c = a + b, iterated over c's valid region
    TSTORE(gc, c);          // Local tile buffer C → GM
}

Vector Instructions: Fine-Grained Vector Load/Store

// 1. DMA copy from GM to UB staging area
copy_gm_to_ubuf(%ub_ptr, %gm_ptr, %sid, %n_burst, %len_burst, %stride_dst, %stride_src);

// 2. Signal Vector pipe that data is ready
set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);

// 3. Wait for data, then vector load
wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
%vreg = pto.vlds %ub[%offset] {dist = "NORM"} : !pto.ptr<f32, ub> -> !pto.vreg<64xf32>;

// 4. Vector compute
%result = pto.vadd %vreg, %vreg : !pto.vreg<64xf32> -> !pto.vreg<64xf32>;

// 5. Vector store
pto.vsts %result, %ub_out[%offset] : !pto.vreg<64xf32>, !pto.ptr<f32, ub> -> ();

// 6. DMA copy from UB back to GM
copy_ubuf_to_gm(%ub_out, %gm_out, %sid, %n_burst, %len_burst, %reserved, %stride_dst, %stride_src);

See Also