pto.tpush¶
pto.tpush is part of the System Scheduling instruction set.
Summary¶
TPUSH moves a tile from a producer pipeline (Cube or Vector) into a ring FIFO for consumption by a paired pipeline. It is the producer half of the TPipe/TMPipe producer-consumer protocol.
What TPUSH Is Not¶
TPUSH is not a scalar stack push or a generic FIFO enqueue. It is a structured tile-movement protocol for Cube-Vector tile passing. It is not available on the CPU simulator.
Architecture: The TPipe Abstraction¶
A TPipe<FlagID, DirType, SlotSize, SlotNum, LocalSlotNum> is a compile-time configured ring FIFO that connects a producer and a consumer. Each TPipe owns:
- A RingFIFO that manages GM slots and consumer-local buffers
- A Producer struct that implements the push protocol
- A Consumer struct that implements the pop protocol (for paired use)
The producer and consumer are independent halves of the same logical FIFO — they do not share the same struct instance, but they share the same GM slot buffer address and flag ID namespace.
Direction Types¶
The DirType template parameter selects the communication direction:
| DirType Constant | Meaning | Producer Side | Consumer Side |
|---|---|---|---|
DIR_C2V (1) |
Cube → Vector | TileType::Acc |
TileType::Vec |
DIR_V2C (2) |
Vector → Cube | TileType::Vec |
TileType::Mat |
DIR_BOTH (3) |
Cube ↔ Vector (bidirectional) | Acc + Vec | Vec + Mat |
DIR_V2C_CTRL (4) |
Vector → Cube (control signal) | TileType::Vec |
scalar control |
DIR_C2V_GM (5, A5 only) |
Cube → Vector via GM | TileType::Acc |
TileType::Vec |
DIR_V2C_GM (6, A5 only) |
Vector → Cube via GM | TileType::Vec |
TileType::Mat |
DIR_BOTH_GM (7, A5 only) |
Bidirectional via GM | Acc + Vec | Vec + Mat |
FIFO Storage Paths¶
Depending on the direction and backend, data flows through different storage paths:
| Path | Description | Used By |
|---|---|---|
| GM FIFO | Tile data written to global memory slots; consumer loads via TLOAD | *_GM directions |
| Local UB FIFO | Tile data written to consumer's local UB buffer directly; no GM traffic | DIR_C2V, DIR_BOTH on A2/A3 |
| MAT FIFO | Vector writes tile into Cube's local MAT buffer via TINSERT | DIR_V2C, DIR_BOTH on A2/A3 |
| CTRL FIFO | Scalar control signal (32-bit) written to shared control buffer | DIR_V2C_CTRL |
On A5, the distinction is sharper: VEC_FIFO / MAT_FIFO / GM_FIFO / CTRL_FIFO are explicitly typed.
Three-Phase Protocol¶
Every TPUSH call executes three phases in order:
1. ALLOCATE ──► 2. PUSH ──► 3. RECORD
Phase 1: Allocate (wait for free slot)¶
Producer ──wait_flag_dev(FlagID+1)──► Consumer
The producer waits until the consumer has freed a slot. This prevents overwriting data the consumer has not yet consumed.
- On A2/A3 C2V:
wait_flag_dev(FlagID + 1)viaPIPE_MTE2 - On A2/A3 V2C:
wait_flag_dev(FlagID + 1)viaPIPE_MTE2 - On A5 C2V:
wait_intra_block(PIPE_FIX, FlagID + 1)(and +16 for subblock 1) - On A5 V2C:
wait_intra_block(PIPE_MTE3, FlagID + 1) - On A5 V2C_CTRL:
wait_intra_block(PIPE_S, FlagID + 1)
Skipped if isAllocate = false (controlled via pipe.prod.setAllocateStatus(false)).
Phase 2: Push (write data to FIFO)¶
The actual data transfer depends on the FIFO type and direction:
C2V (Acc → Vec):
- A2/A3: TSTORE_IMPL writes accumulator tile to GM slot buffer
- A5: Either direct TMOV into consumer's local UB buffer (C2V_CONSUMER_BUF) or via GM, depending on is_c2v_ub vs is_c2v_gm
V2C (Vec → Mat):
- A2/A3: TSTORE_IMPL writes vector tile to GM slot buffer; sub-tile offsets computed via get_subblockid() when splitting
- A5: Either TINSERT into consumer's local MAT buffer or via GM
V2C_CTRL: Writes a single scalar control signal (32-bit) from the vector tile's first element to the control slot buffer.
Phase 3: Record (signal data-ready to consumer)¶
Producer ──set_flag/ffts_cross_core_sync(FlagID)──► Consumer
The producer signals that data is ready. The consumer can now safely wait for and consume it.
- On A2/A3 C2V:
ffts_cross_core_sync(PIPE_FIX, ...) - On A2/A3 V2C:
ffts_cross_core_sync(PIPE_MTE3, ...) - On A5 C2V:
set_intra_block(PIPE_FIX, FlagID)(+16 for subblock 1) - On A5 V2C:
set_intra_block(PIPE_MTE3, FlagID) - On A5 V2C_CTRL:
set_intra_block(PIPE_S, FlagID)
Skipped if isRecord = false.
Tile Split Modes¶
When producer and consumer operate on different tile shapes (e.g., producer is 128×256, consumer is 64×256), TileSplitAxis controls how the tile is decomposed:
| Split Mode | Meaning | Offset Computation |
|---|---|---|
TILE_NO_SPLIT |
Single writer, no decomposition | offset = 0 |
TILE_UP_DOWN |
Split along rows; each subblock writes a row block | offset = subblock_id × ProdM × ProdN × sizeof(T) |
TILE_LEFT_RIGHT |
Split along columns; each subblock writes a column block | offset = subblock_id × ProdN × sizeof(T) |
TMPipe: Multi-Pipe Variant¶
TMPipe<FlagID, FiFoType, FiFoDepth, FiFoSyncT, TileDataProd, TileDataCons, EN_UNIT_FLAG, LocalFiFoDepth, VCRatio> is the multi-pipe version. Key differences:
FiFoType: Selects FIFO implementation (GM_FIFO,VEC_FIFO,MAT_FIFO,CTRL_FIFO)FiFoDepth: Configurable FIFO depth (2–8 on A2/A3; up to 16 on A5)LocalFiFoDepth: Local UB buffer depth for GM-path consumersVCRatio:V2C1_VECS(1 Cube, 2 Vec cores) orV2C2_VECS(1 Cube, 1 Vec core)EN_UNIT_FLAG: Enables per-slot synchronization for fine-grained flow control
Syntax¶
IR Level 1 (SSA)¶
%event = pto.tpush %tile, %pipe : (!pto.tile<f32, 64, 64>, !pto.tpipe<...>) -> !pto.record_event
IR Level 2 (DPS)¶
pto.tpush ins(%tile : !pto.tile_buf<f32, 64, 64>) pipe(%pipe : !pto.tpipe<...>)
C++ Intrinsic¶
#include <pto/common/fifo.hpp>
using namespace pto;
// Define a C2V pipe: Acc producer → Vec consumer
// FlagID=0, DirType=DIR_C2V, SlotSize=16384 bytes, SlotNum=4, LocalSlotNum=2
using MyPipe = TPipe<0, Direction::DIR_C2V, 16384, 4, 2>;
// Allocate the pipe with GM slot buffer and consumer-local buffers
MyPipe pipe(/* GM slot buffer */ reinterpret_cast<__gm__ void*>(0x100000),
/* C2V consumer UB buf */ 0x8000,
/* V2C consumer buf */ 0x9000);
void producer_side(AccTile& accTile) {
TPUSH_IMPL(pipe, accTile);
}
// Or with split mode:
TPUSH_IMPL<MyPipe, AccTile, TileSplitAxis::TILE_UP_DOWN>(pipe, accTile);
TMPipe Usage¶
// Define a V2C multi-pipe: Vec producer → Mat consumer via GM FIFO
using MyMultiPipe = TMPipe<
4, // FlagID
FIFOType::GM_FIFO, // GM FIFO
4, // FiFoDepth
1, // FiFoSyncT (sync every 1)
VecTile, // Producer tile type
MatTile, // Consumer tile type
false, // EN_UNIT_FLAG
2, // LocalFiFoDepth
VecCubeRatio::V2C1_VECS // 2 Vec cores per Cube
>;
MyMultiPipe multiPipe(/* GM FIFO base */ reinterpret_cast<__gm__ float*>(0x200000),
/* local FIFO base */ 0xA000);
void producer_vec(VecTile& vTile) {
TPUSH_IMPL(vTile, multiPipe);
}
Constraints¶
Constraints
TileProd::Locmust beTileType::AccorTileType::Vec.DirTypemust be compatible withTileProd::LocandTileDataCons::Loc.SlotNum × SlotSizemust not exceed the available GM region for the FIFO.FlagIDrange: 0–7 per pipe type on A2/A3; 0–15 on A5 with intra-block flags.- When
isAllocate = false, the producer skips the allocation wait; the caller must ensure the slot is free. - When
isRecord = false, the producer skips the ready signal; the caller must ensure the consumer waits externally. - Pairing: each
TPUSHshould have a correspondingTPOPon the consumer side; skipping allocation or record breaks the protocol.
Target-Profile Restrictions¶
Target-Profile Restrictions
- CPU simulator: Not available.
TPUSHandTPOPrequire the NPU inter-core synchronization infrastructure. - A2/A3: Supports
DIR_C2V,DIR_V2C,DIR_BOTH,DIR_V2C_CTRL. FIFO paths: GM and local UB/MAT. Does not supportDIR_*_GMvariants. - A5: Supports all direction types including
DIR_C2V_GM,DIR_V2C_GM,DIR_BOTH_GM. FIFO paths: GM, VEC_FIFO, MAT_FIFO, CTRL_FIFO. Intra-block synchronization usesset_intra_block/wait_intra_blockinstead of cross-coreffts_*.
Common Patterns¶
Pattern 1: Acc → Vec Tile Passing (GEMM Post-Processing)¶
// Producer (Cube): accumulator result → Vector consumer
using AccTile = Tile<TileType::Acc, float, 64, 64>;
using VecTile = Tile<TileType::Vec, float, 64, 64>;
using Acc2VecPipe = TPipe<0, Direction::DIR_C2V, 16384, 4, 2>;
Acc2VecPipe pipe(/* GM buffer */ 0x100000, /* C2V UB */ 0x8000, /* V2C buf */ 0x9000);
void cube_kernel(AccTile& acc) {
// Acc contains result of TMATMUL
TPUSH_IMPL(pipe, acc); // Signal Vec that accumulator is ready
}
// Consumer (Vector): receives accumulator tile
void vec_kernel(VecTile& vec) {
TPOP_IMPL(pipe, vec); // Wait for and receive accumulator
// Apply activation, quantization, etc.
}
Pattern 2: Vec → Mat Tile Passing with Row Split¶
// Producer: 128×256 vector tile; Consumer: 256×256 matrix tile
using ProdTile = Tile<TileType::Vec, half, 128, 256>;
using ConsTile = Tile<TileType::Mat, half, 256, 256>;
using Vec2MatPipe = TMPipe<
2, FIFOType::MAT_FIFO, 4, 1,
ProdTile, ConsTile,
false, 2, VecCubeRatio::V2C1_VECS
>;
Vec2MatPipe pipe(/* MAT FIFO base */ 0x300000);
void vec_producer(ProdTile& vec) {
TPUSH_IMPL(vec, pipe); // Splits vec into two 128×256 blocks, inserts into mat
}
Pattern 3: Sparse Sync (Skipping Allocation Wait)¶
// After the initial startup phase, slots are guaranteed free at every period boundary.
// Only allocate every N iterations:
pipe.prod.setAllocateStatus(/* iteration % period == 0 */);
TPUSH_IMPL(pipe, tile);
See Also¶
- TPOP — Consumer half of the FIFO protocol
- TPipe and TMPipe source code reference
- System Scheduling