pto.tpush

pto.tpush is part of the System Scheduling instruction set.

Summary

TPUSH moves a tile from a producer pipeline (Cube or Vector) into a ring FIFO for consumption by a paired pipeline. It is the producer half of the TPipe/TMPipe producer-consumer protocol.

What TPUSH Is Not

TPUSH is not a scalar stack push or a generic FIFO enqueue. It is a structured tile-movement protocol for Cube-Vector tile passing. It is not available on the CPU simulator.

Architecture: The TPipe Abstraction

A TPipe<FlagID, DirType, SlotSize, SlotNum, LocalSlotNum> is a compile-time configured ring FIFO that connects a producer and a consumer. Each TPipe owns:

  • A RingFIFO that manages GM slots and consumer-local buffers
  • A Producer struct that implements the push protocol
  • A Consumer struct that implements the pop protocol (for paired use)

The producer and consumer are independent halves of the same logical FIFO — they do not share the same struct instance, but they share the same GM slot buffer address and flag ID namespace.

Direction Types

The DirType template parameter selects the communication direction:

DirType Constant Meaning Producer Side Consumer Side
DIR_C2V (1) Cube → Vector TileType::Acc TileType::Vec
DIR_V2C (2) Vector → Cube TileType::Vec TileType::Mat
DIR_BOTH (3) Cube ↔ Vector (bidirectional) Acc + Vec Vec + Mat
DIR_V2C_CTRL (4) Vector → Cube (control signal) TileType::Vec scalar control
DIR_C2V_GM (5, A5 only) Cube → Vector via GM TileType::Acc TileType::Vec
DIR_V2C_GM (6, A5 only) Vector → Cube via GM TileType::Vec TileType::Mat
DIR_BOTH_GM (7, A5 only) Bidirectional via GM Acc + Vec Vec + Mat

FIFO Storage Paths

Depending on the direction and backend, data flows through different storage paths:

Path Description Used By
GM FIFO Tile data written to global memory slots; consumer loads via TLOAD *_GM directions
Local UB FIFO Tile data written to consumer's local UB buffer directly; no GM traffic DIR_C2V, DIR_BOTH on A2/A3
MAT FIFO Vector writes tile into Cube's local MAT buffer via TINSERT DIR_V2C, DIR_BOTH on A2/A3
CTRL FIFO Scalar control signal (32-bit) written to shared control buffer DIR_V2C_CTRL

On A5, the distinction is sharper: VEC_FIFO / MAT_FIFO / GM_FIFO / CTRL_FIFO are explicitly typed.

Three-Phase Protocol

Every TPUSH call executes three phases in order:

1. ALLOCATE  ──►  2. PUSH  ──►  3. RECORD

Phase 1: Allocate (wait for free slot)

Producer ──wait_flag_dev(FlagID+1)──► Consumer

The producer waits until the consumer has freed a slot. This prevents overwriting data the consumer has not yet consumed.

  • On A2/A3 C2V: wait_flag_dev(FlagID + 1) via PIPE_MTE2
  • On A2/A3 V2C: wait_flag_dev(FlagID + 1) via PIPE_MTE2
  • On A5 C2V: wait_intra_block(PIPE_FIX, FlagID + 1) (and +16 for subblock 1)
  • On A5 V2C: wait_intra_block(PIPE_MTE3, FlagID + 1)
  • On A5 V2C_CTRL: wait_intra_block(PIPE_S, FlagID + 1)

Skipped if isAllocate = false (controlled via pipe.prod.setAllocateStatus(false)).

Phase 2: Push (write data to FIFO)

The actual data transfer depends on the FIFO type and direction:

C2V (Acc → Vec): - A2/A3: TSTORE_IMPL writes accumulator tile to GM slot buffer - A5: Either direct TMOV into consumer's local UB buffer (C2V_CONSUMER_BUF) or via GM, depending on is_c2v_ub vs is_c2v_gm

V2C (Vec → Mat): - A2/A3: TSTORE_IMPL writes vector tile to GM slot buffer; sub-tile offsets computed via get_subblockid() when splitting - A5: Either TINSERT into consumer's local MAT buffer or via GM

V2C_CTRL: Writes a single scalar control signal (32-bit) from the vector tile's first element to the control slot buffer.

Phase 3: Record (signal data-ready to consumer)

Producer ──set_flag/ffts_cross_core_sync(FlagID)──► Consumer

The producer signals that data is ready. The consumer can now safely wait for and consume it.

  • On A2/A3 C2V: ffts_cross_core_sync(PIPE_FIX, ...)
  • On A2/A3 V2C: ffts_cross_core_sync(PIPE_MTE3, ...)
  • On A5 C2V: set_intra_block(PIPE_FIX, FlagID) (+16 for subblock 1)
  • On A5 V2C: set_intra_block(PIPE_MTE3, FlagID)
  • On A5 V2C_CTRL: set_intra_block(PIPE_S, FlagID)

Skipped if isRecord = false.

Tile Split Modes

When producer and consumer operate on different tile shapes (e.g., producer is 128×256, consumer is 64×256), TileSplitAxis controls how the tile is decomposed:

Split Mode Meaning Offset Computation
TILE_NO_SPLIT Single writer, no decomposition offset = 0
TILE_UP_DOWN Split along rows; each subblock writes a row block offset = subblock_id × ProdM × ProdN × sizeof(T)
TILE_LEFT_RIGHT Split along columns; each subblock writes a column block offset = subblock_id × ProdN × sizeof(T)

TMPipe: Multi-Pipe Variant

TMPipe<FlagID, FiFoType, FiFoDepth, FiFoSyncT, TileDataProd, TileDataCons, EN_UNIT_FLAG, LocalFiFoDepth, VCRatio> is the multi-pipe version. Key differences:

  • FiFoType: Selects FIFO implementation (GM_FIFO, VEC_FIFO, MAT_FIFO, CTRL_FIFO)
  • FiFoDepth: Configurable FIFO depth (2–8 on A2/A3; up to 16 on A5)
  • LocalFiFoDepth: Local UB buffer depth for GM-path consumers
  • VCRatio: V2C1_VECS (1 Cube, 2 Vec cores) or V2C2_VECS (1 Cube, 1 Vec core)
  • EN_UNIT_FLAG: Enables per-slot synchronization for fine-grained flow control

Syntax

IR Level 1 (SSA)

%event = pto.tpush %tile, %pipe : (!pto.tile<f32, 64, 64>, !pto.tpipe<...>) -> !pto.record_event

IR Level 2 (DPS)

pto.tpush ins(%tile : !pto.tile_buf<f32, 64, 64>) pipe(%pipe : !pto.tpipe<...>)

C++ Intrinsic

#include <pto/common/fifo.hpp>

using namespace pto;

// Define a C2V pipe: Acc producer → Vec consumer
// FlagID=0, DirType=DIR_C2V, SlotSize=16384 bytes, SlotNum=4, LocalSlotNum=2
using MyPipe = TPipe<0, Direction::DIR_C2V, 16384, 4, 2>;

// Allocate the pipe with GM slot buffer and consumer-local buffers
MyPipe pipe(/* GM slot buffer */ reinterpret_cast<__gm__ void*>(0x100000),
            /* C2V consumer UB buf */ 0x8000,
            /* V2C consumer buf */   0x9000);

void producer_side(AccTile& accTile) {
    TPUSH_IMPL(pipe, accTile);
}

// Or with split mode:
TPUSH_IMPL<MyPipe, AccTile, TileSplitAxis::TILE_UP_DOWN>(pipe, accTile);

TMPipe Usage

// Define a V2C multi-pipe: Vec producer → Mat consumer via GM FIFO
using MyMultiPipe = TMPipe<
    4,                    // FlagID
    FIFOType::GM_FIFO,    // GM FIFO
    4,                    // FiFoDepth
    1,                    // FiFoSyncT (sync every 1)
    VecTile,              // Producer tile type
    MatTile,              // Consumer tile type
    false,                // EN_UNIT_FLAG
    2,                    // LocalFiFoDepth
    VecCubeRatio::V2C1_VECS  // 2 Vec cores per Cube
>;

MyMultiPipe multiPipe(/* GM FIFO base */ reinterpret_cast<__gm__ float*>(0x200000),
                     /* local FIFO base */ 0xA000);

void producer_vec(VecTile& vTile) {
    TPUSH_IMPL(vTile, multiPipe);
}

Constraints

Constraints

  • TileProd::Loc must be TileType::Acc or TileType::Vec.
  • DirType must be compatible with TileProd::Loc and TileDataCons::Loc.
  • SlotNum × SlotSize must not exceed the available GM region for the FIFO.
  • FlagID range: 0–7 per pipe type on A2/A3; 0–15 on A5 with intra-block flags.
  • When isAllocate = false, the producer skips the allocation wait; the caller must ensure the slot is free.
  • When isRecord = false, the producer skips the ready signal; the caller must ensure the consumer waits externally.
  • Pairing: each TPUSH should have a corresponding TPOP on the consumer side; skipping allocation or record breaks the protocol.

Target-Profile Restrictions

Target-Profile Restrictions
  • CPU simulator: Not available. TPUSH and TPOP require the NPU inter-core synchronization infrastructure.
  • A2/A3: Supports DIR_C2V, DIR_V2C, DIR_BOTH, DIR_V2C_CTRL. FIFO paths: GM and local UB/MAT. Does not support DIR_*_GM variants.
  • A5: Supports all direction types including DIR_C2V_GM, DIR_V2C_GM, DIR_BOTH_GM. FIFO paths: GM, VEC_FIFO, MAT_FIFO, CTRL_FIFO. Intra-block synchronization uses set_intra_block/wait_intra_block instead of cross-core ffts_*.

Common Patterns

Pattern 1: Acc → Vec Tile Passing (GEMM Post-Processing)

// Producer (Cube): accumulator result → Vector consumer
using AccTile = Tile<TileType::Acc, float, 64, 64>;
using VecTile = Tile<TileType::Vec, float, 64, 64>;
using Acc2VecPipe = TPipe<0, Direction::DIR_C2V, 16384, 4, 2>;

Acc2VecPipe pipe(/* GM buffer */ 0x100000, /* C2V UB */ 0x8000, /* V2C buf */ 0x9000);

void cube_kernel(AccTile& acc) {
    // Acc contains result of TMATMUL
    TPUSH_IMPL(pipe, acc);  // Signal Vec that accumulator is ready
}

// Consumer (Vector): receives accumulator tile
void vec_kernel(VecTile& vec) {
    TPOP_IMPL(pipe, vec);   // Wait for and receive accumulator
    // Apply activation, quantization, etc.
}

Pattern 2: Vec → Mat Tile Passing with Row Split

// Producer: 128×256 vector tile; Consumer: 256×256 matrix tile
using ProdTile = Tile<TileType::Vec, half, 128, 256>;
using ConsTile = Tile<TileType::Mat, half, 256, 256>;
using Vec2MatPipe = TMPipe<
    2, FIFOType::MAT_FIFO, 4, 1,
    ProdTile, ConsTile,
    false, 2, VecCubeRatio::V2C1_VECS
>;

Vec2MatPipe pipe(/* MAT FIFO base */ 0x300000);

void vec_producer(ProdTile& vec) {
    TPUSH_IMPL(vec, pipe);  // Splits vec into two 128×256 blocks, inserts into mat
}

Pattern 3: Sparse Sync (Skipping Allocation Wait)

// After the initial startup phase, slots are guaranteed free at every period boundary.
// Only allocate every N iterations:
pipe.prod.setAllocateStatus(/* iteration % period == 0 */);
TPUSH_IMPL(pipe, tile);

See Also