Enhanced TPUSH/TPOP ISA Design for Intra-Cluster Function Group Data Communication

Overview

This document specifies an enhanced ISA design for TPUSH and TPOP instructions to support intra-cluster data communication across InCore kernels within a function group.

Cluster Architecture

Each cluster contains 1 Cube core and 2 buddy Vector cores that share a hardware flag-based synchronization mechanism:

┌─────────────────────── Cluster ───────────────────────┐
│                                                       │
│  ┌──────────┐    flags (8 per dir)    ┌──────────┐   │
│  │  Vector 0 │◄══════════════════════►│          │   │
│  └──────────┘   SET/WAIT V→C, C→V    │   Cube   │   │
│                                       │          │   │
│  ┌──────────┐    flags (8 per dir)    │          │   │
│  │  Vector 1 │◄══════════════════════►│          │   │
│  └──────────┘   SET/WAIT V→C, C→V    └──────────┘   │
│                                                       │
└───────────────────────────────────────────────────────┘
  • A Vector core can SET a flag that the Cube core WAITs on, and vice versa.
  • There are 8 flags per direction per peer (Vector→Cube: 8 flags, Cube→Vector: 8 flags), for a total of 16 flags per Vector-Cube pair.
  • With 2 buddy Vector cores, the cluster has 32 cross-core flags in total (2 peers × 2 directions × 8 flags).

Ring Buffer Data Channel

Data is moved between producer and consumer kernels through a multi-slot ring buffer with flow control. Each slot holds one fixed-size Tile. The ring buffer location depends on the platform:

Platform Ring Buffer Location Description
A3 Global Memory (GM) Ring buffer resides in off-chip DDR/HBM, accessible by all cores in the cluster
A5 Consumer's on-chip SRAM Ring buffer resides in the consumer core's local memory: Unified Buffer (UB) if consumer is a Vector core, or L1 Buffer if consumer is a Cube core
A3 Platform:                              A5 Platform:

Producer          GM            Consumer  Producer                    Consumer
┌──────┐   ┌────────────┐   ┌──────┐    ┌──────┐                 ┌──────────────┐
│      │──▶│ slot[0..N-1]│──▶│      │    │      │───────────────▶│ UB / L1      │
│ Cube │   │ (off-chip)  │   │ Vec  │    │ Cube │  DMA to local  │ slot[0..N-1] │
│ /Vec │   └────────────┘   │ /Cube│    │ /Vec │                 │ (on-chip)    │
└──────┘                     └──────┘    └──────┘                 └──────────────┘

The A5 placement in consumer-local SRAM eliminates the round-trip to GM, enabling lower-latency data handoff. The consumer can directly operate on tile data in its local buffer without an explicit TLOAD from GM.

The enhanced design extends TPUSH/TPOP to serve as the primary data communication mechanism between InCore kernels co-scheduled on these cores within the same cluster, enabling:

  • Producer kernel → Ring Buffer → Consumer kernel tile-level data flow between Cube and buddy Vector cores
  • Cross-core synchronization via the hardware flag mechanism (SET/WAIT, 8 flags per direction)
  • Multi-slot ring buffer for pipelined execution (SLOT_NUM = 8 for unidirectional, 4 for bidirectional)
  • Platform-adaptive buffer placement: GM (A3) or consumer-local SRAM (A5)

Motivation: Intra-Cluster Function Group Communication

When the ExpandMixedKernel pass decomposes a mixed InCore function into multiple co-scheduled kernels (e.g., a data-movement kernel on Vector cores and a compute kernel on Cube cores), these kernels need an efficient, synchronized data communication channel within the same cluster.

TPUSH/TPOP with ring buffer flow control provides exactly this capability — the enhanced design formalizes how the compiler should emit TPUSH/TPOP pairs to connect the expanded kernel group.

Enhanced Design: Tag-Based Dual-Channel FIFO Protocol

The enhanced TPUSH/TPOP design uses a multi-slot ring buffer with tag-based dual-channel flow control for moving fixed-size Tile data between producer and consumer kernels.

Producer / Consumer Roles

The terms producer and consumer are conceptual roles, not bound to a specific core type:

  • A Cube core can be a producer (e.g., matmul output → Vector for post-processing), or a consumer (e.g., receiving preprocessed data from Vector).
  • A Vector core can be a producer (e.g., data loading / preprocessing → Cube), or a consumer (e.g., receiving matmul results from Cube).
  • In some applications, both cores are simultaneously producer and consumer in opposite directions, forming a bidirectional data flow.

Ring Buffer Structure

Each ring buffer is a unidirectional channel from one producer to one consumer. The number of slots is a compile-time constant parameter SLOT_NUM, specified during kernel initialization:

Communication Pattern SLOT_NUM Flags Used Description
Unidirectional (one direction only) 8 8 flags for P2C + C2P All 8 flags per direction dedicated to a single ring buffer
Bidirectional (both directions simultaneously) 4 per direction 4 flags for each of the 2 ring buffers The 8 available flags are split equally between the two directions
Unidirectional (SLOT_NUM=8):

    Cube (producer)  ──────▶  Vector (consumer)
    Ring Buffer: slot[0..7], using flags 0..7

Bidirectional (SLOT_NUM=4 per direction):

    Cube  ──── Ring Buffer A (slot[0..3], flags 0..3) ────▶  Vector
    Cube  ◀──── Ring Buffer B (slot[0..3], flags 4..7) ────  Vector
Ring Buffer  —  SLOT_NUM fixed-size Tile slots, indexed by tag

    ┌──────────────────────────────────────────────────────┐
    │  slot[0]    slot[1]    ...    slot[SLOT_NUM-1]       │   A3: Global Memory
    └──────────────────────────────────────────────────────┘   A5: Consumer's UB or L1

Signal Channels (mapped to hardware cross-core flags):
    P2C  —  Producer → Consumer  (data ready signal, indexed by tag)
    C2P  —  Consumer → Producer  (space free signal, indexed by tag)

Each ring buffer slot holds exactly one Tile and is identified by a tag (0 .. SLOT_NUM-1). The two signal channels P2C and C2P carry per-tag notifications: - SET P2C: tag — producer signals "data in slot[tag] is ready" - SET C2P: tag — consumer signals "slot[tag] is free for reuse" - WAIT P2C: tag — consumer blocks until slot[tag] is ready - WAIT C2P: tag — producer blocks until slot[tag] is free

API Definition

Platform Constant

enum PlatformID : uint8_t {
    PLATFORM_A2A3 = 0,   // A2/A3 platform: ring buffer in Global Memory
    PLATFORM_A5   = 1,   // A5 platform: ring buffer in consumer's on-chip SRAM
};

PLATFORM_ID is a compile-time constant generated by the compiler and embedded into the kernel binary. It is used by the initialization APIs and the TPUSH / TPOP instruction families to select the appropriate backing-memory behavior:

PLATFORM_ID Ring Buffer Location TPUSH Behavior TPOP Behavior
PLATFORM_A2A3 GM (orchestration-allocated GM_SLOT_BUFFER) DMA tile → GM slot DMA GM slot → local tile
PLATFORM_A5 Consumer's on-chip SRAM (UB or L1) Materialize tile directly into the consumer-local slot selected by the typed FIFO Zero-copy or consumer-local slot access selected by the typed FIFO

Direction Constants

enum Direction : uint8_t {
    DIR_C2V = 0,   // Cube → Vector: Cube is producer, Vector is consumer
    DIR_V2C = 1,   // Vector → Cube: Vector is producer, Cube is consumer
};

A kernel uses DIR_C2V or DIR_V2C to specify the data flow direction. For bidirectional communication, both directions are active simultaneously (DIR_C2V | DIR_V2C).

DIR_MASK

A bitmask indicating which directions are active for this kernel:

DIR_MASK Value Meaning SLOT_NUM per direction
DIR_C2V 0b01 Unidirectional: Cube → Vector only 8
DIR_V2C 0b10 Unidirectional: Vector → Cube only 8
DIR_C2V \| DIR_V2C 0b11 Bidirectional: both directions 4

GM_SLOT_BUFFER and CONSUMER_BUFFER_BASE / CONSUMER_BUFFER_SIZE

The ring buffer backing memory differs between A2A3 and A5:

Platform Ring Buffer Source Mechanism
A2A3 GM_SLOT_BUFFER — orchestration-allocated GM buffer, passed as INOUT argument Same as before
A5 CONSUMER_BUFFER_BASE / CONSUMER_BUFFER_SIZE — compiler-generated constant symbols per InCore function See "Consumer SRAM Address Problem" section below

A2A3: GM_SLOT_BUFFER is allocated in GM by the orchestration and passed to both InCore functions as INOUT.

A5: The ring buffer lives in the consumer's local SRAM. Its location is specified by CONSUMER_BUFFER_BASE and CONSUMER_BUFFER_SIZE, which are constant symbols attached to each InCore function (see the detailed design in the "Cross-Core Address Problem on A5" section below). The resolved CONSUMER_BUFFER_BASE values are passed as explicit arguments (C2V_CONSUMER_BUF, V2C_CONSUMER_BUF) to the initialization functions, avoiding special compiler requirements for implicit constant lookups.

Orchestration function (A2A3):
    gm_slot_buf = gm_alloc(2 * SLOT_NUM * SLOT_SIZE)    // bidirectional

    for ...:
        cube_kernel(  ..., GM_SLOT_BUFFER=gm_slot_buf, ...)   // INOUT
        vector_kernel(..., GM_SLOT_BUFFER=gm_slot_buf, ...)   // INOUT

Orchestration function (A5):
    // CONSUMER_BUFFER_BASE values are resolved by compiler and passed explicitly

    for ...:
        cube_kernel(  ..., GM_SLOT_BUFFER=nullptr, ...)
        vector_kernel(..., GM_SLOT_BUFFER=nullptr, ...)

aic_initialize_pipe(DIR_MASK, SLOT_SIZE, GM_SLOT_BUFFER, C2V_CONSUMER_BUF, V2C_CONSUMER_BUF)

Called on the Cube (AIC) core at kernel startup. Initializes the ring buffer pipe(s) for the specified direction(s).

Parameter Type Description
DIR_MASK uint8_t Bitmask of active directions (DIR_C2V, DIR_V2C, or both)
SLOT_SIZE uint32_t Size of each ring buffer slot in bytes (= Tile size)
GM_SLOT_BUFFER __gm__ void* GM buffer allocated by orchestration (INOUT). Active on A2A3; nullptr on A5
C2V_CONSUMER_BUF uint32_t Consumer's SRAM base address for C2V direction (Vector's UB). 0 on A2A3; explicit on A5
V2C_CONSUMER_BUF uint32_t Consumer's SRAM base address for V2C direction (Cube's own L1). 0 on A2A3; explicit on A5

Description: Binds the ring buffer pipe(s) to the appropriate backing memory based on PLATFORM_ID, computes SLOT_NUM from DIR_MASK (8 if unidirectional, 4 if bidirectional), and initializes internal state. On A5, the ring buffer base addresses are passed as explicit arguments (C2V_CONSUMER_BUF, V2C_CONSUMER_BUF) — no implicit constant symbol lookup is required. For each direction where the Cube is the consumer (DIR_V2C), it signals all slots as free to the Vector producer.

  • On PLATFORM_A2A3: uses GM_SLOT_BUFFER in GM for all directions. C2V_CONSUMER_BUF and V2C_CONSUMER_BUF are ignored.
  • On PLATFORM_A5:
  • C2V (Cube is producer): uses C2V_CONSUMER_BUF — the Vector's UB address, passed explicitly.
  • V2C (Cube is consumer): uses V2C_CONSUMER_BUF — Cube's own L1 address, passed explicitly.

Pseudocode:

function aic_initialize_pipe(DIR_MASK, SLOT_SIZE, GM_SLOT_BUFFER, C2V_CONSUMER_BUF, V2C_CONSUMER_BUF):
    if DIR_MASK == (DIR_C2V | DIR_V2C):
        SLOT_NUM = 4
    else:
        SLOT_NUM = 8

    if DIR_MASK & DIR_C2V:
        // Cube is PRODUCER in C2V direction
        if PLATFORM_ID == PLATFORM_A2A3:
            c2v_ring_buf = GM_SLOT_BUFFER                         // GM buffer
        else:  // PLATFORM_A5
            c2v_ring_buf = C2V_CONSUMER_BUF                       // Vector's UB (explicit argument)
        c2v_target_tag = 0

    if DIR_MASK & DIR_V2C:
        // Cube is CONSUMER in V2C direction
        if PLATFORM_ID == PLATFORM_A2A3:
            buf_offset = (DIR_MASK & DIR_C2V) ? SLOT_NUM * SLOT_SIZE : 0
            v2c_ring_buf = GM_SLOT_BUFFER + buf_offset            // GM buffer
        else:  // PLATFORM_A5
            v2c_ring_buf = V2C_CONSUMER_BUF                       // Cube's own L1 (explicit argument)
        v2c_target_tag = 0
        // Signal all slots as free to Vector producer
        for (i = 0; i < SLOT_NUM; i++):
            SET flag_V2C_free: i          // "slot[i] is free, Vector may write"

aiv_initialize_pipe(DIR_MASK, SLOT_SIZE, GM_SLOT_BUFFER, C2V_CONSUMER_BUF, V2C_CONSUMER_BUF)

Called on a Vector (AIV) core at kernel startup. Initializes the ring buffer pipe(s) for the specified direction(s).

Parameter Type Description
DIR_MASK uint8_t Bitmask of active directions (DIR_C2V, DIR_V2C, or both)
SLOT_SIZE uint32_t Size of each ring buffer slot in bytes (= Tile size)
GM_SLOT_BUFFER __gm__ void* GM buffer allocated by orchestration (INOUT). Active on A2A3; nullptr on A5
C2V_CONSUMER_BUF uint32_t Consumer's SRAM base address for C2V direction (Vector's own UB). 0 on A2A3; explicit on A5
V2C_CONSUMER_BUF uint32_t Consumer's SRAM base address for V2C direction (Cube's L1). 0 on A2A3; explicit on A5

Description: Binds the ring buffer pipe(s) to the appropriate backing memory based on PLATFORM_ID, computes SLOT_NUM, and initializes internal state. On A5, the ring buffer base addresses are passed as explicit arguments (C2V_CONSUMER_BUF, V2C_CONSUMER_BUF) — no implicit constant symbol lookup is required. For each direction where the Vector is the consumer (DIR_C2V), it signals all slots as free to the Cube producer.

  • On PLATFORM_A2A3: uses GM_SLOT_BUFFER in GM for all directions. C2V_CONSUMER_BUF and V2C_CONSUMER_BUF are ignored.
  • On PLATFORM_A5:
  • C2V (Vector is consumer): uses C2V_CONSUMER_BUF — Vector's own UB address, passed explicitly.
  • V2C (Vector is producer): uses V2C_CONSUMER_BUF — Cube's L1 address, passed explicitly.

Pseudocode:

function aiv_initialize_pipe(DIR_MASK, SLOT_SIZE, GM_SLOT_BUFFER, C2V_CONSUMER_BUF, V2C_CONSUMER_BUF):
    if DIR_MASK == (DIR_C2V | DIR_V2C):
        SLOT_NUM = 4
    else:
        SLOT_NUM = 8

    if DIR_MASK & DIR_C2V:
        // Vector is CONSUMER in C2V direction
        if PLATFORM_ID == PLATFORM_A2A3:
            c2v_ring_buf = GM_SLOT_BUFFER                         // GM buffer
        else:  // PLATFORM_A5
            c2v_ring_buf = C2V_CONSUMER_BUF                       // Vector's own UB (explicit argument)
        c2v_target_tag = 0
        // Signal all slots as free to Cube producer
        for (i = 0; i < SLOT_NUM; i++):
            SET flag_C2V_free: i          // "slot[i] is free, Cube may write"

    if DIR_MASK & DIR_V2C:
        // Vector is PRODUCER in V2C direction
        if PLATFORM_ID == PLATFORM_A2A3:
            buf_offset = (DIR_MASK & DIR_C2V) ? SLOT_NUM * SLOT_SIZE : 0
            v2c_ring_buf = GM_SLOT_BUFFER + buf_offset            // GM buffer
        else:  // PLATFORM_A5
            v2c_ring_buf = V2C_CONSUMER_BUF                       // Cube's L1 (explicit argument)
        v2c_target_tag = 0

Buffer Layout and Cross-Core Address Problem on A5

A2A3: Straightforward GM Layout

On A2A3, the ring buffer resides in GM. The orchestration allocates a single GM_SLOT_BUFFER and passes it to both InCore functions. Both cores access the same physical GM addresses.

GM_SLOT_BUFFER (total size = 2 * SLOT_NUM * SLOT_SIZE for bidirectional):

    ┌─────────────────────────────┬─────────────────────────────┐
    │  C2V ring buffer            │  V2C ring buffer            │
    │  slot[0] .. slot[SLOT_NUM-1]│  slot[0] .. slot[SLOT_NUM-1]│
    │  offset: 0                  │  offset: SLOT_NUM*SLOT_SIZE │
    └─────────────────────────────┴─────────────────────────────┘
A5: Consumer SRAM Address Problem

On A5, the ring buffer for each direction resides in the consumer's on-chip SRAM (UB or L1). This creates a fundamental problem:

  1. The ring buffer is a local memory region in the consumer's InCore function, allocated by the compiler in the consumer's local address space (UB or L1).
  2. The producer needs to know this address to DMA data into it — but it lives in another core's address space.
  3. In standard C/C++ semantics, a local symbol's address from one function cannot be referenced by another function. This violates symbol locality.
A5 Problem: C2V direction (Cube produces → Vector consumes)

    Cube InCore function:                    Vector InCore function:
    ┌─────────────────────┐                 ┌─────────────────────┐
    │  TPUSH(prod, ...) │   ??? how to    │  consumer_buf =     │
    │  DMA to Vector's UB │ ──────────────▶ │  UB[BASE..BASE+SIZE]│
    │  at what address?   │   get address?  │  // local segment   │
    └─────────────────────┘                 └─────────────────────┘
Solution: CONSUMER_BUFFER_BASE / CONSUMER_BUFFER_SIZE Constant Symbols

The solution defines two constant symbols that are attached to each InCore function that participates in TPUSH/TPOP communication:

const uint32_t CONSUMER_BUFFER_BASE;   // base address of the consumer's ring buffer in its local SRAM
const uint32_t CONSUMER_BUFFER_SIZE;   // total size in bytes (= SLOT_NUM * SLOT_SIZE)

These symbols represent a reserved memory segment in the consumer InCore function's local SRAM (UB for Vector, L1 for Cube). The key properties are:

  1. Per-function constants: Each InCore function that acts as a consumer in any TPUSH/TPOP direction has its own CONSUMER_BUFFER_BASE and CONSUMER_BUFFER_SIZE. Each InCore function that acts as a producer also receives the peer consumer's CONSUMER_BUFFER_BASE and CONSUMER_BUFFER_SIZE so it knows the DMA target.

  2. Value origin:

  3. Auto-generated kernels (auto_incore / ExpandMixedKernel): The values are generated by the ExpandMixedKernel pass, which has visibility into both producer and consumer functions' memory layouts and can assign a non-overlapping SRAM region for the ring buffer.
  4. Manually written kernels: The programmer specifies CONSUMER_BUFFER_BASE and CONSUMER_BUFFER_SIZE as explicit constant declarations in the InCore function. The values must be chosen to not conflict with other SRAM usage.

  5. Address allocator reservation: The downstream memory address allocator (e.g., AllocateMemoryAddr pass) must treat the segment [CONSUMER_BUFFER_BASE, CONSUMER_BUFFER_BASE + CONSUMER_BUFFER_SIZE) as occupied / in-use in the consumer function's SRAM. It must not allocate any other symbols (tiles, temporaries, etc.) into this region. This ensures the ring buffer and the function's normal tile allocations do not overlap.

  6. Cross-function visibility: The CONSUMER_BUFFER_BASE value of a consumer function is visible to its paired producer function as a compile-time constant. The compiler ensures this by:

  7. Generating both functions in the same compilation unit (natural for ExpandMixedKernel).
  8. Emitting the consumer's CONSUMER_BUFFER_BASE as a constant in the producer's initialization code.
Compiler pipeline (A5):

    ExpandMixedKernel pass:
    ┌──────────────────────────────────────────────────────────┐
    │  1. Identify TPUSH/TPOP communication pairs and directions │
    │  2. For each consumer function:                          │
    │     - Choose CONSUMER_BUFFER_BASE in consumer's SRAM     │
    │     - Set CONSUMER_BUFFER_SIZE = SLOT_NUM * SLOT_SIZE    │
    │     - Attach as constant symbols to the consumer func    │
    │  3. For each producer function:                          │
    │     - Import the consumer's CONSUMER_BUFFER_BASE value   │
    │     - Attach as constant symbol for DMA target address   │
    └──────────────────────────────────────────────────────────┘
                              │
                              ▼
    AllocateMemoryAddr pass:
    ┌──────────────────────────────────────────────────────────┐
    │  For each InCore function with CONSUMER_BUFFER_BASE:     │
    │  - Mark [BASE, BASE+SIZE) as reserved in SRAM layout     │
    │  - Allocate all other tiles/temporaries OUTSIDE this     │
    │    region                                                │
    └──────────────────────────────────────────────────────────┘

Example — C2V unidirectional on A5:

Vector InCore function (consumer):
    CONSUMER_BUFFER_BASE = 0x1000            // compiler-assigned UB address
    CONSUMER_BUFFER_SIZE = 8 * TILE_SIZE     // 8 slots × tile size

    // UB layout after AllocateMemoryAddr:
    //   [0x0000 .. 0x0FFF]  — normal tiles / temporaries
    //   [0x1000 .. 0x1000 + 8*TILE_SIZE)  — RESERVED: ring buffer (CONSUMER_BUFFER segment)
    //   [above  .. UB_END]  — normal tiles / temporaries

Cube InCore function (producer):
    CONSUMER_BUFFER_BASE = 0x1000            // same value, imported from consumer
    CONSUMER_BUFFER_SIZE = 8 * TILE_SIZE     // same value

    // Cube uses CONSUMER_BUFFER_BASE as the DMA target base address
    // for its TPUSH operations (writes to Vector's UB at this address)

Bidirectional case: Each direction has a different consumer. The Cube function has its own CONSUMER_BUFFER_BASE/SIZE for V2C (ring buffer in Cube's L1), and the Vector function has its own for C2V (ring buffer in Vector's UB). Each function also imports the peer's CONSUMER_BUFFER_BASE for the direction where it acts as producer.

Bidirectional A5:

    Cube InCore function:
        // V2C: Cube is consumer → own L1 segment
        V2C_CONSUMER_BUFFER_BASE = 0x2000    // Cube's L1
        V2C_CONSUMER_BUFFER_SIZE = 4 * TILE_SIZE

        // C2V: Cube is producer → needs Vector's UB address
        C2V_CONSUMER_BUFFER_BASE = 0x1000    // imported from Vector's constant

    Vector InCore function:
        // C2V: Vector is consumer → own UB segment
        C2V_CONSUMER_BUFFER_BASE = 0x1000    // Vector's UB
        C2V_CONSUMER_BUFFER_SIZE = 4 * TILE_SIZE

        // V2C: Vector is producer → needs Cube's L1 address
        V2C_CONSUMER_BUFFER_BASE = 0x2000    // imported from Cube's constant
Buffer Layout Summary
A2A3 (ring buffer in GM, GM_SLOT_BUFFER active):

    GM_SLOT_BUFFER:
    ┌─────────────────────────────┬─────────────────────────────┐
    │  C2V ring buffer            │  V2C ring buffer            │
    │  slot[0] .. slot[SLOT_NUM-1]│  slot[0] .. slot[SLOT_NUM-1]│
    └─────────────────────────────┴─────────────────────────────┘

A5 (ring buffer in consumer SRAM, CONSUMER_BUFFER_BASE/SIZE):

    Vector UB (for C2V, Vector is consumer):
    ┌──────────┬──────────────────────────────┬───────────┐
    │ normal   │  CONSUMER_BUFFER segment     │ normal    │
    │ tiles    │  [BASE .. BASE+SIZE)         │ tiles     │
    │          │  slot[0] .. slot[SLOT_NUM-1] │           │
    └──────────┴──────────────────────────────┴───────────┘
    ◄─── allocator avoids this region ───►

    Cube L1 (for V2C, Cube is consumer):
    ┌──────────┬──────────────────────────────┬───────────┐
    │ normal   │  CONSUMER_BUFFER segment     │ normal    │
    │ tiles    │  [BASE .. BASE+SIZE)         │ tiles     │
    │          │  slot[0] .. slot[SLOT_NUM-1] │           │
    └──────────┴──────────────────────────────┴───────────┘

Data Transfer Instructions (generic families: TPUSH, TPOP, TFREE)

This design aligns with the existing public PTO instruction surface in pto_instr.hpp: TPUSH(prod, tile, fifo) and TPOP(cons, tile, fifo) are generic instruction families whose behavior is specialized by typed pipe state and DataFIFO. This note uses TFREE(cons) as the normative name for the explicit release step in the split consumer protocol; the current implementation helper is named TPOPDONE(cons).

Direction is not encoded in the opcode spelling. It is inferred from:

  • The producer vs consumer role (prod or cons)
  • The tile type / tile location participating in the transfer
  • The FIFO kind (GM_FIFO, VEC_FIFO, MAT_FIFO)
  • The configured pipe state, which already determines the peer relationship and flag assignment

In practice:

  • On A2A3, both directions typically use GM_FIFO; the typed pipe state determines whether the channel is C2V or V2C.
  • On A5, VEC_FIFO means the consumer-local ring buffer is in a Vector core UB (Cube → Vector), and MAT_FIFO means the consumer-local ring buffer is in a Cube-local matrix/L1 buffer (Vector → Cube).

The TPOP and TFREE instructions form a split consumer protocol: TPOP acquires a slot (wait-ready + load / bind data) and TFREE releases it (signal-free + advance tag). This split is essential because the consumer may continue reading from the slot buffer after TPOP returns. If the slot were released immediately, the producer could overwrite the data before the consumer finishes using it.

Instruction family Form Executed On Role Description
TPUSH TPUSH(prod, TILE, fifo) Producer core Producer Wait for slot free, materialize tile into the configured ring-buffer slot, signal ready, advance producer tag
TPOP TPOP(cons, TILE, fifo) Consumer core Consumer Wait for slot ready, acquire the next slot, load or bind tile data, keep slot held
TFREE TFREE(cons) Consumer core Consumer Release the currently held slot, signal free, advance consumer tag

Peer selection is part of the configured pipe state and flag mapping. It is not a per-instruction operand.

TPUSH(prod, TILE, fifo)

Executed on the producer core. Pushes a tile into the ring buffer selected by prod and fifo.

Parameter Type Description
prod PipeProd& Producer-side pipe state; identifies the communication channel and peer
TILE Tile& Source tile data to push
fifo DataFIFO& Backing FIFO descriptor (GM_FIFO, VEC_FIFO, or MAT_FIFO)

Pseudocode:

function TPUSH(prod, TILE, fifo):
    pipe = prod.pipe_state

    // 1) Wait for slot to be free
    WAIT pipe.flag_free: pipe.target_tag

    // 2) Materialize tile data into the current ring-buffer slot
    dst_addr = pipe.ring_buf_base + pipe.target_tag * SLOT_SIZE
    if fifo.kind == GM_FIFO:
        DMA_or_store(src=TILE.data, dst=dst_addr, size=SLOT_SIZE)
    else if fifo.kind == VEC_FIFO:
        DMA_or_store(src=TILE.data, dst=dst_addr, size=SLOT_SIZE)    // destination is consumer Vector-local SRAM
    else:  // MAT_FIFO
        DMA_or_store(src=TILE.data, dst=dst_addr, size=SLOT_SIZE)    // destination is consumer Cube-local SRAM
    WAIT transfer_complete

    // 3) Signal consumer: data in slot is ready
    SET pipe.flag_ready: pipe.target_tag

    // 4) Advance to next slot
    pipe.target_tag = (pipe.target_tag + 1) % SLOT_NUM

The exact move path is determined by the existing typed implementation:

  • GM_FIFO uses GM-backed staging
  • VEC_FIFO writes directly into the consumer Vector-local slot
  • MAT_FIFO writes directly into the consumer Cube-local slot

TPOP(cons, TILE, fifo)

Executed on the consumer core. Acquires the next slot from the ring buffer selected by cons and fifo. The slot remains held until the consumer explicitly calls TFREE(cons).

Parameter Type Description
cons PipeCon& Consumer-side pipe state; identifies the communication channel and current held slot
TILE Tile& Destination tile / view used to consume the slot data
fifo DataFIFO& Backing FIFO descriptor (GM_FIFO, VEC_FIFO, or MAT_FIFO)

Pseudocode:

function TPOP(cons, TILE, fifo):
    pipe = cons.pipe_state

    // 1) Wait for data to be ready
    WAIT pipe.flag_ready: pipe.target_tag

    // 2) Acquire the current ring-buffer slot
    src_addr = pipe.ring_buf_base + pipe.target_tag * SLOT_SIZE
    if fifo.kind == GM_FIFO:
        DMA_or_load(src=src_addr, dst=TILE.data, size=SLOT_SIZE)
        WAIT transfer_complete
    else:
        TILE.data = src_addr    // consumer-local slot: zero-copy / direct local binding

    // NOTE: Slot is NOT released here.
    // The consumer must call TFREE(cons) after it has finished using TILE's data.

This document uses the split protocol as the normative behavior even though existing implementations may also support auto-free behavior via consumer configuration.

TFREE(cons)

Executed on the consumer core. Releases the currently held slot back to the producer after the consumer has finished reading the data obtained from the preceding TPOP(cons, TILE, fifo).

Parameter Type Description
cons PipeCon& Consumer-side pipe state; identifies which held slot is being released

Pseudocode:

function TFREE(cons):
    pipe = cons.pipe_state

    // 1) Signal producer: slot is free for reuse
    SET pipe.flag_free: pipe.target_tag

    // 2) Advance to next slot
    pipe.target_tag = (pipe.target_tag + 1) % SLOT_NUM

Flag Assignment

The 8 hardware flags per direction per peer are mapped as follows:

Unidirectional (DIR_C2V only, SLOT_NUM=8):

    flag_ready[C2V, peer] : flags 0..7   (Cube SETs, Vector WAITs)
    flag_free [C2V, peer] : flags 0..7   (Vector SETs, Cube WAITs)

Bidirectional (DIR_C2V | DIR_V2C, SLOT_NUM=4):

    flag_ready[C2V, peer] : flags 0..3   (Cube SETs, Vector WAITs)
    flag_free [C2V, peer] : flags 0..3   (Vector SETs, Cube WAITs)
    flag_ready[V2C, peer] : flags 4..7   (Vector SETs, Cube WAITs)
    flag_free [V2C, peer] : flags 4..7   (Cube SETs, Vector WAITs)

Timing Diagram: Unidirectional C2V (SLOT_NUM=4)

The split TPOP / TFREE protocol allows the consumer to hold a slot while computing on the data. TFREE may happen any time before the ring buffer wraps back to the same slot.

          iter 0              iter 1              iter 2              iter 3              iter 4
tag:        0                   1                   2                   3                   0

AIC (Cube, producer):
          ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
          │TPUSH         │  │TPUSH         │  │TPUSH         │  │TPUSH         │  │TPUSH         │
          │ WAIT f:0     │  │ WAIT f:1     │  │ WAIT f:2     │  │ WAIT f:3     │  │ WAIT f:0     │
          │ MTE → 0      │  │ MTE → 1      │  │ MTE → 2      │  │ MTE → 3      │  │ MTE → 0      │
          │ SET r:0      │  │ SET r:1      │  │ SET r:2      │  │ SET r:3      │  │ SET r:0      │
          └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘
                 │                  │                  │                  │                  │
                 ▼ ready            ▼ ready            ▼ ready            ▼ ready            ▼ ready

AIV (Vector, consumer):
                 ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
                 │TPOP          │  │TPOP          │  │TPOP          │  │TPOP          │
                 │ WAIT r:0     │  │ WAIT r:1     │  │ WAIT r:2     │  │ WAIT r:3     │
                 │ load [0]     │  │ load [1]     │  │ load [2]     │  │ load [3]     │
                 └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘
                        │                  │                  │                  │
                    (consumer             (consumer          (consumer          (consumer
                     uses data)            uses data)         uses data)         uses data)
                        │                  │                  │                  │
                 ┌──────┴───────┐  ┌──────┴───────┐  ┌──────┴───────┐  ┌──────┴───────┐
                 │TFREE         │  │TFREE         │  │TFREE         │  │TFREE         │
                 │ SET f:0      │  │ SET f:1      │  │ SET f:2      │  │ SET f:3      │
                 │ tag++ → 1    │  │ tag++ → 2    │  │ tag++ → 3    │  │ tag++ → 0    │
                 └──────┬───────┘  └──────┴───────┘  └──────┴───────┘  └──────┴───────┘
                        │
                        ▼ free (slot 0 now available for iter 4's TPUSH)

Legend: r = flag_ready, f = flag_free
        The configured producer / consumer pipe state determines the communicating peer.
        aiv_initialize_pipe pre-SETs f:0..3 so AIC does not block initially

Timing Diagram: Bidirectional (SLOT_NUM=4)

          iter 0              iter 1              iter 2              iter 3              iter 4

AIC (Cube):
  C2V:    TPUSH            TPUSH            TPUSH            TPUSH            TPUSH
          tag=0               tag=1               tag=2               tag=3               tag=0
  V2C:         TPOP             TPOP             TPOP             TPOP
               tag=0               tag=1               tag=2               tag=3
               (use data)         (use data)         (use data)         (use data)
               TFREE            TFREE            TFREE            TFREE

AIV (Vector):
  V2C:    TPUSH             TPUSH             TPUSH             TPUSH             TPUSH
          tag=0               tag=1               tag=2               tag=3               tag=0
  C2V:         TPOP              TPOP              TPOP              TPOP
               tag=0               tag=1               tag=2               tag=3
               (use data)          (use data)          (use data)          (use data)
               TFREE             TFREE             TFREE             TFREE

Flag usage (per AIV peer):
  flags 0..3 : C2V direction (ready + free)
  flags 4..7 : V2C direction (ready + free)

Key Properties

  1. No deadlock: The consumer side (aiv_initialize_pipe or aic_initialize_pipe) pre-signals all SLOT_NUM slots as free before the main loop begins, so the producer can fill up to SLOT_NUM slots before blocking.
  2. Backpressure: If the producer is faster than the consumer, TPUSH blocks at WAIT flag_free when all slots are occupied; if the consumer is faster, TPOP blocks at WAIT flag_ready when no data is ready.
  3. In-order delivery: Both sides advance target_tag in strict round-robin order (tag + 1) % SLOT_NUM, guaranteeing FIFO semantics. The producer advances in TPUSH; the consumer advances in TFREE (not in TPOP).
  4. Decoupled data movement: TPUSH uses the data-movement path selected by the typed FIFO and tile combination, with explicit completion before signaling the consumer.
  5. Buddy core selection: Peer selection lives in the configured pipe state and flag assignment, enabling independent pipes to different Vector peers from the same Cube core without adding an instruction operand.
  6. Direction inferred from typed pipe/fifo state: Each instruction family is generic. The actual C2V vs V2C behavior is inferred from producer/consumer role, tile type/location, and FIFO kind rather than opcode suffixes or runtime direction operands.
  7. Split consumer protocol (TPOP / TFREE): TPOP only acquires the slot (wait-ready + load); TFREE releases it (signal-free + advance). This prevents the producer from overwriting a slot while the consumer is still reading from it. The compiler or programmer must ensure every TPOP is paired with a corresponding TFREE before the same slot is needed again (i.e., before wrapping around the ring buffer by SLOT_NUM iterations).

API Summary

API Called On Role Direction Description
aic_initialize_pipe(DIR_MASK, SLOT_SIZE, GM_SLOT_BUFFER, C2V_CONSUMER_BUF, V2C_CONSUMER_BUF) Cube (AIC) Setup Bind ring buffer, init tags, pre-signal free slots for V2C
aiv_initialize_pipe(DIR_MASK, SLOT_SIZE, GM_SLOT_BUFFER, C2V_CONSUMER_BUF, V2C_CONSUMER_BUF) Vector (AIV) Setup Bind ring buffer, init tags, pre-signal free slots for C2V
TPUSH(prod, TILE, fifo) Producer core Producer Inferred from prod / fifo Wait free → materialize tile into configured ring-buffer slot → signal ready
TPOP(cons, TILE, fifo) Consumer core Consumer Inferred from cons / fifo Wait ready → acquire / load tile (DMA or zero-copy). Slot remains held
TFREE(cons) Consumer core Consumer Inferred from cons Signal producer: slot free → advance tag. Must follow TPOP

CONSUMER_BUFFER_BASE / CONSUMER_BUFFER_SIZE — Constant Symbols per InCore Function

Symbol Type Scope Description
{DIR}_CONSUMER_BUFFER_BASE uint32_t Per InCore function, per direction Base address of the ring buffer in the consumer's local SRAM
CONSUMER_BUFFER_SIZE uint32_t Per InCore function Total reserved size (SLOT_NUM * SLOT_SIZE)

These are constant symbols embedded in each InCore function's symbol table, used for two purposes:

  1. Address allocator reservation: The AllocateMemoryAddr pass reads these symbols and marks the corresponding SRAM region as occupied, preventing other allocations from overlapping.
  2. Explicit argument to initialization: The resolved CONSUMER_BUFFER_BASE values are passed as explicit arguments (C2V_CONSUMER_BUF, V2C_CONSUMER_BUF) to aic_initialize_pipe / aiv_initialize_pipe. This avoids any special compiler mechanism for implicit constant lookups inside the init function.

Each function that participates in TPUSH/TPOP communication has:

  • As consumer (owns the buffer): {DIR}_CONSUMER_BUFFER_BASE is the base address of the reserved segment in its own SRAM (UB or L1). The value is passed to its own init function and also to the paired producer's init function.
  • As producer (DMA target): receives the consumer's {DIR}_CONSUMER_BUFFER_BASE value via pl.import_peer_buffer, and passes it as an explicit argument to its init function.

Value generation:

Kernel Origin How values are set
auto_incore / ExpandMixedKernel The pass generates both constants when splitting the mixed InCore function. It assigns a non-overlapping SRAM region for the ring buffer.
Manually written The programmer declares CONSUMER_BUFFER_BASE and CONSUMER_BUFFER_SIZE as explicit constants. Values must be chosen to avoid conflict with other SRAM usage.

Address allocator contract: The AllocateMemoryAddr pass (or equivalent downstream allocator) must:

  1. Read {DIR}_CONSUMER_BUFFER_BASE and CONSUMER_BUFFER_SIZE from the function's symbol table.
  2. Mark [BASE, BASE + SIZE) as reserved / occupied in the SRAM layout.
  3. Allocate all other symbols (tiles, temporaries, spills) outside this region.

This ensures the ring buffer segment and normal compute allocations never overlap.

DSL Grammar: pl.reserve_buffer — Reserved Address Space Declaration

The compiler must provide a DSL-level mechanism for InCore kernel programs to declare reserved address space for the SLOT_BUFFER. This is necessary because:

  1. The address allocator must know which SRAM regions are off-limits before it runs.
  2. For manually written InCore kernels, the programmer needs an explicit way to express "this region of my local SRAM is reserved for TPUSH/TPOP ring buffer."
  3. For compiler-generated kernels (auto_incore / ExpandMixedKernel), the pass emits the same declaration into the generated IR, so the rest of the pipeline treats it uniformly.

Proposed Syntax

pypto DSL (Python frontend):

@pl.incore
def my_vector_kernel(...):
    # Declare a reserved buffer region in this function's local SRAM.
    # The allocator will not place any other symbols in [base, base + size).
    # 'base' can be:
    #   - pl.AUTO: compiler picks the address (typical for auto_incore)
    #   - an integer literal: programmer specifies exact address (manual kernels)
    pipe_buf = pl.reserve_buffer(
        name="c2v_slot_buffer",
        size=SLOT_NUM * SLOT_SIZE,       # total bytes to reserve
        base=pl.AUTO,                    # or e.g. 0x1000 for manual kernels
    )

    # pipe_buf.base is a compile-time constant (resolved by allocator if AUTO)
    # pipe_buf.size is the declared size
    # Pass pipe_buf.base explicitly to the initialization function:
    aiv_initialize_pipe(DIR_C2V, SLOT_SIZE, gm_slot_buffer,
                        c2v_consumer_buf=pipe_buf.base,
                        v2c_consumer_buf=0)

    for ...:
        tile = TPOP(pipe_buf)    # conceptual frontend sugar; lowers to TPOP(cons, tile, fifo)
        # ... compute on tile ...

pypto DSL (producer side):

@pl.incore
def my_cube_kernel(...):
    # Producer imports the consumer's reserved buffer address.
    # 'peer_func' identifies the paired consumer InCore function.
    # The compiler resolves peer_buf.base to the consumer's CONSUMER_BUFFER_BASE value.
    peer_buf = pl.import_peer_buffer(
        name="c2v_slot_buffer",
        peer_func=my_vector_kernel,       # reference to paired consumer function
    )

    # Pass peer_buf.base explicitly to the initialization function:
    aic_initialize_pipe(DIR_C2V, SLOT_SIZE, gm_slot_buffer,
                        c2v_consumer_buf=peer_buf.base,
                        v2c_consumer_buf=0)

    for ...:
        TPUSH(tile, peer_buf)    # conceptual frontend sugar; lowers to TPUSH(prod, tile, fifo)

IR Representation

At the IR level, pl.reserve_buffer lowers to a ReserveBuffer node attached to the InCore function:

// IR after lowering (conceptual):
func @my_vector_kernel(...) {
    %pipe_buf = reserve_buffer {
        name = "c2v_slot_buffer",
        size = 4096,                    // SLOT_NUM * SLOT_SIZE
        base = auto,                    // or literal 0x1000
        memory_space = "UB"             // inferred from core type (UB for Vector, L1 for Cube)
    }
    ...
}

And pl.import_peer_buffer lowers to an ImportPeerBuffer node:

func @my_cube_kernel(...) {
    %peer_buf = import_peer_buffer {
        name = "c2v_slot_buffer",
        peer_func = @my_vector_kernel
    }
    ...
}

Allocator Handling

The AllocateMemoryAddr pass processes ReserveBuffer nodes as follows:

base value Allocator behavior
auto Allocator picks an address that does not conflict with other allocations. Writes the chosen address back into %pipe_buf.base.
literal (e.g. 0x1000) Allocator marks [0x1000, 0x1000 + size) as occupied. Fails with an error if the region overlaps with prior allocations.

After the allocator runs, %pipe_buf.base is a resolved compile-time constant in both the consumer and producer functions. The ImportPeerBuffer node resolves to the same literal value as the paired ReserveBuffer node.

ExpandMixedKernel Auto-Generation

When ExpandMixedKernel splits a mixed InCore function, it automatically emits ReserveBuffer and ImportPeerBuffer nodes:

ExpandMixedKernel pass:

    Input: mixed InCore function with TPUSH/TPOP ops

    Output:
    ┌───────────────────────────────────┐
    │ Consumer function (e.g. Vector):  │
    │   %buf = reserve_buffer {         │
    │     name = "c2v_slot_buffer",     │
    │     size = SLOT_NUM * SLOT_SIZE,  │
    │     base = auto,                  │  ← allocator will resolve
    │     memory_space = "UB"           │
    │   }                               │
    │   ...TPOP uses %buf...            │
    └───────────────────────────────────┘

    ┌───────────────────────────────────┐
    │ Producer function (e.g. Cube):    │
    │   %peer = import_peer_buffer {    │
    │     name = "c2v_slot_buffer",     │
    │     peer_func = @consumer_func    │
    │   }                               │
    │   ...TPUSH uses %peer...          │
    └───────────────────────────────────┘

The programmer never writes reserve_buffer or import_peer_buffer when using auto_incore — the compiler generates them. These constructs are only explicitly written in manually authored InCore kernels.

Summary of Grammar Elements

DSL Construct Purpose Who writes it
pl.reserve_buffer(name, size, base) Declare a reserved SRAM region in the current InCore function for ring buffer use Compiler (auto_incore) or programmer (manual kernel)
pl.import_peer_buffer(name, peer_func) Import the resolved base address of a peer function's reserved buffer Compiler (auto_incore) or programmer (manual kernel)
pl.AUTO Sentinel value requesting compiler to auto-assign the base address Used in base= parameter

These constructs form the contract between the InCore kernel program and the address allocator: "this region of my SRAM is spoken for — do not allocate into it."

Compiler Toolchain Implications

The CONSUMER_BUFFER_BASE / CONSUMER_BUFFER_SIZE design and the reserve_buffer / import_peer_buffer grammar require the pypto compiler (or downstream toolchain) to support the following:

  1. DSL frontend — new constructs: The pypto Python frontend must support pl.reserve_buffer(...) and pl.import_peer_buffer(...). These lower to ReserveBuffer and ImportPeerBuffer IR nodes respectively.

  2. ExpandMixedKernel pass — auto-generation: When splitting a mixed InCore function into Cube and Vector sub-functions, the pass must:

  3. Identify TPUSH / TPOP operations and their inferred directions.
  4. Emit ReserveBuffer nodes in consumer functions with base=auto.
  5. Emit ImportPeerBuffer nodes in producer functions referencing the consumer.
  6. Set CONSUMER_BUFFER_SIZE = SLOT_NUM * SLOT_SIZE.
  7. Insert TFREE calls in consumer kernels at the point where the consumer has finished reading the data from the popped slot. The pass must analyze the data dependency to determine the earliest safe point for TFREE — typically after the last read of the tile variable produced by the corresponding TPOP.

  8. AllocateMemoryAddr pass — reservation and resolution:

  9. For ReserveBuffer with base=auto: pick a non-conflicting address, write it back as a resolved constant.
  10. For ReserveBuffer with explicit base: validate no overlap, mark as reserved.
  11. For ImportPeerBuffer: resolve to the same literal as the paired ReserveBuffer in the peer function.
  12. All other tile/temporary allocations must avoid [BASE, BASE + SIZE).

  13. Cross-function constant propagation: The resolved ReserveBuffer.base value must be propagated to all ImportPeerBuffer nodes that reference it. Since both functions exist in the same compilation unit (generated by ExpandMixedKernel or co-compiled manual kernels), this is a straightforward symbol resolution step.

  14. Validation:

  15. The declared size must not exceed available SRAM.
  16. Every ImportPeerBuffer must have a matching ReserveBuffer in the referenced peer function.
  17. On A2A3, ReserveBuffer / ImportPeerBuffer nodes are not generated (ring buffer is in GM via GM_SLOT_BUFFER). If present, the compiler may emit a warning or ignore them.
  18. Every TPOP must be paired with a corresponding TFREE in the same kernel. The compiler should verify that no execution path consumes a TPOP without a matching TFREE before the ring buffer wraps.

  19. Platform-conditional code generation: The compiler emits different initialization and data transfer code paths based on PLATFORM_ID. On A2A3, the GM_SLOT_BUFFER argument path is used; on A5, the CONSUMER_BUFFER_BASE constant path is used.

(To be expanded with future instructions.)