Appendix: Cluster ID Mapping and Core Architecture Assumptions¶
Overview¶
This appendix describes the cluster ID (CVID) mapping assumptions for A5 and A2A3 platforms, which underpins the TPUSH/TPOP ring buffer communication design.
Recommended Approach: Logical Block ID as Cluster ID¶
When block_dim <= number of cores, the simplest and recommended approach is to use the logical block ID directly as the cluster ID:
// Recommended: logical block_idx as cluster_id
int cluster_id = get_block_idx();
// GM_SLOT_BUFFER access
my_gm_slot_buffer = GM_SLOT_BUFFER_BASE + cluster_id * PER_CLUSTER_SLOT_BUFFER_SIZE;
Why This Works¶
-
Hardware allocates block_idx: The FFTS block scheduler assigns
block_idxvalues when launching tasks. This is a hardware-provided logical identifier. -
1:1 mapping: When
block_dim <= num_cores, each logical block maps to exactly one physical cluster — no over-subscription. -
No GM communication needed: Both Cube and Vector cores can use
get_block_idx()directly without runtime negotiation. -
No working buffer reservation: The 12.5KB
cv_comm_bufregion for CVID exchange is not required.
Kernel Identification¶
Kernels identify their cluster membership using hardware-provided IDs:
// On Cube (AIC)
int my_cluster = get_block_idx();
// On Vector (AIV)
int my_cluster = get_block_idx();
int my_aiv_idx = get_subblockid(); // 0 or 1
Platform Architecture Comparison¶
| Aspect | A5 | A2A3 |
|---|---|---|
| Architecture | Tightly coupled | Decoupled |
| Cluster Binding | Hardware-fixed 1:2 mapping | Task scheduler-bound |
| Sync Mechanism | SET intra-block | SET cross-core via FFTS |
| Local Datapath | L0C↔UB, UB↔L1 direct | Via GM staging |
Cross-Core Synchronization Mechanism¶
FFTS Semaphore IDs¶
Each cluster has 16 semaphore IDs available for cross-core synchronization via set_cross_core and wait_flag_dev:
Cluster Semaphore Resources:
+---------------------------------------------------------------+
| 16 Semaphore IDs per Cluster (ID 0-15) |
| |
| Each ID has a 4-bit semaphore value (0-15) |
| Can control 0-15 FIFO slots per semaphore |
| |
+---------------------------------------------------------------+
TPUSH/TPOP Semaphore Allocation¶
TPUSH/TPOP uses 4 semaphore IDs for bidirectional Cube-Vector communication:
| ID | Direction | Purpose |
|---|---|---|
| 0 | C→V | Cube signals data ready for Vector |
| 1 | C→V | Vector signals slot free to Cube |
| 2 | V→C | Vector signals data ready for Cube |
| 3 | V→C | Cube signals slot free to Vector |
Semaphore Operations¶
// Producer signals data ready (increment semaphore by 1)
// pipe: VEC, MTE, CUBE, or FIX (avoids SU barrier)
// Uses mode2 for 1:2 cluster configuration
set_cross_core(pipe, semaphore_id);
// Consumer waits for data (decrement semaphore, blocks if 0)
wait_flag_dev(semaphore_id);
Constraints: - Increment is always 1 (not configurable) - Must specify pipe (VEC/MTE/CUBE/FIX) to avoid SU barrier stalls - Uses mode2 for 1:2 cluster configuration
Mode2 Semantics (1:2 Configuration)¶
Under the 1:2 cluster configuration, set_cross_core and wait_flag_dev have special broadcast/reduce semantics:
| Direction | Operation | Semantics |
|---|---|---|
| C→V | set_cross_core |
Broadcast: Block sets semaphore for both subblocks (AIV0 + AIV1) |
| C→V | wait_flag_dev |
Each Vector core waits independently |
| V→C | set_cross_core |
Each Vector core sets its own semaphore |
| V→C | wait_flag_dev |
Reduce: Cube waits for both Vector subblocks to set |
C→V Broadcast (set_cross_core from Cube):
AIC ──set──┬──> AIV0 semaphore++
└──> AIV1 semaphore++
V→C Reduce (wait_flag_dev on Cube):
AIV0 ──set──┐
├──> AIC waits for BOTH
AIV1 ──set──┘
This ensures correct synchronization for the 1:2 Cube-Vector cluster topology without requiring separate signaling to each Vector core.
4-bit Semaphore Range¶
Each semaphore ID has a 4-bit counter (values 0-15), which limits the maximum number of outstanding FIFO slots:
Semaphore value range: 0-15
- Value 0: No slots available (consumer blocks on wait_flag_dev)
- Value 1-15: N slots available
- Maximum outstanding slots: 15 per direction
This matches the ring buffer design where each direction can have up to 8 slots (well within the 15-slot semaphore limit).
Cluster Binding Flow¶
Hardware Block/Subblock Allocation¶
The block_idx and subblock_id are allocated by hardware (FFTS block scheduler), not by software. When FFTS launches a mixed kernel, it creates logical clusters with a 1:2 block-subblock relationship:
FFTS Mixed Kernel Kickstart:
+---------------------------------------------------------------------+
| FFTS Block Scheduler (Hardware) |
| |
| Allocates: block_idx, subblock_id per core |
| Creates: 1 block + 2 subblocks (1:2 ratio) per cluster |
| |
| +-------------------+ +-------------------+ |
| | Cluster 0 | | Cluster 1 | ... |
| | block_idx=0 | | block_idx=1 | |
| | AIC (block) | | AIC (block) | |
| | AIV0 (subblk 0) | | AIV0 (subblk 0) | |
| | AIV1 (subblk 1) | | AIV1 (subblk 1) | |
| +-------------------+ +-------------------+ |
| |
+---------------------------------------------------------------------+
AICPU Handshake for Core Mapping¶
When AICPU needs to launch a runtime on a cluster, it must obtain the core-to-block/subblock mapping via handshake with the hardware scheduler, rather than allocating these IDs itself:
AICPU Runtime Launch:
+---------------------------------------------------------------------+
| AICPU |
| |
| 1. Request cluster allocation from HW scheduler |
| 2. Receive mapping: physical_core_id <-> (block_idx, subblock_id) |
| 3. Initialize ffts_addr for cross-core synchronization |
| 4. Launch runtime on assigned cores with consistent block_idx |
| |
+---------------------------------------------------------------------+
|
| Handshake
v
+---------------------------------------------------------------------+
| FFTS / HW Scheduler |
| |
| Provides: block_idx, subblock_id assignments for physical cores |
| Ensures: same 1:2 cluster structure as kernel launches |
| |
+---------------------------------------------------------------------+
This ensures TPUSH/TPOP ring buffer operations work correctly whether launched via FFTS directly or through AICPU runtime.
A3 ffts_addr Initialization¶
On A3, the ffts_addr must be initialized during the AICPU handshake process to enable cross-core synchronization via set_cross_core and wait_flag_dev:
- ffts_addr: Base address for FFTS semaphore registers
- Initialization timing: Must be done before any cross-core sync operations
- Scope: Per-cluster, shared by all cores (AIC + AIV0 + AIV1) in the cluster
This initialization is part of the AICPU handshake (step 3 above) and ensures the semaphore IDs (0-15) are correctly mapped to hardware registers for the assigned cluster.
A3 FFTS Scheduler and Logical Cluster Setup¶
Current TPUSH/TPOP Implementation on A3¶
The A3 TPUSH/TPOP implementation relies on FFTS cross-core synchronization features. During mixed kernel kickstart, the FFTS hardware establishes a logical cluster through the block scheduler.
Logical-to-Physical Core Mapping¶
The FFTS hardware builds the logical-to-physical core mapping at task launch time:
-
Block ID → Physical Cube Core: The block scheduler assigns a physical AIC core to each logical block.
-
Subblock ID → Physical Vector Core: Each subblock (0, 1) maps to a physical AIV core that becomes a buddy of the assigned Cube.
-
Intra-cluster sync resolution: The FFTS hardware resolves all intra-cluster synchronization and communication paths based on this mapping.
Appendix A: Generic Core ID-Based CVID Computation¶
Note: This appendix documents the generic implementation used for
block_dim > num_coresSIMD mode. Not recommended for PyPTO with MPMD AICPU runtime — use the logical block_idx approach described in the main document instead.
Constants Reference¶
| Constant | A5 Value | A2A3 Value | Description |
|---|---|---|---|
CORE_PER_DIE |
18 | 25 | Clusters per die |
AIV_RATIO |
2 | 2 | Vector cores per Cube |
AIC_AIV_PER_DIE |
54 | 75 | Total cores per die (AIC + AIV) |
SEMAPHORE_IDS |
16 | 16 | Semaphore IDs per cluster |
TPUSH_TPOP_SEMA_IDS |
4 | 4 | IDs used for CV bidirectional comm |
SEMA_BITS |
4 | 4 | Bits per semaphore (0-15 slots) |
CV_MAX_CORES |
36 | 25 | Max clusters supported |
Related Documents¶
This section documents the generic implementation that computes cluster ID from physical core ID. This is provided as a fallback for scenarios where block_dim > num_cores (SIMD over-subscription) or when direct block_idx mapping is not available.
A5: Direct Core ID Computation¶
On A5, each die contains 18 core clusters with a fixed 1:2 architecture. The cluster ID is computed directly from the physical core ID:
// A5 TSYNC_CVID implementation (generic)
#ifdef __DAV_CUBE__
int die_id = get_coreid() / AIC_AIV_PER_DIE; // AIC_AIV_PER_DIE = 54
comm_slot = die_id * CORE_PER_DIE + get_coreid() % AIC_AIV_PER_DIE;
#elif defined(__DAV_VEC__)
int die_id = get_coreid() / AIC_AIV_PER_DIE;
comm_slot = die_id * CORE_PER_DIE +
(((get_coreid() % AIC_AIV_PER_DIE) - CORE_PER_DIE - get_subblockid()) / AIV_RATIO);
#endif
Key properties: - No runtime communication needed - Deterministic mapping: core ID → cluster ID is a pure function - Hardware-enforced 1:2 relationship: L0C↔UB and UB↔L1 local datapaths exist within each cluster
A2A3: Core ID via GM Exchange (Generic)¶
On A2A3, when using the generic implementation, the cluster ID is communicated through GM:
// A2A3 TSYNC_CVID implementation (generic)
#ifdef __DAV_CUBE__
// Cube core writes its core ID to GM slot
comm_slot = static_cast<int>(get_coreid() & 0x7f);
comm_slot %= CV_MAX_CORES;
// Write to GM slot and flush cache
__gm__ volatile uint32_t *comm_slot_ptr = reinterpret_cast<__gm__ volatile uint32_t *>(
cv_comm_buf + static_cast<std::size_t>(block_idx) * CV_COMM_SLOT_BYTES);
comm_slot_ptr[0] = static_cast<uint32_t>(comm_slot);
dcci(comm_slot_ptr, SINGLE_CACHE_LINE);
dsb(DSB_DDR);
// Signal Vector cores via FFTS
ffts_cross_core_sync(PIPE_MTE2, _getFFTSMsg(CV_CORE_SYNC, CV_COMM_CTRL));
#elif defined(__DAV_VEC__)
// Vector core waits for Cube's signal, then reads cluster ID from GM
__gm__ volatile uint32_t *comm_slot_ptr = reinterpret_cast<__gm__ volatile uint32_t *>(
cv_comm_buf + static_cast<std::size_t>(block_idx) * CV_COMM_SLOT_BYTES);
dcci(comm_slot_ptr, SINGLE_CACHE_LINE);
wait_flag_dev(CV_COMM_CTRL);
comm_slot = static_cast<int>(comm_slot_ptr[0]);
#endif
Appendix B: A2A3 Working Buffer Reservation (Generic Implementation)¶
When using the generic core ID-based CVID computation (Appendix A), A2A3 requires a reserved region at the bottom of the working buffer for cv_comm_buf slots. This is needed for scenarios where block_dim > num_cores (SIMD over-subscription).
Reserved Space Calculation¶
CV_COMM_SLOT_BYTES = 512 bytes (per block, 512B aligned)
CV_MAX_CORES = 25 (max block_dim)
Reserved space = CV_COMM_SLOT_BYTES * CV_MAX_CORES
= 512 * 25
= 12,800 bytes
= 12.5 KB (round up to 16KB for alignment)
Note: This reservation is not required when using the recommended block_idx as cluster ID approach (when block_dim <= num_cores).
Memory Layout (Generic Implementation)¶
A2A3 Working Buffer (GM) - Generic Implementation Only:
+------------------------------------------------------------------+
| Bottom 12.5KB: Reserved for cv_comm_buf (CVID negotiation) |
| |
| +------------+------------+------------+-----+------------+ |
| | block_idx=0| block_idx=1| block_idx=2| ... | block_idx=24| |
| | 512B | 512B | 512B | | 512B | |
| +------------+------------+------------+-----+------------+ |
| |
+------------------------------------------------------------------+
| Remaining space: Available for GM_SLOT_BUFFER, task data, etc. |
+------------------------------------------------------------------+
A5: No Working Buffer Reservation Needed¶
On A5, CVID is computed directly from get_coreid() without any GM communication. No working buffer reservation is required regardless of block_dim.
Constants (Generic Implementation Only)¶
| Constant | A5 Value | A2A3 Value | Description |
|---|---|---|---|
CV_COMM_SLOT_BYTES |
512 | 512 | Bytes per block's comm slot |
CV_COMM_RESERVED |
0 | 12.5KB | Working buffer reservation |