Execution Agents And Target Profiles¶
PTO uses an architecture-visible three-level execution hierarchy: host, device, and core. This structure is not a direct hardware block diagram — it is an abstraction that makes explicit where work is prepared, dispatched, and executed, and where target profiles may differ in capability.
Execution Hierarchy¶
┌─────────────────────────────────────────────────────────┐
│ HOST │
│ CPU: prepares kernel arguments, submits graphs, │
│ manages runtime orchestration and memory allocation │
└───────────────────────┬─────────────────────────────────┘
│ RPC / AOE / custom transport
▼
┌─────────────────────────────────────────────────────────┐
│ DEVICE │
│ Scheduler: dispatches legal PTO work to cores in │
│ dependence order, manages device-level memory (GM) │
└───────────────────────┬─────────────────────────────────┘
│ Block dispatch
▼
┌─────────────────────────────────────────────────────────┐
│ BLOCK / AI CORE (one per physical core) │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Scalar Unit │ │
│ │ - Control flow, address calculation │ │
│ │ - System query: GetBlockIdx, GetSubBlockIdx, ...│ │
│ ├────────────────────────────────────────────────────┤ │
│ │ Local Tile Buffers │ │
│ │ ┌──────────┬──────────┬──────────┬──────────┐ │ │
│ │ │ Vec / UB │ L0A │ L0B │ L0C │ │ │
│ │ │ vector │ left │ right │ acc │ │ │
│ │ └────┬─────┴────┬─────┴────┬─────┴──────────┘ │ │
│ │ Tile registers are the ISA abstraction over │ │
│ │ these role-specific tile buffers. │ │
│ ├───────┼──────────┼──────────┼───────────────────┤ │
│ │ ┌────▼────┐ ┌───▼───┐ ┌────▼────┐ │ │
│ │ │ Vector │ │Matrix │ │ DMA │ │ │
│ │ │Pipeline │ │ M / │ │ Engine │ │ │
│ │ │ (V) │ │ CUBE │ │MTE1/2/3 │ │ │
│ │ └────┬────┘ └───┬───┘ └─────────┘ │ │
│ └───────┼──────────┼────────────────────────────────┘ │
└──────────┼──────────┼────────────────────────────────────┘
│ │
▼ ▼
GM (off-chip device memory, shared by all blocks)
Host¶
The host (typically a CPU or the host portion of a heterogeneous SoC):
- Prepares kernel arguments and memory descriptors
- Submits PTO programs to the device scheduler
- Manages graph-level or runtime orchestration (stream queuing, event tracking)
- Owns host-side memory used for argument staging
The host does NOT execute PTO instructions directly. It prepares and submits.
Device¶
The device is the architecture-visible scheduling layer. A backend may implement it differently, but it is responsible for:
- Dispatching legal PTO work units to AI Core blocks
- Maintaining device-level memory (GM) and coherency with host memory
- Enforcing dependence order across blocks when required
- Managing device-side memory allocation
Core (AI Core)¶
The core (one physical AI Core / NPU) is where PTO instructions execute. It contains:
| Component | Description | PTO Visibility |
|---|---|---|
| Scalar Unit | Control flow, address calculation, system queries | GetBlockIdx(), GetBlockNum(), GetSubBlockIdx() |
| Vector tile buffer (hardware UB) | 256 KB on-chip SRAM used by TileType::Vec and by the vector micro-instruction path |
!pto.ptr<T, ub> |
| Local tile buffers | Role-specific local storage: Left→L0A, Right→L0B, Acc→L0C, scale tiles on the corresponding side buffers |
!pto.tile_buf<...> |
| Vector Pipeline (V) | Executes pto.v* vector micro-instructions on vector registers |
!pto.vreg<NxT> |
| Matrix Multiply Unit (M/CUBE) | Executes pto.tmatmul and pto.tgemv |
Via TileType::Mat, TileType::Left, TileType::Right, TileType::Acc |
| DMA Engine (MTE1/MTE2/MTE3) | Moves data between GM and UB; coordinates with pipelines | copy_gm_to_ubuf, copy_ubuf_to_gm, TLOAD, TSTORE |
Vector Register Architecture (VLane)¶
On A5 (Ascend 950 PR / DT), the vector register is organized as 8 VLanes of 32 bytes each. A VLane is the atomic unit for group reduction operations.
vreg (256 bytes total):
┌─────────┬─────────┬─────────┬─────┬─────────┬─────────┐
│ VLane 0 │ VLane 1 │ VLane 2 │ ... │ VLane 6 │ VLane 7 │
│ 32B │ 32B │ 32B │ │ 32B │ 32B │
└─────────┴─────────┴─────────┴─────┴─────────┴─────────┘
Elements per VLane by data type:
| Data Type | Elements/VLane | Total Elements/vreg |
|---|---|---|
| i8 / u8 | 32 | 256 |
| i16 / u16 / f16 / bf16 | 16 | 128 |
| i32 / u32 / f32 | 8 | 64 |
| i64 / u64 | 4 | 32 |
The VLane concept is architecturally visible: group reduction operations (vcgadd, vcgmax, vcgmin) reduce within each VLane independently, producing one result per VLane.
MTE Pipeline Detail¶
The DMA engine uses three sub-units that operate concurrently with compute pipelines:
| MTE | Direction | Role in Tile Instructions | Role in Vector Instructions |
|---|---|---|---|
MTE1 |
GM → vector tile buffer | Optional: explicit prefetch | Pre-stage data before vector load |
MTE2 |
GM → local tile buffer | Load staging into the selected local tile buffer (via TLOAD) |
DMA copy: GM→vector tile buffer (via copy_gm_to_ubuf) |
MTE3 |
local tile buffer → GM | Store from the selected local tile buffer (via TSTORE) |
DMA copy: vector tile buffer → GM (via copy_ubuf_to_gm) |
MTE1, MTE2, and MTE3 can operate in parallel with the Vector Pipeline and Matrix Multiply Unit when proper set_flag/wait_flag synchronization is used.
System Query Operations¶
The following operations query the position of the current block within the grid:
| Operation | Return | Description |
|---|---|---|
GetBlockIdx(dim) |
i32 |
0-based index of current block along dimension dim |
GetSubBlockIdx(dim) |
i32 |
0-based index of current sub-block within its parent block |
GetBlockNum(dim) |
i32 |
Total number of blocks along dimension dim |
GetSubBlockNum(dim) |
i32 |
Total number of sub-blocks within the parent block |
These are the only operations that depend on the grid topology. All other tile/vector/scalar operations are block-local.
Target Profiles¶
PTO ISA is instantiated by target profiles that narrow the ISA to the capabilities of a specific backend. A profile does NOT introduce new ISA semantics — it only documents which subsets are available and may add implementation-defined variation points.
Three target profiles are currently defined:
CPU Simulator¶
The CPU simulator (also called the reference simulator) executes PTO programs on the host CPU. Its goals are correctness and debuggability, not performance.
- All
pto.t*tile instructions operations are emulated in software - All
pto.v*vector instructions operations are emulated with scalar loops - Matmul operations use a reference GEMM implementation
- Fractal layouts are simulated with strided memory access
- UB is allocated from heap memory
- The UB size is configurable via build flags
A2A3 Profile¶
The A2A3 profile targets Ascend 910B and Ascend 910C. These targets support:
- Full
pto.t*tile instructions on hardware pto.v*vector instructions emulated through a tile-vector bridge (SimdTileToMemrefOp,SimdVecScopeOp)- Hardware matmul via the Matrix Multiply Unit (CUBE)
- Fractal layout support on hardware, but with software fallback paths
- Vector tile buffer (hardware UB): 256 KB per AI Core
- Vector width: N=64 (f32), N=128 (f16/bf16), N=256 (i8)
- Support for
textractcompact modes (ND2NZ, NZ2ND, ND, ND2NZ2)
A5 Profile¶
The A5 profile targets Ascend 950 PR and Ascend 950 DT. These targets support:
- Full
pto.t*tile instructions on hardware - Full native
pto.v*vector instructions on the vector pipeline - Hardware matmul with MX format support (int8 input → int32 accumulator)
- Full fractal layout support (NZ, ZN, FR, RN) on hardware
- Vector tile buffer (hardware UB): 256 KB per AI Core
- MX block-scale formats with explicit
TileLeftScaleandTileRightScale - FP8 support:
float8_e4m3_t(E4M3) andfloat8_e5m3fn(E5M2) - Native vector unaligned store (
vstu/vstus) and alignment state threading - Block-scoped collective communication primitives (
TBROADCAST,TGET,TPUT, etc.) - 8 VLanes per vector register (group reduction atomic unit)
Target Profile Comparison¶
| Feature | CPU Simulator | A2A3 Profile | A5 Profile |
|---|---|---|---|
Tile instructions (pto.t*) |
Full (emulated) | Full (hardware) | Full (hardware) |
Vector instructions (pto.v*) |
Emulated (scalar loops) | Emulated (tile-vector bridge) | Full native |
Matmul (TMATMUL) |
Software fallback | Hardware CUBE | Hardware CUBE |
| MX format (int8→int32 acc) | Not applicable | Not applicable | Supported |
| Fractal layouts (NZ/ZN/FR/RN) | Simulated | Simulated | Full hardware |
| Vector tile buffer size | Configurable | 256 KB/core | 256 KB/core |
| Vector width (f32 / f16,bf16 / i8) | N=64 / N=128 / N=256 | N=64 / N=128 / N=256 | N=64 / N=128 / N=256 |
| FP8 types (e4m3 / e5m2) | Not supported | Not supported | Supported |
Vector unaligned store (vstu) |
Not supported | Not supported | Supported |
Vector alignment state (vstu/vstas) |
Not supported | Not supported | Supported |
hifloat8_t, float4_e* types |
Not supported | Not supported | Supported |
| Block-scoped collective comm | Not supported | Supported | Supported |
| Atomic store variants | Not supported | Supported | Supported |
vselr, vselrv2 (pair select) |
Not supported | Not supported | Supported |
| TEXTRACT compact modes | Simulated | Supported | Supported |
| VLane group reduction | Not applicable | Not applicable | Supported |
Constraints¶
Constraints
- Architecture-visible dependence order MUST survive target scheduling
- Target profiles may narrow support, but MUST NOT redefine legal PTO semantics
- Machine-model documentation MUST state clearly which facts are portable and which are profile-specific
- Programs that depend on profile-specific features (e.g., MX format, FP8, unaligned vector store) are NOT portable across profiles
Cases That Are Not Allowed¶
Cases That Are Not Allowed
- Documenting A5-only features as general PTO guarantees
- Assuming the CPU simulator's emulation behavior matches hardware performance or cycle-accurate timing
- Treating a profile restriction as a contradiction of the ISA (profiles only narrow, never contradict)