Execution Agents And Target Profiles¶

PTO uses an architecture-visible three-level execution hierarchy: host, device, and core. This structure is not a direct hardware block diagram — it is an abstraction that makes explicit where work is prepared, dispatched, and executed, and where target profiles may differ in capability.

Execution Hierarchy¶

┌─────────────────────────────────────────────────────────┐
│                        HOST                              │
│  CPU: prepares kernel arguments, submits graphs,         │
│  manages runtime orchestration and memory allocation       │
└───────────────────────┬─────────────────────────────────┘
                        │ RPC / AOE / custom transport
                        ▼
┌─────────────────────────────────────────────────────────┐
│                       DEVICE                            │
│  Scheduler: dispatches legal PTO work to cores in       │
│  dependence order, manages device-level memory (GM)     │
└───────────────────────┬─────────────────────────────────┘
                        │ Block dispatch
                        ▼
┌─────────────────────────────────────────────────────────┐
│          BLOCK / AI CORE (one per physical core)       │
│                                                         │
│  ┌────────────────────────────────────────────────────┐ │
│  │  Scalar Unit                                       │ │
│  │  - Control flow, address calculation               │ │
│  │  - System query: GetBlockIdx, GetSubBlockIdx, ...│ │
│  ├────────────────────────────────────────────────────┤ │
│  │  Local Tile Buffers                                │ │
│  │  ┌──────────┬──────────┬──────────┬──────────┐   │ │
│  │  │ Vec / UB │  L0A     │  L0B     │   L0C    │   │ │
│  │  │ vector   │  left    │  right   │   acc    │   │ │
│  │  └────┬─────┴────┬─────┴────┬─────┴──────────┘   │ │
│  │  Tile registers are the ISA abstraction over      │ │
│  │  these role-specific tile buffers.                │ │
│  ├───────┼──────────┼──────────┼───────────────────┤ │
│  │  ┌────▼────┐ ┌───▼───┐ ┌────▼────┐              │ │
│  │  │ Vector  │ │Matrix │ │  DMA    │              │ │
│  │  │Pipeline │ │  M /  │ │ Engine  │              │ │
│  │  │   (V)   │ │ CUBE  │ │MTE1/2/3 │              │ │
│  │  └────┬────┘ └───┬───┘ └─────────┘              │ │
│  └───────┼──────────┼────────────────────────────────┘ │
└──────────┼──────────┼────────────────────────────────────┘
           │          │
           ▼          ▼
        GM (off-chip device memory, shared by all blocks)

Host¶

The host (typically a CPU or the host portion of a heterogeneous SoC):

Prepares kernel arguments and memory descriptors
Submits PTO programs to the device scheduler
Manages graph-level or runtime orchestration (stream queuing, event tracking)
Owns host-side memory used for argument staging

The host does NOT execute PTO instructions directly. It prepares and submits.

Device¶

The device is the architecture-visible scheduling layer. A backend may implement it differently, but it is responsible for:

Dispatching legal PTO work units to AI Core blocks
Maintaining device-level memory (GM) and coherency with host memory
Enforcing dependence order across blocks when required
Managing device-side memory allocation

Core (AI Core)¶

The core (one physical AI Core / NPU) is where PTO instructions execute. It contains:

Component	Description	PTO Visibility
Scalar Unit	Control flow, address calculation, system queries	`GetBlockIdx()`, `GetBlockNum()`, `GetSubBlockIdx()`
Vector tile buffer (hardware UB)	256 KB on-chip SRAM used by `TileType::Vec` and by the vector micro-instruction path	`!pto.ptr<T, ub>`
Local tile buffers	Role-specific local storage: `Left`→L0A, `Right`→L0B, `Acc`→L0C, scale tiles on the corresponding side buffers	`!pto.tile_buf<...>`
Vector Pipeline (V)	Executes `pto.v*` vector micro-instructions on vector registers	`!pto.vreg<NxT>`
Matrix Multiply Unit (M/CUBE)	Executes `pto.tmatmul` and `pto.tgemv`	Via `TileType::Mat`, `TileType::Left`, `TileType::Right`, `TileType::Acc`
DMA Engine (MTE1/MTE2/MTE3)	Moves data between GM and UB; coordinates with pipelines	`copy_gm_to_ubuf`, `copy_ubuf_to_gm`, `TLOAD`, `TSTORE`

Vector Register Architecture (VLane)¶

On A5 (Ascend 950 PR / DT), the vector register is organized as 8 VLanes of 32 bytes each. A VLane is the atomic unit for group reduction operations.

vreg (256 bytes total):
┌─────────┬─────────┬─────────┬─────┬─────────┬─────────┐
│ VLane 0 │ VLane 1 │ VLane 2 │ ... │ VLane 6 │ VLane 7 │
│   32B   │   32B   │   32B   │     │   32B   │   32B   │
└─────────┴─────────┴─────────┴─────┴─────────┴─────────┘

Elements per VLane by data type:

Data Type	Elements/VLane	Total Elements/vreg
i8 / u8	32	256
i16 / u16 / f16 / bf16	16	128
i32 / u32 / f32	8	64
i64 / u64	4	32

The VLane concept is architecturally visible: group reduction operations (vcgadd, vcgmax, vcgmin) reduce within each VLane independently, producing one result per VLane.

MTE Pipeline Detail¶

The DMA engine uses three sub-units that operate concurrently with compute pipelines:

MTE	Direction	Role in Tile Instructions	Role in Vector Instructions
`MTE1`	GM → vector tile buffer	Optional: explicit prefetch	Pre-stage data before vector load
`MTE2`	GM → local tile buffer	Load staging into the selected local tile buffer (via `TLOAD`)	DMA copy: GM→vector tile buffer (via `copy_gm_to_ubuf`)
`MTE3`	local tile buffer → GM	Store from the selected local tile buffer (via `TSTORE`)	DMA copy: vector tile buffer → GM (via `copy_ubuf_to_gm`)

MTE1, MTE2, and MTE3 can operate in parallel with the Vector Pipeline and Matrix Multiply Unit when proper set_flag/wait_flag synchronization is used.

System Query Operations¶

The following operations query the position of the current block within the grid:

Operation	Return	Description
`GetBlockIdx(dim)`	`i32`	0-based index of current block along dimension `dim`
`GetSubBlockIdx(dim)`	`i32`	0-based index of current sub-block within its parent block
`GetBlockNum(dim)`	`i32`	Total number of blocks along dimension `dim`
`GetSubBlockNum(dim)`	`i32`	Total number of sub-blocks within the parent block

These are the only operations that depend on the grid topology. All other tile/vector/scalar operations are block-local.

Target Profiles¶

PTO ISA is instantiated by target profiles that narrow the ISA to the capabilities of a specific backend. A profile does NOT introduce new ISA semantics — it only documents which subsets are available and may add implementation-defined variation points.

Three target profiles are currently defined:

CPU Simulator¶

The CPU simulator (also called the reference simulator) executes PTO programs on the host CPU. Its goals are correctness and debuggability, not performance.

All pto.t* tile instructions operations are emulated in software
All pto.v* vector instructions operations are emulated with scalar loops
Matmul operations use a reference GEMM implementation
Fractal layouts are simulated with strided memory access
UB is allocated from heap memory
The UB size is configurable via build flags

A2A3 Profile¶

The A2A3 profile targets Ascend 910B and Ascend 910C. These targets support:

Full pto.t* tile instructions on hardware
pto.v* vector instructions emulated through a tile-vector bridge (SimdTileToMemrefOp, SimdVecScopeOp)
Hardware matmul via the Matrix Multiply Unit (CUBE)
Fractal layout support on hardware, but with software fallback paths
Vector tile buffer (hardware UB): 256 KB per AI Core
Vector width: N=64 (f32), N=128 (f16/bf16), N=256 (i8)
Support for textract compact modes (ND2NZ, NZ2ND, ND, ND2NZ2)

A5 Profile¶

The A5 profile targets Ascend 950 PR and Ascend 950 DT. These targets support:

Full pto.t* tile instructions on hardware
Full native pto.v* vector instructions on the vector pipeline
Hardware matmul with MX format support (int8 input → int32 accumulator)
Full fractal layout support (NZ, ZN, FR, RN) on hardware
Vector tile buffer (hardware UB): 256 KB per AI Core
MX block-scale formats with explicit TileLeftScale and TileRightScale
FP8 support: float8_e4m3_t (E4M3) and float8_e5m3fn (E5M2)
Native vector unaligned store (vstu / vstus) and alignment state threading
Block-scoped collective communication primitives (TBROADCAST, TGET, TPUT, etc.)
8 VLanes per vector register (group reduction atomic unit)

Target Profile Comparison¶

Feature	CPU Simulator	A2A3 Profile	A5 Profile
Tile instructions (`pto.t*`)	Full (emulated)	Full (hardware)	Full (hardware)
Vector instructions (`pto.v*`)	Emulated (scalar loops)	Emulated (tile-vector bridge)	Full native
Matmul (`TMATMUL`)	Software fallback	Hardware CUBE	Hardware CUBE
MX format (int8→int32 acc)	Not applicable	Not applicable	Supported
Fractal layouts (NZ/ZN/FR/RN)	Simulated	Simulated	Full hardware
Vector tile buffer size	Configurable	256 KB/core	256 KB/core
Vector width (f32 / f16,bf16 / i8)	N=64 / N=128 / N=256	N=64 / N=128 / N=256	N=64 / N=128 / N=256
FP8 types (e4m3 / e5m2)	Not supported	Not supported	Supported
Vector unaligned store (`vstu`)	Not supported	Not supported	Supported
Vector alignment state (`vstu`/`vstas`)	Not supported	Not supported	Supported
`hifloat8_t`, `float4_e*` types	Not supported	Not supported	Supported
Block-scoped collective comm	Not supported	Supported	Supported
Atomic store variants	Not supported	Supported	Supported
`vselr`, `vselrv2` (pair select)	Not supported	Not supported	Supported
TEXTRACT compact modes	Simulated	Supported	Supported
VLane group reduction	Not applicable	Not applicable	Supported

Constraints¶

Constraints

Architecture-visible dependence order MUST survive target scheduling
Target profiles may narrow support, but MUST NOT redefine legal PTO semantics
Machine-model documentation MUST state clearly which facts are portable and which are profile-specific
Programs that depend on profile-specific features (e.g., MX format, FP8, unaligned vector store) are NOT portable across profiles

Cases That Are Not Allowed¶

Cases That Are Not Allowed

Documenting A5-only features as general PTO guarantees
Assuming the CPU simulator's emulation behavior matches hardware performance or cycle-accurate timing
Treating a profile restriction as a contradiction of the ISA (profiles only narrow, never contradict)