Parallel Tile Operation ISA¶
Overview¶
PTO ISA (Parallel Tile Operation Instruction Set Architecture) defines a machine-independent ISA for Huawei Ascend NPU software. PTO ISA provides a stable low-level programming contract above generation-specific hardware instruction sets, serving as the assembly-language layer of the PTO software stack.
PTO ISA is not the native binary ISA of any single Ascend implementation. It defines the architecture-visible meaning of legal PTO programs and the instruction vocabulary shared by frontends, code generators, verifiers, simulators, and target backends.
Why Tile-First¶
Most Ascend kernels are authored in terms of tiles — bounded multi-dimensional array fragments with layout and valid-region metadata — not anonymous lanes or opaque buffers. A generic SIMD or SIMT model can describe the hardware eventually, but it pushes the important questions into backend-specific folklore:
- Shape and layout legality
- Which elements are meaningful (valid regions)
- When two tiles may alias
- Where synchronization must appear
PTO lifts these questions into the ISA so programs, verifiers, and backends share one testable, portable contract.
See Goals Of PTO for product goals and Tiles And Valid Regions for how tiles work in programs.
Two Compilation Flows¶
PTO programs can be compiled to hardware through two supported paths. Both paths share the same PTO instruction semantics; they differ in how the final binary is produced.
Flow A: High-Level Compile (ptoas → C++ → bisheng → binary)¶
High-level frontends (TileLang, PyPTO, custom DSLs) emit PTO programs as .pto text files. The ptoas tool parses, validates, and lowers these to C++ code that calls the pto-isa C++ library. A backend compiler (bisheng) then compiles this C++ to the target binary.
High-level Frontend
(TileLang, PyPTO, C/C++, ...)
│
▼
.pto file
(PTO program text)
│
▼
ptoas
(PTO assembler & optimizer)
┌─────────────────────────────────────┐
│ Parse, validate, optimize │
│ Lower PTO instructions to C++ calls │
│ Insert synchronization (auto-sync) │
└─────────────────────────────────────┘
│
▼
C++ kernel code
(calls pto-isa C++ intrinsics)
│
▼
bisheng (or backend C++ compiler)
┌─────────────────────────────────────┐
│ Compile to target binary │
│ Target: A2A3 (Ascend 910B / 910C) │
│ Target: A5 (Ascend 950 PR / DT) │
│ Target: CPU simulator │
└─────────────────────────────────────┘
│
▼
Binary
Who uses this flow: Compiler developers, library authors, high-level framework integrators. The .pto text format is portable and can be cached/distributed as bytecode.
Flow B: Direct Assemble (ptoas → binary)¶
PTO programs can also be assembled directly to binary via ptoas with an appropriate backend target. This bypasses the C++ intermediate step.
High-level Frontend
(TileLang, PyPTO, C/C++, ...)
│
▼
.pto file
(PTO program text)
│
▼
ptoas --target=a3|a5|cpu
┌─────────────────────────────────────┐
│ Parse, validate, lower to binary │
│ Directly emit target instructions │
└─────────────────────────────────────┘
│
▼
Binary
Who uses this flow: Performance engineers who need direct control over the final instruction stream, or toolchains that embed ptoas as a pure assembler without a full C++ toolchain.
Which Flow to Use¶
| Criterion | Flow A (ptoas → C++ → bisheng) | Flow B (ptoas → binary) |
|---|---|---|
| Debugging | Full C++ debugging available | Binary only |
| Portability | C++ code is source portable | Binary is target-specific |
| Integration | Easy with existing C++ codebases | Requires custom binary packaging |
| Performance | Depends on C++ compiler | Direct, predictable instruction stream |
| Typical user | Library authors, compiler devs | Kernel engineers, performance tuners |
A Minimal Example¶
The smallest end-to-end PTO program loads two tiles from global memory, adds them element-wise, and stores the result:
#include <pto/pto-inst.hpp>
using namespace pto;
void vec_add(Tile<float, 16, 16>& c, const GlobalTensor<float>& ga,
const GlobalTensor<float>& gb) {
Tile<float, 16, 16> a, b;
TLOAD(a, ga); // Load from global memory
TLOAD(b, gb); // Load from global memory
TADD(c, a, b); // Element-wise addition
TSTORE(gc, c); // Store to global memory
}
Even this fragment depends on valid regions, dtype and layout rules, and explicit data movement — ideas the manual unpacks in the programming model, machine model, and per-instruction reference.
Key Terms¶
| Term | Definition |
|---|---|
| PTO | The programming and instruction model built around tiles, explicit data movement, explicit synchronization, and machine-visible execution structure |
| PTO ISA | The instruction set architecture defined by this manual |
| PTO-AS | The textual assembly syntax for PTO ISA (e.g., tadd %dst, %src0, %src1) |
| ptoas | The assembler and optimizer tool that parses .pto files and lowers them to C++ or directly to binary |
| PTOBC | The bytecode representation used to package PTO programs for transport, caching, and distribution |
| Tile | A bounded multi-dimensional array fragment with shape, layout, and valid-region metadata that is architecturally visible |
| Valid Region | The subset of a tile's declared shape that contains meaningful data, expressed as (Rv, Cv) — valid rows and valid columns |
| Global Memory (GM) | Off-chip device memory (__gm__ address space) shared by all blocks and accessible via GlobalTensor views |
| Vector Tile Buffer | The local tile buffer used for TileType::Vec. On current hardware this is implemented by the Unified Buffer (UB), but PTO treats it as one tile-buffer concept rather than two separate architectural objects. |
| Tile Buffer | On-chip storage for one tile, chosen by TileType: Vec uses the vector tile buffer (hardware UB), Left maps to L0A, Right maps to L0B, Acc maps to L0C, and scale tiles map to the corresponding left/right scale buffers. |
| Location Intent | The declared role of a tile operand: Left (L0A-backed left matmul operand), Right (L0B-backed right matmul operand), Acc (accumulator/output), Vec (vector tile buffer), ScaleLeft, and ScaleRight |
| Block Layout (BLayout) | The in-memory storage order of a tile: RowMajor (row-major, C-contiguous) or ColMajor (column-major, Fortran-contiguous) |
| Stripe Layout (SLayout) | The layout of sub-elements within a tile: NoneBox (uniform rectangular), RowMajor (fractal/strided), ColMajor (fractal/strided) |
| Fractal Layout | A strided layout encoding non-uniform strides for 2D tiles: NZ (row-major fractal), ZN (col-major fractal), FR (row-fractal), RN (row-N-fractal) |
| TileType | Classification of tile buffer role: Vec (vector pipe), Mat (matrix/CUBE pipe), Acc (accumulator), Scalar (scalar tile), Left/Right (matmul operands) |
| MTE | DMA engine sub-unit: MTE1 (GM→UB), MTE2 (UB→GM for loads), MTE3 (tile→GM for stores) |
| Target Profile | A concrete instantiation of PTO ISA for a specific backend: CPU (reference simulator), A2A3 (Ascend 910B / Ascend 910C), A5 (Ascend 950 PR / Ascend 950 DT) |
| Instruction Set | One of the four ISA instruction sets: pto.t* (tile instructions), pto.v* (vector micro-instruction set), pto.* (scalar and control instructions), collective ops (communication instructions) |
| pto.t* | The tile compute instruction set (pto.tadd, pto.tmul, etc.) that operates on tile buffers |
| pto.v* | The low-level vector micro-instruction set (pto.v*) that operates on vector registers after an explicit GM→UB→vector data flow |
| Element Type | The dtype of a tile's elements: floating-point (f16, bf16, f32, f8e4m3, f8e5m2), integer (i8–i64, u8–u64), or specialized (hifloat8_t, float4_e*) |
| Auto Mode | Execution mode where the compiler/runtime automatically inserts TASSIGN, TSYNC, and data-movement operations |
| Manual Mode | Execution mode where the author explicitly binds tile resources with TASSIGN and manages synchronization explicitly |
| pto.tget / TGET | Inter-NPU remote read: reads data from a remote NPU's GM to local GM. Both spellings (pto.tget in IR, TGET in C++) refer to the same operation. |
Position In The Software Stack¶
PTO ISA sits between source-level frontends and target-specific lowering. Frontends and code generators target PTO ISA; target backends lower PTO ISA to CPU simulation or to supported Ascend NPU targets.
Source Languages
(C/C++, Python, TileLang, PyPTO, code generators)
│
▼
PTO instructions (.pto text)
│
├──► ptoas ──► C++ ──► bisheng ──► binary (Flow A)
│
└──► ptoas ──────────────────► binary (Flow B)
Targets: CPU simulation / A2A3 (Ascend 910B / 910C) / A5 (Ascend 950 PR / 950 DT) / future Ascend NPUs
This structure gives the software stack one versioned instruction language even when native hardware instruction sets and low-level programming rules change across generations.
Hierarchical Abstractions¶
PTO ISA uses hierarchical abstractions rather than one flat opcode space. The ISA is organized into four instruction sets:
PTO ISA
├── Tile Instructions (pto.t*) Primary tile-oriented compute instruction set
│ ├── Sync and Config Resource binding, event setup, tile-local config
│ ├── Elementwise Tile-Tile Lane-wise binary and unary operations
│ ├── Tile-Scalar and Immediate Tile combined with scalar or immediate
│ ├── Reduce and Expand Row/column reductions and expansions
│ ├── Memory and Data Movement GM↔tile transfer, gather/scatter
│ ├── Matrix and Matrix-Vector GEMV, matmul, and variants
│ ├── Layout and Rearrangement Reshape, transpose, extract, insert
│ └── Irregular and Complex Sort, quantize, print, and others
│
├── Vector Instructions (pto.v*) Micro-instruction set for vector pipe
│ ├── Vector Load/Store Predicate-based vector memory access
│ ├── Unary Vector Instructions abs, neg, exp, sqrt, rec, relu, not, etc.
│ ├── Binary Vector Instructions add, sub, mul, div, max, min, shl, shr, etc.
│ ├── Vector-Scalar Instructions Vector combined with scalar operands
│ ├── Conversion Ops Type conversion between numeric types
│ ├── Reduction Instructions Cross-lane reductions (cadd, cmax, etc.)
│ ├── Compare and Select Comparison and conditional selection
│ ├── Data Rearrangement Interleave, slide, shift, permute, pack
│ └── SFU and DSA Instructions Special function units and DSA ops
│
├── Scalar and Control Instruction Set (pto.*) State setup and control shell
│ ├── Pipeline Sync Event and barrier synchronization
│ ├── DMA Copy GM↔vector-tile-buffer transfer configuration
│ ├── Predicate Load/Store Mask-based scalar memory access
│ ├── Predicate Generation pset, pge, plt, pand, por, pxor, pnot, etc.
│ ├── Control and Configuration tile-prefixed mode/config ops such as sethf32mode, settf32mode, and setfmatrix
│ └── Shared Arithmetic/SCF Scalar arithmetic and structured control flow
│
├── Communication Instructions (pto.*) Collective and runtime operations
├── Collective Communication TBROADCAST, TGET, TPUT, TREDUCE, etc.
└── System Scheduling Instructions (pto.t*) TPipe/TMPipe and lifetime protocol
└── Scheduling Runtime Control TPUSH, TPOP, TFREE
The tile instructions is the primary programming instruction set. The vector instructions exists for fine-grained vector-pipe control. The scalar and control instructions sets up the execution shell around tile payload regions. The communication instructions handles inter-rank communication, while system scheduling handles TPipe/TMPipe flow and resource lifetime.
Machine Model¶
PTO programs run on a hierarchical execution structure:
Grid (whole kernel invocation)
└── Block (AI Core / NPU)
├── Host Interface
├── Scalar Unit (control flow, address calculation)
├── Local Tile Buffers (typed by TileType)
│ ├── Vec buffer = hardware UB
│ ├── Left buffer = L0A
│ ├── Right buffer = L0B
│ └── Acc buffer = L0C
├── Tile Registers ISA abstraction over those local tile buffers
├── DMA Engine
│ ├── MTE1: GM ──► UB (GM→UB, prefetch)
│ ├── MTE2: GM ──► UB (GM→UB, load staging)
│ └── MTE3: UB ──► GM (UB→GM, store)
└── Vector Pipeline (V) (unary/binary/reduce on vector regs)
Execution Hierarchy¶
| Level | Description | PTO Visibility |
|---|---|---|
| Grid | Entire kernel invocation across all participating AI Cores | GetBlockNum(), GetBlockIdx() |
| Block | Single AI Core with local UB, tile regs, and compute units | GetSubBlockNum(), GetSubBlockIdx() |
| Tile Buffer | Per-core on-chip storage for one tile (typed by TileType) |
!pto.tile_buf<...> |
| Vector Register | Per-lane on-chip storage for vector compute (N lanes) | !pto.vreg<NxT> |
| Vector Tile Buffer (hardware UB) | On-chip buffer used by TileType::Vec and by the vector micro-instruction path |
!pto.ptr<T, ub> |
| Global Memory (GM) | Off-chip device memory shared by all AI Cores | __gm__ T*, !pto.partition_tensor_view<...> |
Target Profiles¶
PTO ISA is instantiated by concrete target profiles that narrow the ISA to the capabilities of a specific backend. Profiles do NOT introduce new ISA semantics; they only restrict which subsets are available.
| Feature | CPU Simulator | A2A3 Profile | A5 Profile |
|---|---|---|---|
Tile instructions (pto.t*) |
Full | Full | Full |
Vector instructions (pto.v*) |
Emulated | Emulated | Full |
| Matmul / CUBE ops | Software fallback | Hardware | Hardware |
| MX format (int8→acc int32) | Not applicable | Not applicable | Supported |
| Fractal layout (NZ/ZN/FR/RN) | Simulated | Simulated | Full |
| Vector tile buffer size | Configurable | 256 KB/core | 256 KB/core |
| Vector width (f32 / f16,bf16 / i8) | N=64 / N=128 / N=256 | N=64 / N=128 / N=256 | N=64 / N=128 / N=256 |
| FP8 types (e4m3 / e5m2) | Not supported | Not supported | Supported |
Vector unaligned store (vstu) |
Not supported | Not supported | Supported |
| Block-scoped collective comm | Not supported | Supported | Supported |
Instruction Syntax Overview¶
PTO instructions use a consistent textual syntax. Three forms are commonly shown:
Assembly Form (PTO-AS)¶
The human-readable assembly spelling — the preferred form for documentation and portable pseudocode:
# Scalar operand suffix: immediate added to each tile element
tadds %dst, %src, 0x3F800000 : !pto.tile<f32, 16, 16>
# Saturating carry variant
taddc %dst, %src0, %src1 : !pto.tile<f32, 16, 16>
# Tile with explicit memory operand: load from GlobalTensor view
tload %tile, %gtensor[%r, %c] : (!pto.tile<f32,16,16>, !pto.memref<f32,1x16x16x16>) -> !pto.tile<f32,16,16>
SSA Form (AS Level 1)¶
MLIR-style SSA form with explicit types and a named result:
// Tile compute: element-wise addition
%dst = pto.tadd %src0, %src1 : (!pto.tile<f32, 16, 16>, !pto.tile<f32, 16, 16>) -> !pto.tile<f32, 16, 16>
// Tile load: from GlobalTensor partition view
%tile = pto.tload %mem : !pto.partition_tensor_view<1x1x1x16x16xf32> -> !pto.tile_buf<loc=vec, f32, 16, 16, RowMajor, NoneBox, None, Zero>
// Scalar tile comparison
%cmp = pto.tcmps %src, 0 : !pto.tile<f32, 16, 16>, i32 -> !pto.tile<predicate, 16, 16>
DPS Form (AS Level 2)¶
Functional-style form with explicit ins(...) and outs(...) blocks — closest to the C++ intrinsic instruction set:
// Tile compute (DPS)
pto.tadd ins(%src0, %src1 : !pto.tile_buf<f32, 16, 16>, !pto.tile_buf<f32, 16, 16>)
outs(%dst : !pto.tile_buf<f32, 16, 16>)
// Tile load (DPS)
pto.tload ins(%mem : !pto.partition_tensor_view<1x1x1x16x16xf32>)
outs(%tile : !pto.tile_buf<loc=vec, f32, 16, 16, RowMajor, NoneBox, None, Zero>)
// Tile store (DPS)
pto.tstore ins(%tile : !pto.tile_buf<f32, 16, 16>)
outs(%mem : !pto.partition_tensor_view<1x1x1x16x16xf32>)
See Assembly Spelling And Operands for the full syntax specification.
Tile Instructions And Vector Instructions¶
PTO distinguishes two complementary data-flow paths from GM to computed result. Both are architecturally visible; neither is a backend-only detail.
Tile Instructions (pto.t*)¶
The tile instructions operates on tile buffers directly. The complete data path is:
GM ──(MTE2)──► UB ──(implicit)──► Tile Buffer ──(Tile Compute)──► Tile Buffer ──(MTE3)──► GM
│ ▲
└──(vlds/vsts on vector instructions before/after tile instructions)─────────┘
TLOADcopies data from GM into a tile buffer (via MTE2 → UB → tile)- Tile compute (
TADD,TMATMUL, etc.) operates directly on tile buffers TSTOREcopies data from a tile buffer to GM through the corresponding local store path- Valid regions, layout, and tile type are explicit at every step
Vector Instructions (pto.v*)¶
The vector instructions operates on vector registers after an explicit UB staging step. The data path is:
GM ──(copy_gm_to_ubuf)──► UB ──(vlds)──► Vector Register ──(Vector Compute)──► Vector Register ──(vsts)──► UB ──(copy_ubuf_to_gm)──► GM
copy_gm_to_ubuf/copy_ubuf_to_gm: DMA engine moves data between GM and UBvlds/vsld/vgather2: Vector load brings data from UB into vector registers- Vector compute (
vadd,vmul, etc.): operates on vector registers with predicate masking vsts/vsst/vscatter: Vector store writes data from vector registers back to UB- An explicit
syncorset_flag/wait_flagsequence establishes producer-consumer ordering between DMA and vector compute
When To Use Which Instruction Set¶
| Criteria | Tile Instructions (pto.t*) |
Vector Instructions (pto.v*) |
|---|---|---|
| Typical use | Dense tensor algebra, matmul, elementwise | Fine-grained vector-pipe control, per-lane masking |
| Data movement | TLOAD/TSTORE (implicit tile↔UB) | copy_gm_to_ubuf / copy_ubuf_to_gm + vlds/vsts |
| Synchronization | TSYNC, set_flag/wait_flag | set_flag/wait_flag on vector pipe, mem_bar |
| Layout control | Via tile layout parameters | Via distribution mode (NORM, BRC, DS, etc.) |
| Predicate support | No per-lane masking | Yes — %mask : !pto.mask<G> on every vector op |
| Target portability | All profiles | A5 hardware; emulated on CPU/A2/A3 |
Audience: Who Reads This Manual¶
This manual serves two primary audiences with different needs:
Compiler Backend Developers¶
You are building or maintaining a compiler that targets PTO ISA. You need to understand:
- The complete instruction inventory and its legality rules
- How PTO-AS maps to your backend's native instructions
- Target profile restrictions (which ops are available on A2/A3 vs A5)
- Layout constraints (which tile layouts are legal for which operations)
- Synchronization contracts (when to insert
set_flag/wait_flagpairs) - The two compilation flows and when to use each
Kernel Writers¶
You are writing PTO programs directly, either in C++ (using pto-isa intrinsics) or in .pto text (using ptoas). You need to understand:
- Tile and valid region semantics (what data is meaningful)
- The tile instructions programming model (TLOAD, TSTORE, TADD, TMATMUL, etc.)
- GlobalTensor and memory layout (how data maps from GM to tiles)
- Auto vs. Manual mode (when the compiler helps vs. when you control everything)
- The synchronization model (TSYNC, set_flag/wait_flag, RecordEvent)
- Collective communication (
pto.tbroadcast,pto.tget,pto.tput) for multi-NPU kernels
Scope Of This Manual¶
This manual defines:
- The architecture-visible meaning of PTO instructions
- The programming model, machine model, and memory model of PTO ISA
- The distinction between tile, vector, scalar/control, and communication instructions
- The boundary between core ISA guarantees and target-profile restrictions
This manual is written for:
- Library and kernel authors
- Compiler and code generator developers
- Backend and runtime implementers
- Performance engineers
- Architecture and conformance test authors
See Also¶
- Document structure — Full chapter map
- Goals Of PTO — Design objectives
- Scope And Boundaries — ISA scope and boundaries
- Current PTO ISA Scope — Current scope
- Tiles And Valid Regions — Tile semantics
- Auto Vs Manual — Execution modes
- Format of instruction descriptions — How individual opcode pages are structured