PTO Micro-Instruction: BlockDim and Runtime Query Operations¶

This page documents the PTO micro-instruction runtime query operations that expose block-level execution coordinates to scalar code. These ops are part of the PTO micro-instruction surface (A5 Ascend 950 profile) and are distinct from the Tile-level ISA.

Overview¶

These ops expose the current kernel instance's execution coordinates to scalar code. They are the PTO-level equivalent of runtime queries such as GetBlockIdx() and GetBlockNum() in GPU kernel programming models.

Use them when the same kernel body is launched across multiple blocks or subblocks and each execution instance must figure out which slice of the global workload it owns.

Mechanism¶

The block-dimension query operations are pure scalar producers. They do not move data or synchronize pipelines; instead they expose launch-time execution coordinates so surrounding scalar arithmetic and pointer formation can derive the local GM or UB window owned by the current block or subblock.

BlockDim Query Operations¶

Common Pattern¶

A common pattern is: - Split the full input/output tensor into block_num disjoint block-sized regions - Let each block compute its own starting offset from block_idx - Within one block, further tile the local region and drive the tile loop with ordinary scalar arith / scf ops

For example, if a tensor is split evenly across 8 blocks and each block handles block_length = 2048 elements, then block b owns the global range [b * block_length, (b + 1) * block_length). The per-block GM base pointer can be formed by adding block_idx * block_length elements to the original base pointer.

Execution Model¶

At the PTO micro-instruction level, these runtime-query ops are pure scalar producers. They do not perform data movement, do not allocate memory, and do not by themselves create tiling or double buffering. Instead, they provide the scalar values used by surrounding address computation and structured control flow.

`pto.get_block_idx`¶

Syntax: %block = pto.get_block_idx

Result: i64

Semantics: Return the current block ID in the range [0, pto.get_block_num()).

block = block_idx();

Inputs¶

None.

Expected Outputs¶

Result	Type	Description
`%block`	`i64`	Current block ID in range `[0, block_num)`

Constraints¶

The returned value is in the range [0, get_block_num()).
These ops are valid only within a kernel launch context that defines block dimensions.

Examples¶

// Get current block index
%block = pto.get_block_idx

`pto.get_subblock_idx`¶

Syntax: %subblock = pto.get_subblock_idx

Result: i64

Semantics: Return the current subblock ID in the range [0, pto.get_subblock_num()).

subblock = subblock_idx();

Inputs¶

None.

Expected Outputs¶

Result	Type	Description
`%subblock`	`i64`	Current subblock ID in range `[0, subblock_num)`

Constraints¶

The returned value is in the range [0, get_subblock_num()).

Examples¶

// Get current subblock index
%subblock = pto.get_subblock_idx

`pto.get_block_num`¶

Syntax: %block_num = pto.get_block_num

Result: i64

Semantics: Return the total number of launched blocks visible to the current kernel instance.

block_num = block_num();

Inputs¶

None.

Expected Outputs¶

Result	Type	Description
`%block_num`	`i64`	Total number of launched blocks

Constraints¶

The returned value is a positive integer representing the total block count.

Examples¶

// Get total number of blocks
%block_num = pto.get_block_num

`pto.get_subblock_num`¶

Syntax: %subblock_num = pto.get_subblock_num

Result: i64

Semantics: Return the total number of visible subblocks for the current execution instance.

subblock_num = subblock_num();

Inputs¶

None.

Expected Outputs¶

Result	Type	Description
`%subblock_num`	`i64`	Total number of visible subblocks

Constraints¶

The returned value is a positive integer representing the total subblock count per block.

Typical Usage: Block-Level Data Partitioning¶

// Get block-level coordinates
%block = pto.get_block_idx
%block_num = pto.get_block_num

// Compute per-block parameters
%block_len = arith.constant 2048 : index
%block_len_i64 = arith.index_cast %block_len : index to i64

// Compute block offset
%base = arith.index_cast %block : i64 to index
%offset = arith.muli %base, %block_len : index

// Adjust GM base pointers for this block
%block_in = pto.addptr %gm_in, %offset : !pto.ptr<f32, gm> -> !pto.ptr<f32, gm>
%block_out = pto.addptr %gm_out, %offset : !pto.ptr<f32, gm> -> !pto.ptr<f32, gm>

In this pattern, all blocks execute the same kernel body, but each block sees a different %block value and therefore computes a different GM window.

Grid Design Considerations¶

When designing the block grid:

Grid Dimension	Use Case
`block_num`	Parallelism across disjoint data regions
`subblock_num`	Hierarchical tiling within each block

Pointer arithmetic: Pointer Operations — pto.addptr, pto.castptr
Scalar memory access: Pointer Operations — pto.load_scalar, pto.store_scalar
Scalar arithmetic: Shared Scalar Arithmetic — arith.constant, arith.index_cast, arith.muli

PTO Micro-Instruction: BlockDim and Runtime Query Operations¶

Overview¶

Mechanism¶

BlockDim Query Operations¶

Common Pattern¶

Execution Model¶

pto.get_block_idx¶

Inputs¶

Expected Outputs¶

Constraints¶

Examples¶

pto.get_subblock_idx¶

Inputs¶

Expected Outputs¶

Constraints¶

Examples¶

pto.get_block_num¶

Inputs¶

Expected Outputs¶

Constraints¶

Examples¶

pto.get_subblock_num¶

Inputs¶

Expected Outputs¶

Constraints¶

Typical Usage: Block-Level Data Partitioning¶

Grid Design Considerations¶

Related Operations¶

`pto.get_block_idx`¶

`pto.get_subblock_idx`¶

`pto.get_block_num`¶

`pto.get_subblock_num`¶