Elementwise Tile-Tile Instruction Set

Elementwise tile-tile operations perform lane-wise binary and unary operations over tile valid regions. These are the most commonly used tile compute operations in PTO programs.

Operations

Operation Description Category C++ Intrinsic
pto.tadd Elementwise addition Binary TADD(dst, src0, src1)
pto.tabs Elementwise absolute value Unary TABS(dst, src)
pto.tand Elementwise bitwise AND Binary TAND(dst, src0, src1)
pto.tor Elementwise bitwise OR Binary TOR(dst, src0, src1)
pto.tsub Elementwise subtraction Binary TSUB(dst, src0, src1)
pto.tmul Elementwise multiplication Binary TMUL(dst, src0, src1)
pto.tmin Elementwise minimum Binary TMIN(dst, src0, src1)
pto.tmax Elementwise maximum Binary TMAX(dst, src0, src1)
pto.tcmp Elementwise comparison Binary TCMP(dst, src0, src1, cmp)
pto.tdiv Elementwise division Binary TDIV(dst, src0, src1)
pto.tshl Elementwise shift left Binary TSHL(dst, src0, src1)
pto.tshr Elementwise shift right Binary TSHR(dst, src0, src1)
pto.txor Elementwise bitwise XOR Binary TXOR(dst, src0, src1)
pto.tlog Elementwise natural logarithm Unary TLOG(dst, src)
pto.trecip Elementwise reciprocal Unary TRECIP(dst, src)
pto.tprelu Elementwise parameterized ReLU Binary TPRELU(dst, src0, src1)
pto.taddc Three-input fused addition Ternary-like Binary TADDC(dst, src0, src1, src2)
pto.tsubc Three-input fused subtract/add Ternary-like Binary TSUBC(dst, src0, src1, src2)
pto.tcvt Elementwise type conversion Unary TCVT(dst, src)
pto.tsel Elementwise conditional selection Ternary TSEL(dst, src0, src1, cmp)
pto.trsqrt Elementwise reciprocal square root Unary TRSQRT(dst, src)
pto.tsqrt Elementwise square root Unary TSQRT(dst, src)
pto.texp Elementwise exponential Unary TEXP(dst, src)
pto.tpow Elementwise power with tile exponent Binary TPOW(dst, base, exp, tmp)
pto.tnot Elementwise bitwise NOT Unary TNOT(dst, src)
pto.trelu Elementwise ReLU Unary TRELU(dst, src)
pto.tneg Elementwise negation Unary TNEG(dst, src)
pto.trem Elementwise remainder Binary TREM(dst, src0, src1)
pto.tfmod Elementwise floating-point modulo Binary TFMOD(dst, src0, src1)

Mechanism

Binary operations combine two source tiles lane-by-lane. Unary operations transform one source tile lane-by-lane. The iteration domain is the destination tile's valid region.

For each lane (r, c) in the destination's valid region:

\[ \mathrm{dst}_{r,c} = f(\mathrm{src0}_{r,c}, \mathrm{src1}_{r,c}) \]

For ternary selection (TSEL):

\[ \mathrm{dst}_{r,c} = (\mathrm{cmp}_{r,c} \neq 0) \; ?\; \mathrm{src0}_{r,c} \;:\; \mathrm{src1}_{r,c} \]

Valid Region Compatibility

All elementwise tile-tile operations iterate over the destination tile's valid region. For each lane (r, c) in the destination's valid region:

  • The corresponding lane (r, c) from each source tile is read, regardless of whether that lane is within the source tile's own valid region
  • Source tiles whose valid region does not cover (r, c) read all-one-bits (0xFF) on A2/A3 and all-one-bits (0xFF) on A5 for those out-of-region lanes
  • Programs MUST NOT rely on any particular value being read from an out-of-region source lane unless the operation explicitly documents the behavior

_c Variants

Within the current canonical per-op pages and intrinsic signatures, the _c suffix in this instruction family does not denote a generic saturating-arithmetic convention:

  • TADDC is a three-input fused add: src0 + src1 + src2
  • TSUBC is a three-input fused subtract/add: src0 - src1 + src2

Readers MUST NOT infer saturating semantics from the suffix alone; always treat the individual per-op page as the source of truth.

Type Support by Target Profile

Element Type CPU Simulator A2/A3 A5
f32 (float) Yes Yes Yes
f16 (half) Yes Yes Yes
bf16 (bfloat16_t) Yes Yes Yes
i8 / u8 Yes Yes Yes
i16 / u16 Yes Yes Yes
i32 / u32 Yes Yes Yes
i64 / u64 Yes Yes Yes
f8e4m3 / f8e5m2 No No Yes

Constraints

Constraints

  • Tile layout, shape, and valid-region state affect legality.
  • Type support varies by target profile (see per-op pages for exact restrictions).
  • Comparison operations (TCMP) produce a predicate tile; arithmetic operations produce a numeric tile.
  • Conversion operations (TCVT) may change element type between source and destination; dtype sizes may differ.
  • All source and destination tiles MUST have the same physical shape (Rows, Cols).
  • Shift operations (TSHL, TSHR) interpret the second operand as an unsigned shift count; shift count MUST be < element bit-width.

Cases That Are Not Allowed

Cases That Are Not Allowed

  • MUST NOT assume implicit broadcasting, reshaping, or valid-region repair.
  • MUST NOT rely on a defined value from a source tile lane outside its valid region.
  • MUST NOT infer a generic saturating-arithmetic meaning from the _c suffix alone.
  • MUST NOT use a shift count >= element bit-width.

C++ Intrinsic

#include <pto/pto-inst.hpp>
using namespace pto;

// Binary elementwise
template <typename TileDst, typename TileSrc0, typename TileSrc1>
PTO_INST RecordEvent TADD(TileDst& dst, TileSrc0& src0, TileSrc1& src1);

template <typename TileDst, typename TileSrc0, typename TileSrc1>
PTO_INST RecordEvent TMUL(TileDst& dst, TileSrc0& src0, TileSrc1& src1);

template <typename TileData, typename TileData0, typename TileData1, typename TileData2>
PTO_INST RecordEvent TADDC(TileData& dst, TileData0& src0, TileData1& src1, TileData2& src2);

// Unary elementwise
template <typename TileDst, typename TileSrc>
PTO_INST RecordEvent TABS(TileDst& dst, TileSrc& src);

template <typename TileDst, typename TileSrc>
PTO_INST RecordEvent TEXP(TileDst& dst, TileSrc& src);

// Type conversion
template <typename TileDst, typename TileSrc>
PTO_INST RecordEvent TCVT(TileDst& dst, TileSrc& src);

// Comparison (produces predicate tile)
template <typename TileDst, typename TileSrc0, typename TileSrc1>
PTO_INST RecordEvent TCMP(TileDst& dst, TileSrc0& src0, TileSrc1& src1, CompareMode cmp);

Throughput and Latency (A2/A3)

Tile elementwise operations use the Vector Core (PIPE_V) via the CCE instruction set. The performance model is defined in include/pto/costmodel/a2a3/.

Cycle Model Formula

total_cycles = startup + completion + repeats × per_repeat + (repeats - 1) × interval

Where repeats is computed from tile layout and valid region shape.

CCE Instruction Parameters

Metric Constant Value (cycles) Applies To
Startup latency A2A3_STARTUP_BINARY 14 all arithmetic binary ops (vadd, vmul, vsub)
Startup latency A2A3_STARTUP_REDUCE 13 transcendental/unary ops (vexp, vsqrt, vabs)
Completion: FP32 A2A3_COMPL_FP_BINOP 19 vadd, vsub (f32), vcadd, vcmax
Completion: INT binary A2A3_COMPL_INT_BINOP 17 vadd, vsub (int16)
Completion: INT mul A2A3_COMPL_INT_MUL 18 vmul (int)
Completion: FP transcendental A2A3_COMPL_FP32_EXP 26 vexp (f32)
Completion: FP transcendental A2A3_COMPL_FP32_SQRT 27 vsqrt (f32)
Per-repeat throughput A2A3_RPT_1 1 unary/scalar ops
Per-repeat throughput A2A3_RPT_2 2 binary ops (vadd, vmul)
Per-repeat throughput A2A3_RPT_4 4 transcendental ops (f16 exp/sqrt)
Pipeline interval A2A3_INTERVAL 18 all vector ops
Pipeline interval (copy) A2A3_INTERVAL_VCOPY 13 vmov, copy_ubuf_to_ubuf

Instruction Repeat Calculation

The TBinOp.hpp / TBinSOp.hpp / TUnaryOp.hpp headers compute repeats from tile geometry:

Continuous (fast) path (source stride == destination stride == 1):

repeats = validRow × validCol / elementsPerRepeat

General path: handles arbitrary stride combinations, including small-shape optimization (Bin1LNormModeSmall) where one repeat covers an entire row.

Layout and Shape Impact

Tile layout (RowMajor, ColMajor, Zigzag, etc.) affects stride alignment and determines which optimization path is taken:

Layout Stride Pattern Optimization
RowMajor src0/1: (1, cols), dst: (1, cols) Continuous fast path when col-aligned
ColMajor src0/1: (rows, 1), dst: (rows, 1) General path
Mixed layouts Mixed stride patterns General path only

Shape-sensitive special cases (FP32, hardcoded at compile time):

Valid Shape Instruction Sequence
64×128 (TROWSUM) vcgadd128 → PIPE_V → vadd8 → PIPE_V → vcgadd*8 → PIPE_V
32×256 (TROWSUM) vcgadd128 → PIPE_V → vadd8 → PIPE_V → vadd4 → PIPE_V → vcgadd4 → PIPE_V
16×512 (TROWSUM) vcgadd128 → PIPE_V → vcgadd16 → PIPE_V → vcgadd*2 → PIPE_V
8×1024 (TROWSUM) vcgadd128 → PIPE_V → vcgadd16 → PIPE_V → vadd8 → PIPE_V → vcgadd8 → PIPE_V

Bandwidth Model for Tile Movements

Transfer Path Bandwidth (B/cycle) Constant
GM → Vec Buffer (TLOAD) 128 A2A3_BW_GM_VEC
Vec → Vec (TMOV) 128 A2A3_BW_VEC_VEC
GM → Mat (TLOAD Mat) 256 A2A3_BW_GM_MAT
Mat → L0A (TMOV Left) 256 A2A3_BW_MAT_LEFT
Mat → L0B (TMOV Right) 128 A2A3_BW_MAT_RIGHT
Mat → Mat (TEXTRACT) 32 A2A3_BW_MAT_MAT

Transfer cost: ceil(bufferSize / bandwidth) cycles.

Accuracy and Testing

The cost model is validated against cycle-accurate profiling with ≥99% accuracy (error < 1%): - Tests in tests/costmodel/tadd_kernel.cpp etc. - Run via tests/run_costmodel.py --testcase <name> - Build with -D__COSTMODEL preprocessor flag


See Also