Elementwise Tile-Tile Instruction Set¶
Elementwise tile-tile operations perform lane-wise binary and unary operations over tile valid regions. These are the most commonly used tile compute operations in PTO programs.
Operations¶
| Operation | Description | Category | C++ Intrinsic |
|---|---|---|---|
| pto.tadd | Elementwise addition | Binary | TADD(dst, src0, src1) |
| pto.tabs | Elementwise absolute value | Unary | TABS(dst, src) |
| pto.tand | Elementwise bitwise AND | Binary | TAND(dst, src0, src1) |
| pto.tor | Elementwise bitwise OR | Binary | TOR(dst, src0, src1) |
| pto.tsub | Elementwise subtraction | Binary | TSUB(dst, src0, src1) |
| pto.tmul | Elementwise multiplication | Binary | TMUL(dst, src0, src1) |
| pto.tmin | Elementwise minimum | Binary | TMIN(dst, src0, src1) |
| pto.tmax | Elementwise maximum | Binary | TMAX(dst, src0, src1) |
| pto.tcmp | Elementwise comparison | Binary | TCMP(dst, src0, src1, cmp) |
| pto.tdiv | Elementwise division | Binary | TDIV(dst, src0, src1) |
| pto.tshl | Elementwise shift left | Binary | TSHL(dst, src0, src1) |
| pto.tshr | Elementwise shift right | Binary | TSHR(dst, src0, src1) |
| pto.txor | Elementwise bitwise XOR | Binary | TXOR(dst, src0, src1) |
| pto.tlog | Elementwise natural logarithm | Unary | TLOG(dst, src) |
| pto.trecip | Elementwise reciprocal | Unary | TRECIP(dst, src) |
| pto.tprelu | Elementwise parameterized ReLU | Binary | TPRELU(dst, src0, src1) |
| pto.taddc | Three-input fused addition | Ternary-like Binary | TADDC(dst, src0, src1, src2) |
| pto.tsubc | Three-input fused subtract/add | Ternary-like Binary | TSUBC(dst, src0, src1, src2) |
| pto.tcvt | Elementwise type conversion | Unary | TCVT(dst, src) |
| pto.tsel | Elementwise conditional selection | Ternary | TSEL(dst, src0, src1, cmp) |
| pto.trsqrt | Elementwise reciprocal square root | Unary | TRSQRT(dst, src) |
| pto.tsqrt | Elementwise square root | Unary | TSQRT(dst, src) |
| pto.texp | Elementwise exponential | Unary | TEXP(dst, src) |
| pto.tpow | Elementwise power with tile exponent | Binary | TPOW(dst, base, exp, tmp) |
| pto.tnot | Elementwise bitwise NOT | Unary | TNOT(dst, src) |
| pto.trelu | Elementwise ReLU | Unary | TRELU(dst, src) |
| pto.tneg | Elementwise negation | Unary | TNEG(dst, src) |
| pto.trem | Elementwise remainder | Binary | TREM(dst, src0, src1) |
| pto.tfmod | Elementwise floating-point modulo | Binary | TFMOD(dst, src0, src1) |
Mechanism¶
Binary operations combine two source tiles lane-by-lane. Unary operations transform one source tile lane-by-lane. The iteration domain is the destination tile's valid region.
For each lane (r, c) in the destination's valid region:
For ternary selection (TSEL):
Valid Region Compatibility¶
All elementwise tile-tile operations iterate over the destination tile's valid region. For each lane (r, c) in the destination's valid region:
- The corresponding lane
(r, c)from each source tile is read, regardless of whether that lane is within the source tile's own valid region - Source tiles whose valid region does not cover
(r, c)read all-one-bits (0xFF) on A2/A3 and all-one-bits (0xFF) on A5 for those out-of-region lanes - Programs MUST NOT rely on any particular value being read from an out-of-region source lane unless the operation explicitly documents the behavior
_c Variants¶
Within the current canonical per-op pages and intrinsic signatures, the _c suffix in this instruction family does not denote a generic saturating-arithmetic convention:
TADDCis a three-input fused add:src0 + src1 + src2TSUBCis a three-input fused subtract/add:src0 - src1 + src2
Readers MUST NOT infer saturating semantics from the suffix alone; always treat the individual per-op page as the source of truth.
Type Support by Target Profile¶
| Element Type | CPU Simulator | A2/A3 | A5 |
|---|---|---|---|
| f32 (float) | Yes | Yes | Yes |
| f16 (half) | Yes | Yes | Yes |
| bf16 (bfloat16_t) | Yes | Yes | Yes |
| i8 / u8 | Yes | Yes | Yes |
| i16 / u16 | Yes | Yes | Yes |
| i32 / u32 | Yes | Yes | Yes |
| i64 / u64 | Yes | Yes | Yes |
| f8e4m3 / f8e5m2 | No | No | Yes |
Constraints¶
Constraints
- Tile layout, shape, and valid-region state affect legality.
- Type support varies by target profile (see per-op pages for exact restrictions).
- Comparison operations (
TCMP) produce a predicate tile; arithmetic operations produce a numeric tile. - Conversion operations (
TCVT) may change element type between source and destination; dtype sizes may differ. - All source and destination tiles MUST have the same physical shape
(Rows, Cols). - Shift operations (
TSHL,TSHR) interpret the second operand as an unsigned shift count; shift count MUST be<element bit-width.
Cases That Are Not Allowed¶
Cases That Are Not Allowed
- MUST NOT assume implicit broadcasting, reshaping, or valid-region repair.
- MUST NOT rely on a defined value from a source tile lane outside its valid region.
- MUST NOT infer a generic saturating-arithmetic meaning from the
_csuffix alone. - MUST NOT use a shift count
>=element bit-width.
C++ Intrinsic¶
#include <pto/pto-inst.hpp>
using namespace pto;
// Binary elementwise
template <typename TileDst, typename TileSrc0, typename TileSrc1>
PTO_INST RecordEvent TADD(TileDst& dst, TileSrc0& src0, TileSrc1& src1);
template <typename TileDst, typename TileSrc0, typename TileSrc1>
PTO_INST RecordEvent TMUL(TileDst& dst, TileSrc0& src0, TileSrc1& src1);
template <typename TileData, typename TileData0, typename TileData1, typename TileData2>
PTO_INST RecordEvent TADDC(TileData& dst, TileData0& src0, TileData1& src1, TileData2& src2);
// Unary elementwise
template <typename TileDst, typename TileSrc>
PTO_INST RecordEvent TABS(TileDst& dst, TileSrc& src);
template <typename TileDst, typename TileSrc>
PTO_INST RecordEvent TEXP(TileDst& dst, TileSrc& src);
// Type conversion
template <typename TileDst, typename TileSrc>
PTO_INST RecordEvent TCVT(TileDst& dst, TileSrc& src);
// Comparison (produces predicate tile)
template <typename TileDst, typename TileSrc0, typename TileSrc1>
PTO_INST RecordEvent TCMP(TileDst& dst, TileSrc0& src0, TileSrc1& src1, CompareMode cmp);
Throughput and Latency (A2/A3)¶
Tile elementwise operations use the Vector Core (PIPE_V) via the CCE instruction set. The performance model is defined in include/pto/costmodel/a2a3/.
Cycle Model Formula¶
total_cycles = startup + completion + repeats × per_repeat + (repeats - 1) × interval
Where repeats is computed from tile layout and valid region shape.
CCE Instruction Parameters¶
| Metric | Constant | Value (cycles) | Applies To |
|---|---|---|---|
| Startup latency | A2A3_STARTUP_BINARY |
14 | all arithmetic binary ops (vadd, vmul, vsub) |
| Startup latency | A2A3_STARTUP_REDUCE |
13 | transcendental/unary ops (vexp, vsqrt, vabs) |
| Completion: FP32 | A2A3_COMPL_FP_BINOP |
19 | vadd, vsub (f32), vcadd, vcmax |
| Completion: INT binary | A2A3_COMPL_INT_BINOP |
17 | vadd, vsub (int16) |
| Completion: INT mul | A2A3_COMPL_INT_MUL |
18 | vmul (int) |
| Completion: FP transcendental | A2A3_COMPL_FP32_EXP |
26 | vexp (f32) |
| Completion: FP transcendental | A2A3_COMPL_FP32_SQRT |
27 | vsqrt (f32) |
| Per-repeat throughput | A2A3_RPT_1 |
1 | unary/scalar ops |
| Per-repeat throughput | A2A3_RPT_2 |
2 | binary ops (vadd, vmul) |
| Per-repeat throughput | A2A3_RPT_4 |
4 | transcendental ops (f16 exp/sqrt) |
| Pipeline interval | A2A3_INTERVAL |
18 | all vector ops |
| Pipeline interval (copy) | A2A3_INTERVAL_VCOPY |
13 | vmov, copy_ubuf_to_ubuf |
Instruction Repeat Calculation¶
The TBinOp.hpp / TBinSOp.hpp / TUnaryOp.hpp headers compute repeats from tile geometry:
Continuous (fast) path (source stride == destination stride == 1):
repeats = validRow × validCol / elementsPerRepeat
General path: handles arbitrary stride combinations, including small-shape optimization (Bin1LNormModeSmall) where one repeat covers an entire row.
Layout and Shape Impact¶
Tile layout (RowMajor, ColMajor, Zigzag, etc.) affects stride alignment and determines which optimization path is taken:
| Layout | Stride Pattern | Optimization |
|---|---|---|
RowMajor |
src0/1: (1, cols), dst: (1, cols) |
Continuous fast path when col-aligned |
ColMajor |
src0/1: (rows, 1), dst: (rows, 1) |
General path |
| Mixed layouts | Mixed stride patterns | General path only |
Shape-sensitive special cases (FP32, hardcoded at compile time):
| Valid Shape | Instruction Sequence |
|---|---|
| 64×128 (TROWSUM) | vcgadd128 → PIPE_V → vadd8 → PIPE_V → vcgadd*8 → PIPE_V |
| 32×256 (TROWSUM) | vcgadd128 → PIPE_V → vadd8 → PIPE_V → vadd4 → PIPE_V → vcgadd4 → PIPE_V |
| 16×512 (TROWSUM) | vcgadd128 → PIPE_V → vcgadd16 → PIPE_V → vcgadd*2 → PIPE_V |
| 8×1024 (TROWSUM) | vcgadd128 → PIPE_V → vcgadd16 → PIPE_V → vadd8 → PIPE_V → vcgadd8 → PIPE_V |
Bandwidth Model for Tile Movements¶
| Transfer Path | Bandwidth (B/cycle) | Constant |
|---|---|---|
| GM → Vec Buffer (TLOAD) | 128 | A2A3_BW_GM_VEC |
| Vec → Vec (TMOV) | 128 | A2A3_BW_VEC_VEC |
| GM → Mat (TLOAD Mat) | 256 | A2A3_BW_GM_MAT |
| Mat → L0A (TMOV Left) | 256 | A2A3_BW_MAT_LEFT |
| Mat → L0B (TMOV Right) | 128 | A2A3_BW_MAT_RIGHT |
| Mat → Mat (TEXTRACT) | 32 | A2A3_BW_MAT_MAT |
Transfer cost: ceil(bufferSize / bandwidth) cycles.
Accuracy and Testing¶
The cost model is validated against cycle-accurate profiling with ≥99% accuracy (error < 1%):
- Tests in tests/costmodel/tadd_kernel.cpp etc.
- Run via tests/run_costmodel.py --testcase <name>
- Build with -D__COSTMODEL preprocessor flag
See Also¶
- Tile instruction set — Instruction set overview
- Tile instruction set — Instruction Set description