pto.tadd

pto.tadd is part of the Elementwise Tile Tile instruction set.

Summary

Lane-wise addition of two source tiles into a destination tile. The iteration domain is the destination tile's valid region.

Mechanism

For each element (i, j) in the destination tile's valid region:

\[ \mathrm{dst}_{i,j} = \mathrm{src0}_{i,j} + \mathrm{src1}_{i,j} \]

Only the destination tile's valid region defines the iteration domain. Source tiles are read lane-by-lane at the same (i, j) coordinates; source tiles whose valid region does not cover (i, j) read all-one-bits on A2/A3 and all-one-bits on A5 (both platforms fill out-of-region lanes with 0xFF pattern in the vector register file before the ALU reads them).

Syntax

Assembly Form (PTO-AS)

%dst = tadd %src0, %src1 : !pto.tile<...>

AS Level 1 — SSA Form

PTO-AS at Level 1 uses SSA-style result binding:

%dst = pto.tadd %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>

AS Level 2 — DPS Form

PTO-AS at Level 2 uses the Def-Use-Style (DPS) explicit operand binding:

pto.tadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>)
          outs(%dst : !pto.tile_buf<...>)

The ins(...) clause names operands in the input position; the outs(...) clause names the output. The tile buffer type !pto.tile_buf<...> is the in-memory storage form used at Level 2.

Micro-Operation Mapping

The pto.tadd SSA operation maps to the following micro-operation sequence on the Tile Register File (TRF):

TRF_READ(src0, i, j)  →  A
TRF_READ(src1, i, j)  →  B
A + B                  →  C
TRF_WRITE(dst, i, j, C)

The micro-operation level is not exposed to the ISA author; it is the responsibility of the backend to schedule these steps subject to pipeline constraints.

C++ Intrinsic

Declared in include/pto/common/pto_instr.hpp:

template <typename TileDataDst, typename TileDataSrc0, typename TileDataSrc1, typename... WaitEvents>
PTO_INST RecordEvent TADD(TileDataDst& dst, TileDataSrc0& src0, TileDataSrc1& src1, WaitEvents&... events);

Inputs

Operand Role Description
%src0 Left tile First source tile; read at (i, j) for each (i, j) in dst valid region
%src1 Right tile Second source tile; read at (i, j) for each (i, j) in dst valid region
WaitEvents... Optional synchronisation RecordEvent tokens to wait on before issuing the operation

Both source tiles and the destination tile share the same element type. Layout and shape constraints are stated under Constraints.

Expected Outputs

Result Type Description
%dst !pto.tile<...> Destination tile; all (i, j) in its valid region contain src0[i,j] + src1[i,j] after the operation

Side Effects

None beyond producing the destination tile. Does not implicitly fence unrelated tile traffic.

Constraints

Constraints

  • Type match: All three tiles (src0, src1, dst) MUST have identical element types.
  • Layout: Both source tiles and the destination tile MUST have compatible layouts. See the TileType–Layout compatibility table in Tiles and Valid Regions.
  • Valid region: The iteration domain is dst.GetValidRow() × dst.GetValidCol(). Source tiles with smaller valid regions read all-one-bits (0xFF) on A2/A3 and all-one-bits (0xFF) on A5 for lanes outside their valid region.
  • TileType: The destination tile's TileType determines which pipelines execute the operation. See Tiles and Valid Regions for TileType constraints.

Exceptions

Exceptions

  • Verifier rejects type mismatches between source and destination tiles.
  • Backend rejects unsupported element types, layouts, or shapes for the selected target profile.
  • Programs that read values from destination lanes outside dst's declared valid region observe undefined behavior.

Target-Profile Restrictions

Target-Profile Restrictions
CPU Simulator A2/A3 A5
f32 Simulated Supported Supported
f16 Simulated Supported Supported
bf16 Simulated Supported Supported
i32 Simulated Supported Supported
i16 Simulated Supported Supported
i8 / u8 Simulated No Supported
i64 / u64 Simulated No No
f8e4m3 / f8e5m2 Simulated No Supported
Layout Any RowMajor only RowMajor only

A2/A3 requires isRowMajor == true for all operands. A5 additionally requires isRowMajor == true but supports more element types.

Performance

A2/A3 Throughput

Tile elementwise operations are compiled to CCE vector instructions. The TBinOp.hpp performance model computes cycles as follows:

Metric Value Constant
Startup latency 14 cycles A2A3_STARTUP_BINARY
Completion latency 19 (FP) / 17 (INT) A2A3_COMPL_FP_BINOP / A2A3_COMPL_INT_BINOP
Per-repeat throughput 2 cycles A2A3_RPT_2
Pipeline interval 18 cycles A2A3_INTERVAL
Cycle model 14 + C + 2R + (R-1)×18 R=repeats

Repeat calculation: R = validRow × validCol / 8 (assuming 8 elements per repeat, RowMajor layout).

Shape-Dependent Optimizations

The performance model applies different instruction sequences based on valid region geometry:

Shape Optimization Path Instruction Sequence
Small tiles (few rows) Bin1LNormModeSmall 1 repeat covers entire row
Col-aligned (RowMajor) Continuous fast path vadd(repeats, 1, 1, 1, 8, 8, 8)
Misaligned General path stride-dependent repeats

Layout Impact on Throughput

Layout Stride Pattern Cost Impact
RowMajor src: (1, cols), dst: (1, cols) Best: continuous fast path available
ColMajor src: (rows, 1), dst: (rows, 1) General path: higher repeat count
Zigzag non-linear strides General path only

Example Throughput Estimate

For TADD on a 16×64 FP32 tile with RowMajor layout:

validRow = 16, validCol = 64, layout = RowMajor
repeats = 16 × 64 / 8 = 128
total = 14 + 19 + 256 + (128-1) × 18 = 14 + 19 + 256 + 2286 = 2575 cycles

Examples

C++ — Auto Mode

#include <pto/pto-inst.hpp>
using namespace pto;

void add_tiles(Tile<Vec, float, 16, 16>& dst,
               Tile<Vec, float, 16, 16>& src0,
               Tile<Vec, float, 16, 16>& src1) {
    // Compiler inserts TASSIGN and TSYNC automatically in Auto mode.
    TADD(dst, src0, src1);
}

C++ — Manual Mode

#include <pto/pto-inst.hpp>
using namespace pto;

void add_tiles_manual(Tile<Vec, float, 16, 16>& dst,
                      Tile<Vec, float, 16, 16>& src0,
                      Tile<Vec, float, 16, 16>& src1) {
    TASSIGN(src0, 0x1000);
    TASSIGN(src1, 0x2000);
    TASSIGN(dst,  0x3000);
    RecordEvent e0 = TLOAD(src0, ga);
    RecordEvent e1 = TLOAD(src1, gb);
    TSYNC(e0, e1);
    TADD(dst, src0, src1);
    TSYNC();
    TSTORE(gc, dst);
}

MLIR — SSA Form

%result = pto.tadd %src0, %src1 : (!pto.tile<f32, 16, 16>, !pto.tile<f32, 16, 16>) -> !pto.tile<f32, 16, 16>

MLIR — DPS Form

pto.tadd ins(%src0, %src1 : !pto.tile_buf<f32, 16, 16>, !pto.tile_buf<f32, 16, 16>)
          outs(%result : !pto.tile_buf<f32, 16, 16>)