Reduce And Expand Instruction Set

Reduce operations collapse a 2D tile along one axis into a 1D result (or a tile with reduced extent along that axis). Expand operations broadcast a 1D tile along one axis to produce a 2D tile.

Operations

Reduce (Row)

Operation Description C++ Intrinsic
pto.trowsum Sum reduction along rows TROWSUM(dst, src, tmp)
pto.trowprod Product reduction along rows TROWPROD(dst, src, tmp)
pto.trowmax Maximum reduction along rows TROWMAX(dst, src, tmp)
pto.trowmin Minimum reduction along rows TROWMIN(dst, src, tmp)
pto.trowargmax Index of maximum along rows TROWARGMAX(dst, src, tmp)
pto.trowargmin Index of minimum along rows TROWARGMIN(dst, src, tmp)

Reduce (Column)

Operation Description C++ Intrinsic
pto.tcolsum Sum reduction along columns TCOLSUM(dst, src)
pto.tcolprod Product reduction along columns TCOLPROD(dst, src)
pto.tcolmax Maximum reduction along columns TCOLMAX(dst, src)
pto.tcolmin Minimum reduction along columns TCOLMIN(dst, src)
pto.tcolargmax Index, or value and index, of maximum along columns TCOLARGMAX(dst, src, tmp) / TCOLARGMAX(dstVal, dstIdx, src, tmp)
pto.tcolargmin Index, or value and index, of minimum along columns TCOLARGMIN(dst, src, tmp) / TCOLARGMIN(dstVal, dstIdx, src, tmp)

Expand (Row)

Operation Description C++ Intrinsic
pto.trowexpand Expand row scalar to full tile TROWEXPAND(dst, src)
pto.trowexpandadd Expand row and add TROWEXPANDADD(dst, src0, src1)
pto.trowexpandsub Expand row and subtract TROWEXPSUB(dst, src0, src1)
pto.trowexpandmul Expand row and multiply TROWEXPMUL(dst, src0, src1)
pto.trowexpanddiv Expand row and divide TROWEXPDIV(dst, src0, src1)
pto.trowexpandmax Expand row and max TROWEXPANDMAX(dst, src0, src1)
pto.trowexpandmin Expand row and min TROWEXPANDMIN(dst, src0, src1)
pto.trowexpandexpdif Expand with exponential difference TROWEXPDIF(dst, src0, src1)

Expand (Column)

Operation Description C++ Intrinsic
pto.tcolexpand Expand column scalar to full tile TCOLEXPAND(dst, src)
pto.tcolexpandadd Expand column and add TCOLEXPANDADD(dst, src0, src1)
pto.tcolexpandsub Expand column and subtract TCOLEXPSUB(dst, src0, src1)
pto.tcolexpandmul Expand column and multiply TCOLEXPMUL(dst, src0, src1)
pto.tcolexpanddiv Expand column and divide TCOLEXPDIV(dst, src0, src1)
pto.tcolexpandmax Expand column and max TCOLEXPANDMAX(dst, src0, src1)
pto.tcolexpandmin Expand column and min TCOLEXPANDMIN(dst, src0, src1)
pto.tcolexpandexpdif Expand with exponential difference TCOLEXPDIF(dst, src0, src1)

Mechanism

Reduce

For each row r, reduce along the column axis:

\[ \mathrm{dst}_r = \bigoplus_{c=0}^{C-1} \mathrm{src}_{r,c} \]

For each column c, reduce along the row axis:

\[ \mathrm{dst}_c = \bigoplus_{r=0}^{R-1} \mathrm{src}_{r,c} \]

where \(\bigoplus\) is the reduction operator (sum, max, min, prod).

Expand

Expand takes a 1D tile of shape (R) or (C) and broadcasts it to a 2D tile of shape (R, C):

\[ \mathrm{dst}_{r,c} = \mathrm{src}_r \quad \text{(row expand)} \]
\[ \mathrm{dst}_{r,c} = \mathrm{src}_c \quad \text{(column expand)} \]

Expand variants combine the broadcast with an elementwise operation using a second source tile:

\[ \mathrm{dst}_{r,c} = \mathrm{src0}_{r,c} \;\oplus\; \mathrm{src1}_r \quad \text{(row expand with op)} \]

Output Shape

Operation Input Shape Output Shape
Row reduce (R, C) (R, 1)
Column reduce (R, C) (1, C)
Row expand (R, 1) (R, C)
Column expand (1, C) (R, C)

Type Support by Target Profile

Element Type CPU Simulator A2/A3 A5
f32 (float) Yes Yes Yes
f16 (half) Yes Yes Yes
bf16 (bfloat16_t) Yes Yes Yes
i8 / u8 Yes Yes Yes
i16 / u16 Yes Yes Yes
i32 / u32 Yes Yes Yes
i64 / u64 Yes Yes Yes

Constraints

Constraints

  • The source tile's valid region determines the reduction domain.
  • Arg variants (*_argmax, *_argmin) produce an integer index tile, not a numeric value tile.
  • The destination tile for reduce operations has extent 1 along the reduced axis.
  • Expand variants require a second source tile with shape (R) or (C) matching the expand axis.
  • Exp-diff variants compute: dst = exp(src0 - src1) — used for softmax-style reductions.

Cases That Are Not Allowed

Cases That Are Not Allowed

  • MUST NOT reduce along an axis with zero extent.
  • MUST NOT use arg variants with non-numeric element types.
  • MUST NOT use expand variants with mismatched expand-axis lengths.

C++ Intrinsic

#include <pto/pto-inst.hpp>
using namespace pto;

// Row reduce (requires temporary tile)
template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TROWSUM(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);

template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TROWPROD(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);

template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TROWMAX(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);

template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TROWMIN(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);

template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TROWARGMAX(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);

template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TROWARGMIN(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);

// Column reduce
template <typename TileDst, typename TileSrc, typename... WaitEvents>
PTO_INST RecordEvent TCOLSUM(TileDst& dst, TileSrc& src, WaitEvents&... events);

template <typename TileDst, typename TileSrc, typename... WaitEvents>
PTO_INST RecordEvent TCOLPROD(TileDst& dst, TileSrc& src, WaitEvents&... events);

template <typename TileDst, typename TileSrc, typename... WaitEvents>
PTO_INST RecordEvent TCOLMAX(TileDst& dst, TileSrc& src, WaitEvents&... events);

template <typename TileDst, typename TileSrc, typename... WaitEvents>
PTO_INST RecordEvent TCOLMIN(TileDst& dst, TileSrc& src, WaitEvents&... events);

template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TCOLARGMAX(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);

template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TCOLARGMIN(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);

// Row expand
template <typename TileDst, typename TileSrc, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPAND(TileDst& dst, TileSrc& src, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPANDADD(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPSUB(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPMUL(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPDIV(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPANDMAX(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPANDMIN(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPDIF(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

// Column expand
template <typename TileDst, typename TileSrc, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPAND(TileDst& dst, TileSrc& src, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPANDADD(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPSUB(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPMUL(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPDIV(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPANDMAX(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPANDMIN(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPDIF(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);

Throughput and Latency (A2/A3)

Reduce and expand operations are compiled to CCE vector instructions. The performance model is defined in include/pto/costmodel/a2a3/.

Row Reduction Throughput and Latency (TROWSUM / TROWMAX / TROWMIN)

Row reductions compile to sequences of vcgadd/vcmax/vcmin followed by vadd/vmax/vmin and a final vcadd/vcmax/vcmin with a PIPE_V barrier.

Cycle model:

total = startup + sum(completion_i) + sum(repeats_i × per_repeat_i) + sum((repeats_i - 1) × interval)

Key parameters:

Metric Value Constant
Startup 13 A2A3_STARTUP_REDUCE
Completion (FP32) 19 A2A3_COMPL_FP_BINOP
Completion (INT32) 19 A2A3_COMPL_FP_BINOP
Completion (INT16) 17 A2A3_COMPL_INT_BINOP
Per-repeat (FP32/INT) 2 A2A3_RPT_2
Pipeline interval 18 A2A3_INTERVAL

Special shape optimizations (FP32, hardcoded compile-time branches in TRowReduceOp.hpp):

Valid Shape Instruction Sequence
64×128 vcgadd128 → vadd8 → vcgadd*8 → PIPE_V
32×256 vcgadd128 → vadd8 → vadd4 → vcgadd4 → PIPE_V
16×512 vcgadd128 → vcgadd16 → vcgadd*2 → PIPE_V
8×1024 vcgadd128 → vcgadd16 → vadd8 → vcgadd8 → PIPE_V

General shape algorithm (non-FP32 or non-special shapes): 1. Fill tmp tile: copy_ubuf_to_ubuf (if validCol >= 2 × elementsPerRpt) 2. Loop-fill tmp: vadd/vmax/vmin per row 3. Handle tail mask if needed 4. Merge tmp: vadd/vmax/vmin per row 5. Final reduction: vcadd/vcmax/vcmin + PIPE_V

Column Reduction Throughput and Latency (TCOLSUM / TCOLMAX)

Binary path (validRow >= 2): Each iteration processes 2 rows using mask(0, elementsPerLine) + vadd/vmax/vmin with repeats = 1, blockStride = 1, repeatStride = 8. Iterates cnt/2 times.

Sequential path: Each row is added to dst one at a time (SequentialSum).

Row Expand Throughput and Latency (TROWEXPAND)

Broadcasts a row to all rows of a tile.

Broadcast path (preferred): Uses vbrcb instruction, repeats = ceil(Numel / 8).

General path: vector_dup(BLOCK_MAX_PER_REPEAT) per row.

Throughput and Latency Testing

Tests are in tests/costmodel/trowsum_kernel.cpp, tcolsum_kernel.cpp, etc. Validation: error < 1% vs cycle-accurate profiling. Run via tests/run_costmodel.py --testcase <name>.


See Also