Reduce And Expand Instruction Set¶
Reduce operations collapse a 2D tile along one axis into a 1D result (or a tile with reduced extent along that axis). Expand operations broadcast a 1D tile along one axis to produce a 2D tile.
Operations¶
Reduce (Row)¶
| Operation | Description | C++ Intrinsic |
|---|---|---|
| pto.trowsum | Sum reduction along rows | TROWSUM(dst, src, tmp) |
| pto.trowprod | Product reduction along rows | TROWPROD(dst, src, tmp) |
| pto.trowmax | Maximum reduction along rows | TROWMAX(dst, src, tmp) |
| pto.trowmin | Minimum reduction along rows | TROWMIN(dst, src, tmp) |
| pto.trowargmax | Index of maximum along rows | TROWARGMAX(dst, src, tmp) |
| pto.trowargmin | Index of minimum along rows | TROWARGMIN(dst, src, tmp) |
Reduce (Column)¶
| Operation | Description | C++ Intrinsic |
|---|---|---|
| pto.tcolsum | Sum reduction along columns | TCOLSUM(dst, src) |
| pto.tcolprod | Product reduction along columns | TCOLPROD(dst, src) |
| pto.tcolmax | Maximum reduction along columns | TCOLMAX(dst, src) |
| pto.tcolmin | Minimum reduction along columns | TCOLMIN(dst, src) |
| pto.tcolargmax | Index, or value and index, of maximum along columns | TCOLARGMAX(dst, src, tmp) / TCOLARGMAX(dstVal, dstIdx, src, tmp) |
| pto.tcolargmin | Index, or value and index, of minimum along columns | TCOLARGMIN(dst, src, tmp) / TCOLARGMIN(dstVal, dstIdx, src, tmp) |
Expand (Row)¶
| Operation | Description | C++ Intrinsic |
|---|---|---|
| pto.trowexpand | Expand row scalar to full tile | TROWEXPAND(dst, src) |
| pto.trowexpandadd | Expand row and add | TROWEXPANDADD(dst, src0, src1) |
| pto.trowexpandsub | Expand row and subtract | TROWEXPSUB(dst, src0, src1) |
| pto.trowexpandmul | Expand row and multiply | TROWEXPMUL(dst, src0, src1) |
| pto.trowexpanddiv | Expand row and divide | TROWEXPDIV(dst, src0, src1) |
| pto.trowexpandmax | Expand row and max | TROWEXPANDMAX(dst, src0, src1) |
| pto.trowexpandmin | Expand row and min | TROWEXPANDMIN(dst, src0, src1) |
| pto.trowexpandexpdif | Expand with exponential difference | TROWEXPDIF(dst, src0, src1) |
Expand (Column)¶
| Operation | Description | C++ Intrinsic |
|---|---|---|
| pto.tcolexpand | Expand column scalar to full tile | TCOLEXPAND(dst, src) |
| pto.tcolexpandadd | Expand column and add | TCOLEXPANDADD(dst, src0, src1) |
| pto.tcolexpandsub | Expand column and subtract | TCOLEXPSUB(dst, src0, src1) |
| pto.tcolexpandmul | Expand column and multiply | TCOLEXPMUL(dst, src0, src1) |
| pto.tcolexpanddiv | Expand column and divide | TCOLEXPDIV(dst, src0, src1) |
| pto.tcolexpandmax | Expand column and max | TCOLEXPANDMAX(dst, src0, src1) |
| pto.tcolexpandmin | Expand column and min | TCOLEXPANDMIN(dst, src0, src1) |
| pto.tcolexpandexpdif | Expand with exponential difference | TCOLEXPDIF(dst, src0, src1) |
Mechanism¶
Reduce¶
For each row r, reduce along the column axis:
For each column c, reduce along the row axis:
where \(\bigoplus\) is the reduction operator (sum, max, min, prod).
Expand¶
Expand takes a 1D tile of shape (R) or (C) and broadcasts it to a 2D tile of shape (R, C):
Expand variants combine the broadcast with an elementwise operation using a second source tile:
Output Shape¶
| Operation | Input Shape | Output Shape |
|---|---|---|
| Row reduce | (R, C) |
(R, 1) |
| Column reduce | (R, C) |
(1, C) |
| Row expand | (R, 1) |
(R, C) |
| Column expand | (1, C) |
(R, C) |
Type Support by Target Profile¶
| Element Type | CPU Simulator | A2/A3 | A5 |
|---|---|---|---|
| f32 (float) | Yes | Yes | Yes |
| f16 (half) | Yes | Yes | Yes |
| bf16 (bfloat16_t) | Yes | Yes | Yes |
| i8 / u8 | Yes | Yes | Yes |
| i16 / u16 | Yes | Yes | Yes |
| i32 / u32 | Yes | Yes | Yes |
| i64 / u64 | Yes | Yes | Yes |
Constraints¶
Constraints
- The source tile's valid region determines the reduction domain.
- Arg variants (
*_argmax,*_argmin) produce an integer index tile, not a numeric value tile. - The destination tile for reduce operations has extent
1along the reduced axis. - Expand variants require a second source tile with shape
(R)or(C)matching the expand axis. - Exp-diff variants compute:
dst = exp(src0 - src1)— used for softmax-style reductions.
Cases That Are Not Allowed¶
Cases That Are Not Allowed
- MUST NOT reduce along an axis with zero extent.
- MUST NOT use arg variants with non-numeric element types.
- MUST NOT use expand variants with mismatched expand-axis lengths.
C++ Intrinsic¶
#include <pto/pto-inst.hpp>
using namespace pto;
// Row reduce (requires temporary tile)
template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TROWSUM(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TROWPROD(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TROWMAX(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TROWMIN(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TROWARGMAX(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TROWARGMIN(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
// Column reduce
template <typename TileDst, typename TileSrc, typename... WaitEvents>
PTO_INST RecordEvent TCOLSUM(TileDst& dst, TileSrc& src, WaitEvents&... events);
template <typename TileDst, typename TileSrc, typename... WaitEvents>
PTO_INST RecordEvent TCOLPROD(TileDst& dst, TileSrc& src, WaitEvents&... events);
template <typename TileDst, typename TileSrc, typename... WaitEvents>
PTO_INST RecordEvent TCOLMAX(TileDst& dst, TileSrc& src, WaitEvents&... events);
template <typename TileDst, typename TileSrc, typename... WaitEvents>
PTO_INST RecordEvent TCOLMIN(TileDst& dst, TileSrc& src, WaitEvents&... events);
template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TCOLARGMAX(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
template <typename TileDst, typename TileSrc, typename TileTmp, typename... WaitEvents>
PTO_INST RecordEvent TCOLARGMIN(TileDst& dst, TileSrc& src, TileTmp& tmp, WaitEvents&... events);
// Row expand
template <typename TileDst, typename TileSrc, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPAND(TileDst& dst, TileSrc& src, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPANDADD(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPSUB(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPMUL(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPDIV(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPANDMAX(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPANDMIN(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TROWEXPDIF(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
// Column expand
template <typename TileDst, typename TileSrc, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPAND(TileDst& dst, TileSrc& src, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPANDADD(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPSUB(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPMUL(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPDIV(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPANDMAX(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPANDMIN(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
template <typename TileDst, typename TileSrc0, typename TileSrc1, typename... WaitEvents>
PTO_INST RecordEvent TCOLEXPDIF(TileDst& dst, TileSrc0& src0, TileSrc1& src1, WaitEvents&... events);
Throughput and Latency (A2/A3)¶
Reduce and expand operations are compiled to CCE vector instructions. The performance model is defined in include/pto/costmodel/a2a3/.
Row Reduction Throughput and Latency (TROWSUM / TROWMAX / TROWMIN)¶
Row reductions compile to sequences of vcgadd/vcmax/vcmin followed by vadd/vmax/vmin and a final vcadd/vcmax/vcmin with a PIPE_V barrier.
Cycle model:
total = startup + sum(completion_i) + sum(repeats_i × per_repeat_i) + sum((repeats_i - 1) × interval)
Key parameters:
| Metric | Value | Constant |
|---|---|---|
| Startup | 13 | A2A3_STARTUP_REDUCE |
| Completion (FP32) | 19 | A2A3_COMPL_FP_BINOP |
| Completion (INT32) | 19 | A2A3_COMPL_FP_BINOP |
| Completion (INT16) | 17 | A2A3_COMPL_INT_BINOP |
| Per-repeat (FP32/INT) | 2 | A2A3_RPT_2 |
| Pipeline interval | 18 | A2A3_INTERVAL |
Special shape optimizations (FP32, hardcoded compile-time branches in TRowReduceOp.hpp):
| Valid Shape | Instruction Sequence |
|---|---|
| 64×128 | vcgadd128 → vadd8 → vcgadd*8 → PIPE_V |
| 32×256 | vcgadd128 → vadd8 → vadd4 → vcgadd4 → PIPE_V |
| 16×512 | vcgadd128 → vcgadd16 → vcgadd*2 → PIPE_V |
| 8×1024 | vcgadd128 → vcgadd16 → vadd8 → vcgadd8 → PIPE_V |
General shape algorithm (non-FP32 or non-special shapes):
1. Fill tmp tile: copy_ubuf_to_ubuf (if validCol >= 2 × elementsPerRpt)
2. Loop-fill tmp: vadd/vmax/vmin per row
3. Handle tail mask if needed
4. Merge tmp: vadd/vmax/vmin per row
5. Final reduction: vcadd/vcmax/vcmin + PIPE_V
Column Reduction Throughput and Latency (TCOLSUM / TCOLMAX)¶
Binary path (validRow >= 2): Each iteration processes 2 rows using mask(0, elementsPerLine) + vadd/vmax/vmin with repeats = 1, blockStride = 1, repeatStride = 8. Iterates cnt/2 times.
Sequential path: Each row is added to dst one at a time (SequentialSum).
Row Expand Throughput and Latency (TROWEXPAND)¶
Broadcasts a row to all rows of a tile.
Broadcast path (preferred): Uses vbrcb instruction, repeats = ceil(Numel / 8).
General path: vector_dup(BLOCK_MAX_PER_REPEAT) per row.
Throughput and Latency Testing¶
Tests are in tests/costmodel/trowsum_kernel.cpp, tcolsum_kernel.cpp, etc. Validation: error < 1% vs cycle-accurate profiling. Run via tests/run_costmodel.py --testcase <name>.
See Also¶
- Tile instruction set — Instruction set overview
- Tile instruction set — Instruction Set description