TCOLARGMAX¶
Tile Operation Diagram¶
Introduction¶
Get the row index of the maximum element for each column.
Math Interpretation¶
Let R = src.GetValidRow() and C = src.GetValidCol(). For 0 <= j < C:
Assembly Syntax¶
PTO-AS form: see docs/grammar/PTO-AS.md.
Synchronous form:
%dst = tcolargmax %src : !pto.tile<...> -> !pto.tile<...>
Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit tmp operand.
IR Level 1 (SSA)¶
%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
IR Level 2 (DPS)¶
pto.tcolargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
C++ Intrinsic¶
Declared in include/pto/common/pto_instr.hpp:
template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
PTO_INST RecordEvent TCOLARGMAX(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
Constraints¶
Implementation checks (NPU):
- A2A3:
- Tile location:
dstandsrcmust beTileType::Vec. - Tile layout of
src: ND fractal (isRowMajorandSLayout::NoneBox). - Tile layout of
dst: ND fractal (isRowMajorandSLayout::NoneBox). - Source data types:
half,float,uint16_t,uint32_t. - Destination data types:
uint32_torint32_t. tmpdata type must be consistent withsrcdata type.- Compile-time check:
src.ValidColmust be1or-1(dynamic). - Runtime valid checks:
srcValidCol != 0andsrcValidRow != 0.dstValidRow == 1.srcValidCol == dstValidCol.
- A5:
- Tile location:
dstandsrcmust beTileType::Vec. - Tile layout of
src: ND fractal (isRowMajorandSLayout::NoneBox). - Tile layout of
dst: ND fractal (isRowMajorandSLayout::NoneBox). - Source data types:
int8_t,uint8_t,int16_t,uint16_t,int32_t,uint32_t,half,float. - Destination data types:
uint32_torint32_t. - Compile-time check:
src.ValidColmust be1or-1(dynamic). - Runtime valid checks:
srcValidCol != 0andsrcValidRow != 0.dstValidRow == 1.srcValidCol == dstValidCol.
tmptemporary tile is not used, only for compatibility.
About temporary tile tmp for A2A3¶
tmpis always used in the A2A3 implementation as scratch space for intermediate results (current index, argmax index, and current max elements).tmptile's data type must be the same assrc's data type.tmptile is organized into three regions within a single row:- Region 0 (
[0, tmpGapEles)): current row index counter (incremented per row). - Region 1 (
[tmpGapEles, 2 * tmpGapEles)): current maximum elements for comparison. - Region 2 (
[2 * tmpGapEles, 3 * tmpGapEles)): argmax index result (before final conversion todst). tmpGapElesis determined as follows:- When
srcValidCol >= elemPerRpt:tmpGapEles = elemPerRpt. - When
srcValidCol < elemPerRpt:tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock. - Simply set
tmptile size the same assrcwhensrcis small, or calculate the required stride based onsrc'svalidColusing the following formula:
repeats = ceil(validCol / elementPerRepeat)
stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
About temporary tile tmp for A5¶
tmptemporary tile is not used in the A5 implementation. The A5 uses vector register-based computation (__VEC_SCOPE__) and does not require scratch tile storage.tmpis retained in the C++ intrinsic signature solely for API compatibility with A2A3.
Examples¶
Auto¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_auto() {
using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
SrcT src(16, 255);
DstT dst(1, 255);
TmpT tmp(1, 32);
TCOLARGMAX(dst, src, tmp);
}
Manual¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_manual() {
using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
SrcT src(16, 255);
DstT dst(1, 255);
TmpT tmp(1, 32);
TASSIGN(src, 0x0);
TASSIGN(dst, 0x1000);
TASSIGN(tmp, 0x2000);
TCOLARGMAX(dst, src, tmp);
}
ASM Form Examples¶
Auto Mode¶
# Auto mode: compiler/runtime-managed placement and scheduling.
%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
Manual Mode¶
# Manual mode: bind resources explicitly before issuing the instruction.
# Optional for tile operands:
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
%dst = pto.tcolargmax %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
PTO Assembly Form¶
%dst = tcolargmax %src : !pto.tile<...> -> !pto.tile<...>
# IR Level 2 (DPS)
pto.tcolargmax ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
- [x] Explore existing docs/isa for documentation style and format - [x] Read tcolargmax and tcolargmin A2A3 implementation in include/ - [x] Read tcolargmax and tcolargmin A5 implementation in include/ - [x] Read test cases for tcolargmax and tcolargmin - [x] Understand A2A3 vs A5 differences and tmp handling - [x] Write tcolargmax English documentation (docs/isa/TCOLARGMAX.md) - [ ] Write tcolargmax Chinese documentation (docs/isa/TCOLARGMAX_zh.md) - [ ] Verify documentation completeness and accuracy