TCOLARGMIN¶
Tile Operation Diagram¶
Introduction¶
Get the row index of the minimum element for each column.
Math Interpretation¶
Let R = src.GetValidRow() and C = src.GetValidCol(). For 0 <= j < C:
\[ \mathrm{dst}_{0,j} = \underset{0 \le i < R}{\operatorname{argmin}} \; \mathrm{src}_{i,j} \]
Assembly Syntax¶
PTO-AS form: see docs/grammar/PTO-AS.md.
Synchronous form:
%dst = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit tmp operand.
IR Level 1 (SSA)¶
%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
IR Level 2 (DPS)¶
pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
C++ Intrinsic¶
Declared in include/pto/common/pto_instr.hpp:
template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
PTO_INST RecordEvent TCOLARGMIN(TileDataOut& dst, TileDataIn& src, TileDataTmp& tmp, WaitEvents&... events);
Constraints¶
Implementation checks (NPU):
- A2A3:
- Tile location:
dstandsrcmust beTileType::Vec. - Tile layout of
src: ND fractal (isRowMajorandSLayout::NoneBox). - Tile layout of
dst: ND fractal (isRowMajorandSLayout::NoneBox). - Source data types:
half,float,uint16_t,uint32_t. - Destination data types:
uint32_torint32_t. tmpdata type must be consistent withsrcdata type.- Compile-time check:
src.ValidColmust be1or-1(dynamic). - Runtime valid checks:
srcValidCol != 0andsrcValidRow != 0.dstValidRow == 1.srcValidCol == dstValidCol.
- A5:
- Tile location:
dstandsrcmust beTileType::Vec. - Tile layout of
src: ND fractal (isRowMajorandSLayout::NoneBox). - Tile layout of
dst: ND fractal (isRowMajorandSLayout::NoneBox). - Source data types:
int8_t,uint8_t,int16_t,uint16_t,int32_t,uint32_t,half,float. - Destination data types:
uint32_torint32_t. - Compile-time check:
src.ValidColmust be1or-1(dynamic). - Runtime valid checks:
srcValidCol != 0andsrcValidRow != 0.dstValidRow == 1.srcValidCol == dstValidCol.
tmptemporary tile is not used, only for compatibility.
About temporary tile tmp for A2A3¶
tmpis always used in the A2A3 implementation as scratch space for intermediate results (current index, argmin index, and current min elements).tmptile's data type must be the same assrc's data type.tmptile is organized into three regions within a single row:- Region 0 (
[0, tmpGapEles)): current row index counter (incremented per row). - Region 1 (
[tmpGapEles, 2 * tmpGapEles)): current minimum elements for comparison. - Region 2 (
[2 * tmpGapEles, 3 * tmpGapEles)): argmin index result (before final conversion todst). tmpGapElesis determined as follows:- When
srcValidCol >= elemPerRpt:tmpGapEles = elemPerRpt. - When
srcValidCol < elemPerRpt:tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock. - Simply set
tmptile size the same assrcwhensrcis small, or calculate the required stride based onsrc'svalidColusing the following formula:
repeats = ceil(validCol / elementPerRepeat)
stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
About temporary tile tmp for A5¶
tmptemporary tile is not used in the A5 implementation. The A5 uses vector register-based computation (__VEC_SCOPE__) and does not require scratch tile storage.tmpis retained in the C++ intrinsic signature solely for API compatibility with A2A3.
Examples¶
Auto¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_auto() {
using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
SrcT src(16, 255);
DstT dst(1, 255);
TmpT tmp(1, 32);
TCOLARGMIN(dst, src, tmp);
}
Manual¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_manual() {
using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
SrcT src(16, 255);
DstT dst(1, 255);
TmpT tmp(1, 32);
TASSIGN(src, 0x0);
TASSIGN(dst, 0x1000);
TASSIGN(tmp, 0x2000);
TCOLARGMIN(dst, src, tmp);
}
ASM Form Examples¶
Auto Mode¶
# Auto mode: compiler/runtime-managed placement and scheduling.
%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
Manual Mode¶
# Manual mode: bind resources explicitly before issuing the instruction.
# Optional for tile operands:
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
%dst = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
PTO Assembly Form¶
%dst = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
# IR Level 2 (DPS)
pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
- [x] Write tcolargmin English documentation (docs/isa/TCOLARGMIN.md) - [ ] Write tcolargmin Chinese documentation (docs/isa/TCOLARGMIN_zh.md)