pto.tgather

pto.tgather is part of the Irregular And Complex instruction set.

Summary

Gather/select elements using either an index tile or a compile-time mask pattern.

Mechanism

Gather/select elements using either an index tile or a compile-time mask pattern. It belongs to the tile instructions and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.

Index-based gather (conceptual):

Let R = dst.GetValidRow() and C = dst.GetValidCol(). For 0 <= i < R and 0 <= j < C:

\[ \mathrm{dst}_{i,j} = \mathrm{src0}\!\left[\mathrm{indices}_{i,j}\right] \]

Exact index interpretation and bounds behavior are as follows: On A2/A3 and A5, out-of-range indices produce undefined results (no explicit masking); on the CPU simulator, out-of-range indices wrap modulo the source extent.

Mask-pattern gather is a selection controlled by pto::MaskPattern. On A2/A3 and A5, the mask selects elements from the source in a pattern-defined order; on the CPU simulator, the same mask semantics apply.

Syntax

Textual spelling is defined by the PTO ISA syntax-and-operands pages.

Index-based gather:

%dst = tgather %src0, %indices : !pto.tile<...> -> !pto.tile<...>

Mask-pattern gather:

%dst = tgather %src {maskPattern = #pto.mask_pattern<P0101>} : !pto.tile<...> -> !pto.tile<...>

AS Level 1 (SSA)

%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
%dst = pto.tgather %src {maskPattern = #pto.mask_pattern<P0101>}: !pto.tile<...> -> !pto.tile<...>

AS Level 2 (DPS)

pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
pto.tgather ins(%src, {maskPattern = #pto.mask_pattern<P0101>} : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)

C++ Intrinsic

Declared in include/pto/common/pto_instr.hpp:

template <typename TileDataD, typename TileDataS0, typename TileDataS1, typename TileDataTmp, typename... WaitEvents>
PTO_INST RecordEvent TGATHER(TileDataD &dst, TileDataS0 &src0, TileDataS1 &src1, TileDataTmp &tmp, WaitEvents &... events);

template <typename DstTileData, typename SrcTileData, MaskPattern maskPattern, auto gatherType = GatherAxis::GATHER_ROW, typename... WaitEvents>
PTO_INST RecordEvent TGATHER(DstTileData &dst, SrcTileData &src, WaitEvents &... events);

template <typename TileDataD, typename TileDataS, typename TileDataS1, typename TileDataC, typename TileDataTmp, CmpMode cmpMode, typename... WaitEvents>
PTO_INST RecordEvent TGATHER(TileDataD &dst, TileDataS &src0, TileDataS1 &k_value, TileDataC &cdst, TileDataTmp &tmp, uint32_t offset, WaitEvents &... events);

Inputs

  • src0 is the source tile.
  • indices (index-based gather): index tile providing gather indices.
  • tmp (optional): temporary tile for index-based gather.
  • maskPattern (mask-pattern gather): compile-time mask pattern.
  • gatherType (mask-pattern gather): compile-time gather axis; defaults to GatherAxis::GATHER_ROW.
  • k_value (comparison-based gather): per-row threshold tile.
  • cdst (comparison-based gather): per-row match-count output tile.
  • offset (comparison-based gather): starting index value for gathered indices.
  • dst names the destination tile. The operation iterates over dst's valid region.

Expected Outputs

dst holds gathered elements from src0 at positions specified by indices or maskPattern. For comparison-based gather, dst holds the matching indices and cdst holds the match count per row.

Side Effects

No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.

Constraints

Constraints

  • Bounds / validity:

    • Index bounds are not validated by explicit runtime assertions; on A2/A3 and A5, out-of-range indices produce undefined results; on the CPU simulator, out-of-range indices are clamped to the valid range.
  • Temporary tile:

    • Index-based gather (A2/A3): The C++ API requires an explicit tmp tile. TileDataTmp::DType must be the same type as TileDataS1::DType (int32_t or uint32_t). src1.GetValidRow() == TileDataTmp::Rows and src1.GetValidCol() == TileDataTmp::Cols. The tmp tile holds intermediate vmuls results used by vgather for b16 source types; for b32 source types, the result is written directly to dst but the API still requires tmp.
    • Index-based gather (A5): The tmp tile is accepted and ignored. A5 hardware handles index-based gather without a scratch buffer.
    • Comparison-based gather (A2/A3): The C++ API requires an explicit tmp tile that serves as a combined scratch buffer for three internal regions:
      1. cmpsTmp (comparison result bitmap): offset 0, stored as uint8_t, size = TileDataTmp::Rows × TileDataTmp::Cols bytes.
      2. indexTmp (index array): offset = TileDataTmp::Rows × TileDataTmp::Cols × sizeof(uint8_t), stored as TileDataD::DType, size = TileDataS::Rows × TileDataS::Cols × sizeof(TileDataD::DType) bytes.
      3. cvtTmp (converted k-value array): offset = TileDataTmp::Rows × TileDataTmp::Cols × sizeof(uint8_t) + TileDataS::Rows × TileDataS::Cols × sizeof(TileDataD::DType), stored as TileDataS::DType, size = TileDataS::Rows × sizeof(TileDataS::DType) bytes. The minimum tmp size in bytes must satisfy: $$ \text{tmpSize} \ge \text{Rows}\text{tmp} \times \text{Cols}\text{tmp} + \text{Rows}\text{src} \times \text{Cols}\text{src} \times \text{sizeof(DType}\text{dst}\text{)} + \text{Rows}\text{src} \times \text{sizeof(DType}_\text{src}\text{)} $$
    • Comparison-based gather (A5): The tmp tile is accepted and ignored. A5 hardware handles comparison-based gather without a scratch buffer.

Exceptions

Exceptions

  • Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend instruction set.
  • Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.

Target-Profile Restrictions

Target-Profile Restrictions
  • Index-based gather: implementation checks (A2A3):

    • sizeof(DstTileData::DType) must be 2 or 4 bytes (b16/b32).
    • sizeof(Src1TileData::DType) must be 4 bytes (b32: int32_t, uint32_t).
    • DstTileData::DType must be the same type as Src0TileData::DType.
    • TmpTileData::DType must be the same type as Src1TileData::DType.
    • src1.GetValidCol() == TmpTileData::Cols and src1.GetValidRow() == TmpTileData::Rows.
    • dst.GetValidCol() == DstTileData::Cols (continuous dst storage).
  • Index-based gather: implementation checks (A5):

    • sizeof(DstTileData::DType) must be must be int16_t, uint16_t, int32_t, uint32_t, half, float.
    • sizeof(Src1TileData::DType) must be must be int16_t, uint16_t, int32_t, uint32_t.
    • DstTileData::DType must be the same type as Src0TileData::DType.
    • src1.GetValidCol() == Src1TileData::Cols and dst.GetValidCol() == DstTileData::Cols.
  • Mask-pattern gather: implementation checks (A2A3):

    • Source element size must be 2 or 4 bytes.
    • SrcTileData::DType/DstTileData::DType must be int16_t or uint16_t or int32_t or uint32_t or half or bfloat16_t or float.
    • dst and src must both be TileType::Vec and row-major.
    • sizeof(dst element) == sizeof(src element) and dst.GetValidCol() == DstTileData::Cols (continuous dst storage).
  • Mask-pattern gather: implementation checks (A5):

    • Source element size must be 1 or 2 or 4 bytes.
    • dst and src must both be TileType::Vec and row-major.
    • SrcTileData::DType/DstTileData::DType must be int8_t or uint8_t or int16_t or uint16_t or int32_t or uint32_t or half or bfloat16_t or float or float8_e4m3_tor float8_e5m2_t or hifloat8_t.
    • Supported dtypes are restricted to a target-defined set (checked via static_assert in the implementation), and sizeof(dst element) == sizeof(src element), dst.GetValidCol() == DstTileData::Cols (continuous dst storage).
  • Comparison-based gather: implementation checks (A2A3):

    • TileDataD::DType must be int32_t or uint32_t.
    • TileDataS::DType must be float, half, or int32_t (EQ mode only).
    • TileDataS1::DType must be int32_t or uint32_t.
    • cmpMode must be CmpMode::GT or CmpMode::EQ.
  • Comparison-based gather: implementation checks (A5):

    • TileDataD::DType must be int32_t or uint32_t.
    • TileDataS::DType must be int16_t, uint16_t, int32_t, uint32_t, half, or float.
    • TileDataS1::DType must be uint16_t or uint32_t.
    • cmpMode must be CmpMode::GT or CmpMode::EQ.

Examples

Auto

#include <pto/pto-inst.hpp>

using namespace pto;

void example_auto() {
  using SrcT = Tile<TileType::Vec, float, 16, 16>;
  using IdxT = Tile<TileType::Vec, int32_t, 16, 16>;
  using DstT = Tile<TileType::Vec, float, 16, 16>;
  SrcT src0;
  IdxT idx;
  DstT dst;
  TGATHER(dst, src0, idx);
}

Manual

#include <pto/pto-inst.hpp>

using namespace pto;

void example_manual() {
  using SrcT = Tile<TileType::Vec, float, 16, 16>;
  using DstT = Tile<TileType::Vec, float, 1, 16>;
  SrcT src;
  DstT dst;
  TASSIGN(src, 0x1000);
  TASSIGN(dst, 0x2000);
  TGATHER<DstT, SrcT, MaskPattern::P0101>(dst, src);
}

Auto Mode

# Auto mode: compiler/runtime-managed placement and scheduling.
%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>

Manual Mode

# Manual mode: bind resources explicitly before issuing the instruction.
# Optional for tile operands:
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>

PTO Assembly Form

%dst = pto.tgather %src, %indices : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
# AS Level 2 (DPS)
pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)