pto.tsort32¶
pto.tsort32 is part of the Irregular And Complex instruction set.
Tile Operation Diagram¶
Introduction¶
Sort each 32-element block of src together with the corresponding indices from idx, and write the sorted value-index pairs into dst. The underlying SFU instruction is VBS32 (vbitsort), which sorts one or more independent 32-element lists in a single call.
Hardware: VBS32 (vbitsort)¶
VBS32 runs on the SFU (not the vector pipeline). One invocation sorts repeat consecutive 32-element blocks, each consisting of 32 values + 32 indices packed as value-index pairs:
void vbitsort(__ubuf__ T *dst, // sorted value-index pairs out
__ubuf__ T *src0, // 32 values per block × repeat
__ubuf__ uint32_t *src1, // 32 indices per block × repeat
uint8_t repeat); // number of 32-element blocks (1..255)
repeat(capREPEAT_MAX = 255) is packed intoconfig[63:56].- Blocks are contiguous, strided by 32 elements: block
breadssrc0[b*32 : b*32+32]andsrc1[b*32 : b*32+32], writesdst[b*32*coef : ...]wherecoef= 2 (float) or 4 (half) — the value-index pair expansion factor. - Sort order: descending by value; ties broken by smaller index first.
Math Interpretation¶
For each row r, src is processed in independent 32-element blocks. Let block b cover columns 32b … 32b+31, and n_b = min(32, C - 32b) be its valid count.
Sort the pairs by value (descending); output the reordered sequence:
where π is the sort permutation for that block.
Notes:
- idx is an input tile (indices permuted with values), not an output.
- dst stores sorted value-index pairs, not just sorted values.
C++ Intrinsic¶
Declared in include/pto/common/pto_instr.hpp:
// 3-arg: src must be 32-aligned (validCol % 32 == 0)
template <typename DstTileData, typename SrcTileData, typename IdxTileData>
PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx);
// 4-arg: supports non-32-aligned tails (validCol % 32 != 0) via tmp padding
template <typename DstTileData, typename SrcTileData, typename IdxTileData, typename TmpTileData>
PTO_INST RecordEvent TSORT32(DstTileData &dst, SrcTileData &src, IdxTileData &idx, TmpTileData &tmp);
Tile Sizes & Data Types¶
For src of shape \(R \times C\) (valid region), block size 32:
| Tile | dtype | Size (elements) | Notes |
|---|---|---|---|
src |
half or float (\(T\)) |
\(R \times C\) | values to sort |
idx |
uint32_t |
\(R \times C\) (or \(1 \times C\) broadcast) | indices permuted with values |
dst |
\(T\) | \(R \times (2C)\) float, \(R \times (4C)\) half | sorted value-index pairs (expansion below) |
tmp (4-arg only) |
\(T\) | \(\ge \mathrm{ceil}_{32}(C)\) per row | tail-padding scratch; ≤ \(32 \times 255\) elements |
dst expansion factor (typeCoef): each input element becomes a value-index pair.
| dtype | dst cols per src col |
pair layout | bytes/pair |
|---|---|---|---|
float |
×2 | [value_f32, index_u32] |
8 |
half |
×4 | half-pair packing | 8 |
Constraints¶
| Constraint | Reason |
|---|---|
dst/src dtype = half or float (must match); idx = uint32_t |
VBS32 type dispatch |
All tiles TileType::Vec, BLayout::RowMajor |
SFU addressing |
validCol % 32 == 0 (3-arg) |
each block exactly 32 elements |
validCol arbitrary (4-arg) |
tail block padded to 32 with \(-\infty\) via tmp |
repeat = validCol/32 (3-arg) or ceil(validCol/32) (4-arg) |
VBS32 repeat count, ≤ 255 per call; larger validCol splits into multiple vbitsort calls |
tmp (4-arg) ≥ ceil32(C) elements/row |
holds the padded copy of the tail block |
No WaitEvents&... / no internal TSYNC |
synchronize explicitly if needed |
4-arg tail handling¶
When validCol % 32 != 0, the last partial block is copied to tmp, padded with \(-\infty\) (-(0.0/0.0)) up to 32 elements via vdup, then sorted from tmp (padding values land at the bottom of the descending order). If validCol > 32 × 255, the row is chunked into REPEAT_MAX-sized groups, each sorted separately.
Assembly Syntax¶
AS Level 1 (SSA)¶
%dst = pto.tsort32 %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
AS Level 2 (DPS)¶
pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
Examples¶
#include <pto/pto-inst.hpp>
using namespace pto;
// 32-aligned: single block per row
using SrcT = Tile<TileType::Vec, float, 1, 32>;
using IdxT = Tile<TileType::Vec, uint32_t, 1, 32>;
using DstT = Tile<TileType::Vec, float, 1, 64>; // 2× src cols (float)
SrcT src; IdxT idx; DstT dst;
TSORT32(dst, src, idx);
// Non-32-aligned tail: 4-arg with tmp
using SrcT2 = Tile<TileType::Vec, half, 1, 100>;
using IdxT2 = Tile<TileType::Vec, uint32_t, 1, 100>;
using DstT2 = Tile<TileType::Vec, half, 1, 400>; // 4× src cols (half)
using TmpT = Tile<TileType::Vec, half, 1, 128>; // ≥ ceil32(100)=128
TSORT32(dst2, src2, idx2, tmp);
ASM Form Examples¶
Auto Mode¶
%dst = pto.tsort32 %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
Manual Mode¶
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
# pto.tassign %arg2, @tile(0x3000)
%dst = pto.tsort32 %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
PTO Assembly Form¶
%dst = tsort32 %src, %idx : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
# AS Level 2 (DPS)
pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
Related Ops / Instruction Set Links¶
- Instruction set overview: Irregular And Complex
- Previous op in instruction set: pto.tmrgsort
- Next op in instruction set: pto.tgather