pto.tcvt¶
pto.tcvt is part of the Elementwise Tile Tile instruction set.
Summary¶
Elementwise type conversion with a specified rounding mode and optional saturation mode.
Mechanism¶
For each element (i, j) in the valid region:
where rmode is the rounding policy and satmode (if provided) controls saturation behavior.
Rounding Modes¶
| Mode | Behavior |
|---|---|
RoundMode::CAST_RINT |
Round to nearest, ties to even |
RoundMode::CAST_ROUND |
Round to nearest, ties away from zero |
RoundMode::CAST_FLOOR |
Round toward -∞ |
RoundMode::CAST_CEIL |
Round toward +∞ |
RoundMode::CAST_TRUNC |
Round toward zero |
Saturation Modes¶
When SaturationMode is provided, saturation behavior is explicitly controlled:
| Mode | Behavior |
|---|---|
SaturationMode::ON |
Saturation enabled |
SaturationMode::OFF |
Saturation disabled |
When SaturationMode is omitted, the implementation chooses the default behavior for the selected target/type path. Some conversion paths also expose a tmp-tile overload used for explicit scratch storage.
Syntax¶
Textual spelling is defined by the PTO ISA syntax-and-operands pages.
Synchronous form:
%dst = tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
AS Level 1 (SSA)¶
%dst = pto.tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
AS Level 2 (DPS)¶
pto.tcvt ins(%src {rmode = #pto.round_mode<CAST_RINT>}: !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
C++ Intrinsic¶
Declared in include/pto/common/pto_instr.hpp and include/pto/common/constants.hpp:
template <typename TileDataD, typename TileDataS, typename TmpTileData, typename... WaitEvents>
PTO_INST RecordEvent TCVT(TileDataD &dst, TileDataS &src, TmpTileData &tmp, RoundMode mode,
SaturationMode satMode, WaitEvents &... events);
template <typename TileDataD, typename TileDataS, typename TmpTileData, typename... WaitEvents>
PTO_INST RecordEvent TCVT(TileDataD &dst, TileDataS &src, TmpTileData &tmp, RoundMode mode, WaitEvents &... events);
template <typename TileDataD, typename TileDataS, typename... WaitEvents>
PTO_INST RecordEvent TCVT(TileDataD &dst, TileDataS &src, RoundMode mode,
SaturationMode satMode, WaitEvents &... events);
template <typename TileDataD, typename TileDataS, typename... WaitEvents>
PTO_INST RecordEvent TCVT(TileDataD &dst, TileDataS &src, RoundMode mode, WaitEvents &... events);
The tmp-tile overloads exist for conversion paths that need explicit scratch storage.
Inputs¶
| Operand | Role | Description |
|---|---|---|
%src |
Source tile | Source tile; read at (i, j) for each (i, j) in dst valid region |
%dst |
Destination tile | Destination tile receiving the converted values |
mode |
Rounding mode | One of CAST_RINT, CAST_ROUND, CAST_FLOOR, CAST_CEIL, CAST_TRUNC |
satMode |
Saturation mode (optional) | ON or OFF |
tmp |
Temporary tile (optional) | Scratch tile for conversion paths that require explicit temporary storage |
WaitEvents... |
Optional synchronisation | RecordEvent tokens to wait on before issuing the operation |
Expected Outputs¶
| Result | Type | Description |
|---|---|---|
%dst |
!pto.tile<...> |
Destination tile; all (i, j) in its valid region contain the converted element values after the operation |
Side Effects¶
No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
Constraints¶
Constraints
srcanddstMUST have compatible shapes (declared shape and valid region).- The source/destination type pair MUST be supported by the selected target profile.
- The rounding mode MUST be supported for the given type pair.
- When a conversion path requires explicit scratch storage, callers MUST use one of the
tmp-tile overloads. -
Disabling saturation may change overflow behavior for some backend/type paths, especially low-precision integer conversions.
-
Temporary tile:
- The C++ API provides overloads with an explicit
tmptile. On A2/A3 this tmp tile is consumed by PyTorch-compatible non-saturating narrowing paths whenSaturationMode::OFFis used forfloat -> int16,half -> int16, orhalf -> int8. Other conversions do not require tmp space. - The implementation casts
tmptoint32_t *; size the tile by bytes, independent of the declaredTmpTileData::DType. - The formulas below give the minimum allocation size rounded to the 32-byte vector block granularity used by
the implementation. If
C = 0, no tmp-backed conversion is issued and the required tmp size is0. - Common parameters:
R = dst.GetValidRow().C = dst.GetValidCol().SS = TileDataS::RowStride, in source elements.REPEAT_MAX = 255,REPEAT_BYTE = 256,BLOCK_BYTE_SIZE = 32.
float -> int16, non-saturating (SaturationMode::OFF):- The temporary result is an
int32_ttile produced by the firstfloat -> int32conversion step. floatsource rows are 32-byte aligned by the tile constraints, soSS / 8is the source repeat stride in 32-byte blocks.- For the aligned main region, one call processes one row and up to
REPEAT_MAXrepeats, with64elements per repeat: $$ \text{tmpHeadBytes} = 4 \times 64 \times \min\left(\left\lfloor\frac{C}{64}\right\rfloor, 255\right) $$ - For the tail region, one call processes up to
REPEAT_MAXrows using the source row stride. The extent is computed in 32-byte blocks because the vector repeat stride is block-based: $$ \text{tmpTailBytes} = \begin{cases} 32 \times \left((\min(R, 255) - 1) \times \frac{SS}{8} + \left\lceil\frac{C \bmod 64}{8}\right\rceil\right), & C \bmod 64 > 0 \ 0, & C \bmod 64 = 0 \end{cases} $$ - Minimum required tmp size for this path: $$ \text{tmpFloatToInt16Bytes} = \max(\text{tmpHeadBytes}, \text{tmpTailBytes}) $$
- A compact full-repeat upper bound for the main region is
REPEAT_MAX * REPEAT_BYTE = 65280bytes, but tail sizing can be larger whenSSis large because tail rows are written with source-row stride.
- The temporary result is an
half -> int16, non-saturating (SaturationMode::OFF):- The implementation processes each row in sub-chunks of at most
64elements and reuses the same temp buffer for every sub-chunk. ForC > 0, let: $$ H = \min(C, 64) $$ - Minimum required tmp size for this path: $$ \text{tmpHalfToInt16Bytes} = 32 \times \left\lceil\frac{H}{8}\right\rceil $$
- A shape-independent upper bound for any non-empty tile is
256bytes.
- The implementation processes each row in sub-chunks of at most
half -> int8, non-saturating (SaturationMode::OFF):- The implementation also processes sub-chunks of at most
64elements and reuses the same 256-byte temp region. The first step can write up to64int32_tvalues into bytes[0, 255]; after theint32 -> int16narrow, bytes[0, 127]hold theint16_tvalues and bytes[128, 255]are reused as scratch. tempMaskBuf = tempAndBuf + 64advances by64 * sizeof(int16_t) = 128bytes, so it points at the upper half of the same 256-byte temp region. It does not require an additional 256-byte allocation.- Minimum required tmp size for this path: $$ \text{tmpHalfToInt8Bytes} = \max\left(32 \times \left\lceil\frac{H}{8}\right\rceil,\ 128 + 32 \times \left\lceil\frac{H}{16}\right\rceil\right) $$
- A shape-independent upper bound for any non-empty tile is
256bytes.
- The implementation also processes sub-chunks of at most
- Overall minimum for all tmp-backed TCVT conversions:
- Since
tmpHalfToInt8Bytes >= tmpHalfToInt16Bytes, the minimum tmp size that fits all tmp-backed TCVT conversion paths for the same shape is: $$ \text{tmpSizeAllBytes} = \max(\text{tmpFloatToInt16Bytes},\ \text{tmpHalfToInt8Bytes}) $$ - Equivalently, if the tile is non-empty and a compact shape-independent bound for the half paths is acceptable: $$ \text{tmpSizeAllBytes} = \max(\text{tmpFloatToInt16Bytes},\ 256) $$
- Since
- The no-
tmpoverload remains valid for conversions that do not need the PyTorch-compatible tmp-backed path, or when native saturation behavior is sufficient.
- The C++ API provides overloads with an explicit
Cases That Are Not Allowed¶
Cases That Are Not Allowed
- MUST NOT use a type pair not supported by the target profile.
- MUST NOT use a rounding mode not supported for the given type pair.
- MUST NOT assume that disabling saturation still clamps overflow to the destination range.
Target-Profile Restrictions¶
Target-Profile Restrictions
pto.tcvt preserves PTO-visible semantics across CPU simulation, A2/A3-class targets, and A5-class targets, but the exact set of supported type pairs, scratch requirements, and saturation behavior is backend-specific.
In this checkout, the fp16 → int8 non-saturating path is explicitly implemented through helper logic that may require temporary storage and row-aware sub-chunking.
Supported Conversions¶
| Source Type | A2A3 Destinations | A5 Destinations | Difference |
|---|---|---|---|
| FP32 | FP16, FP32 (round-only), BF16, I16, I32, I64 | FP32, FP16, BF16, I16, I32, I64, FP8_E4M3, FP8_E5M2, H8 | A5 adds FP8/H8 targets |
| FP16 | FP32, I32, I16, I8, U8, S4 (int4b_t) | FP32, I32, I16, I8, U8, H8 | A2A3 has S4 path; A5 has H8 path |
| BF16 | FP32, I32 | FP32, I32, FP16, FP4_E1M2X2, FP4_E2M1X2 | A5 adds FP16/FP4 targets |
| I16 | FP16, FP32 | U8, FP16, FP32, U32, I32 | A5 adds U8/U32/I32 targets |
| I32 | FP32, I16, I64, FP16 (deq path) | FP32, I16, U16, I64, U8 | A2A3 supports I32 -> FP16 (half, deq); A5 does not |
| I64 | FP32, I32 | FP32, I32 | Same |
| U8 | FP16 | FP16, U16 | A5 adds U16 target |
| I8 | FP16 | FP16, I16, I32 | A5 adds I16/I32 targets |
| S4 (int4b_t) | FP16 | N/A | A2A3-only |
| U32 | N/A | U8, U16, I16 | A5-only source type |
| FP8_E4M3 | N/A | FP32 | A5-only source type |
| FP8_E5M2 | N/A | FP32 | A5-only source type |
| H8 | N/A | FP32 | A5-only source type |
| FP4_E1M2X2 | N/A | BF16 | A5-only source type |
| FP4_E2M1X2 | N/A | BF16 | A5-only source type |
Notes:
- A2A3 supports I32 -> FP16 through the half dequantization path; A5 does not support I32 -> FP16.
- A5 does not support FP16 -> FP8_E4M3 or FP16 -> FP8_E5M2.
Examples¶
Auto¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_auto() {
using SrcT = Tile<TileType::Vec, float, 16, 16>;
using DstT = Tile<TileType::Vec, half, 16, 16>;
SrcT src;
DstT dst;
TCVT(dst, src, RoundMode::CAST_RINT);
}
Explicit Saturation / Scratch¶
using TmpT = Tile<TileType::Vec, int32_t, 16, 16>;
TmpT tmp;
TCVT(dst, src, tmp, RoundMode::CAST_TRUNC, SaturationMode::OFF);
PTO Assembly Form¶
%dst = tcvt %src {rmode = #pto.round_mode<CAST_RINT>} : !pto.tile<...> -> !pto.tile<...>
pto.tcvt ins(%src {rmode = #pto.round_mode<CAST_RINT>}: !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
Related Ops / Instruction Set Links¶
- Instruction set overview: Elementwise Tile Tile
- Previous op in instruction set: pto.tsubc
- Next op in instruction set: pto.tsel