pto.ttrans¶
pto.ttrans is part of the Layout And Rearrangement instruction set.
Summary¶
Transpose with a temporary tile whose allocation and usage depend on the target. On A2/A3 the transpose is performed in-place using the scratchpad as staging; on A5 the operation requires an explicit tmp tile passed via the C++ API because the A5 DMA engine cannot perform a true in-place transpose and needs a scratch buffer of the same shape as the source. On the CPU simulator the tmp tile is not used but must still be provided.
Mechanism¶
Transpose with a temporary tile whose allocation and usage depend on the target. On A2/A3 the transpose is performed in-place using the scratchpad as staging; on A5 the operation requires an explicit tmp tile passed via the C++ API because the A5 DMA engine cannot perform a true in-place transpose and needs a scratch buffer of the same shape as the source. On the CPU simulator the tmp tile is not used but must still be provided. It belongs to the tile instructions and carries architecture-visible behavior that is not reducible to a plain elementwise compute pattern.
For a 2D tile, over the effective transpose domain:
Exact shape/layout and the transpose domain depend on the target (see Constraints).
Syntax¶
Textual spelling is defined by the PTO ISA syntax-and-operands pages.
Synchronous form:
%dst = ttrans %src : !pto.tile<...> -> !pto.tile<...>
Lowering may introduce internal scratch tiles; the C++ intrinsic requires an explicit tmp operand.
AS Level 1 (SSA)¶
%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
AS Level 2 (DPS)¶
pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
C++ Intrinsic¶
Declared in include/pto/common/pto_instr.hpp:
template <typename TileDataDst, typename TileDataSrc, typename TileDataTmp, typename... WaitEvents>
PTO_INST RecordEvent TTRANS(TileDataDst &dst, TileDataSrc &src, TileDataTmp &tmp, WaitEvents &... events);
Inputs¶
srcis the source tile.tmpis a temporary tile used during transpose (may not be used by all implementations).dstnames the destination tile. The operation iterates over dst's valid region.
Expected Outputs¶
dst holds the transposed version of src: dst[i,j] = src[j,i].
Side Effects¶
No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.
Constraints¶
Constraints
-
Temporary tile:
- The C++ API requires
tmp, but some implementations may not use it.
- The C++ API requires
-
ConvTile:
- Transpose of ConvTile for
TileType::Vecis supported。 Element size must be1、2or4bytes. Supported element types areuint32_t、int32_t、float、uint16_t、int16_t、half、bfloat16_t、uint8_t、int8_t. - Format transformation from
NCHWtoNC1HWC0is supported, whileC1 == (C + C0 - 1)/C0,HW matches alignment constraint,which meansH*W*sizeof(T)==0. C0 meansc0_size, whichC0 * sizeof(T) == 32。C0 can also be 4. - Format transformation from
NC1HWC0toFRACTAL_Zis supported, whileN1 == (N + N0 - 1)/N0。N0 should be 16.
- Transpose of ConvTile for
Exceptions¶
Exceptions
- Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend instruction set.
- Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.
Target-Profile Restrictions¶
Target-Profile Restrictions
-
Implementation checks (A2A3):
sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType).- Source layout must be row-major (
TileDataSrc::isRowMajor). - Element size must be
1,2, or4bytes. - Supported element types are restricted per element width:
- 4 bytes:
uint32_t,int32_t,float - 2 bytes:
uint16_t,int16_t,half,bfloat16_t - 1 byte:
uint8_t,int8_t - The transpose size is taken from
src.GetValidRow()/src.GetValidCol().
-
Implementation checks (A5):
sizeof(TileDataSrc::DType) == sizeof(TileDataDst::DType).- 32-byte alignment constraints are enforced on the major dimension of both input and output (row-major checks
Cols * sizeof(T) % 32 == 0, col-major checksRows * sizeof(T) % 32 == 0). - Supported element types are restricted per element width:
- 4 bytes:
uint32_t,int32_t,float - 2 bytes:
uint16_t,int16_t,half,bfloat16_t - 1 byte:
uint8_t,int8_t - The implementation operates over the static tile shape (
TileDataSrc::Rows/Cols) and does not consultGetValidRow/GetValidCol.
Examples¶
Auto¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_auto() {
using SrcT = Tile<TileType::Vec, float, 16, 16>;
using DstT = Tile<TileType::Vec, float, 16, 16>;
using TmpT = Tile<TileType::Vec, float, 16, 16>;
SrcT src;
DstT dst;
TmpT tmp;
TTRANS(dst, src, tmp);
}
Manual¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_manual() {
using SrcT = Tile<TileType::Vec, float, 16, 16>;
using DstT = Tile<TileType::Vec, float, 16, 16>;
using TmpT = Tile<TileType::Vec, float, 16, 16>;
SrcT src;
DstT dst;
TmpT tmp;
TASSIGN(src, 0x1000);
TASSIGN(dst, 0x2000);
TASSIGN(tmp, 0x3000);
TTRANS(dst, src, tmp);
}
Auto Mode¶
# Auto mode: compiler/runtime-managed placement and scheduling.
%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
Manual Mode¶
# Manual mode: bind resources explicitly before issuing the instruction.
# Optional for tile operands:
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
%dst = pto.ttrans %src : !pto.tile<...> -> !pto.tile<...>
PTO Assembly Form¶
%dst = ttrans %src : !pto.tile<...> -> !pto.tile<...>
# AS Level 2 (DPS)
pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)
Related Ops / Instruction Set Links¶
- Instruction set overview: Layout And Rearrangement
- Previous op in instruction set: pto.treshape