pto.tgemv_acc

pto.tgemv_acc is part of the Matrix And Matrix Vector instruction set.

Summary

GEMV with explicit accumulator input/output tiles.

Mechanism

Tile-based GEMV with explicit accumulator input tile (cInMatrix) and output tile (cOutMatrix). It operates on tile payloads rather than scalar control state, and its legality is constrained by tile shape, layout, valid-region, and target-profile support.

Let:

  • M = 1
  • K = bMatrix.GetValidRow()
  • N = bMatrix.GetValidCol()

For 0 <= j < N (accumulates into the existing output tile):

\[ \mathrm{C}_{0,j} \gets \mathrm{C}_{0,j} + \sum_{k=0}^{K-1} \mathrm{A}_{0,k} \cdot \mathrm{B}_{k,j} \]

Accumulator behavior and datatype promotion are concrete per target. On A2/A3: accumulation uses the accumulator tile's native datatype (int32_t or float), with int8 accumulation performed in 32-bit and fp accumulation using standard IEEE round-to-nearest-even. On A5: accumulation is always in the accumulator tile's native type, and fp accumulation follows the accumulator's native rounding mode. On CPU simulator: follows A5 semantics.

Syntax

Textual spelling is defined by the PTO ISA syntax-and-operands pages.

Synchronous form:

%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>

AS Level 1 (SSA)

%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>

AS Level 2 (DPS)

pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)

C++ Intrinsic

Declared in include/pto/common/pto_instr.hpp:

template <typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);

template <AccPhase Phase, typename TileRes, typename TileLeft, typename TileRight, typename... WaitEvents>
PTO_INST RecordEvent TGEMV_ACC(TileRes &cOutMatrix, TileRes &cInMatrix, TileLeft &aMatrix, TileRight &bMatrix, WaitEvents &... events);

Inputs

  • cIn is the input accumulator tile.
  • a is the left operand tile (must be TileLeft location).
  • b is the right operand tile (must be TileRight location).
  • dst names the output accumulator tile. The operation iterates over dst's valid region.

Expected Outputs

dst holds the accumulated matrix-vector product: dst[0,j] = cIn[0,j] + sum over k of a[0,k] * b[k,j].

Side Effects

No architectural side effects beyond producing the destination tile. Does not implicitly fence unrelated traffic.

Constraints

Constraints

Common shape and location constraints

  • Static shape constraints:

    • TileLeft::Rows == TileRes::Rows
    • TileLeft::Cols == TileRight::Rows
    • TileRight::Cols == TileRes::Cols
  • Tile locations:

    • TileLeft::Loc == Left
    • TileRight::Loc == Right
    • TileRes::Loc == Acc
  • Runtime valid-size constraints:

    • m must be 1
    • k and n (taken from bMatrix.GetValidRow() and bMatrix.GetValidCol()) must be in [1, 4095]

Datatype constraints

Exceptions

Exceptions

  • Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend instruction set.
  • Programs must not rely on behavior outside the documented legal domain of this operation, even if one backend currently accepts it.

Target-Profile Restrictions

Target-Profile Restrictions
  • Implementation checks (A2A3):

    • Supported (CType, AType, BType) triples:
      • (int32_t, int8_t, int8_t)
      • (float, half, half)
      • (float, float, float)
      • (float, bfloat16_t, bfloat16_t)
  • Implementation checks (A5):

    • Accumulator type must be int32_t or float.
    • If int32_t: AType == int8_t and BType == int8_t.
    • If float: supports half, bfloat16_t, float, and selected fp8 pairs (target-defined).
    • Fractal/layout constraints are enforced:
      • Left: Loc == Left, !isRowMajor, SFractal == RowMajor
      • Right: Loc == Right, isRowMajor, SFractal == ColMajor
      • Acc: Loc == Acc, !isRowMajor, SFractal == RowMajor
    • No separate explicit m/k/n runtime assertions are enforced in the underlying A5 matmul implementation beyond the GEMV contract described above.

Examples

Auto

#include <pto/pto-inst.hpp>

using namespace pto;

void example_auto() {
  using A = TileLeft<half, 1, 16>;
  using B = TileRight<half, 16, 16>;
  using C = TileAcc<float, 1, 16>;
  A a;
  B b;
  C c0, c1;
  TGEMV_ACC(c1, c0, a, b);
}

Manual

#include <pto/pto-inst.hpp>

using namespace pto;

void example_manual() {
  using A = TileLeft<half, 1, 16>;
  using B = TileRight<half, 16, 16>;
  using C = TileAcc<float, 1, 16>;
  A a;
  B b;
  C c0, c1;
  TASSIGN(a, 0x1000);
  TASSIGN(b, 0x2000);
  TASSIGN(c0, 0x3000);
  TASSIGN(c1, 0x4000);
  TGEMV_ACC(c1, c0, a, b);
}

Auto Mode

# Auto mode: compiler/runtime-managed placement and scheduling.
%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>

Manual Mode

# Manual mode: bind resources explicitly before issuing the instruction.
# Optional for tile operands:
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
%c_out = pto.tgemv.acc %c_in, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>

PTO Assembly Form

%acc1 = tgemv.acc %acc0, %a, %b : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
# AS Level 2 (DPS)
pto.tgemv.acc ins(%c_in, %a, %b : !pto.tile_buf<...>, !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%c_out : !pto.tile_buf<...>)