pto.tprefetch_async¶

Introduction¶

Prefetch data from Global Memory (GM/HBM) into the NPU's L2 Cache via SDMA CMO (Cache Maintenance Operation, opcode=6). Unlike TPREFETCH, which moves data from GM into UB, TPREFETCH_ASYNC keeps data in L2 cache only, consuming no UB space for the data itself. This allows subsequent TLOAD operations to hit L2 cache instead of going to GM.

TPREFETCH_ASYNC is logically a memory-access/cache-hint instruction. It uses the SDMA CMO path internally, but the public API lives in namespace pto, next to pto::TPREFETCH.

Data Flow¶

GM / HBM  ──(SDMA CMO prefetch)──>  L2 Cache
                                       │
                                       └── subsequent TLOAD hits L2 (fast)

C++ Intrinsic¶

Declared in include/pto/common/pto_instr.hpp, namespace pto:

namespace pto {

template <typename GlobalData, typename... WaitEvents>
PTO_INST comm::AsyncEvent TPREFETCH_ASYNC(GlobalData &src, PrefetchAsyncContext &ctx,
                                          WaitEvents &... events);

} // namespace pto

PrefetchAsyncContext contains the SDMA workspace pointer prepared host-side via SdmaWorkspaceManager::Init:

pto::PrefetchAsyncContext ctx(workspace);
auto evt = pto::TPREFETCH_ASYNC(srcGlobal, ctx);
evt.Wait(ctx.session);

The instruction builds and stores an SDMA session inside PrefetchAsyncContext with default parameters (channelGroupIdx = get_block_idx(), syncId = 0, queue_num = 1). Use the returned comm::AsyncEvent with ctx.session to wait for completion when a dependent consumer must observe the prefetch.

Parameters¶

Parameter	Type	Description
`src`	`GlobalData&`	Source GlobalTensor region to prefetch into L2
`ctx`	`PrefetchAsyncContext&`	Compute-side prefetch context containing the SDMA workspace base and internally built `comm::AsyncSession`
`events...`	`WaitEvents&...`	Optional wait events for synchronization

Return Value¶

comm::AsyncEvent - handle for asynchronous completion tracking. Call evt.Wait(ctx.session) before a dependent TLOAD when the consumer must observe prefetched data.

Constraints¶

Source data must be in Global Memory (GM/HBM address space).
For the GlobalTensor overload, the tensor must be flat contiguous (packed 1D layout).
The SDMA workspace must be initialized by host code before the kernel launch and passed into the kernel.
PrefetchAsyncContext owns a 256-byte UB scratch tile and an AsyncSession used while constructing and waiting on SDMA metadata.
SDMA CMO operates at cache-line granularity; non-aligned prefetch ranges are rounded by hardware.
On CPU simulation backend, this instruction is a no-op (returns an empty AsyncEvent).

Comparison with TPREFETCH¶

Dimension	`pto::TPREFETCH`	`pto::TPREFETCH_ASYNC`
Data flow	GM → UB	GM → L2 Cache
Hardware path	MTE (`copy_gm_to_ubuf`)	SDMA CMO (opcode=6)
UB consumption	Yes (requires dst Tile)	No (only 256B scratch for SQE construction)
Synchronization	Synchronous (pipeline barrier)	Asynchronous (`AsyncEvent`)
Use case	Small data preload to UB	Large data L2 warm-up

Examples¶

Basic Usage¶

#include <pto/pto-inst.hpp>

using namespace pto;

__global__ AICORE void my_kernel(__gm__ half *src, __gm__ half *dst,
                                 __gm__ uint8_t *workspace)
{
    using GShape = Shape<1, 1, 1, 1, 16384>;
    using GStride = Stride<1, 1, 1, 1, 1>;
    GlobalTensor<half, GShape, GStride> srcGlobal(src);

    PrefetchAsyncContext ctx(workspace);
    auto evt = TPREFETCH_ASYNC(srcGlobal, ctx);
    evt.Wait(ctx.session);

    using TileData = Tile<TileType::Vec, half, 128, 128, BLayout::RowMajor>;
    TileData tile;
    TASSIGN(tile, 0x100);
    TLOAD(tile, srcGlobal);
}