pto.tput_async¶

Introduction¶

pto.tput_async is an asynchronous remote write primitive. It starts a transfer from local GM to remote GM and returns an AsyncEvent immediately.

Data flow:

srcGlobalData (local GM) -> DMA engine -> dstGlobalData (remote GM)

Template Parameter¶

engine:
- DmaEngine::SDMA (default)
- DmaEngine::URMA (Ascend950, NPU_ARCH 3510 only)

Important (SDMA path) pto.tput_async with DmaEngine::SDMA currently supports only flat contiguous logical 1D tensors. Non-1D or non-contiguous layouts are not supported by the current SDMA async implementation.

C++ Intrinsic¶

Declared in include/pto/comm/pto_comm_inst.hpp.

template <DmaEngine engine = DmaEngine::SDMA,
          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
PTO_INST AsyncEvent PUT_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
                             const AsyncSession &session, WaitEvents &... events);

AsyncSession is an engine-agnostic session object. Build once with BuildAsyncSession<engine>(), then pass to all async calls and event waits. The template engine parameter selects the DMA backend at compile time, making the code forward-compatible with future engines (CCU, etc.).

AsyncSession Construction¶

Use BuildAsyncSession from include/pto/comm/async_common/async_event_impl.hpp. There are two overloads — one for SDMA and one for URMA — with different parameter lists.

SDMA Construction (default)¶

template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
                                   __gm__ uint8_t *workspace,
                                   AsyncSession &session,
                                   uint32_t syncId = 0,
                                   const sdma::SdmaBaseConfig &baseConfig = {sdma::kDefaultSdmaBlockBytes, 0, 1},
                                   uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);

Parameter	Default	Description
`scratchTile`	—	UB scratch tile for SDMA control metadata (see scratchTile Role).
`workspace`	—	GM pointer allocated by host-side `SdmaWorkspaceManager`.
`session`	—	Output `AsyncSession` object.
`syncId`	`0`	MTE3/MTE2 pipe sync event id (0-7). Override if kernel uses other pipe barriers on the same id.
`baseConfig`	`{kDefaultSdmaBlockBytes, 0, 1}`	`{block_bytes, comm_block_offset, queue_num}`. Suitable for most single-queue transfers.
`channelGroupIdx`	`kAutoChannelGroupIdx`	SDMA channel group index. Default uses `get_block_idx()` internally, mapping to current AI core. Override for multi-block or custom channel mapping scenarios.

URMA Construction (NPU_ARCH 3510 only)¶

URMA (User-level RDMA Memory Access) is a hardware-accelerated RDMA transport available on Ascend950 (NPU_ARCH 3510). URMA requires CANN Toolkit >= 9.1.0.

#ifdef PTO_URMA_SUPPORTED
template <DmaEngine engine>
PTO_INTERNAL bool BuildAsyncSession(__gm__ uint8_t *workspace,
                                   uint32_t destRankId,
                                   AsyncSession &session);
#endif

Parameter	Description
`workspace`	GM pointer allocated by host-side `UrmaWorkspaceManager`.
`destRankId`	Remote PE rank id that this session communicates with. For `pto.tput_async` this is the destination rank.
`session`	Output `AsyncSession` object.

URMA does not require scratchTile — polling uses ld_dev/st_dev hardware intrinsics directly.

Constraints¶

Constraints

GlobalSrcData::RawDType == GlobalDstData::RawDType
GlobalSrcData::layout == GlobalDstData::layout
Both SDMA and URMA paths require source tensor to be flat contiguous logical 1D only
SDMA workspace must be a valid GM pointer allocated by host-side SdmaWorkspaceManager
URMA workspace must be a valid GM pointer allocated by host-side UrmaWorkspaceManager
URMA is only available on NPU_ARCH 3510 (Ascend950)
URMA requires CANN Toolkit >= 9.1.0
The symmetric data buffer passed to UrmaWorkspaceManager::Init() must be backed by huge-page memory (allocate with ACL_MEM_MALLOC_HUGE_ONLY). The underlying MR registration requires huge-page backing; ACL_MEM_MALLOC_HUGE_FIRST may silently fall back to 4KB pages for small allocations, causing registration to fail

If the 1D contiguous requirement is not met, current implementation returns an invalid async event (handle == 0).

scratchTile Role¶

scratchTile is not the payload staging buffer for user data. It is converted to TmpBuffer and used as temporary UB workspace for:

writing/reading SDMA control words (flag, sq_tail, channel_info)
polling event completion flags
committing queue tail during completion

Data payload moves between GM buffers directly; scratchTile only supports control and synchronization metadata.

scratchTile Type and Size Constraints¶

must be a pto::Tile type
must be UB/Vec tile (ScratchTile::Loc == TileType::Vec)
available bytes must be at least sizeof(uint64_t) (8 bytes)

Recommended: Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE> (256B).

Completion Semantics (Quiet Semantics)¶

The completion mechanism differs by engine, but user-facing quiet semantics are identical:

SDMA: pto.tput_async only submits data transfer SQEs. The flag SQE is deferred to Wait, which polls the flag for completion.
URMA: pto.tput_async submits an RDMA WRITE WQE and rings the doorbell immediately. Wait polls the Completion Queue (CQ) until all expected CQEs have been consumed.
event.Wait(session) — blocks until all async operations issued since the last Wait are complete

This means after multiple pto.tput_async calls, a single Wait on the last returned AsyncEvent drains all pending operations (similar to shmem's quiet semantics).

After wait succeeds, all issued writes to dstGlobalData are complete.

Example¶

Single Transfer¶

#include <pto/comm/pto_comm_inst.hpp>
#include <pto/common/pto_tile.hpp>

using namespace pto;

template <typename T>
__global__ AICORE void SimplePut(__gm__ T *remoteDst, __gm__ T *localSrc,
                                __gm__ uint8_t *sdmaWorkspace)
{
    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;

    ShapeDyn shape(1, 1, 1, 1, 1024);
    StrideDyn stride(1024, 1024, 1024, 1024, 1);
    GT dstG(remoteDst, shape, stride);
    GT srcG(localSrc, shape, stride);

    ScratchTile scratchTile;
    TASSIGN(scratchTile, 0x0);

    comm::AsyncSession session;
    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
        return;
    }

    auto event = comm::PUT_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
    (void)event.Wait(session);
}

Batch Transfer (Quiet Semantics)¶

template <typename T>
__global__ AICORE void BatchPut(__gm__ T *remoteDstBase, __gm__ T *localSrc,
                                __gm__ uint8_t *sdmaWorkspace, int nranks)
{
    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;

    ShapeDyn shape(1, 1, 1, 1, 1024);
    StrideDyn stride(1024, 1024, 1024, 1024, 1);
    GT srcG(localSrc, shape, stride);

    ScratchTile scratchTile;
    TASSIGN(scratchTile, 0x0);

    comm::AsyncSession session;
    if (!comm::BuildAsyncSession(scratchTile, sdmaWorkspace, session)) {
        return;
    }

    comm::AsyncEvent lastEvent;
    for (int rank = 0; rank < nranks; ++rank) {
        GT dstG(remoteDstBase + rank * 1024, shape, stride);
        lastEvent = comm::PUT_ASYNC(dstG, srcG, session);
    }
    (void)lastEvent.Wait(session);  // single Wait drains all pending ops
}

URMA Example (NPU_ARCH 3510)¶

#include <pto/comm/pto_comm_inst.hpp>
#include <pto/common/pto_tile.hpp>

using namespace pto;

template <typename T>
__global__ AICORE void SimplePutUrma(__gm__ T *remoteDst, __gm__ T *localSrc,
                                      __gm__ uint8_t *urmaWorkspace, uint32_t destRankId)
{
    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;

    ShapeDyn shape(1, 1, 1, 1, 1024);
    StrideDyn stride(1024, 1024, 1024, 1024, 1);
    GT dstG(remoteDst, shape, stride);
    GT srcG(localSrc, shape, stride);

    comm::AsyncSession session;
    if (!comm::BuildAsyncSession<comm::DmaEngine::URMA>(urmaWorkspace, destRankId, session)) {
        return;
    }

    auto event = comm::PUT_ASYNC<comm::DmaEngine::URMA>(dstG, srcG, session);
    (void)event.Wait(session);
}