pto.vmrgsort¶
pto.vmrgsort is part of the SFU And DSA Instructions instruction set.
Summary¶
Merge-sort 4 pre-sorted input vectors.
Mechanism¶
pto.vmrgsort performs a 4-way merge sort. It takes 4 pre-sorted input vectors from UB memory, merges them into a single sorted output vector, and writes the result back to UB.
Key properties:
- Inputs MUST be pre-sorted in ascending (or descending, per %config) order
- Exactly 4 input segments are merged
- The operation is stable: equal elements retain their relative order from the original inputs
- Sort order is controlled by the %config control word
Control word (%config):
- Encodes sort order direction (ASC/DESC)
- May encode element width and comparison mode
- See the target-profile specification for the full bit-field layout
Syntax¶
PTO Assembly Form¶
vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config : !pto.ptr<T, ub>, !pto.ptr<T, ub>, !pto.ptr<T, ub>, !pto.ptr<T, ub>, !pto.ptr<T, ub>, i64, i64
AS Level 1 (SSA)¶
pto.vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config
: !pto.ptr<f32, ub>, !pto.ptr<f32, ub>, !pto.ptr<f32, ub>, !pto.ptr<f32, ub>, !pto.ptr<f32, ub>, i64, i64
Inputs¶
| Operand | Type | Description |
|---|---|---|
%dest |
!pto.ptr<T, ub> |
UB destination pointer for merged output |
%src0 |
!pto.ptr<T, ub> |
First pre-sorted input segment |
%src1 |
!pto.ptr<T, ub> |
Second pre-sorted input segment |
%src2 |
!pto.ptr<T, ub> |
Third pre-sorted input segment |
%src3 |
!pto.ptr<T, ub> |
Fourth pre-sorted input segment |
%count |
i64 |
Number of valid elements per input segment |
%config |
i64 |
Sort order and comparison mode control word |
Expected Outputs¶
This op writes UB memory directly and returns no SSA value. The sorted result is written to %dest.
Side Effects¶
This operation has no architectural side effect beyond producing its SSA results. It does not implicitly reserve buffers, signal events, or establish memory fences.
Constraints¶
Constraints
- Pre-sorted inputs: ALL four input segments MUST be pre-sorted in the order specified by
%config. Feeding unsorted data produces undefined output. - Same sort order: All four input segments MUST use the same sort order and comparison mode as encoded in
%config. - Same element type: All inputs and the destination MUST use the same element type
T. - UB address space: All pointers MUST have address space
ub. - Single active predicate: Loading a new predicate does not implicitly save a prior predicate. Programs that need to preserve predicate state MUST save it first.
Exceptions¶
Exceptions
- Illegal if any input pointer is not a UB-space pointer.
- Illegal if the effective address (base + areg * 8) is not 64-bit aligned.
- Illegal if the
distattribute value is not in the allowed set for this form.
Target-Profile Restrictions¶
Target-Profile Restrictions
- A5 is the most detailed concrete profile in the current manual; CPU simulation and A2/A3-class targets may support narrower subsets or emulate the behavior while preserving the visible PTO contract.
- The
configword layout and supported element types are profile-specific.
Examples¶
Merge 4 sorted segments¶
// Pre-condition: four sorted arrays in UB
// %sorted_a, %sorted_b, %sorted_c, %sorted_d each contain sorted data
// Merge all four into one sorted output
pto.vmrgsort4 %dest,
%sorted_a, %sorted_b, %sorted_c, %sorted_d,
%count, %config
: !pto.ptr<f32, ub>, !pto.ptr<f32, ub>, !pto.ptr<f32, ub>,
!pto.ptr<f32, ub>, !pto.ptr<f32, ub>, i64, i64
Sort preparation pipeline¶
// Step 1: Sort each segment independently (e.g., using vsort32)
pto.vsort32 %sorted_a, %unsorted_a, %config : !pto.ptr<f32, ub>, !pto.ptr<f32, ub>, i64
pto.vsort32 %sorted_b, %unsorted_b, %config : !pto.ptr<f32, ub>, !pto.ptr<f32, ub>, i64
pto.vsort32 %sorted_c, %unsorted_c, %config : !pto.ptr<f32, ub>, !pto.ptr<f32, ub>, i64
pto.vsort32 %sorted_d, %unsorted_d, %config : !pto.ptr<f32, ub>, !pto.ptr<f32, ub>, i64
// Step 2: Merge the 4 sorted segments
pto.vmrgsort4 %dest,
%sorted_a, %sorted_b, %sorted_c, %sorted_d,
%count, %config
: !pto.ptr<f32, ub>, !pto.ptr<f32, ub>, !pto.ptr<f32, ub>,
!pto.ptr<f32, ub>, !pto.ptr<f32, ub>, i64, i64
Performance¶
A5 Latency¶
SFU operations have higher latency than standard arithmetic ops. Consult the target profile's performance model for cycle-accurate estimates.
A2/A3 Throughput¶
| Metric | Value | Constant | |
|---|---|---|---|
| Startup latency | 14 | A2A3_STARTUP_BINARY |
|
| Completion latency | 26 | A2A3_COMPL_FP32_EXP |
|
| Per-repeat throughput | 2 | A2A3_RPT_2 |
|
| Pipeline interval | 18 | A2A3_INTERVAL |
Detailed Notes¶
pto.vmrgsort is a UB-to-UB accelerator operation. Unlike pure vreg -> vreg ops that operate on vector registers, this instruction moves data directly between UB locations with hardware-accelerated merge logic.
The 4-way merge pattern is common in:
- Parallel sort: Partition data into 4 segments, sort each with vsort32, then merge
- K-way merge: Merge results from independent sorted streams
- Top-K selection: Find the K smallest/largest elements across multiple sources
Related Ops / Instruction Set Links¶
- Instruction set overview: SFU And DSA Instructions
- Previous op in instruction set: pto.vsort32