Predicate Load Store¶
Predicate load/store instruction set moves predicate-register state (!pto.mask<G>) between UB-visible storage and the architectural predicate instruction set. Predicates are the lane-masking mechanism that pto.v* vector operations consume.
Mechanism¶
Predicate state lives on the scalar and control instructions. pld*/pst* operations transfer predicate bits to or from UB memory locations, enabling predicates to persist across kernel boundaries or to be shared with scalar address calculations.
Data Flow¶
Predicate Register File ──(plds/pld/pldi)──► UB location (64-bit aligned)
UB location ──(psts/pst/psti/pstu)──► Predicate Register File
Predicate Width¶
| Element Type | Vector Width N | Predicate Width |
|---|---|---|
| f32 | 64 | 64 bits |
| f16 / bf16 | 128 | 128 bits (2 × 64-bit transfers) |
| i8 / u8 | 256 | 256 bits (4 × 64-bit transfers) |
A single predicate load/store operation covers the full predicate width for the element type in use. Partial predicate loads are not supported.
Alignment Requirements¶
| Operation | Alignment Requirement | Consequence of Violation |
|---|---|---|
plds / psts |
64-bit (8 bytes) at UB address | Illegal if address not 8-byte aligned |
pld / pst (areg offset) |
64-bit; offset must be register-aligned | Illegal if address or offset violates alignment |
pldi / psti (immediate offset) |
64-bit; offset must be compile-time constant | Illegal if immediate violates alignment |
pstu (stream form) |
None; tracks alignment state internally | On A2/A3 and A5: alignment base is initialized to zero on first use; on CPU simulator: alignment base starts as undefined until the first store updates it. Subsequent stores always update %align_out with the new alignment base. |
Distribution Modes¶
Distribution modes (dist attribute) control how predicate bits are packed into UB storage. All load/store forms accept a dist attribute:
| Mode | Description | Load Behavior | Store Behavior |
|---|---|---|---|
NORM |
Normal packing | Read 64-bit predicate word directly | Write 64-bit predicate word directly |
PK |
Packed (store only) | Not applicable | Pack two 32-bit predicate segments into one 64-bit word |
US |
Unsigned streaming | UB bits as-is | UB bits as-is |
DS |
Signed streaming | UB bits as-is, sign-extend | UB bits as-is |
Shared Constraints¶
All predicate load/store operations MUST satisfy:
- UB address space: The pointer operand MUST have type
!pto.ptr<T, ub>. Predicates cannot be transferred directly to/from GM. - Alignment: The effective UB address (base + offset) MUST be 64-bit aligned. The stream form (
pstu) relaxes this but imposes its own ordering requirements. - Predicate width match: The transfer covers the full predicate width for the active element type. Partial transfers are not permitted.
- Event ordering: When used in a producer-consumer chain with DMA, the program MUST use
set_flag/wait_flagto order the predicate transfer before or after the dependent operation. - Single active predicate: At any point in program order, at most one predicate register is architecturally active. Concurrent predicate transfers that would overwrite an in-flight predicate are illegal.
Stream Form (pstu)¶
pto.pstu is the high-throughput stream variant of predicate store. It differs from psts in the following ways:
| Aspect | psts |
pstu |
|---|---|---|
| Alignment | 64-bit required | None required |
| Write atomicity | Single predicate word is atomic | Writes may be batched; individual 64-bit words are not guaranteed atomic |
| Alignment state | Not updated | Updates %align_out with new alignment base |
| Use case | Exact predicate save/restore | Streaming predicate writes with internal buffering |
Programs that require exact predicate state restoration (e.g., saving and restoring a mask for later reuse) MUST use psts. Programs that stream predicates as part of a larger pipeline SHOULD use pstu.
Predicate Lifecycle¶
A typical predicate load/store lifecycle:
// Kernel entry: load saved predicate
%mask = pto.plds %ub_saved : !pto.ptr<i64, ub> -> !pto.mask<G>
// Use predicate for vector computation
%result = pto.vsel %v_true, %v_false, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask<b32> -> !pto.vreg<64xf32>
// At kernel exit: save predicate for next kernel
pto.psts %mask, %ub_saved : !pto.mask<G>, !pto.ptr<i64, ub>
Target-Profile Restrictions¶
Target-Profile Restrictions
| Feature | CPU Simulator | A2/A3 | A5 |
|---|---|---|---|
plds / psts |
Simulated | Supported | Supported |
pld / pst (areg) |
Simulated | Supported | Supported |
pldi / psti (immediate) |
Simulated | Supported | Supported |
pstu stream form |
Not supported | Supported | Supported |
PK distribution mode |
Not supported | Supported | Supported |
Alignment relaxation (pstu) |
Not applicable | Supported | Supported |
Per-Op Pages¶
- pto.plds — Contiguous predicate load
- pto.pld — Predicate load with areg offset
- pto.pldi — Predicate load with immediate offset
- pto.psts — Contiguous predicate store
- pto.pst — Predicate store with areg offset
- pto.psti — Predicate store with immediate offset
- pto.pstu — Predicate unaligned stream store