'pcf' Dialectlink
A dialect designed to model parallel control flow.
The pcf dialect models parallelized control flow using structured operations akin to dialects like scf. It offers a set of core loop-like constructs alongside the glue necessary to represent splitting and joining parallel work.
In contrast with scf whose scope is purely focused on representing common control flow, the pcf dialect includes type, interfaces, and operations that represent dataflow across parallel workers. This comprises two key conceptual types:
- Scoped memory. This is a reference to memory that carries information about its allocation scope as well as how to synchronize it. This allows for fencing at fine granularities (e.g. allocation).
- Tokens. Types capable of managing synchronization of resources between threads. This could be anything ranging from fences + (named) barriers to producer/consumer queues implemented with ringbuffers.
PCF ops + types are designed to be lowered in three phases, starting from structural ops on scoped memory infused with synchronization tokens. Prior to each phase a different level of scheduling is implied.
- Tokens tied to resources are split and lowered to separate ops. Before this the compiler can perform coarse grain scheduling around resources according to their tied synchronization.
- Generic scoped memory is converted to memref. Since all tokens have been resolved by this point, this is just a matter of propagating layout and memory space.
- Wrapping structured ops are lowered to control flow (scf and/or cf).
This diagram illustrates where the dialect fits in to executable lowering pipelines for typical GPUs. For CPUs and other accelerators, the same flow is intended to work modulo different levels of physical parallelism instead of thread/subgroup/lane.
v----------+----------v
| Executable Input |
| (Linalg on tensors) |
+----------v----------+
|
TileAndDistribute |
to workgroups |
|
v----------+----------v
| PCF and/or SCF |
+----------v----------+
|
SCF(.forall)ToPCF |
|
v---------+---------v
+--------|-------+ +-------|--------+
| Tile Op1 to | | Tile OpN to |
| Subgroups/ | | Subgroups/ |
| Threads/ |...| Threads/ |
| Lanes | | Lanes |
| pcf.concurrent | | pcf.concurrent |
+--------|-------+ +-------|--------+
+---------v---------+
|
Vectorize | // WriteOps vectorize
|
Bufferize | // PCF tensor -> pcf.sref
| // becomes memref -> sref
|
ResolveTokens |
SRefToMemRef | // pcf.sref -> memref
LowerPCF | // pcf -> scf/cf
|
v-----------+-----------v
| SCF+GPU+vector+memref |
+-----------------------+
Operationslink
Alloc opslink
pcf.alloc (PCF::AllocOp)link
Shaped ref allocation operation
Syntax:
operation ::= `pcf.alloc` `(`$dynamicSizes`)` attr-dict `:` type($result)
Allocates a pcf.sref with the given element type and shape. Dynamic
dimensions in the result type must have corresponding dynamic size
operands. The allocation scope is determined by the scope attribute of
the result type.
Example:
%sref = pcf.alloc() : !pcf.sref<4x8xf32, #foo.scope>
%sref_dyn = pcf.alloc(%d0, %d1) : !pcf.sref<?x?xf32, #foo.scope>
Operands:link
| Operand | Description |
|---|---|
dynamicSizes |
variadic of index |
Results:link
| Result | Description |
|---|---|
result |
A shaped reference to a buffer. |
Parallel execution opslink
pcf.br.cond_return (PCF::BranchCondReturnOp)link
Branch operation with conditional return
Syntax:
operation ::= `pcf.br.cond_return` $condition $dest (`(` $dest_operands^ `:` type($dest_operands) `)`)? attr-dict
The pcf.br.cond_return operation represents a conditional branch operation
to a given block, or return from the parent.
Example:
pcf.<scoped op> #foo.scope {
^bb0(%0: !foo.type)
%1 = ... %0 : !foo.type
pcf.br.cond_return %cond ^bb0(%0: !foo.type)
}
Traits: AlwaysSpeculatableImplTrait, HasParent<IREE::PCF::GenericOp>, Terminator
Interfaces: BranchOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:link
| Operand | Description |
|---|---|
condition |
1-bit signless integer |
dest_operands |
variadic of any type |
Successors:link
| Successor | Description |
|---|---|
dest |
any successor |
Parallel execution opslink
pcf.generic (PCF::GenericOp)link
Execute a set of workers in parallel on a region.
Syntax:
operation ::= `pcf.generic` (`sync` $sync_on_return^)?
`scope` `(` $scope `)`
(`initialize` $initializer^)?
custom<ParallelExecutionBody>($inits,
type($inits),
$dynamic_sizes,
type($results),
$is_tied,
$region,
$num_leading_args,
"true")
custom<InferNumIndexArgs>(ref($region), ref($num_leading_args), $num_index_args)
prop-dict attr-dict
Executes a region across a set of workers at a specified scope. When
control flow reaches this op, nproc workers of the specified scope are
spawned and begin executing the region. The scope is given by an attribute
implementing the ScopeAttrInterface interface and is responsible for the semantics
of all pcf primitives at the same scope. Further details about scopes are
included in the docs for the interface.
The optional initialize region is executed once when control flow first
reaches the op. Values yielded from the initializer become block arguments
available to the execute region. This is useful for setting up per-op
state that persists across all worker invocations.
Results are produced by snapshotting the value of each result's tied sref once all workers have returned. Results can either be: 1. Tied to initial values (tensor or memref) - the init value provides the initial contents and the result captures the final state. 2. Allocated by the op itself - dynamic sizes must be provided for untied results with dynamic dimensions.
Basic example with tied results:
%0 = ... : tensor<4x8xf32>
%1 = pcf.generic scope(#foo.scope)
execute(%ref = %0)[%id: index, %num_workers: index]
: (!pcf.sref<4x8xf32, #foo.scope>) -> (tensor<4x8xf32>) {
// Each worker can read/write %ref.
pcf.return
}
Example with initializer:
%result = pcf.generic scope(#foo.scope)
initialize {
%scratch = pcf.alloc() : !pcf.sref<16xf32, #foo.scope>
pcf.yield %scratch : !pcf.sref<16xf32, #foo.scope>
} -> (%scratch_arg: !pcf.sref<16xf32, #foo.scope>)
execute(%ref = %init)[%id: index, %num_workers: index]
: (!pcf.sref<4x8xf32, #foo.scope>) -> (tensor<4x8xf32>) {
// %scratch_arg is available here, initialized once.
pcf.return
}
Example with untied (allocated) results:
%d0, %d1 = ... : index
%result = pcf.generic scope(#foo.scope)
execute[%id: index, %num_workers: index]
: () -> (tensor<?x?xf32>{%d0, %d1}) {
// Result sref is allocated by the op, not tied to any init.
pcf.return
}
Traits: AttrSizedOperandSegments, AutomaticAllocationScope, RecursiveMemoryEffects
Interfaces: OpAsmOpInterface
Attributes:link
| Attribute | MLIR Type | Description |
|---|---|---|
scope | ::mlir::iree_compiler::IREE::PCF::ScopeAttrInterface | Defines parallel execution scope. |
Operands:link
| Operand | Description |
|---|---|
inits |
variadic of ranked tensor or memref of any type |
dynamic_sizes |
variadic of index |
Results:link
| Result | Description |
|---|---|
results |
variadic of ranked tensor or memref of any type |
pcf.loop (PCF::LoopOp)link
Execute a set of workers in parallel on a region.
Syntax:
operation ::= `pcf.loop` (`sync` $sync_on_return^)?
`scope` `(` $scope `)`
`count` `(` $count `)`
custom<ParallelExecutionBody>($inits,
type($inits),
$dynamic_sizes,
type($results),
$is_tied,
$region)
prop-dict attr-dict
Executes a region for each point in the iteration space defined by the
count operands. Unlike pcf.generic which spawns workers equal to the
native parallelism of the scope, pcf.loop explicitly specifies the
iteration count and maps iterations to workers according to the scope's
scheduling policy.
When control flow reaches this op, the scope determines how to distribute
the iterations across available workers. The scope is given by an attribute
implementing the ScopeAttrInterface interface. Further details about scopes are
included in the docs for the interface.
The execute region receives one index block argument per count operand, representing the current iteration's coordinates in the iteration space.
Results are produced by snapshotting the value of each result's tied sref once all iterations have completed. Results can either be: 1. Tied to initial values (tensor or memref) - the init value provides the initial contents and the result captures the final state. 2. Allocated by the op itself - dynamic sizes must be provided for untied results with dynamic dimensions.
Basic example with 1D iteration:
%n = ... : index
%0 = ... : tensor<4x8xf32>
%1 = pcf.loop scope(#foo.scope) count(%n)
execute(%ref = %0)[%id: index]
: (!pcf.sref<4x8xf32, #foo.scope>) -> (tensor<4x8xf32>) {
// %id ranges from 0 to %n-1.
pcf.return
}
Example with multi-dimensional iteration:
%m, %n = ... : index
%result = pcf.loop scope(#foo.scope) count(%m, %n)
execute(%ref = %init)[%i: index, %j: index]
: (!pcf.sref<?x?xf32, #foo.scope>) -> (tensor<?x?xf32>) {
// %i ranges from 0 to %m-1, %j ranges from 0 to %n-1.
pcf.return
}
Traits: AttrSizedOperandSegments, AutomaticAllocationScope, RecursiveMemoryEffects, SingleBlockImplicitTerminator<mlir::iree_compiler::IREE::PCF::ReturnOp>, SingleBlock
Interfaces: OpAsmOpInterface, RegionBranchOpInterface
Attributes:link
| Attribute | MLIR Type | Description |
|---|---|---|
scope | ::mlir::iree_compiler::IREE::PCF::ScopeAttrInterface | Defines parallel execution scope. |
Operands:link
| Operand | Description |
|---|---|
count |
variadic of index |
inits |
variadic of ranked tensor or memref of any type |
dynamic_sizes |
variadic of index |
Results:link
| Result | Description |
|---|---|
results |
variadic of ranked tensor or memref of any type |
Read opslink
pcf.get_memref (PCF::GetMemrefOp)link
Extract a memref view from a slice of a pcf.sref.
Syntax:
operation ::= `pcf.get_memref` $source ``
custom<DynamicIndexList>($offsets, $static_offsets)
custom<DynamicIndexList>($sizes, $static_sizes)
custom<DynamicIndexList>($strides, $static_strides)
attr-dict `:` type($source) `to` type($result)
The pcf.get_memref operation extracts a memref view from a slice of a
sref, breaking the synchronization guarantees of the source.
The returned memref must have a maximally dynamic layout (all strides and offset dynamic) and no memory space. Layout and memory space information is determined by the ConvertSRefToMemRef analysis pass.
The operation supports the following arguments:
* source: the sref from which to extract a view.
* offsets: shaped-rank number of offsets into the source from which
the slice begins.
* sizes: shaped-rank number of sizes which specify the sizes of the result
memref type.
* strides: shaped-rank number of strides that specify subsampling in each
dimension.
Traits: AttrSizedOperandSegments
Interfaces: OffsetSizeAndStrideOpInterface
Attributes:link
| Attribute | MLIR Type | Description |
|---|---|---|
static_offsets | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
static_sizes | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
static_strides | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
Operands:link
| Operand | Description |
|---|---|
source |
A shaped reference to a buffer. |
offsets |
variadic of index |
sizes |
variadic of index |
strides |
variadic of index |
Results:link
| Result | Description |
|---|---|
result |
memref of any type values |
pcf.read_slice (PCF::ReadSliceOp)link
Read a tensor or vector from a pcf.sref based on the provided slice
parameters.
Syntax:
operation ::= `pcf.read_slice` $source ``
custom<DynamicIndexList>($offsets, $static_offsets)
custom<DynamicIndexList>($sizes, $static_sizes)
custom<DynamicIndexList>($strides, $static_strides)
attr-dict `:` type($source) `to` type($result)
Read a slice from a pcf.sref. If this is reading a vector, the sizes
may be smaller than the return vector type. In this case out of bounds
elements have undefined value.
The pcf.read_slice operation supports the following arguments:
* source: the shaped value that is written.
* dest: the sref into which the source is written.
* offsets: shaped-rank number of offsets into the dest into which
the slice is inserted.
* sizes: shaped-rank number of sizes which specify the sizes of the source
tensor type.
* strides: shaped-rank number of strides that specify subsampling in each
dimension.
Traits: AttrSizedOperandSegments
Interfaces: OffsetSizeAndStrideOpInterface
Attributes:link
| Attribute | MLIR Type | Description |
|---|---|---|
static_offsets | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
static_sizes | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
static_strides | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
Operands:link
| Operand | Description |
|---|---|
source |
A shaped reference to a buffer. |
offsets |
variadic of index |
sizes |
variadic of index |
strides |
variadic of index |
Results:link
| Result | Description |
|---|---|
result |
ranked tensor of any type values or vector of any type values |
pcf.return (PCF::ReturnOp)link
Returns from a thread.
Syntax:
operation ::= `pcf.return` attr-dict
Returns control flow to the parent without fencing memory. If the parent carries an implicit fence one may still occur after the parent has finished.
Traits: AlwaysSpeculatableImplTrait, HasParent<IREE::PCF::GenericOp, IREE::PCF::LoopOp>, Terminator
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Write opslink
pcf.write_slice (PCF::WriteSliceOp)link
Submit a write of a tensor, vector, or memref to a slice of a pcf.sref.
Syntax:
operation ::= `pcf.write_slice` $source `into` $dest ``
custom<DynamicIndexList>($offsets, $static_offsets)
custom<DynamicIndexList>($sizes, $static_sizes)
custom<DynamicIndexList>($strides, $static_strides)
attr-dict `:` type($source) `into` type($dest)
The pcf.write_slice operation supports the following arguments:
- source: the shaped value that is written.
- dest: the sref into which the source is written.
- offsets: shaped-rank number of offsets into the
destinto which the slice is inserted. - sizes: shaped-rank number of sizes which specify the sizes of the source tensor type.
- strides: shaped-rank number of strides that specify subsampling in each dimension.
Traits: AttrSizedOperandSegments
Interfaces: OffsetSizeAndStrideOpInterface
Attributes:link
| Attribute | MLIR Type | Description |
|---|---|---|
static_offsets | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
static_sizes | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
static_strides | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
Operands:link
| Operand | Description |
|---|---|
source |
ranked tensor, vector, or memref of any type |
dest |
A shaped reference to a buffer. |
offsets |
variadic of index |
sizes |
variadic of index |
strides |
variadic of index |
pcf.yield (PCF::YieldOp)link
Yields results from a region.
Syntax:
operation ::= `pcf.yield` attr-dict
$operands `:` type($operands)
The values returned are copied by-value.
Traits: AlwaysSpeculatableImplTrait, HasParent<IREE::PCF::GenericOp>, ReturnLike, Terminator
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface
Effects: MemoryEffects::Effect{}
Operands:link
| Operand | Description |
|---|---|
operands |
variadic of any type |
Attributeslink
SequentialAttrlink
Attribute representing sequential execution
Syntax: #pcf.sequential
Scope that reuses the current process as the sole executor of a parallel region.
SyncOnReturnAttrlink
Synchronize when returning from the worker of the same scope
Syntax: #pcf.sync_on_return
Attribute indicating that the shaped ref this attribute is tied to is only fenced when the parent of the same scope returns. This is akin to memory order acquire on scope entry and __syncthreads followed by a memory order release fence on scope exit.
TestScopeAttrlink
Test scope attribute used for testing.
Syntax: #pcf.test_scope
Scope that fails on all interface uses. For use in testing where the scope is not relevant.
Typeslink
ShapedRefTypelink
A shaped reference to a buffer.
A reference to a buffer with unspecified layout and physical storage.
Carries the shape and element type of the referenced region. Elements can
be accessed by index, though no assumptions about the physical relation
between two coordinates can be made. Elements at different coordinates must
not internally alias. For example, if foo is a pcf.sref<2xi32>, foo[0]
and foo[1] must be distinct values.
template<size_t rank, typename eltype, typename alloc_scope_ty, sync_scope>
class ShapedRef {
// Access is pointwise within the coordinate space implied by the shape.
// Element type determines the minimum access bitwidth.
eltype *getElementPtr(int a, ...) // |rank| operands.
size_t shape[rank];
// Scope this referenced memory was allocated at. Defines memory space.
alloc_scope_ty alloc_scope;
// Class defining synchronization for this reference.
token_ty sync_scope;
}
When the sync_scope is of type #pcf.sync_on_return, then a special
printer kicks in, i.e. the following two types are equivalent:
!pcf.ref<?xi32, #pcf.test_scope, #pcf.sync_on_return>
!pcf.ref<?xi32, sync(#pcf.test_scope)>
Parameters:link
| Parameter | C++ type | Description |
|---|---|---|
| shape | ::llvm::ArrayRef<int64_t> |
|
| elementType | Type |
|
| scope | PCF::ScopeAttrInterface |
|
| sync_scope | Attribute |