'iree_vector_ext' Dialectlink
IREE Vector Extensions.
A dialect designed for experimenting with vector operations beyond what is currently available in the Vector Dialect.
Operationslink
iree_vector_ext.to_layout
(VectorExt::ToLayoutOp)link
Layout conversion operator
Syntax:
operation ::= `iree_vector_ext.to_layout` $input `to` `layout` `(` $layout `)` attr-dict `:` type($input)
The layout conversion operator takes a shaped value and a layout and transforms the value to have that layout.
If the "shared_memory_conversion" attribute is set, then this layout change has to be materialized through shared memory.
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes:link
Attribute | MLIR Type | Description |
---|---|---|
layout | ::mlir::iree_compiler::IREE::VectorExt::VectorLayoutInterface | VectorLayoutInterface instance |
shared_memory_conversion | ::mlir::UnitAttr | unit attribute |
mma_kind | ::mlir::Attribute | any attribute |
Operands:link
Operand | Description |
---|---|
input |
shaped of any type values |
Results:link
Result | Description |
---|---|
output |
shaped of any type values |
iree_vector_ext.to_simd
(VectorExt::ToSIMDOp)link
SIMT to SIMD conversion operation
Syntax:
operation ::= `iree_vector_ext.to_simd` $input attr-dict `:` type($input) `->` type($output)
This operation is a temporary operation useful for source/target materializations when doing type conversions between distributed and not distributed vectors.
Traits: AlwaysSpeculatableImplTrait
, SameOperandsAndResultElementType
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:link
Operand | Description |
---|---|
input |
vector of any type values |
Results:link
Result | Description |
---|---|
output |
vector of any type values |
iree_vector_ext.to_simt
(VectorExt::ToSIMTOp)link
SIMD to SIMT conversion operation
Syntax:
operation ::= `iree_vector_ext.to_simt` $input attr-dict `:` type($input) `->` type($output)
This operation is a temporary operation useful for source/target materializations when doing type conversions between distributed and not distributed vectors.
Traits: AlwaysSpeculatableImplTrait
, SameOperandsAndResultElementType
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:link
Operand | Description |
---|---|
input |
vector of any type values |
Results:link
Result | Description |
---|---|
output |
vector of any type values |
Attributeslink
NestedLayoutAttrlink
A layout representing a mapping from GPU thread hierarchy to a shape
Syntax:
#iree_vector_ext.nested_layout<
::llvm::ArrayRef<int64_t>, # subgroupTile
::llvm::ArrayRef<int64_t>, # batchTile
::llvm::ArrayRef<int64_t>, # outerTile
::llvm::ArrayRef<int64_t>, # threadTile
::llvm::ArrayRef<int64_t>, # elementTile
::llvm::ArrayRef<int64_t>, # subgroupStrides
::llvm::ArrayRef<int64_t> # threadStrides
>
This layout explicitly defines how a shape is mapped to a compute hierarchy. We consider the following levels of hierarchy, inspired by GPUs:
- Subgroups per Workgroup
- Threads per Subgroup
- Elements per Thread
Conceptually, each higher level of hierarchy can be viewed as multiple tiles of the lower level of hierarchy; each lower level of hierarchy is nested in the higher level of hierarchy. The last level represents the final elements in memory.
The conceptual mapping is leveraged during compilation for tiling and distributing to hardware for parallel computation. Concretely, the mapping is done on each dimension of the original vector shape. For example, for vector shape 16x16x16, we have 3 dimensions, so at each level of the hierarchy we would have 3 tile sizes. Similarly for vector shape 32x32, we would have 2-D tile sizes per compute hierarchy level.
We now describe each level of tiling. Each level of tiling represents a count of tiles over the next level (rather than a list of tile sizes).
- Subgroups per Workgroup
This level of tiling is also known as "subgroup/warp distribution". It represents how subgroups are distributed in a workgroup.
The subgroups are placed contiguously with their shape and ordering
determined by:
- subgroup_tile
: Sizes of this level of tiling
- subgroup_order
: Ordering of dimensions, from outermost to innermost
For example, subgroup_tile=[4, 2], subgroup_order=[1, 0] will arrange the subgroups in the order:
0 4 1 5 2 6 3 7
The total number of subgroups used (computed by multiplying each dim in subgroup_tile) should be a multiple of number of subgroups in the harware. If the total number of subgroups used exceeds the number of subgroups of the hardware, then the subgroup used (say x) is x mod num_subgroups:
num_subgroups = 4
0 4 0 0 1 5 x mod 4 1 1 2 6 -------> 2 2 3 7 3 3
- Threads per Subgroup:
Threads in a subgroup are distributed in three levels.
The first level, batches, are a way to represent instruction unrolling. For example, an intrinsic which can only take 4x4 shape at a time, uses batches to unroll a 16x16 shape to the native intrinsice shape.
Batches can be thought of as loops around the original layout:
for b_0 in range(batch_0): for b_1 in range(batch_1): ...
batch_tile
represents the range of each loop.
The second level, outers, is a way to represent thread layout duplication required by a particular intrinsic. For example, some AMDGPU matrix multiplication variants require threads to be distributed like:
0 1 2 3 4 5 6 7 8 9 --------- --> Thread Layout of shape 2x5 duplicated 2 times, to get a layout of shape 4x5 0 1 2 3 4 outer_tile=[2, 1] 5 6 7 8 9 thread_tile=[2, 5]
outer_tile
represents the number of outers in a batch.
Finally, threads are distributed in a single outer. The thread distribution is represented by:
- thread_tile: Sizes of this level of tiling
- thread_order: Ordering of dimensions, from outermost to innermost
Examples of thread distribution over a 8x4 shape:
{ batch_tile = [2, 1] outer_tile = [2, 2] thread_tile = [2, 2]
thread_order = [1, 0] }
Distributed tile:
{ [0 2]|[0 2] 0,1,2,3 --> thread ids [1 3]|[1 3] ------------ [x z] --> a single outer tile [0 2]|[0 2] [y w] [1 3]|[1 3] }{ [0 2]|[0 2] { ... } --> a single batch tile [1 3]|[1 3]
[0 2]|[0 2] [1 3]|[1 3] }
So, the thread distribution looks like:
[0 2 0 2] [1 3 1 3] [0 2 0 2] [1 3 1 3] [0 2 0 2] [1 3 1 3] [0 2 0 2] [1 3 1 3]
The total number of threads used (computed by multiplying each dim in thread_tile) should be a multiple of subgroup size of the harware. If the total number of threads used exceeds the subgroup size of the hardware, then the threads used (say tid) is tid mod subgroup_size:
subgroup_size = 4
0 1 0 0 2 3 tid mod 4 1 1 4 5 --------> 2 2 6 7 3 3
- Elements per Thread
The final level of tiling, representing the minimum shape of vector that is treated as an atom.
element_tile
represents the native size of the vector.
Parameters:link
Parameter | C++ type | Description |
---|---|---|
subgroupTile | ::llvm::ArrayRef<int64_t> |
subgroup_tile |
batchTile | ::llvm::ArrayRef<int64_t> |
batch_tile |
outerTile | ::llvm::ArrayRef<int64_t> |
outer_tile |
threadTile | ::llvm::ArrayRef<int64_t> |
thread_tile |
elementTile | ::llvm::ArrayRef<int64_t> |
element_tile |
subgroupStrides | ::llvm::ArrayRef<int64_t> |
subgroup_strides |
threadStrides | ::llvm::ArrayRef<int64_t> |
thread_strides |