Skip to content

'iree_vector_ext' Dialectlink

IREE Vector Extensions.

A dialect designed for experimenting with vector operations beyond what is currently available in the Vector Dialect.

Operationslink

iree_vector_ext.to_layout (VectorExt::ToLayoutOp)link

Layout conversion operator

Syntax:

operation ::= `iree_vector_ext.to_layout` $input `to` `layout` `(` $layout `)` attr-dict `:` type($input)

The layout conversion operator takes a shaped value and a layout and transforms the value to have that layout.

If the "shared_memory_conversion" attribute is set, then this layout change has to be materialized through shared memory.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:link
AttributeMLIR TypeDescription
layout::mlir::iree_compiler::IREE::VectorExt::VectorLayoutInterfaceVectorLayoutInterface instance
shared_memory_conversion::mlir::UnitAttrunit attribute
mma_kind::mlir::Attributeany attribute
Operands:link
Operand Description
input shaped of any type values
Results:link
Result Description
output shaped of any type values

iree_vector_ext.to_simd (VectorExt::ToSIMDOp)link

SIMT to SIMD conversion operation

Syntax:

operation ::= `iree_vector_ext.to_simd` $input attr-dict `:` type($input) `->` type($output)

This operation is a temporary operation useful for source/target materializations when doing type conversions between distributed and not distributed vectors.

Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultElementType

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:link
Operand Description
input vector of any type values
Results:link
Result Description
output vector of any type values

iree_vector_ext.to_simt (VectorExt::ToSIMTOp)link

SIMD to SIMT conversion operation

Syntax:

operation ::= `iree_vector_ext.to_simt` $input attr-dict `:` type($input) `->` type($output)

This operation is a temporary operation useful for source/target materializations when doing type conversions between distributed and not distributed vectors.

Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultElementType

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:link
Operand Description
input vector of any type values
Results:link
Result Description
output vector of any type values

Attributeslink

NestedLayoutAttrlink

A layout representing a mapping from GPU thread hierarchy to a shape

Syntax:

#iree_vector_ext.nested_layout<
  ::llvm::ArrayRef<int64_t>,   # subgroupTile
  ::llvm::ArrayRef<int64_t>,   # batchTile
  ::llvm::ArrayRef<int64_t>,   # outerTile
  ::llvm::ArrayRef<int64_t>,   # threadTile
  ::llvm::ArrayRef<int64_t>,   # elementTile
  ::llvm::ArrayRef<int64_t>,   # subgroupStrides
  ::llvm::ArrayRef<int64_t>   # threadStrides
>

This layout explicitly defines how a shape is mapped to a compute hierarchy. We consider the following levels of hierarchy, inspired by GPUs:

  1. Subgroups per Workgroup
  2. Threads per Subgroup
  3. Elements per Thread

Conceptually, each higher level of hierarchy can be viewed as multiple tiles of the lower level of hierarchy; each lower level of hierarchy is nested in the higher level of hierarchy. The last level represents the final elements in memory.

The conceptual mapping is leveraged during compilation for tiling and distributing to hardware for parallel computation. Concretely, the mapping is done on each dimension of the original vector shape. For example, for vector shape 16x16x16, we have 3 dimensions, so at each level of the hierarchy we would have 3 tile sizes. Similarly for vector shape 32x32, we would have 2-D tile sizes per compute hierarchy level.

We now describe each level of tiling. Each level of tiling represents a count of tiles over the next level (rather than a list of tile sizes).

  1. Subgroups per Workgroup

This level of tiling is also known as "subgroup/warp distribution". It represents how subgroups are distributed in a workgroup.

The subgroups are placed contiguously with their shape and ordering determined by: - subgroup_tile: Sizes of this level of tiling - subgroup_order: Ordering of dimensions, from outermost to innermost

For example, subgroup_tile=[4, 2], subgroup_order=[1, 0] will arrange the subgroups in the order:

0 4 1 5 2 6 3 7

The total number of subgroups used (computed by multiplying each dim in subgroup_tile) should be a multiple of number of subgroups in the harware. If the total number of subgroups used exceeds the number of subgroups of the hardware, then the subgroup used (say x) is x mod num_subgroups:

num_subgroups = 4

0 4 0 0 1 5 x mod 4 1 1 2 6 -------> 2 2 3 7 3 3

  1. Threads per Subgroup:

Threads in a subgroup are distributed in three levels.

The first level, batches, are a way to represent instruction unrolling. For example, an intrinsic which can only take 4x4 shape at a time, uses batches to unroll a 16x16 shape to the native intrinsice shape.

Batches can be thought of as loops around the original layout:

for b_0 in range(batch_0): for b_1 in range(batch_1): ...

batch_tile represents the range of each loop.

The second level, outers, is a way to represent thread layout duplication required by a particular intrinsic. For example, some AMDGPU matrix multiplication variants require threads to be distributed like:

0 1 2 3 4 5 6 7 8 9 --------- --> Thread Layout of shape 2x5 duplicated 2 times, to get a layout of shape 4x5 0 1 2 3 4 outer_tile=[2, 1] 5 6 7 8 9 thread_tile=[2, 5]

outer_tile represents the number of outers in a batch.

Finally, threads are distributed in a single outer. The thread distribution is represented by:

  • thread_tile: Sizes of this level of tiling
  • thread_order: Ordering of dimensions, from outermost to innermost

Examples of thread distribution over a 8x4 shape:

{ batch_tile = [2, 1] outer_tile = [2, 2] thread_tile = [2, 2]

thread_order = [1, 0] }

Distributed tile:

{ [0 2]|[0 2] 0,1,2,3 --> thread ids [1 3]|[1 3] ------------ [x z] --> a single outer tile [0 2]|[0 2] [y w] [1 3]|[1 3] }{ [0 2]|[0 2] { ... } --> a single batch tile [1 3]|[1 3]


[0 2]|[0 2] [1 3]|[1 3] }

So, the thread distribution looks like:

[0 2 0 2] [1 3 1 3] [0 2 0 2] [1 3 1 3] [0 2 0 2] [1 3 1 3] [0 2 0 2] [1 3 1 3]

The total number of threads used (computed by multiplying each dim in thread_tile) should be a multiple of subgroup size of the harware. If the total number of threads used exceeds the subgroup size of the hardware, then the threads used (say tid) is tid mod subgroup_size:

subgroup_size = 4

0 1 0 0 2 3 tid mod 4 1 1 4 5 --------> 2 2 6 7 3 3

  1. Elements per Thread

The final level of tiling, representing the minimum shape of vector that is treated as an atom.

element_tile represents the native size of the vector.

Parameters:link
Parameter C++ type Description
subgroupTile ::llvm::ArrayRef<int64_t> subgroup_tile
batchTile ::llvm::ArrayRef<int64_t> batch_tile
outerTile ::llvm::ArrayRef<int64_t> outer_tile
threadTile ::llvm::ArrayRef<int64_t> thread_tile
elementTile ::llvm::ArrayRef<int64_t> element_tile
subgroupStrides ::llvm::ArrayRef<int64_t> subgroup_strides
threadStrides ::llvm::ArrayRef<int64_t> thread_strides