'iree_vector_ext' Dialectlink
IREE Vector Extensions.
A dialect designed for experimenting with vector operations beyond what is currently available in the Vector Dialect.
Operationslink
iree_vector_ext.to_layout
(VectorExt::ToLayoutOp)link
Layout conversion operator
Syntax:
operation ::= `iree_vector_ext.to_layout` $input `to` `layout` `(` $layout `)` attr-dict `:` type($input)
The layout conversion operator takes a shaped value and a layout and transforms the value to have that layout.
If the "shared_memory_conversion" attribute is set, then this layout change has to be materialized through shared memory.
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes:link
Attribute | MLIR Type | Description |
---|---|---|
layout | ::mlir::iree_compiler::IREE::VectorExt::VectorLayoutInterface | VectorLayoutInterface instance |
shared_memory_conversion | ::mlir::UnitAttr | unit attribute |
mma_kind | ::mlir::Attribute | any attribute |
Operands:link
Operand | Description |
---|---|
input |
shaped of any type values |
Results:link
Result | Description |
---|---|
output |
shaped of any type values |
iree_vector_ext.to_simd
(VectorExt::ToSIMDOp)link
SIMT to SIMD conversion operation
Syntax:
operation ::= `iree_vector_ext.to_simd` $input attr-dict `:` type($input) `->` type($output)
This operation is a temporary operation useful for source/target materializations when doing type conversions between distributed and not distributed vectors.
Traits: AlwaysSpeculatableImplTrait
, SameOperandsAndResultElementType
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:link
Operand | Description |
---|---|
input |
vector of any type values |
Results:link
Result | Description |
---|---|
output |
vector of any type values |
iree_vector_ext.to_simt
(VectorExt::ToSIMTOp)link
SIMD to SIMT conversion operation
Syntax:
operation ::= `iree_vector_ext.to_simt` $input attr-dict `:` type($input) `->` type($output)
This operation is a temporary operation useful for source/target materializations when doing type conversions between distributed and not distributed vectors.
Traits: AlwaysSpeculatableImplTrait
, SameOperandsAndResultElementType
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:link
Operand | Description |
---|---|
input |
vector of any type values |
Results:link
Result | Description |
---|---|
output |
vector of any type values |
iree_vector_ext.transfer_gather
(VectorExt::TransferGatherOp)link
Gathers a supervector from memory into an SSA vector value.
The iree_vector_ext.transfer_gather operation provides a structured abstraction for gathers, by preserving the iteration space mapping between the result vector and the memory dimensions being indexed.
The operation is a generalization of vector.transfer_read
op, where the
slice from which the read is performed is not guranteed to be contiguous,
and instead how the slice is gathered is defined explicitly in the
operation.
The operation can be thought of as:
1. A contiguous slice gathered from the source as described by the operation
2. A vector.transfer_read
on the contiguous slice
The operation defines permutation_map
, padding
, mask
, in_bounds
in
the same way as vector.transfer_read
defines, but on the inferred
contiguous slice.
The other parameters of the operation define how the contiguous slice is
gathered from the source. indices
define a base to offset the source by.
indexed
defines for each dimension if the dimension is gathered or
contiguous.
The indices
contains a base to offset the source by. The indexed
array
defines if a dimension is gathered or not. For example, for the following
gather:
slice[i, j, k] = source[i + i_offset][j][indices[i][j][k]]
The operation would represent this as:
indices = %i_offset, 0, 0
indexed = [False, False, True]
For every dimension that is gathered, the operation defines how it is
gathered. For each gathered dimension, the operation expects a vector of
indices in index_vecs
to act as a source of indices for that dimension
and an AffineMap in index_maps
describing how this source of indices is
indexed. For example, for the following gather:
slice[i, j, k] = source[i][indices0[i] + offset][indices1[j, k]]
The indexing would be described by:
indices = 0, %offset, 0
indexed = [False, True, True]
index_vecs = %index_vec1, %index_vec2
index_maps = [
affine_map<(i, j, k) -> (i),
affine_map<(i, j, k) -> (j, k)
]
With these additional parameters, the operation can define a supervector read from a non-contiguous slice. For example:
source: memref<8192x8x16xf32>
indices0 : vector<2xindex>
indices1 : vector<4x8xindex>
slice[i, j, k] = source[indices0[k]][j][indices1[i, j]]
vector = read(slice) : memref<8192x8x16xf32> -> vector<16x8x2xf32>
Can be represented by:
%vector = vector.transfer_gather %source[0, 0, 0](%indices0, %indices1) {
gather_dims = [0, 2],
index_maps = [
affine_map<(i, j, k) -> (k)>,
affine_map<(i, j, k) -> (i, j)>
],
in_bounds = [true, true, true],
permutation_map = affine_map<(i, j, k) -> (k, j, i)>
} : memref<8192x8x16xf32> -> vector<16x8x2xf32>
The crucial structure of the operation relies on the index_vec and the result vector's indexing being defined based on the dimensions of the memory. This mapping can be exploited to simplify gathered dimensions to contiguous dimensions.
Traits: AttrSizedOperandSegments
Interfaces: ConditionallySpeculatable
, MemoryEffectOpInterface
, VectorTransferOpInterface
Attributes:link
Attribute | MLIR Type | Description |
---|---|---|
indexed | ::mlir::ArrayAttr | 1-bit boolean array attribute |
indexed_maps | ::mlir::ArrayAttr | AffineMap array attribute |
permutation_map | ::mlir::AffineMapAttr | AffineMap attribute |
in_bounds | ::mlir::ArrayAttr | 1-bit boolean array attribute |
Operands:link
Operand | Description |
---|---|
source |
shaped of any type values |
indices |
variadic of index |
index_vecs |
variadic of vector of index values |
padding |
any type |
mask |
vector of 1-bit signless integer values |
Results:link
Result | Description |
---|---|
vector |
vector of any type values |
Attributeslink
NestedLayoutAttrlink
A layout representing a mapping from GPU thread hierarchy to a shape
Syntax:
#iree_vector_ext.nested_layout<
::llvm::ArrayRef<int64_t>, # subgroupTile
::llvm::ArrayRef<int64_t>, # batchTile
::llvm::ArrayRef<int64_t>, # outerTile
::llvm::ArrayRef<int64_t>, # threadTile
::llvm::ArrayRef<int64_t>, # elementTile
::llvm::ArrayRef<int64_t>, # subgroupStrides
::llvm::ArrayRef<int64_t> # threadStrides
>
This layout explicitly defines how the shape of the associated vector is mapped to a compute hierarchy. We consider the following levels of hierarchy, inspired by GPUs:
- Subgroups per workgroup
- Threads per subgroup
- Elements per thread
Note that elements in a thread is also conceptually viewed as
a 3 dimensions. i.e. elements per thread = batch x outer x element
However, the final order of sub-dimensions are not exactly in that
hierarchy. For e.g. a single dimensional vector say vector< n x f16>
is viewed as a
vector<subgroup x batch x outer x thread x element>
5 dimensional
vector. For a two dimensional vector, each above sub-dimension would
be doubled. i.e. vector< n1 x n2 x f16>
is viewed as a
vector<subgroup1 x subgroup2 x batch1 x batch2 x ... x element1 x element2>
Now, when the vectorthread
are not directly refferring
to the subgroup_id and thread_id in the GPU context. lets define them
as virtual_subgroup_id and virtual_thread_id and they hold the following
definition:
virtual_subgroup_id[i] = (subgroup_id / subgroup_stride[i]) % subgroup_tile_size[i]
virtual_thread_id[i] = (thread_id / thread_stride[i]) % thread_tile_size[i]
the inverse mapping would be:
subgroup_id = sum_i(subgroup_stride[i] * virtual_subgroup_id[i]) % mul_i(subgroup_tile_size[i])
thread_id = sum_i(thread_stride[i] * virtual_thread_id[i]) % mul_i(thread_tile_size[i])
for i = [0 : rank(undistributed_vector)]
NOTE: if stride is zero, it represents non-distribution of that dimension on that hierarchy.
We now describe each level of tiling. Each level of tiling represents a count of tiles over the next level (rather than a list of tile sizes).
Subgroups per Workgrouplink
This level of tiling is also known as "subgroup/warp distribution". It represents how the vector is distributed into subgroups.
For example, consider distributing vector<4x2xf16>
to a
subgroup_tile=[4, 2], subgroup_stride=[1, 4]
will
arrange the subgroups in the order:
virtual_subgroups_ids:
[0][0] , [0][1] , [1][0], [1][1], [2][0], [2][1], [3][0], [3][1]
subgroups_ids:
0, 4, 1, 5, 2, 6, 3, 7
The subgroups are placed contiguously with their shape and ordering
determined by:
- subgroup_tile
: Sizes of this level of tiling
- subgroup_strides
: Stride of this level of tiling. 0 if not distributed.
Tiling levels must not overlap.
The total number of subgroups used (computed by multiplying each dim in subgroup_tile) should be a multiple of number of subgroups in the harware. If the total number of subgroups used exceeds the number of subgroups of the hardware, then the subgroup used (say x) is x mod num_subgroups:
num_subgroups = 4
0, 4, 1, 5, 2, 6, 3, 7
| mod 4
V
0, 0, 1, 1, 2, 2, 3, 3
Threads per Subgroup:link
This level of tiling is also known as "thread distribution" within a subgroup. The logic is quite similiar to subgroup distribution using the tile sizes and the 'thread_strides'.
Element distribution on a threadlink
So after the vector is distributed per thread on a subgroup, it is viewed as [batch] x [outer] x [element] where each sub-dimensions group has dimensions equal to original rank of the undistributed vector.
The first level, batches, are a way to represent instruction unrolling. For example, an intrinsic which can only take 4x4 shape at a time, uses batches to unroll a 16x16 shape to the native intrinsice shape.
The second level, outers, is a way to represent thread layout duplication required by a particular intrinsic. For example, some AMDGPU matrix multiplication variants require threads to be distributed like:
E.g.: outer_tile=[2, 1], thread_tile=[2, 5]
the thread Layout of shape 2x5 duplicated 2 times, to get a layout of shape 4x5
outer = 0,0 :
[0 1 2 3 4]
[5 6 7 8 9]
outer = 1,0 :
[0 1 2 3 4]
[5 6 7 8 9]
outer_tile
represents the number of outers in a batch.
The final level of tiling, representing the minimum shape of vector that is treated as an atom.
element_tile
represents the native size of the vector.
A full examplelink
Vector to be distributed: vector<64x64xf16>
NestedLayout : <
subgroup_tile = [2, 1],
batch_tile = [2, 4],
outer_tile = [1, 1],
thread_tile = [16, 4],
element_tile = [1, 4],
subgroup_strides = [1, 0],
thread_strides = [1, 16]
>
This is conceptually viewed as a: vector<[2x1]x[2x4]x[1x1]x[16x4]x[1x4]>
where the first groups of sub-dimensions
represent the distribution into subgroups.
The subgroup_strides being [1, 0] means
each subgroup is going to get a vector
as follows:
subgroup0 : vector<[2x4]x[1x1]x[16x4]x[1x4]>
from vector<[2x1]x[2x4]x[1x1]x[16x4]x[1x4]>[0,:,:,:,:,:,:,:,:,:]
subgroup1 : vector<[2x4]x[1x1]x[16x4]x[1x4]>
from vector<[2x1]x[2x4]x[1x1]x[16x4]x[1x4]>[1,:,:,:,:,:,:,:,:,:]
subgroup2 : vector<[2x4]x[1x1]x[16x4]x[1x4]>
from vector<[2x1]x[2x4]x[1x1]x[16x4]x[1x4]>[0,:,:,:,:,:,:,:,:,:]
subgroup3 : vector<[2x4]x[1x1]x[16x4]x[1x4]>
from vector<[2x1]x[2x4]x[1x1]x[16x4]x[1x4]>[1,:,:,:,:,:,:,:,:,:]
Then each vector<[2x4]x[1x1]x[16x4]x[1x4]> is distributed threads in a subgroup using thread_strides = [1, 16]
recall: thread_id = sum_i(thread_stride[i] * virtual_thread_id[i]) % mul_i(thread_tile_size[i])
thread0 : vector<[2x4]x[1x1]x[1x4]>
from vector<[2x4]x[1x1]x[16x4]x[1x4]>[:,:,:,:,0,0,:,:]
thread1 : vector<[2x4]x[1x1]x[1x4]>
from vector<[2x4]x[1x1]x[16x4]x[1x4]>[:,:,:,:,1,0,:,:]
...
...
thread16 : vector<[2x4]x[1x1]x[1x4]>
from vector<[2x4]x[1x1]x[16x4]x[1x4]>[:,:,:,:,0,1,:,:]
Finally we are left with a distributed vector
of conceptual view : vector<[2x4]x[1x1]x[1x4]>
where the actual shape is : vector<2x16>
.
Parameters:link
Parameter | C++ type | Description |
---|---|---|
subgroupTile | ::llvm::ArrayRef<int64_t> |
subgroup_tile |
batchTile | ::llvm::ArrayRef<int64_t> |
batch_tile |
outerTile | ::llvm::ArrayRef<int64_t> |
outer_tile |
threadTile | ::llvm::ArrayRef<int64_t> |
thread_tile |
elementTile | ::llvm::ArrayRef<int64_t> |
element_tile |
subgroupStrides | ::llvm::ArrayRef<int64_t> |
subgroup_strides |
threadStrides | ::llvm::ArrayRef<int64_t> |
thread_strides |