'iree_vector_ext' Dialectlink

IREE Vector Extensions.

A dialect designed for experimenting with vector operations beyond what is currently available in the Vector Dialect.

'iree_vector_ext' Dialect
- Operations
- Attributes
  - NestedLayoutAttr

Operationslink

`iree_vector_ext.to_layout` (VectorExt::ToLayoutOp)link

Layout conversion operator.

Syntax:

operation ::= `iree_vector_ext.to_layout` $input `to` `layout` `(` $layout `)` attr-dict `:` type($input)

The layout conversion operator takes a shaped value and a layout and transforms the value to have that layout.

If the "shared_memory_conversion" attribute is set, then this layout change has to be materialized through shared memory.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:link

Attribute	MLIR Type	Description
`layout`	::mlir::iree_compiler::IREE::VectorExt::VectorLayoutInterface	VectorLayoutInterface instance
`shared_memory_conversion`	::mlir::UnitAttr	unit attribute
`mma_kind`	::mlir::Attribute	any attribute

Operands:link

Operand	Description
`input`	shaped of any type values

Results:link

Result	Description
`output`	shaped of any type values

`iree_vector_ext.to_simd` (VectorExt::ToSIMDOp)link

SIMT to SIMD conversion operation.

Syntax:

operation ::= `iree_vector_ext.to_simd` $input attr-dict `:` type($input) `->` type($output)

This operation is a temporary operation useful for source/target materializations when doing type conversions between distributed and not distributed vectors.

Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultElementType

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:link

Operand	Description
`input`	vector of any type values

Results:link

Result	Description
`output`	vector of any type values

`iree_vector_ext.to_simt` (VectorExt::ToSIMTOp)link

SIMD to SIMT conversion operation.

Syntax:

operation ::= `iree_vector_ext.to_simt` $input attr-dict `:` type($input) `->` type($output)

This operation is a temporary operation useful for source/target materializations when doing type conversions between distributed and not distributed vectors.

Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultElementType

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:link

Operand	Description
`input`	vector of any type values

Results:link

Result	Description
`output`	vector of any type values

`iree_vector_ext.transfer_gather` (VectorExt::TransferGatherOp)link

Gathers a supervector from memory into an SSA vector value.

The iree_vector_ext.transfer_gather operation provides a structured abstraction for gathers, by preserving the iteration space mapping between the result vector and the memory dimensions being indexed.

The operation is a generalization of vector.transfer_read op, where the slice from which the read is performed is not guranteed to be contiguous, and instead how the slice is gathered is defined explicitly in the operation.

The operation can be thought of as: 1. A contiguous slice gathered from the base as described by the operation 2. A vector.transfer_read on the contiguous slice

The operation defines permutation_map, padding, mask, in_bounds in the same way as vector.transfer_read defines, but on the inferred contiguous slice.

The other parameters of the operation define how the contiguous slice is gathered from the source. indices define a base to offset the source by. indexed defines for each dimension if the dimension is gathered or contiguous.

The indices contains a base to offset the source by. The indexed array defines if a dimension is gathered or not. For example, for the following gather:

slice[i, j, k] = base[i + i_offset][j][indices[i][j][k]]

The operation would represent this as:

indices = %i_offset, 0, 0
indexed = [False, False, True]

For every dimension that is gathered, the operation defines how it is gathered. For each gathered dimension, the operation expects a vector of indices in index_vecs to act as a source of indices for that dimension and an AffineMap in index_maps describing how this source of indices is indexed. For example, for the following gather:

slice[i, j, k] = base[i][indices0[i] + offset][indices1[j, k]]

The indexing would be described by:

indices      = 0, %offset, 0
indexed      = [False, True, True]
index_vecs   = %index_vec1, %index_vec2
index_maps = [
  affine_map<(i, j, k) -> (i),
  affine_map<(i, j, k) -> (j, k)
]

With these additional parameters, the operation can define a supervector read from a non-contiguous slice. For example:

base: memref<8192x8x16xf32>
indices0 : vector<2xindex>
indices1 : vector<4x8xindex>

slice[i, j, k] = base[indices0[k]][j][indices1[i, j]]
vector = read(slice) : memref<8192x8x16xf32> -> vector<16x8x2xf32>

Can be represented by:

%vector = vector.transfer_gather %base[0, 0, 0](%indices0, %indices1) {
  gather_dims = [0, 2],
  index_maps = [
    affine_map<(i, j, k) -> (k)>,
    affine_map<(i, j, k) -> (i, j)>
  ],
  in_bounds = [true, true, true],
  permutation_map = affine_map<(i, j, k) -> (k, j, i)>
} : memref<8192x8x16xf32> -> vector<16x8x2xf32>

The crucial structure of the operation relies on the index_vec and the result vector's indexing being defined based on the dimensions of the memory. This mapping can be exploited to simplify gathered dimensions to contiguous dimensions.

Traits: AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, MaskableOpInterface, MemoryEffectOpInterface, VectorTransferOpInterface

Attributes:link

Attribute	MLIR Type	Description
`indexed`	::mlir::ArrayAttr	1-bit boolean array attribute
`indexed_maps`	::mlir::ArrayAttr	AffineMap array attribute
`permutation_map`	::mlir::AffineMapAttr	AffineMap attribute
`in_bounds`	::mlir::ArrayAttr	1-bit boolean array attribute

Operands:link

Operand	Description
`base`	shaped of any type values
`indices`	variadic of index
`index_vecs`	variadic of vector of index values
`padding`	any type
`mask`	vector of 1-bit signless integer values

Results:link

Result	Description
`vector`	vector of any type values

Attributeslink

NestedLayoutAttrlink

A layout representing a mapping from GPU thread hierarchy to a shape.

Syntax:

#iree_vector_ext.nested_layout<
  ::llvm::ArrayRef<int64_t>,   # subgroupTile
  ::llvm::ArrayRef<int64_t>,   # batchTile
  ::llvm::ArrayRef<int64_t>,   # outerTile
  ::llvm::ArrayRef<int64_t>,   # threadTile
  ::llvm::ArrayRef<int64_t>,   # elementTile
  ::llvm::ArrayRef<int64_t>,   # subgroupStrides
  ::llvm::ArrayRef<int64_t>   # threadStrides
>

This layout explicitly defines how the shape of the associated vector is mapped to a compute hierarchy. We consider the following levels of hierarchy, inspired by GPUs:

Subgroups per workgroup
Threads per subgroup
Elements per thread

Note that elements in a thread is also conceptually viewed as a 3 dimensions. i.e. elements per thread = batch x outer x element However, the final order of sub-dimensions are not exactly in that hierarchy. For e.g. a single dimensional vector say vector< n x f16> is viewed as a vector<subgroup x batch x outer x thread x element> 5 dimensional vector. For a two dimensional vector, each above sub-dimension would be doubled. i.e. vector< n1 x n2 x f16> is viewed as a vector<subgroup1 x subgroup2 x batch1 x batch2 x ... x element1 x element2>

Now, when the vector is indexed, the indices of 'subgroup' and thread are not directly refferring to the subgroup_id and thread_id in the GPU context. lets define them as virtual_subgroup_id and virtual_thread_id and they hold the following definition:

virtual_subgroup_id[i] = (subgroup_id / subgroup_stride[i]) % subgroup_tile_size[i]
virtual_thread_id[i]   = (thread_id   / thread_stride[i]) % thread_tile_size[i]

the inverse mapping would be:

subgroup_id = sum_i(subgroup_stride[i] * virtual_subgroup_id[i]) % mul_i(subgroup_tile_size[i])
thread_id = sum_i(thread_stride[i] * virtual_thread_id[i]) % mul_i(thread_tile_size[i])
    for i = [0 : rank(undistributed_vector)]

NOTE: if stride is zero, it represents non-distribution of that dimension on that hierarchy.

We now describe each level of tiling. Each level of tiling represents a count of tiles over the next level (rather than a list of tile sizes).

Subgroups per Workgrouplink

This level of tiling is also known as "subgroup/warp distribution". It represents how the vector is distributed into subgroups.

For example, consider distributing vector<4x2xf16> to a subgroup_tile=[4, 2], subgroup_stride=[1, 4] will arrange the subgroups in the order:

virtual_subgroups_ids:
[0][0] , [0][1] , [1][0], [1][1], [2][0], [2][1], [3][0], [3][1]
subgroups_ids:
0, 4, 1, 5, 2, 6, 3, 7

The subgroups are placed contiguously with their shape and ordering determined by: - subgroup_tile: Sizes of this level of tiling - subgroup_strides: Stride of this level of tiling. 0 if not distributed. Tiling levels must not overlap.

The total number of subgroups used (computed by multiplying each dim in subgroup_tile) should be a multiple of number of subgroups in the harware. If the total number of subgroups used exceeds the number of subgroups of the hardware, then the subgroup used (say x) is x mod num_subgroups:

num_subgroups = 4

0, 4, 1, 5, 2, 6, 3, 7
| mod 4
V
0, 0, 1, 1, 2, 2, 3, 3

Threads per Subgroup:link

This level of tiling is also known as "thread distribution" within a subgroup. The logic is quite similiar to subgroup distribution using the tile sizes and the 'thread_strides'.

Element distribution on a threadlink

So after the vector is distributed per thread on a subgroup, it is viewed as [batch] x [outer] x [element] where each sub-dimensions group has dimensions equal to original rank of the undistributed vector.

The first level, batches, are a way to represent instruction unrolling. For example, an intrinsic which can only take 4x4 shape at a time, uses batches to unroll a 16x16 shape to the native intrinsice shape.

The second level, outers, is a way to represent thread layout duplication required by a particular intrinsic. For example, some AMDGPU matrix multiplication variants require threads to be distributed like:

E.g.: outer_tile=[2, 1], thread_tile=[2, 5] the thread Layout of shape 2x5 duplicated 2 times, to get a layout of shape 4x5

outer = 0,0 :
[0 1 2 3 4]
[5 6 7 8 9]

outer = 1,0 :
[0 1 2 3 4]
[5 6 7 8 9]

outer_tile represents the number of outers in a batch.

The final level of tiling, representing the minimum shape of vector that is treated as an atom.

element_tile represents the native size of the vector.

A full examplelink

Vector to be distributed: vector<64x64xf16>

NestedLayout : <
    subgroup_tile = [2, 1],
    batch_tile = [2, 4],
    outer_tile = [1, 1],
    thread_tile = [16, 4],
    element_tile = [1, 4],
    subgroup_strides = [1, 0],
    thread_strides = [1, 16]
>

This is conceptually viewed as a: vector<[2x1]x[2x4]x[1x1]x[16x4]x[1x4]> where the first groups of sub-dimensions represent the distribution into subgroups. The subgroup_strides being [1, 0] means each subgroup is going to get a vector as follows:

subgroup0 : vector<[2x4]x[1x1]x[16x4]x[1x4]>
from vector<[2x1]x[2x4]x[1x1]x[16x4]x[1x4]>[0,:,:,:,:,:,:,:,:,:]
subgroup1 : vector<[2x4]x[1x1]x[16x4]x[1x4]>
from vector<[2x1]x[2x4]x[1x1]x[16x4]x[1x4]>[1,:,:,:,:,:,:,:,:,:]
subgroup2 : vector<[2x4]x[1x1]x[16x4]x[1x4]>
from vector<[2x1]x[2x4]x[1x1]x[16x4]x[1x4]>[0,:,:,:,:,:,:,:,:,:]
subgroup3 : vector<[2x4]x[1x1]x[16x4]x[1x4]>
from vector<[2x1]x[2x4]x[1x1]x[16x4]x[1x4]>[1,:,:,:,:,:,:,:,:,:]

Then each vector<[2x4]x[1x1]x[16x4]x[1x4]> is distributed threads in a subgroup using thread_strides = [1, 16]

recall: thread_id = sum_i(thread_stride[i] * virtual_thread_id[i]) % mul_i(thread_tile_size[i])

thread0 : vector<[2x4]x[1x1]x[1x4]>
from vector<[2x4]x[1x1]x[16x4]x[1x4]>[:,:,:,:,0,0,:,:]
thread1 : vector<[2x4]x[1x1]x[1x4]>
from vector<[2x4]x[1x1]x[16x4]x[1x4]>[:,:,:,:,1,0,:,:]
...
...
thread16 : vector<[2x4]x[1x1]x[1x4]>
from vector<[2x4]x[1x1]x[16x4]x[1x4]>[:,:,:,:,0,1,:,:]

Finally we are left with a distributed vector of conceptual view : vector<[2x4]x[1x1]x[1x4]> where the actual shape is : vector<2x16>.

Parameters:link

Parameter	C++ type	Description
subgroupTile	`::llvm::ArrayRef<int64_t>`	subgroup_tile
batchTile	`::llvm::ArrayRef<int64_t>`	batch_tile
outerTile	`::llvm::ArrayRef<int64_t>`	outer_tile
threadTile	`::llvm::ArrayRef<int64_t>`	thread_tile
elementTile	`::llvm::ArrayRef<int64_t>`	element_tile
subgroupStrides	`::llvm::ArrayRef<int64_t>`	subgroup_strides
threadStrides	`::llvm::ArrayRef<int64_t>`	thread_strides

'iree_vector_ext' Dialectlink

Operationslink

iree_vector_ext.to_layout (VectorExt::ToLayoutOp)link

Attributes:link

Operands:link

Results:link

iree_vector_ext.to_simd (VectorExt::ToSIMDOp)link

Operands:link

Results:link

iree_vector_ext.to_simt (VectorExt::ToSIMTOp)link

Operands:link

Results:link

iree_vector_ext.transfer_gather (VectorExt::TransferGatherOp)link

Attributes:link

Operands:link

Results:link

Attributeslink

NestedLayoutAttrlink

Subgroups per Workgrouplink

Threads per Subgroup:link

Element distribution on a threadlink

A full examplelink

Parameters:link

`iree_vector_ext.to_layout` (VectorExt::ToLayoutOp)link

`iree_vector_ext.to_simd` (VectorExt::ToSIMDOp)link

`iree_vector_ext.to_simt` (VectorExt::ToSIMTOp)link

`iree_vector_ext.transfer_gather` (VectorExt::TransferGatherOp)link