IREE Lowering Configslink
Overviewlink
The lowering config is an attribute that is used to lower operations within a dispatch from the tensor level down to the vector level. They are determined by:
- The type of computation being performed (e.g., matmul, reduction, convolution)
- Hardware attributes (e.g., subgroup size, memory bandwidth, compute units)
- Optional tuner refinements for performance optimization
IREE provides multiple variants of lowering configs depending on the desired backend and type of computation.
LLVMGPU Vector Distribute Pipelinelink
Reductionlink
This configuration adopts the broader reduction strategy used in memory-bound kernels, drawing inspiration from the high-level approach described in Harris's Optimizing Parallel Reduction in CUDA.
Relevant lowering config attributeslink
workgrouptile sizesthreadtile sizespartial_reductiontile sizeslane_basis(thread distribution within a subgroup)subgroup_basis(subgroup distribution within a workgroup)expand_dimsreassociation list
Summarylink
| Attribute | Key Semantic |
|---|---|
workgroup |
Workgroup tile size along each dimension |
thread |
Thread tile size along each dimension (e.g., load width per thread) |
partial_reduction |
Tile size of the reduction dimension(s) processed by the workgroup |
lane_basis |
Distribution of threads within a subgroup onto the iteration space |
subgroup_basis |
Distribution of subgroups within a workgroup onto the iteration space |
expand_dims |
Split reduction dimensions to enable finer-grain accumulation |
Tile sizeslink
Tile sizes are expressed as arrays of integers, one per dimension of the iteration space. A zero indicates that the tiling level does not apply to that dimension.
The three relevant tiling levels for this pipeline are: workgroup, thread and partial reduction.
Workgroup- and thread-level tilings directly describe the tile sizes at their respective levels.
Example:
workgroup = [16, 0]
Dimension 0: Each workgroup produces 16 output elements in d0.
Partial reduction tiling is slightly less straightforward and is described as follows:
partial_reduction tile sizeslink
Applies to: Reduction dimensions only.
Tiling strategy: The reduction dimension r is tiled such that
r -> r_outer, r_partial, where we create a serial loop over r_outer with
step size equal to r_partial. Within each iteration, threads maintain
r_partial partial accumulators across the reduction dimension. At the end,
partial results are merged.
Semantics:
partial_reduction[d] = 0: Dimensiondis not a reduction dimension.partial_reduction[d] = S: Tile the reduction dimension into chunks of sizeS.
Number of iterations:
iterations = ceil(reduction_size / partial_reduction[d])
Special case: If reduction_size / partial_reduction[d] = 1, there is only
one iteration and the outer loop can be elided.
Example:
partial_reduction = [0, 512]
Dimension 0: Not a reduction dimension.
Dimension 1: Process the reduction in tiles of 512 elements.
For a reduction of size 16384:
- Loop iterations: 16384 / 512 = 32.
- Each iteration: threads within the subgroup process 512 elements along
dimension 1.
Tip: The total number of elements each thread processes per iteration along a reduction dimension
dis:partial_reduction[d] * thread[d]
Basis attributeslink
Basis attributes describe how a particular resource is distributed within the iteration space.
There are two basis attributes:
- Lane basis -- describes how threads within a subgroup are distributed within the specified iteration space
- Subgroup basis -- describes how subgroups within a workgroup are distributed within the specified iteration space
Format: [[counts], [mapping]]
counts: Array of thread counts per basis dimension; i.e, the shape of the conceptual grid of resources ontomapping.mapping: Permutation array mapping basis coordinates to iteration dimensions.
The counts Arraylink
Definition: Number of threads/subgroups along each basis axis.
Constraint: The product of all counts equals the subgroup size
(for lane_basis) or number of subgroups (for subgroup_basis).
Example:
lane_basis = [[16, 4], [1, 0]]
counts = [16, 4]
For a subgroup of 64 threads:
* 16 * 4 = 64
* This forms a conceptual 16x4 grid of threads in basis space.
The mapping Arraylink
Definition: A permutation that maps basis coordinates to iteration space dimensions.
Semantics:
mapping[j] = i means: iteration_dim[i] <- basis_digit d_j
Example:
mapping = [1, 0]
This swaps/transposes the coordinates:
-
Basis digit d0 maps to iteration dimension 1.
-
Basis digit d1 maps to iteration dimension 0.
Computing thread position based on lane_basis (Step by step)link
Given a thread ID x, compute its position in the iteration space:
Step 1: Delinearize x using counts.
Let the counts be B0, B1, ..., Bn-1, and let N = product of Bi.
Pi = product of Bk for k=i to n-1
The basis digits (coordinates) are:
d_i = floor((x mod P_i) / P_(i+1)) for i = 0..n-1 where each digit ranges 0 <= d_i < b_i
Step 2: Apply the mapping to get iteration-space coordinates.
iteration_dim[mapping[i]] = d_i for i = 0..n-1
Concrete Example: Thread 42 with [[16, 4], [1, 0]]link
Step 1: Delinearize(42, [16, 4])
Basis counts: [16, 4]
Products:
P_2 = 1
P_1 = 4
P_0 = 64
Digits:
d_0 = floor((42 mod 64) / 4) = floor(42 / 4) = 10
d_1 = floor((42 mod 4) / 1) = floor(2 / 1) = 2
Basis digits: [d_0, d_1] = [10, 2]
Step 2: Apply mapping [1, 0]
mapping[0] = 1 -> iteration_dim[1] = d_0 = 10
mapping[1] = 0 -> iteration_dim[0] = d_1 = 2
Coordinates: [dim0 = 2, dim1 = 10]
Result: Thread 42 works at position [d0 = 10, d1 = 2] in the
iteration space.
Visual interpretation:
Threads form a 16x4 grid in basis space:
col0 col1 col2 col3
row0: T0 T1 T2 T3
row1: T4 T5 T6 T7
...
row10: T40 T41 T42 T43 <- Thread 42 at (row = 10, col = 2)
...
Subgroups distribute work identically to how lane basis distributes lanes. If there is more than one subgroup, results require workgroup-level synchronization.
Dimension Expansion (expand_dims)link
Applies to: Reduction dimensions only.
Purpose: Expand (split) a reduction dimension into multiple dimensions in
the iteration space so threads can accumulate at a finer granularity across
the reduction loop. Without expand_dims, each thread typically keeps a full
vector accumulator across the entire reduction (e.g., vector<8xf16>) and
reduces it at the end; with expand_dims, the reduction is split so each
thread can reduce per inner chunk (e.g., vector<1xf16>), reducing register
pressure while preserving the same logical result.
Semantics: The attribute follows the same reassociation model as
tensor.expand_shape, with two parameters:
reassociations: Maps original iterator dimensions to expanded dimensions. For example,[[0], [1], [2, 3]]keeps dimensions 0 and 1 unchanged and splits dimension 2 into dimensions 2 and 3.output_shape: Sizes of the expanded dimensions. Use?to indicate a dynamic size, which is inferred from the original dimension and the other static factors in the same reassociation group (at most one?per group).
Applicability: Expansion is only performed when it is statically valid (e.g., the original size is known and divisible by the static factors). Otherwise, it is ignored.
Example:
#iree_gpu.expand_dims<[[0], [1], [2, 3]], output_shape = [?, ?, ?, 8]>
This keeps d0 and d1 unchanged and splits d2 into d2 and d3, where
d3 = 8 and d2 = extent(d2) / 8.
Examplelink
Iteration space: [d0=parallel(4), d1=parallel(6656), d2=reduction(16384)]
Configuration:
#iree_gpu.expand_dims<[[0], [1], [2, 3]], output_shape = [?, ?, ?, 8]>
lane_basis = [[1, 1, 64, 1], [0, 1, 2, 3]]
partial_reduction = [0, 0, 64, 0]
subgroup_basis = [[1, 1, 1, 1], [0, 1, 2, 3]]
thread = [0, 0, 1, 8]
workgroup = [4, 1, 0, 0]
Analysis:
Expand Dims
#iree_gpu.expand_dims<[[0], [1], [2, 3]], output_shape = [?, ?, ?, 8]>:
The original iteration space has three dimensions. The expand_dims attribute splits
the reduction dimension d2 into two dimensions (d2 and d3), transforming the iteration
space from 3D to 4D:
Original: [d0=parallel(4), d1=parallel(6656), d2=reduction(16384)]
Expanded: [d0=parallel(4), d1=parallel(6656), d2=reduction(2048), d3=reduction(8)]
The reassociation [[0], [1], [2, 3]] maps original dimensions to expanded
dimensions: d0 -> d0, d1 -> d1, and d2 -> (d2, d3). With
output_shape = [?, ?, ?, 8], d3 is fixed at 8, and d2 is inferred:
16384 / 8 = 2048.
Lane basis [[1, 1, 64, 1], [0, 1, 2, 3]]:
In the expanded space, 64 threads are distributed along d2. With identity
mapping [0, 1, 2, 3], the 64 threads cover 64 consecutive positions along
the expanded d2 dimension.
Partial reduction [0, 0, 64, 0]:
Tiles the expanded d2 dimension into chunks of 64. With d2=2048, this creates 2048 / 64 = 32 outer loop iterations. Dimension d3 has tile size 0, meaning it is fully processed within each iteration.
In terms of the original iteration space: each outer loop iteration processes 64 * 8 = 512 elements of the original d2, giving 16384 / 512 = 32 iterations.
Thread [0, 0, 1, 8]:
Each thread processes 1 element along expanded d2 and 8 elements along d3. This
means each thread maintains a vector<8> partial accumulator. With 64 threads
distributed along d2, the subgroup collectively processes 64 * 8 = 512 elements
of the original reduction dimension per iteration.
Workgroup [4, 1, 0, 0]:
The workgroup produces a 4x1 output tile (d0 x d1). The reduction
dimensions (d2, d3) have tile size 0, indicating they are handled entirely
within the workgroup via the partial reduction loop and thread distribution.
Subgroup basis [[1, 1, 1, 1], [0, 1, 2, 3]]:
With counts 1x1x1x1 = 1, there is a single 64-thread subgroup per workgroup.