Skip to content

IREE Lowering Configslink

Overviewlink

The lowering config is an attribute that is used to lower operations within a dispatch from the tensor level down to the vector level. They are determined by:

  1. The type of computation being performed (e.g., matmul, reduction, convolution)
  2. Hardware attributes (e.g., subgroup size, memory bandwidth, compute units)
  3. Optional tuner refinements for performance optimization

IREE provides multiple variants of lowering configs depending on the desired backend and type of computation.


LLVMGPU Vector Distribute Pipelinelink

Reductionlink

This configuration adopts the broader reduction strategy used in memory-bound kernels, drawing inspiration from the high-level approach described in Harris's Optimizing Parallel Reduction in CUDA.

Relevant lowering config attributeslink

  • workgroup tile sizes
  • thread tile sizes
  • partial_reduction tile sizes
  • lane_basis (thread distribution within a subgroup)
  • subgroup_basis (subgroup distribution within a workgroup)
  • expand_dims reassociation list

Summarylink

Attribute Key Semantic
workgroup Workgroup tile size along each dimension
thread Thread tile size along each dimension (e.g., load width per thread)
partial_reduction Tile size of the reduction dimension(s) processed by the workgroup
lane_basis Distribution of threads within a subgroup onto the iteration space
subgroup_basis Distribution of subgroups within a workgroup onto the iteration space
expand_dims Split reduction dimensions to enable finer-grain accumulation

Tile sizeslink

Tile sizes are expressed as arrays of integers, one per dimension of the iteration space. A zero indicates that the tiling level does not apply to that dimension.

The three relevant tiling levels for this pipeline are: workgroup, thread and partial reduction.

Workgroup- and thread-level tilings directly describe the tile sizes at their respective levels.

Example:

workgroup = [16, 0]

Dimension 0: Each workgroup produces 16 output elements in d0.

Partial reduction tiling is slightly less straightforward and is described as follows:

partial_reduction tile sizeslink

Applies to: Reduction dimensions only.

Tiling strategy: The reduction dimension r is tiled such that r -> r_outer, r_partial, where we create a serial loop over r_outer with step size equal to r_partial. Within each iteration, threads maintain r_partial partial accumulators across the reduction dimension. At the end, partial results are merged.

Semantics:

  • partial_reduction[d] = 0: Dimension d is not a reduction dimension.
  • partial_reduction[d] = S: Tile the reduction dimension into chunks of size S.

Number of iterations:

iterations = ceil(reduction_size / partial_reduction[d])

Special case: If reduction_size / partial_reduction[d] = 1, there is only one iteration and the outer loop can be elided.

Example:

partial_reduction = [0, 512]

Dimension 0: Not a reduction dimension.
Dimension 1: Process the reduction in tiles of 512 elements.

For a reduction of size 16384:
- Loop iterations: 16384 / 512 = 32.
- Each iteration: threads within the subgroup process 512 elements along
  dimension 1.

Tip: The total number of elements each thread processes per iteration along a reduction dimension d is: partial_reduction[d] * thread[d]


Basis attributeslink

Basis attributes describe how a particular resource is distributed within the iteration space.

There are two basis attributes:

  • Lane basis -- describes how threads within a subgroup are distributed within the specified iteration space
  • Subgroup basis -- describes how subgroups within a workgroup are distributed within the specified iteration space

Format: [[counts], [mapping]]

  • counts: Array of thread counts per basis dimension; i.e, the shape of the conceptual grid of resources onto mapping.
  • mapping: Permutation array mapping basis coordinates to iteration dimensions.
The counts Arraylink

Definition: Number of threads/subgroups along each basis axis.

Constraint: The product of all counts equals the subgroup size (for lane_basis) or number of subgroups (for subgroup_basis).

Example:

lane_basis = [[16, 4], [1, 0]]
counts = [16, 4]
For a subgroup of 64 threads:
* 16 * 4 = 64
* This forms a conceptual 16x4 grid of threads in basis space.
The mapping Arraylink

Definition: A permutation that maps basis coordinates to iteration space dimensions.

Semantics:

mapping[j] = i  means:  iteration_dim[i] <- basis_digit d_j

Example:

mapping = [1, 0]

This swaps/transposes the coordinates:

  • Basis digit d0 maps to iteration dimension 1.

  • Basis digit d1 maps to iteration dimension 0.

Computing thread position based on lane_basis (Step by step)link

Given a thread ID x, compute its position in the iteration space:

Step 1: Delinearize x using counts.

Let the counts be B0, B1, ..., Bn-1, and let N = product of Bi.

Pi = product of Bk for k=i to n-1

The basis digits (coordinates) are:

d_i = floor((x mod P_i) / P_(i+1))     for i = 0..n-1 where each digit ranges 0 <= d_i < b_i

Step 2: Apply the mapping to get iteration-space coordinates.

iteration_dim[mapping[i]] = d_i   for i = 0..n-1
Concrete Example: Thread 42 with [[16, 4], [1, 0]]link

Step 1: Delinearize(42, [16, 4])

Basis counts: [16, 4]

Products:
  P_2 = 1
  P_1 = 4
  P_0 = 64

Digits:
  d_0 = floor((42 mod 64) / 4) = floor(42 / 4) = 10
  d_1 = floor((42 mod 4)  / 1) = floor(2  / 1) = 2

Basis digits: [d_0, d_1] = [10, 2]

Step 2: Apply mapping [1, 0]

mapping[0] = 1  ->  iteration_dim[1] = d_0 = 10
mapping[1] = 0  ->  iteration_dim[0] = d_1 = 2

Coordinates: [dim0 = 2, dim1 = 10]

Result: Thread 42 works at position [d0 = 10, d1 = 2] in the iteration space.

Visual interpretation:

Threads form a 16x4 grid in basis space:
       col0 col1 col2 col3
row0:   T0   T1   T2   T3
row1:   T4   T5   T6   T7
...
row10:  T40  T41  T42  T43  <- Thread 42 at (row = 10, col = 2)
...

Subgroups distribute work identically to how lane basis distributes lanes. If there is more than one subgroup, results require workgroup-level synchronization.


Dimension Expansion (expand_dims)link

Applies to: Reduction dimensions only.

Purpose: Expand (split) a reduction dimension into multiple dimensions in the iteration space so threads can accumulate at a finer granularity across the reduction loop. Without expand_dims, each thread typically keeps a full vector accumulator across the entire reduction (e.g., vector<8xf16>) and reduces it at the end; with expand_dims, the reduction is split so each thread can reduce per inner chunk (e.g., vector<1xf16>), reducing register pressure while preserving the same logical result.

Semantics: The attribute follows the same reassociation model as tensor.expand_shape, with two parameters:

  • reassociations: Maps original iterator dimensions to expanded dimensions. For example, [[0], [1], [2, 3]] keeps dimensions 0 and 1 unchanged and splits dimension 2 into dimensions 2 and 3.
  • output_shape: Sizes of the expanded dimensions. Use ? to indicate a dynamic size, which is inferred from the original dimension and the other static factors in the same reassociation group (at most one ? per group).

Applicability: Expansion is only performed when it is statically valid (e.g., the original size is known and divisible by the static factors). Otherwise, it is ignored.

Example:

#iree_gpu.expand_dims<[[0], [1], [2, 3]], output_shape = [?, ?, ?, 8]>

This keeps d0 and d1 unchanged and splits d2 into d2 and d3, where d3 = 8 and d2 = extent(d2) / 8.


Examplelink

Iteration space: [d0=parallel(4), d1=parallel(6656), d2=reduction(16384)]

Configuration:

#iree_gpu.expand_dims<[[0], [1], [2, 3]], output_shape = [?, ?, ?, 8]>
lane_basis = [[1, 1, 64, 1], [0, 1, 2, 3]]
partial_reduction = [0, 0, 64, 0]
subgroup_basis = [[1, 1, 1, 1], [0, 1, 2, 3]]
thread = [0, 0, 1, 8]
workgroup = [4, 1, 0, 0]

Analysis:

Expand Dims #iree_gpu.expand_dims<[[0], [1], [2, 3]], output_shape = [?, ?, ?, 8]>:

The original iteration space has three dimensions. The expand_dims attribute splits the reduction dimension d2 into two dimensions (d2 and d3), transforming the iteration space from 3D to 4D:

Original:  [d0=parallel(4), d1=parallel(6656), d2=reduction(16384)]
Expanded:  [d0=parallel(4), d1=parallel(6656), d2=reduction(2048), d3=reduction(8)]

The reassociation [[0], [1], [2, 3]] maps original dimensions to expanded dimensions: d0 -> d0, d1 -> d1, and d2 -> (d2, d3). With output_shape = [?, ?, ?, 8], d3 is fixed at 8, and d2 is inferred: 16384 / 8 = 2048.

Lane basis [[1, 1, 64, 1], [0, 1, 2, 3]]:

In the expanded space, 64 threads are distributed along d2. With identity mapping [0, 1, 2, 3], the 64 threads cover 64 consecutive positions along the expanded d2 dimension.

Partial reduction [0, 0, 64, 0]:

Tiles the expanded d2 dimension into chunks of 64. With d2=2048, this creates 2048 / 64 = 32 outer loop iterations. Dimension d3 has tile size 0, meaning it is fully processed within each iteration.

In terms of the original iteration space: each outer loop iteration processes 64 * 8 = 512 elements of the original d2, giving 16384 / 512 = 32 iterations.

Thread [0, 0, 1, 8]:

Each thread processes 1 element along expanded d2 and 8 elements along d3. This means each thread maintains a vector<8> partial accumulator. With 64 threads distributed along d2, the subgroup collectively processes 64 * 8 = 512 elements of the original reduction dimension per iteration.

Workgroup [4, 1, 0, 0]:

The workgroup produces a 4x1 output tile (d0 x d1). The reduction dimensions (d2, d3) have tile size 0, indicating they are handled entirely within the workgroup via the partial reduction loop and thread distribution.

Subgroup basis [[1, 1, 1, 1], [0, 1, 2, 3]]:

With counts 1x1x1x1 = 1, there is a single 64-thread subgroup per workgroup.