Tuninglink
Tuning is an important step in the process of extracting performance from your hardware. During the compilation of a workload, IREE makes decisions on how to set certain parameters that define how a workload is run on targeted hardware. For example, when targeting a GPU, there could be multiple options for thread count or tile size for a given input graph. By default, these parameters are chosen in a way that performs well for any generic workload. However, it is often possible to select values for these parameters that squeeze out extra performance for a specific workload.
This process of iterating over the space of possible parameter values (knobs) to enable improvements in some chosen performance metrics is called Tuning.
---
title: Generic Tuning workflow
---
graph LR;
accTitle: Generic Tuning workflow
accDescr {
A generic tuning workflow consists of compiling a model, benchmarking the
performance with current choice of parameters, than changing the
parameters before begining the next iteration of this loop.
}
A[Compile]-->B;
B[Benchmark]-->C;
C[Set knobs]-->A;
SHARK Tunerlink
Overviewlink
While tuning can be done manually, the SHARK Tuner tool can automatically search through possible knob values for individual dispatches to improve overall program performance. Dispatches are blocks of code that are created as part of IREE's compilation flow by splitting the input program into blocks that can be executed concurrently and atomically. For further information on dispatches see the sections below.
Info
For more information about SHARK Tuner, see its source in the shark-ai GitHub repository and the Model Tuner example.
In our experience, using the SHARK Tuner can provide meaningful speedup of model execution.
Tip
SHARK Tuner achieved a ~10% improvement on the SDXL (Stable Diffusion XL) model with the MI300X GPU.
What is a dispatch?link
To obtain a deeper understanding of what it means to tune dispatches, let's first build some intuition for what is a dispatch.
Info
Reminder: A dispatch is a block of code that can be executed concurrently and atomically, created by splitting the original input graph.
Let's walk through an example.
#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d0)>
func.func @matmul_reduce_32_1024_2048(%lhs: tensor<32x1024xf16>, %rhs: tensor<1024x2048xf16>) -> tensor<32xf32> {
%c0_f16 = arith.constant 0.0: f16
%c1 = arith.constant dense<1.000000e+00> : tensor<32xf16>
// perform a matmul
%mm_acc = tensor.empty() : tensor<32x2048xf32>
%mm_fill = linalg.fill ins(%c0_f16 : f16) outs(%mm_acc :tensor<32x2048xf32>) -> tensor<32x2048xf32>
%mm = linalg.matmul ins(%lhs, %rhs: tensor<32x1024xf16>, tensor<1024x2048xf16>) outs(%mm_fill: tensor<32x2048xf32>) -> tensor<32x2048xf32>
// sum over last dimension
%c0_f32 = arith.constant 0.0: f32
%red_acc = tensor.empty() : tensor<32xf32>
%red_fill = linalg.fill ins(%c0_f32 : f32) outs(%red_acc : tensor<32xf32>) -> tensor<32xf32>
%red = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "reduction"]} ins(%mm : tensor<32x2048xf32>) outs(%red_fill : tensor<32xf32>) {
^bb0(%in: f32, %out: f32):
%7 = arith.addf %in, %out : f32
linalg.yield %7 : f32
} -> tensor<32xf32>
return %red: tensor<32xf32>
}
The above IR performs a MatMul and then a sum over the last dimension.
flowchart TD
accTitle: Illustration of example IR
accDescr {
An example of IR representing a simple model consisting of a matmul with
shape `32x1024x2048` followed by a reduction sum over the last dimension.
}
A[/Input 1/]-->|tensor<32x1024xf32>|C;
B[/Input 2/]-->|tensor<1024x2048xf32>|C;
C(MatMul)-->|tensor<32x2048xf32>|E;
E(Sum Reduction)-->|tensor<32xf32>|F[/Output/];
While compiling this graph with IREE, the flag
--iree-hal-dump-executable-files-to=<some directory>
can be used to observe
the created dispatches.
// RUN: iree-compile --iree-hal-target-device=hip --iree-hip-target=gfx942 \
--mlir-print-ir-after=iree-codegen-materialize-user-configs \
--iree-hal-dump-executable-files-to=<some directory>
hal.executable public @matmul_reduce_32_1024_2048_dispatch_0 {
hal.executable.variant public @rocm_hsaco_fb target(<...>) {
module {
func.func @matmul_reduce_32_1024_2048_dispatch_0_matmul_32x2048x1024_f32() {
%cst = arith.constant 0.000000e+00 : f32
...
%3 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [32, 1024], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<32x1024xf16>> -> tensor<32x1024xf16>
%4 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [1024, 2048], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<1024x2048xf16>> -> tensor<1024x2048xf16>
%5 = tensor.empty() : tensor<32x2048xf32>
%6 = linalg.fill ins(%cst : f16) outs(%5 : tensor<32x2048xf32>) -> tensor<32x2048xf32>
%7 = linalg.matmul {lowering_config = #iree_gpu.lowering_config<{mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>, promote_operands = [0, 1], reduction = [0, 0, 128], subgroup_m_count = 1 : i64, subgroup_n_count = 4 : i64, workgroup = [16, 128, 0]}>} ins(%3, %4 : tensor<32x1024xf16>, tensor<1024x2048xf16>) outs(%6 : tensor<32x2048xf32>) -> tensor<32x2048xf32>
flow.dispatch.tensor.store %7, %2, offsets = [0, 0], sizes = [32, 2048], strides = [1, 1] : tensor<32x2048xf32> -> !flow.dispatch.tensor<writeonly:tensor<32x2048xf32>>
return
}
}
}
}
hal.executable public @matmul_reduce_32_1024_2048_dispatch_1 {
hal.executable.variant public @rocm_hsaco_fb target(<...>) {
module {
func.func @matmul_reduce_32_1024_2048_dispatch_1_generic_32x2048_f32() {
%cst = arith.constant 0.000000e+00 : f32
...
%2 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [32, 2048], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<32x2048xf32>> -> tensor<32x2048xf32>
%3 = tensor.empty() : tensor<32xf32>
%4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<32xf32>) -> tensor<32xf32>
%5 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%2 : tensor<32x2048xf32>) outs(%4 : tensor<32xf32>) {
^bb0(%in: f32, %out: f32):
%6 = arith.addf %in, %out : f32
linalg.yield %6 : f32
} -> tensor<32xf32>
flow.dispatch.tensor.store %5, %1, offsets = [0], sizes = [32], strides = [1] : tensor<32xf32> -> !flow.dispatch.tensor<writeonly:tensor<32xf32>>
return
}
}
}
}
Illustrated graphically, the following dispatches were created.
flowchart TD
accTitle: Example of dispatch creation
accDescr {
In the earlier example, the matmul and the sum reduction are split into
individual dispatches, with the output of the matmul dispatch being tied
to the input of the sum reduction dispatch.
}
A[/Input 1/]-->|tensor<32x1024xf32>|C;
B[/Input 2/]-->|tensor<1024x2048xf32>|C;
subgraph Dispatch 0
C(MatMul);
end
C-->|tensor<32x2048xf32>|D
subgraph Dispatch 1
D(Sum Reduction);
end
D-->|tensor<32xf32>|F[/Output/];
In the above example, each hal.executable
represents a dispatch.
To briefly explain how these dispatches came to be, when creating dispatches we
first identify "root" operations. These are often operations that perform some
kind of reduction. In our example, there are two such root ops, the MatMul and
the sum reduction. Then for each root op, we find surrounding operations that
could be merged into the same dispatch (not applicable in our example). Then
each of these groups of ops with a single root are split off into individual
functions, creating dispatches. There are many nuances such as how operations
are chosen for a particular dispatch that are beyond the scope of this document
but hopefully this provides a useful starting point to understand how a graph
is broken down into dispatches.
Knobs in Dispatcheslink
Depending on the hardware being targeted, a dispatch will expose different knobs. Focusing on GPUs, some common knobs are subgroup tile sizes or workgroup thread count. For a given input graph, the knobs associated with a particular dispatch can be seen by adding the following flags when compiling.
--iree-hal-dump-executable-benchmarks-to=<directory>
--iree-config-add-tuner-attributes
These will dump standalone hal.executable benchmarks for each dispatch. Within these benchmark files we can find an attribute associated to the root op of the dispatch which shows some tunable attributes.
Shown below is the aforementioned attribute for the MatMul dispatch from our earlier example.
linalg.matmul {lowering_config = #iree_gpu.lowering_config<{
mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>,
promote_operands = [0, 1],
reduction = [0, 0, 128],
subgroup_m_count = 1 : i64,
subgroup_n_count = 4 : i64,
workgroup = [16, 128, 0]}>,
root_op
} ins(%3, %4 : tensor<32x1024xf16>, tensor<1024x2048xf16>)
outs(%6 : tensor<32x2048xf32>) -> tensor<32x2048xf32>
Specifically, observe the lowering_config
attribute which lists some tunable
parameters such as the choice of MMA Layout or the subgroup counts along
various dimensions. These parameters affect model execution on hardware in
various ways such as by influencing memory locality, bank conflicts, etc.
Setting knobs and tuning specslink
Changing / setting the values of knobs can be done in many ways. One way is to use the flags below to dump the IR after these knobs have been set, manually edit the values, and then resume compilation.
--compile-to=executable-configurations
--compile-from=executable-configurations
But a more elegant solution is to use the transform dialect in mlir. This dialect provides ops that can be added to the IR that allow the transformation of the IR during compilation. For more information, see this overview of the Transform dialect. For the purposes of tuning, we use the Transform dialect to create mlir files, which we call "specs", that describe how dispatches and their relevant knobs should be changed.
Usage in IREElink
The use of tuning specs in iree-compile
is controlled with the following
flags:
--iree-codegen-enable-default-tuning-specs
-- enables or disables the default tuning specs shipped with the compiler.--iree-codegen-tuning-spec-path
-- loads a user-specified tuning spec.--iree-codegen-dump-tuning-specs-to
-- dumps final tuning specs to a directory or standard output.
Note that both default and user-provided specs can be enabled at the same time. The compiler will link them together and invoke the user-provided spec before attempting the default one.
Anatomy of a tuning speclink
Examplelink
This is an example of a tuning spec that may be applied to the MatMul dispatch in our earlier example.
module attributes {iree_codegen.tuning_spec_with_default_entrypoint, transform.with_named_sequence} {
transform.named_sequence @apply_op_config(%arg0: !transform.any_op {transform.readonly}, %arg1: !transform.any_param {transform.readonly}) {
transform.annotate %arg0 "compilation_info" = %arg1 : !transform.any_op, !transform.any_param
transform.yield
}
transform.named_sequence @match_matmul_reduce_32_1024_2048_dispatch_0_matmul_32x2048x1024_f16xf16xf32(%arg0: !transform.any_op {transform.readonly}) -> (!transform.any_op, !transform.any_param) {
%inputs, %outputs = transform.iree.match.cast_compatible_dag_from_root %arg0 {
^bb0(%arg1: tensor<32x1024xf16>, %arg2: tensor<1024x2048xf16>, %arg3: tensor<32x2048xf32>):
%1 = linalg.matmul ins(%arg1, %arg2 : tensor<32x1024xf16>, tensor<1024x2048xf16>) outs(%arg3 : tensor<32x2048xf32>) -> tensor<32x2048xf32>
} : (!transform.any_op) -> (!transform.any_value, !transform.any_value)
%0 = transform.param.constant #iree_codegen.compilation_info<lowering_config = #iree_gpu.lowering_config<{mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>, promote_operands = [0, 1], reduction = [0, 0, 128], subgroup_m_count = 2 : i64, subgroup_n_count = 2 : i64, workgroup = [32, 128, 0]}>, translation_info = <pipeline = LLVMGPUVectorDistribute workgroup_size = [128, 2, 1] subgroup_size = 64, {gpu_pipeline_options = #iree_gpu.pipeline_options<prefetch_shared_memory = true>, llvm_func_attrs = {"amdgpu-waves-per-eu" = "2"}}>> -> !transform.any_param
transform.yield %arg0, %0 : !transform.any_op, !transform.any_param
}
transform.named_sequence @__kernel_config(%arg0: !transform.any_op {transform.consumed}) -> !transform.any_op attributes {iree_codegen.tuning_spec_entrypoint} {
%updated_root = transform.foreach_match in %arg0
@match_matmul_reduce_32_1024_2048_dispatch_0_matmul_32x2048x1024_f16xf16xf32 -> @apply_op_config : (!transform.any_op) -> !transform.any_op
transform.yield %updated_root : !transform.any_op
}
}
Explanationlink
Tuning specs are transform dialect libraries that conform to the following format:
- All tuning spec entry points (named sequence ops) are marked with the
iree_codegen.tuning_spec_entrypoint
attribute. They have a single argument of type!transform.any_op
and return a single value of type!transform.any_op
. - All entry points in the final tuning specs must either read
(
transform.readonly
) or consume (transform.consumed
) the argument. - The
iree_codegen.tuning_spec_with_default_entrypoint
attribute ensures that the tuning spec includes a named sequence op with name__kernel_config
, which must contain exactly oneforeach_match
op. Thatforeach_match
op must have exactly one argument and one result of type any_op.
The tuning spec above attempts to match linalg.matmul
ops that correspond to
the shape 32x1024x2048
and f16
operand element types and f32
result
element type.
If the match succeeds, the tuning spec applies the compilation_info
attribute
that will drive the code generation. This attribute is considered a compiler
implementation detail; in general, each codegen pipeline has its own
requirements as to what is considered a valid compilation info and how to
interpret it.
Tuning specs get executed by the 'Materialize User Configs` pass.