Benchmarkinglink
IREE uses benchmarks to inspect performance at varying levels of granularity. Benchmarking is implemented using the Google Benchmark library. To understand performance details and guide optimization, please refer to the IREE profiling documentation.
Module Benchmarkslink
iree-benchmark-module
is a program accepting (almost) the same inputs as
iree-run-module
that will benchmark the invocation of a single entry function.
It measures timing for the whole process of invoking a function through the VM,
including allocating and freeing output buffers. This is a high-level benchmark
of an entire invocation flow. It provides a big picture view, but depends on
many different variables, like an integration test. For finer-grained
measurements more akin to unit tests, see Executable Benchmarks.
To use iree-benchmark-module
, generate an IREE module for the target backend:
$ bazel run //tools:iree-compile -- \
--iree-hal-target-backends=vmvx \
$PWD/samples/models/simple_abs.mlir \
-o /tmp/module.fb
and then benchmark an exported function in that module:
$ bazel run //tools:iree-benchmark-module -- \
--module=/tmp/module.fb \
--device=local-task \
--function=abs \
--input=f32=-2
You'll see output like
Run on (12 X 4500 MHz CPU s)
CPU Caches:
L1 Data 32K (x6)
L1 Instruction 32K (x6)
L2 Unified 1024K (x6)
L3 Unified 8448K (x1)
Load Average: 2.21, 1.93, 3.34
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may
be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
BM_RunModule/process_time/real_time 0.22 ms 0.23 ms 3356
Notice that there are a few warnings in there (you may not see all of these).
The benchmark library helpfully warns about some common issues that will affect
benchmark timing. When trying to obtain real benchmark numbers, you should
generally build an optimized build (-c opt
in Bazel) and
disable CPU scaling.
bazel build -c opt //tools:iree-benchmark-module
Another thing to consider is that depending on where you are running the benchmark you might want to avoid additional programs running at the same time. Bazel itself runs a server even when it's not being actively invoked that can be quite a memory hog, so we'll instead invoke the binary directly. Use your favorite process manager (e.g. htop or pkill on Linux) to kill heavy-weight programs such as Chrome and Bazel.
Now we'll actually invoke the binary:
$ ./bazel-bin/tools/iree-benchmark-module \
--module=/tmp/module.fb \
--device=local-task \
--function=abs \
--input=f32=-2
Run on (12 X 4500 MHz CPU s)
CPU Caches:
L1 Data 32K (x6)
L1 Instruction 32K (x6)
L2 Unified 1024K (x6)
L3 Unified 8448K (x1)
Load Average: 1.49, 3.42, 3.49
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
BM_RunModule/process_time/real_time 0.011 ms 0.014 ms 61654
Remember to restore CPU scaling when you're done.
Executable Benchmarkslink
We also benchmark the performance of individual parts of the IREE system in
isolation. IREE breaks a model down to dispatch functions. To benchmark all the
dispatch functions, generate an IREE module with the
-iree-flow-export-benchmark-funcs
flag set:
$ build/tools/iree-compile \
--iree-input-type=stablehlo \
--iree-flow-export-benchmark-funcs \
--iree-hal-target-backends=vmvx \
tests/e2e/stablehlo_models/fullyconnected.mlir \
-o /tmp/fullyconnected.vmfb
and then benchmark all exported dispatch functions (and all exported functions) in that module:
$ build/tools/iree-benchmark-module
--module=/tmp/fullyconnected.vmfb
--device=local-task
If no entry_function
is specified, iree-benchmark-module
will register a
benchmark for each exported function that takes no inputs.
You will see output like:
Run on (72 X 3700 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x36)
L1 Instruction 32 KiB (x36)
L2 Unified 1024 KiB (x36)
L3 Unified 25344 KiB (x2)
Load Average: 4.39, 5.72, 6.76
---------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------
BM_main_ex_dispatch_0_benchmark/process_time/real_time 0.030 ms 0.037 ms 34065
BM_main_ex_dispatch_1_benchmark/process_time/real_time 0.034 ms 0.042 ms 20567
BM_main_ex_dispatch_2_benchmark/process_time/real_time 0.043 ms 0.051 ms 18576
BM_main_ex_dispatch_3_benchmark/process_time/real_time 0.029 ms 0.036 ms 21345
BM_main_ex_dispatch_4_benchmark/process_time/real_time 0.042 ms 0.051 ms 15880
BM_main_ex_dispatch_5_benchmark/process_time/real_time 0.030 ms 0.037 ms 17854
BM_main_ex_dispatch_6_benchmark/process_time/real_time 0.043 ms 0.052 ms 14919
BM_main_benchmark/process_time/real_time 0.099 ms 0.107 ms 5892
Bytecode Module Benchmarkslink
Normally, the IREE VM is expected to be integrated into applications and driving
model execution. So its performance is of crucial importance. We strive to
introduce as little overhead as possible and have several benchmark binaries
dedicated for evaluating the VM's performance. These benchmark binaries are
named as *_benchmark
in the
iree/vm/
directory. They also use the Google Benchmark library as the above.
CPU Configurationlink
When benchmarking, it's important to consider the configuration of your CPUs. Most notably, CPU scaling can give variable results, so you'll usually want to disable it. This can get pretty complex, but the most basic thing to do is to run all CPUs at maximum frequency. The other thing to consider is what CPU(s) your program is running on. Both of these get more complicated on mobile and in multithreaded workloads.
Linuxlink
Google benchmark provides some instructions. Note that the library will print "CPU scaling is enabled" warnings for any configuration that doesn't have the quota governor set to performance. Similarly the CPU frequency it reports is the maximum frequency of cpu0, not the frequency of the processor it's actually running on. This means that more advanced configurations should ignore these messages.
Turn off CPU scaling before benchmarking.
sudo cpupower frequency-set --governor performance
Restore CPU scaling after benchmarking:
sudo cpupower frequency-set --governor powersave
To learn more about different quota
governor settings, see
https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt. To restrict
which CPUs you run on, use the taskset
command which takes a hexadecimal mask.
To only run on the lowest-numbered CPU you can run
taskset 1 sleep 20 &
You can confirm that the process is running on the given CPU:
ps -o psr $!
Note that $!
indicates the process ID of the last executed background command,
so you can only use this shorthand if you didn't run any commands after the
sleep. For more info on taskset, see https://linux.die.net/man/1/taskset.
Androidlink
Read and understand the Linux instructions first.
Android doesn't give us quite as nice tooling, but the principle is basically the same. One important difference is that thermal throttling is a much bigger concern on mobile. Without a cooling plate, it is likely that high clock speeds will overheat the device and engage thermal throttling, which will ignore whatever clock speeds you may have set to prevent things from catching on fire. Therefore the naive approach above is likely not a good idea.
You will likely need to be root (use su
or adb root
). The commands will
depend on your exact phone and number of cores. First play around and make sure
you understand what everything means. Note that each CPU has its own files which
are used to control its behavior, but changes to a single CPU will sometimes
affect others (see /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus
).
Some useful files:
/proc/cpuinfo
/sys/devices/system/cpu/possible
/sys/devices/system/cpu/present
/sys/devices/system/cpu/cpu0/online
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
See the clockspeed of each CPU
$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
paste \
"/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_cur_freq" \
"/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_min_freq" \
"/sys/devices/system/cpu/cpu${i?}/cpufreq/cpuinfo_max_freq"; \
done
Before changing things, make sure to check the current scaling governor settings first so you can put them back when you're done.
$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
cat "/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
done
Single-Core Examplelink
Here's an example to run IREE in a single-threaded context on CPU 7 at its lowest clock speed.
First we'll take control of the clockspeed by setting the governor to "userspace".
$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
echo userspace > \
"/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
done
We can now set individual clock speeds. We'll pin cpu7 to its minimum frequency. We choose the minimum instead of the maximum here to mitigate thermal throttling concerns
$ cat /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_min_freq > \
/sys/devices/system/cpu/cpu7/cpufreq/scaling_setspeed
We can confirm the frequencies of all the CPUs by running the same command
above. Now to run a command specifically on cpu7, use taskset 80
(hex for 10000000):
taskset 80 sleep 20 &
ps -o psr $!
Remember to cleanup when you're done! Here we'll set the scaling governor back to schedutil because that's what they were before on the particular device this, was tested on, but that may not exist on all devices.
$ for i in `cat /sys/devices/system/cpu/present | tr '-' ' ' | xargs seq`; do \
echo schedutil > \
"/sys/devices/system/cpu/cpu${i?}/cpufreq/scaling_governor"; \
done