GPU debugging playbooklink
This page aims to provide general approaches and practical tips for debugging GPU compiler/runtime correctness/performance issues in IREE.
GPUs fundamentally have similar architectures and software stacks. We target GPUs from various vendors using different GPU APIs, but they share quite a lot common infrastructure in IREE. So the approaches and tips here should be widely applicable.
For the ones that are specific to a particular kind of problem/component/GPU, they are prefixed with proper icons to be clear. Here are what those icons represents--
| Icon | Category |
|---|---|
| Correctness | |
| Performance | |
| AMD HIP/ROCm | |
| Apple Metal | |
| :simple-microsoft: | Microsoft DirectX |
| NVIDIA CUDA | |
| Vulkan |
General methodologylink
The difficulties associated with debugging typically arise from isolating the problematic component and pinpointing the culprit. Once done, the solution typically derives naturally.
There are many components in the IREE stack; hierarchically we can categorize them into either the compiler or runtime bucket:
- For compilers, there are multiple layers from the top to the bottom--frontend input importers, IREE flow/stream compilation, IREE host/device compilation, GPU backend in LLVM proper or GPU driver compiler for SPIR-V.
- For runtime, we have fewer layers--IREE HAL drivers, and GPU driver.
Any of the above components/layers can have bugs. It's important to reduce the potential surface area to make the problem more tractable.
Once we have a more isolated case, the general methodology to pinpoint the exact culprit is to
- collect and inspect the symptoms,
- form hypothesis and run experiments to prove/refute the hypothesis, and
- iterate.
.. with shortcutslink
The above procedure is for facing a large problem with no clue, for example, when bringing up a new model end-to-end via IREE.
Though most of the time, we can leverage existing facilities to avoid going down
the full top-down hiearchical debugging procedure.
For example, for regression happening on an existing model, CI or git bitsect
might tell us directly the culprit commit.
.. using toolslink
For issues with strong signals like crashing, it's also easier to pinpoint the exact culprit with dedicated tools--we can leverage various sanitizers or debuggers.
Isolating the problematic componentlink
If we are facing a large problem without a clear clue, we need to isolate the problematic compiler or runtime layer first, typically by comparing with a working solution:
[correctness/performance]
Sanitize the environment first. Asking these questions and making sure the environment is proper can save you hours of debugging sometimes:
- Did you recently updated the GPU SDK or driver?
- Are others able to reproduce the issue?
- If not what SDK / driver versions they are using?
- Is your machine drawing enough power when benchmarking?
- Is your machine connected with a monitor (e.g., for Vulkan)?
- How long since you last rebooted your machine? 👻
- (Windows only) Did you change the
TdrDelayvalue to something more lenient (e.g., 600 seconds)?TdrDelay(HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers) is the timeout for a response to a preempt request from the GPU scheduler. A large enough computation can outlive the default value of 2 seconds, leading to a GPU reset.
[correctness/performance]
We have multiple GPU targets/drivers in IREE--LLVMGPU/CUDA, LLVMGPU/HIP, SPIR-V/Vulkan, SPIR-V/Metal.
For the same GPU, we typically have two paths to target, e.g., CUDA/HIP or Vulkan for NVIDIA/AMD GPUs, Metal or Vulkan for Apple GPUs.
If one path is correct/performant, we can diff against it to try isolate the problem--the common/shared compiler/runtime code is likely okay; what differs between paths is likely problematic.
[correctness/performance] [vulkan]
Vulkan supports different GPUs. Similarly, if one GPU gives correct/performant result, we diff against it to find clues.
Even more code in compiler/runtime are shared here; what's problematic is likely different capabilities triggering different CodeGen pipelines so revealing bugs in a particular CodeGen pipeline. Or there are driver issues from a particular vendor.
[correctness]
If the CPU is working properly, we can use the same dispatch region formation and diff against the CPU dispatches one by one to isolate the problem. See this issue as an example.
[correctness]
--iree-flow-trace-dispatch-tensors and/or --iree-flow-break-dispatch= to
iree-compile is quite helpful to inspect the output after all/each
dispatch(es).
[correctness]
iree-reduce is a great tool to reduce and isolate issues programmatically.
See more details here.
Pinpointing compiler issueslink
Once we identified that the problem is due to some compiler issue, we can investigate by comparing with different paths and inputs:
[correctness]
For the same dispatch, we may have different CodeGen pipelines, e.g., for matmul we can have simple SIMT pipeline or using tensor/matrix cores. We can try to switch between different pipelines to isolate the problem.
[correctness]
Assuming we have a small repro, we can also try to see if there are "patterns" in the wrong result (e.g., this issue). Or mutate the input to see if the failure has some "consistency".
[correctness/performance]
--mlir-print-ir-* and --debug* to iree-opt is our best friend.
Sometimes it just takes eyeballing the IRs between stages to find clues.
[performance]
For identifying performance issues, we typically need to use:
- Tracy profiling to get a course-grained command-buffer timing to understand what's the most time-consuming kernels. Typical big performance issues include but not limit to going down a incorrect CodeGen pipeline, missing tiling/vectorization, having an improper tiling/vectorization configuration, and so on. If the course-grained information is not enough, then we need to
- vendor-specific tools to understand kernel internal counters to identify the bottleneck.
[correctness]
Some targets support the gpu.printf operation for printing out values from
within GPU code, and many of the targets that don't could support it with
some work in IREE or upstream MLIR.
[correctness]
If you suspect an issue in an LLVM backend, check the LLVM debugging playbook for general recommendations.
[] An occasional source of failures has been disagreements about
code object version. Ensure that both amdhsa_code_object_version metadata
and __oclc_ABI_version are set and agree.
Pinpointing runtime issueslink
On the other side, if we suspect that it's a runtime issue, here are some useful approachs and tips:
[correctness/performance]
Tracy profiling is a great way to view how the application runs dynamically. It can help to show problematic GPU API call sequences and performance bottlenecks.
- It requires adding
-DIREE_ENABLE_RUNTIME_TRACING=ONduring CMake configuration, or use theIREE_PY_RUNTIME=tracyenvironment variable when invoking IREE runtime installed via Python packages.
[correctness]
GPU validation can sometimes give us hints:
- [] Enable validation via
export METAL_DEVICE_WRAPPER_TYPE=1. - [] Use
--vulkan_validation_layers=truetoiree-run-module, or - [] Force enable via environment variables to the Vulkan loader:
export VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_standard_validation(may additionally needexport VK_LAYER_PATH=$VULKAN_SDK/etc/vulkan/explicit_layer.dandexport LD_LIBRARY_PATH=$VULKAN_SDK/libif Vulkan SDK is not installed to a system path).
[correctness]
Turning on verbose output can give us more information:
- When compiling IREE runtime, add
-DCMAKE_C_FLAGS=-DIREE_VM_EXECUTION_TRACING_FORCE_ENABLE=1in CMake configuration to enable VM op tracing. - [] Use
--vulkan_debug_verbosity=4toiree-run-module. - [] Print all Vulkan APIs calls with detailed arguments:
export VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump(may additionally needexport VK_LAYER_PATH=$VULKAN_SDK/etc/vulkan/explicit_layer.dandexport LD_LIBRARY_PATH=$VULKAN_SDK/libif Vulkan SDK is not installed to a system path).
[correctness]
Try different "debugging modes" provided by HAL drivers:
- [] Switch
--cuda_use_streams=betweentrueandfalsetoiree-run-moduleto see whether the issue comes from the stream/graph command buffer implementation. - [] Switch
--cuda_async_allocations=falsetoiree-run-moduleto see if the issue comes from async allocation. - [] Use
--metal_serial_command_dispatch=true,--metal_command_buffer_retain_resources=true, or--metal_resource_hazard_tracking=truetoiree-run-moduleto see if any of the above "fixes" the issue. It can help to isolate the pontential problem. - [] Use
--vulkan_robust_buffer_access=truetoiree-run-moduleespecially when seeing undeterministic/corrupted contents in buffers and suspecting there are buffer allocation/indexing issues.
Binary substiution for ROCmlink
[] The AMD ROCm target supports binary substitution on HSA code objects
(.hsaco files).
These files are, under the hood, ELF shared libraries containing kernel code.
If you have manually produced a binary you want to test, such as by manually
running llc with different optimization flags, you can turn the .o into
a .hsaco with
ld.lld -o [filename].hsaco -shared [filename].o
In full, if you have a dispatch in dispatch.mlir and want to recompile it with
while potentially making modifications, the process is
# A PATH edit is not strictly required. It is used here to point out that the
# LLVM binaries used should me built from the same LLVM sources IREE uses.
export PATH="[build-directory]/llvm-project/bin:[build-directory]/tools/bin:$PATH"
iree-compile dispatch.mlir \
--iree-hal-target-device=hip \
--iree-hip-target=<target> \
-o original.vmfb \
--iree-hal-dump-executable-files-to=odump
# Opt flags are in dump/[...].optimized.ll to a file.
opt -o - [opt flags] <dump/[...].linked.ll >altered.opt.ll
# llc flags are in dump/[...].rocmsasm.
# To produce an assembly file.
llc [llc flags] altered.opt.ll -o altered.rocmasm
# To produce an object file.
llc [llc flags] altered.opt.ll -o altered.o --filetype=obj
# Linking to an HSACO.
ld.lld -o altered.hsaco -shared altered.o
# Re-compile with substitution. [dispatch_name] is the name of the
# `hal.executable` op symbol, not the variant within it. This can
# be found by looking at the relevant configured_*.mlir file in dump/, for
# example.
iree-compile dispatch.mlir \
--iree-hal-target-device=hip \
--iree-hip-target=<target> \
-o altered.vmfb \
--iree-hal-substitute-executable-object=[dispatch_name]=altered.hsaco
If successful, iree-complie will print a message stating
NOTE: hal.executable `[executable name]` substituted with object file at`altered.hsaco`
During each of these steps, you can change the flags or manually edit the .ll
(or even .rocmasm) files to attempt to get potentially-different behavior.
Note
The binary substitution process could be used to replace a dispatch with a completely foreign implementation, such as one written in C, so long as the function names and argument handling schemes agree. If you do this, please document the steps here.