Device profilinglink

IREE device profiling captures HAL-native records from the devices that execute your workload. A .ireeprof file can contain queue submissions, command buffer metadata, dispatch timings, host execution spans, memory lifecycle events, periodic device metrics, executable metadata, and backend-specific counter or trace artifacts.

Use device profiling when you need to answer questions below the VM invocation level:

Which HAL queue operations did this invocation issue?
Which dispatches, transfers, or allocations dominate device time?
Which executable export name should be optimized?
Did a benchmark replay spend time in useful HAL work, setup, copies, or host execution?
What hardware counters or executable trace artifacts did a backend collect for a selected dispatch?

Device profiling complements benchmarking and Tracy. Benchmarks tell you how long a workload takes. Tracy shows process-wide runtime behavior and CPU/GPU API interactions. Device profiling records the HAL/device work in a structured format that can be queried with iree-profile or exported for other tooling.

Capture with IREE toolslink

Tools that create HAL devices accept the --device_profiling_* flags. For example, capture queue and device queue events while running a module:

iree-run-module \
  --device=amdgpu \
  --module=/tmp/model.vmfb \
  --function=main \
  --input=@/tmp/inputs.txt \
  --device_profiling_mode=queue-events,device-queue-events \
  --device_profiling_output=/tmp/model.ireeprof

The same flags work with iree-benchmark-module:

iree-benchmark-module \
  --device=amdgpu \
  --module=/tmp/model.vmfb \
  --function=main \
  --benchmark_min_time=20x \
  --device_profiling_mode=queue-events,device-queue-events \
  --device_profiling_output=/tmp/model.ireeprof

When profiling a benchmark, prefer fixed iteration counts such as --benchmark_min_time=20x while first validating a workflow. This makes the capture easier to compare across runs and avoids accidentally collecting a very large profile from a short microbenchmark.

The benchmark replay tool also accepts the same flags. That is the usual way to capture a profile from an already-recorded HAL workload:

iree-benchmark-replay \
  --device=amdgpu \
  --benchmark_min_time=50x \
  --device_profiling_mode=queue-events,device-queue-events \
  --device_profiling_output=/tmp/model-replay.ireeprof \
  /tmp/model.ireereplay

iree-benchmark-replay flushes profile data outside the timed benchmark iteration. The .ireeprof still describes the replayed HAL work, but profile serialization is not charged to the benchmark timing.

Profiling modeslink

--device_profiling_mode is a comma-separated list of HAL profiling data families. The selected HAL driver decides which families it supports and must fail loudly for unsupported requested data.

Mode	Records requested
`queue-events`	Host-timestamped queue operation records, dependency strategy, operation counts, and transfer byte totals.
`host-execution`	Host execution spans, such as local dispatch bodies or host-side command buffer replay.
`device-queue-events`	Device-timestamped queue operation spans.
`dispatch-events`	Device-timestamped dispatch execution records.
`memory-events`	HAL memory allocation, reservation, pool, and buffer lifecycle records.
`device-metrics`	Periodic device metrics such as clocks, temperature, power, memory occupancy, utilization, or bandwidth when supported.
`counters`	Explicitly selected hardware or software counter samples. Requires one or more `--device_profiling_counter=` flags.
`executable-metadata`	Executable, code object, and export metadata needed for offline analysis.
`executable-traces`	Heavy executable trace artifacts such as AMDGPU ATT/SQTT traces for selected operations.

Use the narrowest mode set that captures the data you need. Some modes are cheap metadata streams; others can insert device packets, allocate large trace buffers, or perturb the workload enough that the resulting timing should not be treated as ordinary benchmark data.

Useful capture flagslink

Common filters select which operations emit expensive artifacts while leaving cheap metadata available for decoding:

--device_profiling_filter_export='*matmul*'
--device_profiling_filter_command_buffer=3
--device_profiling_filter_command_index=12
--device_profiling_filter_physical_device=0
--device_profiling_filter_queue=1

For counter capture, select the counters mode and pass backend-specific counter names:

iree-benchmark-module \
  --device=amdgpu \
  --module=/tmp/model.vmfb \
  --function=main \
  --benchmark_min_time=20x \
  --device_profiling_mode=counters \
  --device_profiling_counter=SQ_WAVES \
  --device_profiling_counter=SQ_BUSY_CYCLES \
  --device_profiling_filter_export='*matmul*' \
  --device_profiling_output=/tmp/model-counters.ireeprof

Long-running workloads can request periodic flushes:

--device_profiling_flush_interval_ms=1000

Use periodic flushing only when the backend documents that in-flight snapshots are safe for the selected data families. A flush may be a no-op for producers that do not buffer completed records.

For a quick aggregate report without writing a .ireeprof, use:

--print_device_statistics

This starts the backend's lightweight statistics mode and prints aggregate device statistics at shutdown. It cannot be combined with --device_profiling_output.

External profiler/tool capture flags are separate from HAL-native .ireeprof output:

--device_capture_tool=renderdoc
--device_capture_file=/tmp/frame.rdc
--device_capture_label=warmup-frame

These flags control provider-specific artifacts such as RenderDoc captures or Metal GPU traces. Success does not imply that a .ireeprof bundle was written.

Capture from Clink

Embedding applications can capture the same profile bundles directly through the HAL API. Create a sink, begin profiling on the devices you want to observe, run the workload, and end profiling.

#include "iree/hal/api.h"
#include "iree/hal/utils/profile_file.h"
#include "iree/io/file_handle.h"

iree_io_file_handle_t* file_handle = NULL;
IREE_RETURN_IF_ERROR(iree_io_file_handle_create(
    IREE_IO_FILE_MODE_WRITE | IREE_IO_FILE_MODE_SEQUENTIAL_SCAN |
        IREE_IO_FILE_MODE_SHARE_READ,
    IREE_SV("/tmp/model.ireeprof"), /*initial_size=*/0, host_allocator,
    &file_handle));

iree_hal_profile_sink_t* sink = NULL;
iree_status_t status =
    iree_hal_profile_file_sink_create(file_handle, host_allocator, &sink);
iree_io_file_handle_release(file_handle);
IREE_RETURN_IF_ERROR(status);

iree_hal_device_profiling_options_t options = {0};
options.data_families =
    IREE_HAL_DEVICE_PROFILING_DATA_QUEUE_EVENTS |
    IREE_HAL_DEVICE_PROFILING_DATA_DEVICE_QUEUE_EVENTS;
options.sink = sink;

status = iree_hal_device_profiling_begin(device, &options);
if (iree_status_is_ok(status)) {
  /* Run the workload while profiling is active. */
  status = iree_hal_device_profiling_end(device);
}
iree_hal_profile_sink_release(sink);
IREE_RETURN_IF_ERROR(status);

For multi-device applications, begin profiling on each HAL device that participates in the workload and end profiling after all captured work has completed. Unless the selected backend documents dynamic toggling support, externally serialize iree_hal_device_profiling_begin, iree_hal_device_profiling_flush, and iree_hal_device_profiling_end with queue submission and command buffer recording on the same device. In practice, start profiling before issuing the workload and end it after the device is idle or after the application-visible synchronization point you care about.

Inspect with iree-profilelink

iree-profile is the command line reader for .ireeprof bundles:

iree-profile summary /tmp/model.ireeprof

Example summary output:

IREE HAL profile summary
records: file=8 session_begin=1 chunks=6 session_end=1 non_ok_session_end=0 unknown=0
chunks: devices=1 queues=1 executables=1 executable_exports=1 command_buffers=0
event_records: queue_events=3 host_execution_events=4 host_execution_duration_ns=68574
devices:
  device[0]: device_records=1 queues=1/1
    dispatches=0 valid=0 invalid=0

Use projection commands to enter the profile from the object you care about:

iree-profile explain /tmp/model.ireeprof
iree-profile dispatch /tmp/model.ireeprof
iree-profile queue --format=jsonl /tmp/model.ireeprof
iree-profile command --id=3 --format=jsonl --dispatch_events \
  /tmp/model.ireeprof
iree-profile memory --format=jsonl /tmp/model.ireeprof
iree-profile executable --filter='*matmul*' /tmp/model.ireeprof

statistics is useful for compact scripts:

iree-profile statistics /tmp/model.ireeprof

Example statistics output:

IREE HAL device statistics:
  aggregate_rows=10
  dispatch_export_total=0 ns
  host_execution_queue_total=68.574 us
  host_queue p=0 q=0 alloca  count=1 total=24.027 us avg=24.027 us operations=1 payload=8B
  host_execute abs_dispatch_0_elementwise_2_f32  count=1 total=681 ns avg=681 ns

For automation, prefer JSONL projections:

iree-profile dispatch --format=jsonl /tmp/model.ireeprof | \
  jq 'select(.type=="dispatch_group") | {key,avg_ns,count}'

iree-profile queue --format=jsonl /tmp/model.ireeprof | \
  jq 'select(.type=="queue_event" or .type=="queue_submission")'

iree-profile export --format=ireeperf-jsonl \
  --output=/tmp/model.ireeperf.jsonl /tmp/model.ireeprof

Report JSONL rows are keyed by type and are intended for command-local drilldown. export --format=ireeperf-jsonl emits a schema-versioned interchange stream keyed by record_type; use that for long-lived downstream adapters and telemetry imports.

Render external timelineslink

iree-profile-render converts the durable ireeperf-jsonl interchange stream into formats used by external viewers. Today the primary renderer emits native Perfetto TrackEvent .pftrace files:

iree-profile export --format=ireeperf-jsonl \
  --output=/tmp/model.ireeperf.jsonl /tmp/model.ireeprof

uvx --with perfetto --with protobuf python "$(command -v iree-profile-render)" \
  --format=perfetto /tmp/model.ireeperf.jsonl -o /tmp/model.pftrace

The renderer is a Python tool with optional format-specific dependencies. The command itself is shipped with IREE, but the Perfetto Python packages are not a runtime dependency. Use the uvx --with ... form above for a one-shot render, or install perfetto and protobuf into the active Python environment and run iree-profile-render directly.

For pipelines, stream the exporter into the renderer:

iree-profile export --format=ireeperf-jsonl --output=- /tmp/model.ireeprof | \
  uvx --with perfetto --with protobuf python "$(command -v iree-profile-render)" \
    --format=perfetto - -o /tmp/model.pftrace

Open the resulting .pftrace in the Perfetto UI to inspect host queue events, device queue spans, dispatch lanes, host execution spans, memory events, device metrics, counters, and relationship flows when those records were captured.

Compose with replaylink

Device profiling and device replay are intentionally separate:

.ireereplay says what HAL work to run.
.ireeprof says what happened while a run executed.

A common workflow is:

iree-run-module \
  --device=amdgpu \
  --module=/tmp/model.vmfb \
  --function=main \
  --input=@/tmp/inputs.txt \
  --device_replay_output=/tmp/model.ireereplay

iree-benchmark-replay \
  --device=amdgpu \
  --benchmark_min_time=50x \
  --device_profiling_mode=queue-events,device-queue-events \
  --device_profiling_output=/tmp/model-replay.ireeprof \
  /tmp/model.ireereplay

iree-profile explain /tmp/model-replay.ireeprof

This captures the application once, then profiles deterministic replayed HAL work as many times as needed while iterating on drivers, devices, or executable substitution.

Appendix: ATT and executable traceslink

When built with AMDGPU profiling support, iree-profile att can decode AMDGPU ATT/SQTT trace artifacts embedded in a .ireeprof bundle:

iree-benchmark-module \
  --device=amdgpu \
  --module=/tmp/model.vmfb \
  --function=main \
  --benchmark_min_time=20x \
  --device_profiling_mode=executable-traces,dispatch-events,executable-metadata \
  --device_profiling_filter_export='*matmul*' \
  --device_profiling_output=/tmp/model-att.ireeprof

iree-profile att \
  --rocm_library_path=/opt/rocm/lib \
  --filter='*matmul*' \
  /tmp/model-att.ireeprof

Executable traces are heavy and should be captured with a narrow filter. If --rocm_library_path is omitted, the tool falls back to IREE_HAL_AMDGPU_LIBAQLPROFILE_PATH, IREE_HAL_AMDGPU_LIBHSA_PATH, and then the system dynamic library search path.

Appendix: Agent-oriented helplink

The profile tool can print a compact Markdown playbook for humans or agents:

iree-profile --agents_md

Use this when building scripts around iree-profile or when you need the current command list, JSONL row families, and cross-reference recipes from the exact binary in your build tree.