Device replaylink
IREE device replay records the HAL work issued by an application into a
.ireereplay file. The replay can then be executed, benchmarked, profiled, or
dumped without running the original application or VM invocation.
Replay is a HAL-level reproducer. It is not a module-level recording of VM inputs and outputs, and it is not a profiler by itself. It records enough of the HAL resource graph and operation stream to reproduce useful device work:
- the retained
iree_hal_device_group_ttopology; - devices, allocators, executable caches, executables, command buffers, semaphores, buffers, and files;
- host-visible byte payloads written through HAL map, unmap, update, and file read operations;
- executable payloads and executable metadata;
- command buffer operations, direct queue operations, and synchronization edges;
- external file references for large stable inputs such as parameter archives.
Replay and device profiling compose cleanly:
.ireereplay says what HAL work to run, and .ireeprof says what happened
while a run executed. A common performance loop is "capture once, replay many
times, profile the replay."
Capture with IREE toolslink
iree-run-module and iree-benchmark-module can wrap the resolved HAL device
group and write a replay stream:
iree-run-module \
--device=amdgpu \
--module=/tmp/model.vmfb \
--function=main \
--input=@/tmp/inputs.txt \
--parameters=model=/models/llama.irpa \
--device_replay_output=/tmp/model.ireereplay \
--device_replay_file_policy=reference
The replay recorder is shared by all devices in the selected device group, so multi-device host calls are emitted into one ordered stream. Device-visible ordering is still the recorded semaphore, event, command buffer, and barrier graph; replay does not infer FIFO ordering from host record order.
Capture from iree-benchmark-module the same way:
iree-benchmark-module \
--device=amdgpu \
--module=/tmp/model.vmfb \
--function=main \
--benchmark_min_time=20x \
--device_replay_output=/tmp/model.ireereplay \
--device_replay_file_policy=reference
The capture includes the HAL work issued by the benchmark run. The recorder is closed after the tool's HAL work completes, so the file header contains the final logical length when the process exits successfully.
When capture is enabled, iree-run-module records standard replay scopes named
init, execute, and deinit. These are metadata markers in the replay
stream. They let replay tools benchmark or query individual phases while still
preserving the complete capture around them.
File policieslink
Large models often use external parameter files. The replay recorder must avoid turning every capture into a copy of a 40 GB or 1 TB parameter archive, while still making it clear what storage the replay depends on.
--device_replay_file_policy= controls imported fd-backed HAL files:
| Policy | Behavior | Use when |
|---|---|---|
reference |
Record the external path and validation metadata. Do not copy file bytes. | The referenced file will be preserved beside the capture. This is the default and the right policy for large .irpa files. |
capture-ranges |
Embed only byte ranges read through HAL queue_read operations. Replay substitutes those reads with queue updates. |
You need a hermetic correctness replay and do not need to benchmark storage-backed reads. |
capture-all |
Embed every byte of each fd-backed file. | Files are small, or the external files cannot be preserved. This can make captures enormous. |
fail |
Reject fd-backed files. Host-allocation-backed files are still embedded inline. | Tests must prove they do not depend on external files. |
--device_replay_file_validation= controls validation for referenced files:
| Validation | Behavior | Cost |
|---|---|---|
identity |
Record cheap platform identity metadata such as file length, device, inode, and modification time. | Default. Does not scan file contents. |
digest |
Record and validate a content digest. | Opt-in only. Reads every byte during capture and replay. |
Use digest only when referenced files will be copied or staged to a different
filesystem and platform identity cannot be preserved. For very large parameter
files, digest validation can dominate capture and replay setup time.
Capture from Clink
Applications that already work through the HAL should wrap their retained
iree_hal_device_group_t. The wrapped group preserves topology order and
contains replacement devices that record operations before forwarding to the
real devices.
#include "iree/hal/replay/recorder.h"
#include "iree/io/file_handle.h"
iree_io_file_handle_t* file_handle = NULL;
IREE_RETURN_IF_ERROR(iree_io_file_handle_create(
IREE_IO_FILE_MODE_WRITE | IREE_IO_FILE_MODE_SEQUENTIAL_SCAN |
IREE_IO_FILE_MODE_SHARE_READ,
IREE_SV("/tmp/model.ireereplay"), /*initial_size=*/0, host_allocator,
&file_handle));
iree_hal_replay_recorder_options_t options =
iree_hal_replay_recorder_options_default();
options.external_file_policy =
IREE_HAL_REPLAY_RECORDER_EXTERNAL_FILE_POLICY_REFERENCE;
options.external_file_validation =
IREE_HAL_REPLAY_RECORDER_EXTERNAL_FILE_VALIDATION_IDENTITY;
iree_hal_replay_recorder_t* recorder = NULL;
iree_status_t status = iree_hal_replay_recorder_create(
file_handle, &options, host_allocator, &recorder);
iree_io_file_handle_release(file_handle);
IREE_RETURN_IF_ERROR(status);
iree_hal_device_group_t* replay_group = NULL;
status = iree_hal_replay_wrap_device_group(recorder, base_group,
host_allocator, &replay_group);
if (iree_status_is_ok(status)) {
status =
iree_hal_replay_recorder_scope_begin(recorder, IREE_SV("prefill"));
if (iree_status_is_ok(status)) {
/* Use replay_group for the application phase being captured. */
status = iree_status_join(
status,
iree_hal_replay_recorder_scope_end(recorder, IREE_SV("prefill")));
}
iree_hal_device_group_release(replay_group);
}
status = iree_status_join(status, iree_hal_replay_recorder_close(recorder));
iree_hal_replay_recorder_release(recorder);
IREE_RETURN_IF_ERROR(status);
Close the recorder after all HAL work and host-visible writes have reached their HAL boundaries. Closing writes the final file length and reports any terminal recorder failure instead of silently producing a partial replay.
Run a replaylink
Use iree-run-replay to execute a capture once:
iree-run-replay --device=amdgpu /tmp/model.ireereplay
The target device group must match the captured topology closely enough for the recorded HAL operations. A mismatch in device count or unsupported operation is a hard failure. That is intentional: silently skipping unsupported HAL work would produce a misleading reproducer.
If the capture references files from a different mount root, remap the prefix before replay opens them:
iree-run-replay \
--device=amdgpu \
--replay_file_remap=/mnt/capture=/mnt/replay \
/tmp/model.ireereplay
The remapped file must still satisfy the recorded validation metadata.
Benchmark and profile a replaylink
Use iree-benchmark-replay to measure deterministic replayed HAL work:
iree-benchmark-replay \
--device=amdgpu \
--benchmark_min_time=50x \
/tmp/model.ireereplay
Example benchmark output:
-------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------
BM_replay/process_time/real_time 0.138 ms 0.138 ms 3 items_per_second=7.23988k/s
If the capture contains replay scope markers, use --replay_scope= to report
only a selected phase. The complete replay still executes every iteration, but
the benchmark row uses manual timing accumulated between matching scope
begin/end markers:
iree-benchmark-replay \
--device=amdgpu \
--benchmark_min_time=50x \
--replay_scope=execute \
/tmp/model.ireereplay
Example scoped benchmark output:
-----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_replay/scope:execute/process_time/manual_time ... ms ... ms ... items_per_second=...
Add profiling flags to collect a .ireeprof bundle from the benchmarked replay
iterations:
iree-benchmark-replay \
--device=amdgpu \
--benchmark_min_time=50x \
--device_profiling_mode=queue-events,device-queue-events \
--device_profiling_output=/tmp/model-replay.ireeprof \
/tmp/model.ireereplay
iree-profile explain /tmp/model-replay.ireeprof
iree-benchmark-replay performs profile flushes outside the timed region. The
profiling output describes the useful replay work without charging profile
serialization to each benchmark iteration.
Dump and query a replaylink
Use text mode for a human-readable summary and record stream:
iree-dump-replay --format=text /tmp/model.ireereplay
Example output:
IREE HAL replay v1.0
file_length: 9107
header_length: 24
summary:
hermetic: yes
environment_referenced: no
strict_replay_supported: yes
records: total=42 objects=12 operations=29 unsupported=0
scopes: begin=1 end=1
files: total=0 external=0 inline=0 ranges=0 unknown=0
records:
@408 #4 operation dev=1 obj=1 rel=3 thread=0 status=OK object=device(1) op=device.create_executable_cache(9)
@920 #9 operation dev=1 obj=3 rel=4 thread=0 status=OK object=executable_cache(6) op=executable_cache.prepare_executable(402)
@5115 #14 operation dev=1 obj=5 rel=0 thread=0 status=OK object=command_buffer(5) op=command_buffer.dispatch(313)
Use JSONL for scripts and agent workflows:
iree-dump-replay --format=jsonl /tmp/model.ireereplay | \
jq 'select(.kind=="operation" and .operation=="device.queue_execute")'
Query named replay scopes:
iree-dump-replay --format=jsonl /tmp/model.ireereplay | \
jq 'select(.payload_type=="replay_scope") |
{operation, name: .payload.name}'
Example JSONL row:
{"kind":"operation","sequence_ordinal":27,"object_type":"device","operation":"device.queue_execute","payload":{"command_buffer_id":5,"wait_semaphores":[{"semaphore_id":7,"value":1}],"signal_semaphores":[{"semaphore_id":9,"value":1}]}}
The dumper reports blob payloads as byte ranges in the original replay file. That keeps large captures queryable and lets generated projections refer to the capture without emitting a second giant sidecar.
Substitute executableslink
Replay can replace captured executable payloads at execution time. This is useful when iterating on generated kernels while preserving the same captured HAL workload, inputs, synchronization, and benchmark harness.
First find executable ids:
iree-dump-replay --format=jsonl /tmp/model.ireereplay | \
jq 'select(.operation=="executable_cache.prepare_executable") |
{executable_id:.related_object_id, format:.payload.format_range}'
Then substitute a replacement:
iree-run-replay \
--device=amdgpu \
--replay_executable_substitution=4=/tmp/new-kernel.hsaco \
/tmp/model.ireereplay
If the target executable cache needs an explicit format, include it in the selector:
iree-benchmark-replay \
--device=amdgpu \
--benchmark_min_time=50x \
--replay_executable_substitution=4@amdgcn-amd-amdhsa--gfx1100=/tmp/new-kernel.hsaco \
/tmp/model.ireereplay
For captures that should use one replacement for every executable, use the
all selector:
iree-run-replay \
--device=amdgpu \
--replay_executable_substitution=all=/tmp/new-kernel.bin \
/tmp/model.ireereplay
Substitution is strict. Replay validates available executable metadata, export counts, reflected ABI shape, constants, bindings, and workgroup information before dispatching a replacement. An ABI mismatch fails before the replacement can silently benchmark a different program.
Current fidelity contractslink
Replay should fail loudly when it cannot reproduce the captured HAL work. These failures are part of the contract:
- Missing or identity-mismatched external files mean replay might point at the
wrong parameter archive. Fix the path with
--replay_file_remapor restore the referenced file. - Persistent host write maps without an observable flush or unmap boundary are rejected because replay cannot see the final byte contents.
- Host calls, channels, collectives, allocator import/export, and opaque external handles are visible in dumps and fail in strict execution until they have replay semantics.
- Imported or exported external buffers are not replayed as best-effort snapshots because the application can mutate them outside observable HAL map, flush, or update operations.
- Target topology matters. Select a device group whose device count and capabilities match the captured workload.
These constraints keep replay useful for correctness and performance work: a successful replay should mean the HAL stream was actually reproduced, not that unsupported operations were skipped.
Workflow: large parameter fileslink
For a normal model serving workflow that uses a large .irpa parameter file,
capture by reference:
iree-run-module \
--device=amdgpu \
--module=/tmp/model.vmfb \
--function=main \
--parameters=model=/data/weights/model.irpa \
--input=@/tmp/prompt.txt \
--device_replay_output=/tmp/model.ireereplay \
--device_replay_file_policy=reference \
--device_replay_file_validation=identity
Move or copy the replay and keep the parameter file available. If the replay host uses a different mount root:
iree-benchmark-replay \
--device=amdgpu \
--benchmark_min_time=20x \
--replay_file_remap=/data/weights=/mnt/replay/weights \
/tmp/model.ireereplay
If platform identity cannot be preserved across staging, recapture with
--device_replay_file_validation=digest. Do not enable digest validation for
terabyte-scale files unless the full scan is acceptable.
Workflow: hermetic correctness capturelink
For a small test or a bug report where external files should not be required, use range capture:
iree-run-module \
--device=local-sync \
--module=/tmp/model.vmfb \
--function=main \
--parameters=model=/tmp/fixture.irpa \
--input=@/tmp/inputs.txt \
--device_replay_output=/tmp/model-hermetic.ireereplay \
--device_replay_file_policy=capture-ranges
Replay substitutes captured file reads with queue updates. That preserves the bytes consumed by the HAL stream, but it is not a storage benchmark for the original file read path.
Appendix: Programmatic replay executionlink
Embedding applications can execute a replay directly with
iree_hal_replay_execute_file:
#include "iree/hal/replay/execute.h"
#include "iree/io/file_contents.h"
iree_io_file_contents_t* replay_contents = NULL;
iree_status_t status = iree_io_file_contents_map(
IREE_SV("/tmp/model.ireereplay"), IREE_IO_FILE_ACCESS_READ,
host_allocator, &replay_contents);
iree_hal_replay_file_path_remap_t remaps[] = {
{IREE_SV("/mnt/capture"), IREE_SV("/mnt/replay")},
};
iree_hal_replay_execute_options_t options =
iree_hal_replay_execute_options_default();
options.file_path_remap_count = IREE_ARRAYSIZE(remaps);
options.file_path_remaps = remaps;
if (iree_status_is_ok(status)) {
status = iree_hal_replay_execute_file(replay_contents->const_buffer,
device_group, &options,
host_allocator);
}
iree_io_file_contents_free(replay_contents);
IREE_RETURN_IF_ERROR(status);
Executable substitution is exposed as a callback on
iree_hal_replay_execute_options_t, allowing callers to decide per captured
executable:
static iree_status_t substitute_executable(
void* user_data,
const iree_hal_replay_executable_substitution_request_t* request,
iree_hal_replay_executable_substitution_t* out_substitution) {
const replacement_library_t* library = (const replacement_library_t*)user_data;
if (request->executable_id != library->target_executable_id) {
return iree_ok_status();
}
out_substitution->substitute = true;
out_substitution->source = library->path;
out_substitution->executable_format = library->format;
out_substitution->executable_data = library->data;
return iree_ok_status();
}
options.executable_substitution_callback.fn = substitute_executable;
options.executable_substitution_callback.user_data = &replacement_library;
Replacement data only needs to remain valid for the prepare call made by replay.
Appendix: Agent-oriented helplink
Replay tools can print Markdown guidance from the exact binary in your build
tree. iree-run-replay --agents_md owns the shared replay playbook intended
for direct inclusion in an AGENTS.md file; the other tools print focused notes
for their capture, benchmark, or dump-specific behavior:
iree-run-module --agents_md
iree-benchmark-module --agents_md
iree-run-replay --agents_md
iree-benchmark-replay --agents_md
iree-dump-replay --agents_md
Use these when integrating replay into scripts, CI reproducers, or agent workflows that need the current flag list and diagnostics without reading the source tree.