Skip to content

Common/GPU

-iree-codegen-expand-gpu-opslink

Expands high-level GPU ops, such as clustered gpu.subgroup_reduce.

-iree-codegen-gpu-alloc-private-memory-for-dps-opslink

Pass to add private memory allocations prior to DPS interface ops.

Creates a bufferization.alloc_tensor in private space for all DPS ops with unused results that can't be removed. These unused results, if originating from loads from global memory, trigger allocations in global memory space during bufferization, which will fail. So, the allocations must be made earlier to avoid failed bufferization.

-iree-codegen-gpu-apply-tiling-levellink

Pass to tile tensor ops based on tiling configs

Optionslink

-tiling-level      : Tiling level to tile. Supported levels are 'reduction' and 'thread'
-allow-zero-slices : Allow pad fusion to generate zero size slices

-iree-codegen-gpu-bubble-resource-castslink

Bubbles iree_gpu.buffer_resource_cast ops upwards.

-iree-codegen-gpu-check-resource-usagelink

Checks GPU specific resource usage constraints like shared memory limits

-iree-codegen-gpu-combine-layout-transformationlink

Combines layout transformation operations into a single map_scatter operation.

Starting from iree_codegen.store_to_memref ops, iteratively combine producer layout/indexing transformation ops (linalg.transpose, tensor.collapse_shape, etc.) into a single iree_linalg_ext.map_scatter operation. For tensor.pad ops, the writing of pad values is distributed to workgroups and threads, and then the padding values are written directly to the output buffer of the store_to_memref op.

-iree-codegen-gpu-combine-value-barrierslink

Combines iree_gpu.value_barrier ops

-iree-codegen-gpu-create-fast-slow-pathlink

Create separate fast and slow paths to handle padding

-iree-codegen-gpu-decompose-horizontally-fused-gemmslink

Decomposes a horizontally fused GEMM back into its constituent GEMMs

-iree-codegen-gpu-distributelink

Pass to distribute scf.forall ops using upstream patterns.

-iree-codegen-gpu-distribute-copy-using-foralllink

Pass to distribute copies to threads.

-iree-codegen-gpu-distribute-foralllink

Pass to distribute scf.forall ops.

-iree-codegen-gpu-distribute-scf-forlink

Distribute tiled loop nests to invocations

Optionslink

-use-block-dims : Use gpu.block_dim ops to query distribution sizes.

-iree-codegen-gpu-distribute-shared-memory-copylink

Pass to distribute shared memory copies to threads.

-iree-codegen-gpu-fuse-and-hoist-parallel-loopslink

Greedily fuses and hoists parallel loops.

-iree-codegen-gpu-generalize-named-opslink

Convert named Linalg ops to linalg.generic ops

-iree-codegen-gpu-greedily-distribute-to-threadslink

Greedily distributes all remaining tilable ops to threads

-iree-codegen-gpu-infer-memory-spacelink

Pass to infer and set the memory space for all alloc_tensor ops.

-iree-codegen-gpu-lower-to-ukernelslink

Lower suitable ops to previously-selected microkernels

-iree-codegen-gpu-multi-bufferinglink

Pass to do multi buffering.

Optionslink

-num-buffers : Number of buffers to use.

-iree-codegen-gpu-pack-to-intrinsicslink

Packs matmul like operations and converts to iree_gpu.multi_mma

-iree-codegen-gpu-pad-operandslink

Pass to pad operands of ops with padding configuration provided.

-iree-codegen-gpu-pipelininglink

Pass to do software pipelining.

Optionslink

-epilogue-peeling    : Try to use un-peeling epilogue when false, peeled epilouge o.w.
-pipeline-depth      : Number of stages 
-schedule-index      : Allows picking different schedule for the pipelining transformation.
-transform-file-name : Optional filename containing a transform dialect specification to apply. If left empty, the IR is assumed to contain one top-level transform dialect operation somewhere in the module.

-iree-codegen-gpu-promote-matmul-operandslink

Pass to insert copies with a different thread configuration on matmul operands

-iree-codegen-gpu-reduce-bank-conflictslink

Pass to try to reduce the number of bank conflicts by padding memref.alloc ops.

Optionslink

-padding-bits : Padding size (in bits) to introduce between rows.

-iree-codegen-gpu-reuse-shared-memory-allocslink

Pass to reuse shared memory allocations with no overlapping liveness.

-iree-codegen-gpu-tensor-alloclink

Pass to create allocations for some tensor values to useGPU shared memory

-iree-codegen-gpu-tensor-tilelink

Pass to tile tensor (linalg) ops within a GPU workgroup

Optionslink

-distribute-to-subgroup : Distribute the workloads to subgroup if true, otherwise distribute to threads.

-iree-codegen-gpu-tensor-tile-to-serial-loopslink

Pass to tile reduction dimensions for certain GPU ops

Optionslink

-coalesce-loops : Collapse the loops that are generated to a single loops

-iree-codegen-gpu-tilelink

Tile Linalg ops with tensor semantics to invocations

-iree-codegen-gpu-tile-reductionlink

Pass to tile linalg reduction dimensions.

-iree-codegen-gpu-vector-alloclink

Pass to create allocations for contraction inputs to copy to GPU shared memory

-iree-codegen-gpu-verify-distributionlink

Pass to verify writes before resolving distributed contexts.

-iree-codegen-reorder-workgroupslink

Reorder workgroup ids for better cache reuse

Optionslink

-strategy : Workgroup reordering strategy, one of: '' (none),  'transpose'

-iree-codegen-vector-reduction-to-gpulink

Convert vector reduction to GPU ops.