Common/GPU

`-iree-codegen-expand-gpu-ops`link

Expands high-level GPU ops, such as clustered gpu.subgroup_reduce.

`-iree-codegen-gpu-alloc-private-memory-for-dps-ops`link

Pass to add private memory allocations prior to DPS interface ops.

Creates a bufferization.alloc_tensor in private space for all DPS ops with unused results that can't be removed. These unused results, if originating from loads from global memory, trigger allocations in global memory space during bufferization, which will fail. So, the allocations must be made earlier to avoid failed bufferization.

`-iree-codegen-gpu-apply-padding-level`link

Pass to pad based on tiling configs

Optionslink

-tiling-level : Tiling level to tile. Supported levels are 'reduction' and 'thread'

`-iree-codegen-gpu-apply-tiling-level`link

Pass to tile tensor ops based on tiling configs

Optionslink

-tiling-level      : Tiling level to tile. Supported levels are 'reduction' and 'thread'
-allow-zero-slices : Allow pad fusion to generate zero size slices
-normalize-loops   : Enable normalization for scf loops

`-iree-codegen-gpu-bubble-resource-casts`link

Bubbles iree_gpu.buffer_resource_cast ops upwards.

`-iree-codegen-gpu-check-resource-usage`link

Checks GPU specific resource usage constraints like shared memory limits

`-iree-codegen-gpu-combine-layout-transformation`link

Combines layout transformation operations into a single map_scatter operation.

Starting from iree_codegen.store_to_buffer ops, iteratively combine producer layout/indexing transformation ops (linalg.transpose, tensor.collapse_shape, etc.) into a single iree_linalg_ext.map_scatter operation. For tensor.pad ops, the writing of pad values is distributed to workgroups and threads, and then the padding values are written directly to the output buffer of the store_to_buffer op.

`-iree-codegen-gpu-combine-value-barriers`link

Combines iree_gpu.value_barrier ops

`-iree-codegen-gpu-create-fast-slow-path`link

Create separate fast and slow paths to handle padding

`-iree-codegen-gpu-decompose-horizontally-fused-gemms`link

Decomposes a horizontally fused GEMM back into its constituent GEMMs

`-iree-codegen-gpu-distribute`link

Pass to distribute scf.forall ops using upstream patterns.

`-iree-codegen-gpu-distribute-copy-using-forall`link

Pass to distribute copies to threads.

`-iree-codegen-gpu-distribute-forall`link

Pass to distribute scf.forall ops.

`-iree-codegen-gpu-distribute-scf-for`link

Distribute tiled loop nests to invocations

Optionslink

-use-block-dims : Use gpu.block_dim ops to query distribution sizes.

`-iree-codegen-gpu-distribute-shared-memory-copy`link

Pass to distribute shared memory copies to threads.

`-iree-codegen-gpu-fuse-and-hoist-parallel-loops`link

Greedily fuses and hoists parallel loops.

`-iree-codegen-gpu-generalize-named-ops`link

Convert named Linalg ops to linalg.generic ops

`-iree-codegen-gpu-greedily-distribute-to-threads`link

Greedily distributes all remaining tilable ops to threads

`-iree-codegen-gpu-infer-memory-space`link

Pass to infer and set the memory space for all alloc_tensor ops.

`-iree-codegen-gpu-lower-to-global-loads`link

Emit direct global loads instructions.

`-iree-codegen-gpu-lower-to-ukernels`link

Lower suitable ops to previously-selected microkernels

`-iree-codegen-gpu-multi-buffering`link

Pass to do multi buffering.

Optionslink

-num-buffers : Number of buffers to use.

`-iree-codegen-gpu-pack-to-intrinsics`link

Packs matmul like operations and converts to iree_codegen.inner_tiled

`-iree-codegen-gpu-pad-convs`link

Pass to pad operands of a convolution with padding configuration provided.

`-iree-codegen-gpu-pad-operands`link

Pass to pad operands of ops with padding configuration provided.

`-iree-codegen-gpu-pipelining`link

Pass to do software pipelining.

Optionslink

-epilogue-peeling    : Try to use un-peeling epilogue when false, peeled epilouge o.w.
-pipeline-depth      : Number of stages 
-schedule-index      : Allows picking different schedule for the pipelining transformation.
-transform-file-name : Optional filename containing a transform dialect specification to apply. If left empty, the IR is assumed to contain one top-level transform dialect operation somewhere in the module.

`-iree-codegen-gpu-promote-matmul-operands`link

Pass to insert copies with a different lowering configuration on matmul operands

Looks for all matmuls annotated with promote_operands = I64Array and inserts copies on the specified operands with a thread lowering config optimized for coalesced loads.

If the matmul is also annotated with promotion_types = ArrayAttr, the logic for "promoting" an operand is deferred to an attribute interface allowing for custom logic.

`-iree-codegen-gpu-reduce-bank-conflicts`link

Pass to try to reduce the number of bank conflicts by padding memref.alloc ops.

Optionslink

-padding-bits : Padding size (in bits) to introduce between rows.

`-iree-codegen-gpu-reuse-shared-memory-allocs`link

Pass to reuse shared memory allocations with no overlapping liveness.

`-iree-codegen-gpu-tensor-alloc`link

Pass to create allocations for some tensor values to useGPU shared memory

`-iree-codegen-gpu-tensor-tile`link

Pass to tile tensor (linalg) ops within a GPU workgroup

Optionslink

-distribute-to-subgroup : Distribute the workloads to subgroup if true, otherwise distribute to threads.

`-iree-codegen-gpu-tensor-tile-to-serial-loops`link

Pass to tile reduction dimensions for certain GPU ops

Optionslink

-coalesce-loops : Collapse the loops that are generated to a single loops

`-iree-codegen-gpu-tile`link

Tile Linalg ops with tensor semantics to invocations

`-iree-codegen-gpu-tile-reduction`link

Pass to tile linalg reduction dimensions.

`-iree-codegen-gpu-vector-alloc`link

Pass to create allocations for contraction inputs to copy to GPU shared memory

`-iree-codegen-gpu-verify-distribution`link

Pass to verify writes before resolving distributed contexts.

`-iree-codegen-reorder-workgroups`link

Reorder workgroup ids for better cache reuse

Optionslink

-strategy : Workgroup reordering strategy, one of: '' (none),  'transpose'

`-iree-codegen-vector-reduction-to-gpu`link

Convert vector reduction to GPU ops.

Optionslink

-expand-subgroup-reduction : Lower subgroup reductions to gpu ops immediately where possible.

Common/GPU

-iree-codegen-expand-gpu-opslink

-iree-codegen-gpu-alloc-private-memory-for-dps-opslink

-iree-codegen-gpu-apply-padding-levellink

Optionslink

-iree-codegen-gpu-apply-tiling-levellink

Optionslink

-iree-codegen-gpu-bubble-resource-castslink

-iree-codegen-gpu-check-resource-usagelink

-iree-codegen-gpu-combine-layout-transformationlink

-iree-codegen-gpu-combine-value-barrierslink

-iree-codegen-gpu-create-fast-slow-pathlink

-iree-codegen-gpu-decompose-horizontally-fused-gemmslink

-iree-codegen-gpu-distributelink

-iree-codegen-gpu-distribute-copy-using-foralllink

-iree-codegen-gpu-distribute-foralllink

-iree-codegen-gpu-distribute-scf-forlink

Optionslink

-iree-codegen-gpu-distribute-shared-memory-copylink

-iree-codegen-gpu-fuse-and-hoist-parallel-loopslink

-iree-codegen-gpu-generalize-named-opslink

-iree-codegen-gpu-greedily-distribute-to-threadslink

-iree-codegen-gpu-infer-memory-spacelink

-iree-codegen-gpu-lower-to-global-loadslink

-iree-codegen-gpu-lower-to-ukernelslink

-iree-codegen-gpu-multi-bufferinglink

Optionslink

-iree-codegen-gpu-pack-to-intrinsicslink

-iree-codegen-gpu-pad-convslink

-iree-codegen-gpu-pad-operandslink

-iree-codegen-gpu-pipelininglink

Optionslink

-iree-codegen-gpu-promote-matmul-operandslink

-iree-codegen-gpu-reduce-bank-conflictslink

Optionslink

-iree-codegen-gpu-reuse-shared-memory-allocslink

-iree-codegen-gpu-tensor-alloclink

-iree-codegen-gpu-tensor-tilelink

Optionslink

-iree-codegen-gpu-tensor-tile-to-serial-loopslink

Optionslink

-iree-codegen-gpu-tilelink

-iree-codegen-gpu-tile-reductionlink

-iree-codegen-gpu-vector-alloclink

-iree-codegen-gpu-verify-distributionlink

-iree-codegen-reorder-workgroupslink

Optionslink

-iree-codegen-vector-reduction-to-gpulink

Optionslink

`-iree-codegen-expand-gpu-ops`link

`-iree-codegen-gpu-alloc-private-memory-for-dps-ops`link

`-iree-codegen-gpu-apply-padding-level`link

`-iree-codegen-gpu-apply-tiling-level`link

`-iree-codegen-gpu-bubble-resource-casts`link

`-iree-codegen-gpu-check-resource-usage`link

`-iree-codegen-gpu-combine-layout-transformation`link

`-iree-codegen-gpu-combine-value-barriers`link

`-iree-codegen-gpu-create-fast-slow-path`link

`-iree-codegen-gpu-decompose-horizontally-fused-gemms`link

`-iree-codegen-gpu-distribute`link

`-iree-codegen-gpu-distribute-copy-using-forall`link

`-iree-codegen-gpu-distribute-forall`link

`-iree-codegen-gpu-distribute-scf-for`link

`-iree-codegen-gpu-distribute-shared-memory-copy`link

`-iree-codegen-gpu-fuse-and-hoist-parallel-loops`link

`-iree-codegen-gpu-generalize-named-ops`link

`-iree-codegen-gpu-greedily-distribute-to-threads`link

`-iree-codegen-gpu-infer-memory-space`link

`-iree-codegen-gpu-lower-to-global-loads`link

`-iree-codegen-gpu-lower-to-ukernels`link

`-iree-codegen-gpu-multi-buffering`link

`-iree-codegen-gpu-pack-to-intrinsics`link

`-iree-codegen-gpu-pad-convs`link

`-iree-codegen-gpu-pad-operands`link

`-iree-codegen-gpu-pipelining`link

`-iree-codegen-gpu-promote-matmul-operands`link

`-iree-codegen-gpu-reduce-bank-conflicts`link

`-iree-codegen-gpu-reuse-shared-memory-allocs`link

`-iree-codegen-gpu-tensor-alloc`link

`-iree-codegen-gpu-tensor-tile`link

`-iree-codegen-gpu-tensor-tile-to-serial-loops`link

`-iree-codegen-gpu-tile`link

`-iree-codegen-gpu-tile-reduction`link

`-iree-codegen-gpu-vector-alloc`link

`-iree-codegen-gpu-verify-distribution`link

`-iree-codegen-reorder-workgroups`link

`-iree-codegen-vector-reduction-to-gpu`link