Common/GPU
-iree-codegen-expand-gpu-ops
link
Expands high-level GPU ops, such as clustered gpu.subgroup_reduce.
-iree-codegen-gpu-alloc-private-memory-for-dps-ops
link
Pass to add private memory allocations prior to DPS interface ops.
Creates a bufferization.alloc_tensor
in private space for all DPS ops
with unused results that can't be removed. These unused results, if
originating from loads from global memory, trigger allocations in global
memory space during bufferization, which will fail. So, the allocations
must be made earlier to avoid failed bufferization.
-iree-codegen-gpu-apply-tiling-level
link
Pass to tile tensor ops based on tiling configs
Optionslink
-tiling-level : Tiling level to tile. Supported levels are 'reduction' and 'thread'
-allow-zero-slices : Allow pad fusion to generate zero size slices
-iree-codegen-gpu-bubble-resource-casts
link
Bubbles iree_gpu.buffer_resource_cast ops upwards.
-iree-codegen-gpu-check-resource-usage
link
Checks GPU specific resource usage constraints like shared memory limits
-iree-codegen-gpu-combine-layout-transformation
link
Combines layout transformation operations into a single map_scatter operation.
Starting from iree_codegen.store_to_memref ops, iteratively combine producer layout/indexing transformation ops (linalg.transpose, tensor.collapse_shape, etc.) into a single iree_linalg_ext.map_scatter operation. For tensor.pad ops, the writing of pad values is distributed to workgroups and threads, and then the padding values are written directly to the output buffer of the store_to_memref op.
-iree-codegen-gpu-combine-value-barriers
link
Combines iree_gpu.value_barrier
ops
-iree-codegen-gpu-create-fast-slow-path
link
Create separate fast and slow paths to handle padding
-iree-codegen-gpu-decompose-horizontally-fused-gemms
link
Decomposes a horizontally fused GEMM back into its constituent GEMMs
-iree-codegen-gpu-distribute
link
Pass to distribute scf.forall ops using upstream patterns.
-iree-codegen-gpu-distribute-copy-using-forall
link
Pass to distribute copies to threads.
-iree-codegen-gpu-distribute-forall
link
Pass to distribute scf.forall ops.
-iree-codegen-gpu-distribute-scf-for
link
Distribute tiled loop nests to invocations
Optionslink
-use-block-dims : Use gpu.block_dim ops to query distribution sizes.
-iree-codegen-gpu-distribute-shared-memory-copy
link
Pass to distribute shared memory copies to threads.
-iree-codegen-gpu-fuse-and-hoist-parallel-loops
link
Greedily fuses and hoists parallel loops.
-iree-codegen-gpu-generalize-named-ops
link
Convert named Linalg ops to linalg.generic ops
-iree-codegen-gpu-greedily-distribute-to-threads
link
Greedily distributes all remaining tilable ops to threads
-iree-codegen-gpu-infer-memory-space
link
Pass to infer and set the memory space for all alloc_tensor ops.
-iree-codegen-gpu-lower-to-ukernels
link
Lower suitable ops to previously-selected microkernels
-iree-codegen-gpu-multi-buffering
link
Pass to do multi buffering.
Optionslink
-num-buffers : Number of buffers to use.
-iree-codegen-gpu-pack-to-intrinsics
link
Packs matmul like operations and converts to iree_gpu.multi_mma
-iree-codegen-gpu-pad-operands
link
Pass to pad operands of ops with padding configuration provided.
-iree-codegen-gpu-pipelining
link
Pass to do software pipelining.
Optionslink
-epilogue-peeling : Try to use un-peeling epilogue when false, peeled epilouge o.w.
-pipeline-depth : Number of stages
-schedule-index : Allows picking different schedule for the pipelining transformation.
-transform-file-name : Optional filename containing a transform dialect specification to apply. If left empty, the IR is assumed to contain one top-level transform dialect operation somewhere in the module.
-iree-codegen-gpu-promote-matmul-operands
link
Pass to insert copies with a different thread configuration on matmul operands
-iree-codegen-gpu-reduce-bank-conflicts
link
Pass to try to reduce the number of bank conflicts by padding memref.alloc ops.
Optionslink
-padding-bits : Padding size (in bits) to introduce between rows.
-iree-codegen-gpu-reuse-shared-memory-allocs
link
Pass to reuse shared memory allocations with no overlapping liveness.
-iree-codegen-gpu-tensor-alloc
link
Pass to create allocations for some tensor values to useGPU shared memory
-iree-codegen-gpu-tensor-tile
link
Pass to tile tensor (linalg) ops within a GPU workgroup
Optionslink
-distribute-to-subgroup : Distribute the workloads to subgroup if true, otherwise distribute to threads.
-iree-codegen-gpu-tensor-tile-to-serial-loops
link
Pass to tile reduction dimensions for certain GPU ops
Optionslink
-coalesce-loops : Collapse the loops that are generated to a single loops
-iree-codegen-gpu-tile
link
Tile Linalg ops with tensor semantics to invocations
-iree-codegen-gpu-tile-reduction
link
Pass to tile linalg reduction dimensions.
-iree-codegen-gpu-vector-alloc
link
Pass to create allocations for contraction inputs to copy to GPU shared memory
-iree-codegen-gpu-verify-distribution
link
Pass to verify writes before resolving distributed contexts.
-iree-codegen-reorder-workgroups
link
Reorder workgroup ids for better cache reuse
Optionslink
-strategy : Workgroup reordering strategy, one of: '' (none), 'transpose'
-iree-codegen-vector-reduction-to-gpu
link
Convert vector reduction to GPU ops.