Common/GPU
-iree-codegen-expand-gpu-opslink
Expands high-level GPU ops, such as clustered gpu.subgroup_reduce.
-iree-codegen-gpu-alloc-private-memory-for-dps-opslink
Pass to add private memory allocations prior to DPS interface ops.
Creates a bufferization.alloc_tensor in private space for all DPS ops
with unused results that can't be removed. These unused results, if
originating from loads from global memory, trigger allocations in global
memory space during bufferization, which will fail. So, the allocations
must be made earlier to avoid failed bufferization.
-iree-codegen-gpu-apply-padding-levellink
Pass to pad based on tiling configs
Optionslink
-tiling-level : Tiling level to tile. Supported levels are 'reduction' and 'thread'
-iree-codegen-gpu-apply-tiling-levellink
Pass to tile tensor ops based on tiling configs
Optionslink
-tiling-level : Tiling level to tile. Supported levels are 'reduction' and 'thread'
-allow-zero-slices : Allow pad fusion to generate zero size slices
-normalize-loops : Enable normalization for scf loops
-fuse-consumers : Enable fusing consumers into scf.forall during tiling
-iree-codegen-gpu-bubble-resource-castslink
Bubbles iree_gpu.buffer_resource_cast ops upwards.
-iree-codegen-gpu-check-resource-usagelink
Checks GPU specific resource usage constraints like shared memory limits
-iree-codegen-gpu-combine-layout-transformationlink
Combines layout transformation operations into a single map_scatter operation.
Starting from iree_codegen.store_to_buffer ops, iteratively combine producer layout/indexing transformation ops (linalg.transpose, tensor.collapse_shape, etc.) into a single iree_linalg_ext.map_scatter operation. For tensor.pad ops, the writing of pad values is distributed to workgroups and threads, and then the padding values are written directly to the output buffer of the store_to_buffer op.
-iree-codegen-gpu-combine-value-barrierslink
Combines iree_gpu.value_barrier ops
-iree-codegen-gpu-convert-to-coalesced-dmalink
Convert operations to coalesced DMA operations.
-iree-codegen-gpu-create-fast-slow-pathlink
Create separate fast and slow paths to handle padding
-iree-codegen-gpu-decompose-horizontally-fused-gemmslink
Decomposes a horizontally fused GEMM back into its constituent GEMMs
-iree-codegen-gpu-distributelink
Pass to distribute scf.forall ops using upstream patterns.
-iree-codegen-gpu-distribute-copy-using-foralllink
Pass to distribute copies to threads.
-iree-codegen-gpu-distribute-foralllink
Pass to distribute scf.forall ops.
-iree-codegen-gpu-distribute-scf-forlink
Distribute tiled loop nests to invocations
Optionslink
-use-block-dims : Use gpu.block_dim ops to query distribution sizes.
-iree-codegen-gpu-distribute-shared-memory-copylink
Pass to distribute shared memory copies to threads.
-iree-codegen-gpu-fuse-and-hoist-parallel-loopslink
Greedily fuses and hoists parallel loops.
-iree-codegen-gpu-generalize-named-opslink
Convert named Linalg ops to linalg.generic ops
Convert a subset of named Linalg ops to linalg.generics. The subset does not contain all named ops. The rule-of-thumb is that named ops whose operand maps are projections are in the subset. For example convolutions and pooling ops are not generalized by this pass, but matmuls are.
-iree-codegen-gpu-greedily-distribute-to-threadslink
Greedily distributes all remaining tilable ops to threads
-iree-codegen-gpu-infer-memory-spacelink
Pass to infer and set the memory space for all alloc_tensor ops.
-iree-codegen-gpu-lower-to-global-loadslink
Emit direct global loads instructions.
-iree-codegen-gpu-multi-bufferinglink
Pass to do multi buffering.
Optionslink
-num-buffers : Number of buffers to use.
-iree-codegen-gpu-pack-to-intrinsicslink
Packs matmul like operations and converts to iree_codegen.inner_tiled
-iree-codegen-gpu-pad-convslink
Pass to pad operands of a convolution with padding configuration provided.
-iree-codegen-gpu-pad-operandslink
Pass to pad operands of ops with padding configuration provided.
-iree-codegen-gpu-pipelininglink
Pass to do software pipelining.
Optionslink
-epilogue-peeling : Try to use un-peeling epilogue when false, peeled epilouge o.w.
-pipeline-depth : Number of stages
-schedule-index : Allows picking different schedule for the pipelining transformation.
-transform-file-name : Optional filename containing a transform dialect specification to apply. If left empty, the IR is assumed to contain one top-level transform dialect operation somewhere in the module.
-iree-codegen-gpu-promote-matmul-operandslink
Pass to insert copies with a different lowering configuration on matmul operands
Looks for all matmuls annotated with promote_operands = I64Array and
inserts copies on the specified operands with a thread lowering config
optimized for coalesced loads.
If the matmul is also annotated with promotion_types = ArrayAttr, the
logic for "promoting" an operand is deferred to an attribute interface
allowing for custom logic.
-iree-codegen-gpu-reduce-bank-conflictslink
Pass to try to reduce the number of bank conflicts by padding memref.alloc ops.
Optionslink
-padding-bits : Padding size (in bits) to introduce between rows.
-iree-codegen-gpu-reuse-shared-memory-allocslink
Pass to reuse shared memory allocations with no overlapping liveness.
-iree-codegen-gpu-tensor-alloclink
Pass to create allocations for some tensor values to useGPU shared memory
-iree-codegen-gpu-tensor-tilelink
Pass to tile tensor (linalg) ops within a GPU workgroup
Optionslink
-distribute-to-subgroup : Distribute the workloads to subgroup if true, otherwise distribute to threads.
-iree-codegen-gpu-tensor-tile-to-serial-loopslink
Pass to tile reduction dimensions for certain GPU ops
Optionslink
-coalesce-loops : Collapse the loops that are generated to a single loops
-iree-codegen-gpu-tilelink
Tile Linalg ops with tensor semantics to invocations
-iree-codegen-gpu-tile-and-convert-conv-to-matmullink
Convert convolution to matmul using tiling.
-iree-codegen-gpu-tile-reductionlink
Pass to tile linalg reduction dimensions.
-iree-codegen-gpu-vector-alloclink
Pass to create allocations for contraction inputs to copy to GPU shared memory
-iree-codegen-gpu-verify-distributionlink
Pass to verify writes before resolving distributed contexts.
-iree-codegen-reorder-workgroupslink
Reorder workgroup ids for better cache reuse
Optionslink
-strategy : Workgroup reordering strategy, one of: '' (none), 'transpose'
-iree-codegen-vector-reduction-to-gpulink
Convert vector reduction to GPU ops.
Optionslink
-expand-subgroup-reduction : Lower subgroup reductions to gpu ops immediately where possible.