Skip to content

'iree_gpu' Dialectlink

A dialect for common functionality used by GPU focused IREE code generation.

This dialect provides operations and attributes to aid in code generation for GPU targets. The functionality in this dialect can be hardware specific, but is intended to be independent of the lowering target. Late lowerings to SPIR-V/LLVM are handled separately.

Operationslink

iree_gpu.barrier_region (GPU::BarrierRegionOp)link

Synchronizes uses of a shared tensor.

Syntax:

operation ::= `iree_gpu.barrier_region` (`ins` `(` $inputs^ `:` type($inputs) `)` )?
              $region attr-dict `:` type($results)

This op is designed to represent synchronization of workers on the operands and results of the given region. This operation naturally arises when combining the regions of producer-consumer scf.forall operations that share a mapping type.

For example, consider the following pair of parallel loops.

  %0 = scf.forall (%idy, %idx) in (2, 32) shared_outs(%init = %empty) -> (tensor<4x128xf32>) {
    %in = ...
    %2 = affine.apply #affine_map<(d0) -> (d0 * 2)> (%idy)
    %3 = affine.apply #affine_map<(d0) -> (d0 * 4)> (%idx)
    scf.forall.in_parallel {
      tensor.parallel_insert_slice %in into %init[%2, %3] [2, 4] [1, 1]
        : tensor<2x4xf32> into tensor<4x128xf32>
    }
  } {mapping = [#gpu.thread<y>, #gpu.thread<x>]}
  %1 = scf.forall (%idy, %idx) in (8, 8) -> (tensor<128x128xf32>) {
    %4 = affine.apply #affine_map<(d0) -> (d0 * 16)> (%idx)
    %extracted_slice = tensor.extract_slice %0[0, %4] [4, 16] [1, 1]
      : tensor<4x128xf32> to tensor<4x16xf32>
    ...
  } {mapping = [#gpu.thread<y>, #gpu.thread<x>]}

Because these loops share the same worker type and total count, the bodies of these two loops can be merged with a barrier an insert_slice and a shuffle where the boundary of the loops currently is.

  %0 = scf.forall (%idy, %idx) in (8, 8) -> (tensor<4x128xf32>) {
    %alloc = bufferization.alloc_tensor {memory_space = #gpu.address_space<workgroup>}
      : tensor<4x128xf32>
    %barrier = iree_gpu.barrier_region %alloc {
    ^bb0(%shared: tensor<4x128xf32>):
      %ids = affine.delinearize_index %idy * 8 + %idx to (2, 32) : index
      %in = ...
      %2 = affine.apply #affine_map<(d0) -> (d0 * 2)> (%ids#0)
      %3 = affine.apply #affine_map<(d0) -> (d0 * 4)> (%ids#1)
      %inserted_slice = tensor.insert_slice %in into %shared[%2, %3] [2, 4] [1, 1]
        : tensor<2x4xf32> to tensor<4x128xf32>
      iree_gpu.yield %slice : tensor<4x16xf32>
    } : tensor<4x128xf32> -> tensor<4x16xf32>
    %4 = affine.apply #affine_map<(d0) -> (d0 * 16)> (%idx)
    %slice = tensor.extract_slice %barrier[0, %4] [4, 16] [1, 1] : tensor<4x128xf32> to tensor<4x16xf32>
    ...
  } {mapping = [#gpu.thread<y>, #gpu.thread<x>]}

A barrier_region can be lowered to two barriers, one on the input operands and a second one on the results.

Movtivation and Intended Use Cases:

The primary way this op is generated is when fusing parallel loops with tensor results. This operation helps to make lowerings more progressive and flexible. - Lowering directly to an alloc + reads and writes breaks the dependency chain making transformations like barrier placement and pipelining potentially more difficult. - Allows the option of non-vector based lowering paths.

Traits: AlwaysSpeculatableImplTrait, SingleBlockImplicitTerminator<mlir::iree_compiler::IREE::GPU::YieldOp>, SingleBlock

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:link
Operand Description
inputs variadic of any type
Results:link
Result Description
results variadic of any type

iree_gpu.multi_mma (GPU::MultiMmaOp)link

Models a contraction of multiple mma operations

Syntax:

operation ::= `iree_gpu.multi_mma` $lhs `,` $rhs `,` $acc attr-dict
              `:` type($lhs) `,` type($rhs) `into` type($acc)

Computes the sum of inner MMA operations along a set of outer dimensions. Logically matches closely with a vector.contraction operation, however the combiner type is a specific intrinsic rather than a generic combiner type.

Similar to vector.contraction, an iterator type attribute list must be specified, where each element of the list represents an iterator over one of the outer dimensions. Iteration of inner dimensions is defined solely by the intrinsic and may be opaque.

An indexing map attribute list must be specified with an entry for lhs, rhs and acc arguments. An indexing map attribute specifies a mapping from each outer loop iterator in the iterator type list, to each dimension of each operand.

The combiner type is defined by the intrinsic.

Example:

#contraction_accesses = [
 affine_map<(i, j, k) -> (i, k)>,
 affine_map<(i, j, k) -> (k, j)>,
 affine_map<(i, j, k) -> (i, j)>
]
#contraction_trait = {
  indexing_maps = #contraction_accesses,
  iterator_types = ["parallel", "parallel", "reduction"],
  kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>
}
%3 = iree_gpu.multi_mma %0, %1, %2 #contraction_trait
  : vector<2x3x4xf16>, vector<3x5x4xf16> into vector<2x5x4xf32>

// Takes tensors as well, however the inner dimensions must always be
// static.
%7 = iree_gpu.multi_mma %4, %5, %6 #contraction_trait
  : tensor<?x?x4xf16>, tensor<?x?x4xf16> into tensor<?x?x4xf32>

The example above can be logically lowered directly to loops like this (ignoring type conversions from tensor to vector needed for the mfma).

%outer_m = tensor.dim %6, %c0 : index
%outer_n = tensor.dim %6, %c1 : index
%outer_k = tensor.dim %4, %c1 : index
%7 = scf.for %i = %c0 to %outer_m iter_args(%arg0 = %6) {
  %8 = scf.for %j = %c0 to %outer_n iter_args(%arg1 = %arg0) {
    %9 = scf.for %k = %c0 to %outer_k iter_args(%arg2 = %arg1) {
      %lhs = tensor.extract_slice %4 [%i, %k, 0] [1, 1, 4] [1, 1, 1] : tensor<4xf16>
      %rhs = tensor.extract_slice %5 [%k, %j, 0] [1, 1, 4] [1, 1, 1] : tensor<4xf16>
      %acc = tensor.extract_slice %arg2 [%i, %j, 0] [1, 1, 4] [1, 1, 1] : tensor<4xf32>
      %res = amdgpu.mfma %lhs, %rhs, %acc : tensor<4xf32>
      %ret = tensor.insert_slice %acc into %arg2 [%i, %j, 0] [1, 1, 4] [1, 1, 1] : tensor<?x?x4xf32>
      scf.yield %ret : tensor<?x?x4xf32>
    }
    scf.yield %9 : tensor<?x?x4xf32>
  }
  scf.yield %8 : tensor<?x?x4xf32>
}

Or alternatively unrolled to a single intrinsic when operation on vectors.

#contraction_accesses = [
 affine_map<() -> ()>,
 affine_map<() -> ()>,
 affine_map<() -> ()>
]
#contraction_trait = {
  indexing_maps = #contraction_accesses,
  iterator_types = [],
  kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>
}
%3 = iree_gpu.multi_mma %0, %1, %2 #contraction_trait
  : vector<4xf16>, vector<4xf16> into vector<4xf32>

This operation can represent an intrinsic both in subgroup/warp and distributed (thread) abstractions through the intrinsic attribute interface. It does so semi-opaquely by including optional permutations of each MMA fragment with respect to the "canonical" MNK row major matrix multiply.

Since the canonical dimensionality of the inner dimensions are somewhat intrinsic specific, verification of this op requires only that element counts of the inner dimensions match the intrinsic.

For example, an MMT product of inner dimensions with warp semantics can be represented with the following. Permutations are only allowed for ops with subgroup semantics and must be resolved before distribution.

#contraction_accesses = [
 affine_map<(i, j, k) -> (i, k)>,
 affine_map<(i, j, k) -> (k, j)>,
 affine_map<(i, j, k) -> (i, j)>
]
#contraction_trait = {
  indexing_maps = #contraction_accesses,
  iterator_types = ["parallel", "parallel", "reduction"],
  kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>,
  rhs_permutation = [1, 0]
}
%7 = iree_gpu.multi_mma %4, %5, %6 #contraction_trait
  : tensor<?x?x16x16xf16>, tensor<?x?x16x16xf16> into tensor<?x?x16x16xf32>

Motivation, Design Choices, and Pitfallslink

The idea behind this operation is to decouple the layout setting/tiling required to target certain intrinsics from the lowering to them. Because typically tiling of this sort happens on tensor operands, however the target intrinsics operate on vectors, we use this operation to bridge the gap. The choice for a shared operation is intended to ease the lowering process and allow for different transformations at different stages of the pipeline without needing to essentially clone this op.

The choice to let the inner dimensions required to compute the intrinsic be implicit based on the indexing maps was made to make this operation easier to generate and to skip the need for type conversion ops. However this comes at the expense of ease of verification for the operation. It is also implicitly linked to a lane-level parent scf.forall operation.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TilingInterface, VectorUnrollOpInterface

Effects: MemoryEffects::Effect{}

Attributes:link
AttributeMLIR TypeDescription
indexing_maps::mlir::ArrayAttrarray attribute
iterator_types::mlir::ArrayAttrIterator type should be an enum.
kindIREE::GPU::MmaInterfaceAttrbuffer-like constant attribute values
lhs_permutation::mlir::DenseI64ArrayAttri64 dense array attribute
rhs_permutation::mlir::DenseI64ArrayAttri64 dense array attribute
acc_permutation::mlir::DenseI64ArrayAttri64 dense array attribute
Operands:link
Operand Description
lhs ranked tensor or vector of any type values
rhs ranked tensor or vector of any type values
acc ranked tensor or vector of any type values
Results:link
Result Description
result ranked tensor or vector of any type values

iree_gpu.value_barrier (GPU::ValueBarrierOp)link

Synchronizes workers on a value semantic tensor or vector.

Syntax:

operation ::= `iree_gpu.value_barrier` $inputs attr-dict `:` type($inputs)

This operation acts as a barrier on a value semantic SSA values (tensor or vector). It takes multiple operands and produces a value equivalent to each input. This does not have copy and/or data movement semantics and simply represents a barrier on all writes in the tensor case, and a barrier until all threads acquire the input vector in the vector case.

The inputs must be either all tensors, or all vectors.

This operation is a no-op when not present in a parallel context. This operation is pure as it only requires synchronization for the value it produces.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:link
Operand Description
inputs variadic of ranked tensor or vector of any type values
Results:link
Result Description
results variadic of ranked tensor or vector of any type values

iree_gpu.yield (GPU::YieldOp)link

Yield values from a region

Syntax:

operation ::= `iree_gpu.yield` attr-dict ($values^ `:` type($values))?

This operation is used to yield values from a within a region.

Traits: AlwaysSpeculatableImplTrait, HasParent<::mlir::iree_compiler::IREE::GPU::BarrierRegionOp>, ReturnLike, Terminator

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface

Effects: MemoryEffects::Effect{}

Operands:link
Operand Description
values variadic of any type

Attributeslink

ComputeBitwidthsAttrlink

Supported bitwidths for compute

Syntax:

#iree_gpu.compute_bitwidths<
  ::mlir::iree_compiler::IREE::GPU::ComputeBitwidths   # value
>

Enum cases: * fp64 (FP64) * fp32 (FP32) * fp16 (FP16) * int64 (Int64) * int32 (Int32) * int16 (Int16) * int8 (Int8)

Parameters:link
Parameter C++ type Description
value ::mlir::iree_compiler::IREE::GPU::ComputeBitwidths an enum of type ComputeBitwidths

DataTiledMMAAttrlink

Syntax:

#iree_gpu.data_tiled_mma_layout<
  ::mlir::iree_compiler::IREE::GPU::MMAIntrinsicAttr,   # intrinsic
  int64_t,   # unroll_m
  int64_t,   # subgroups_m
  int64_t,   # unroll_n
  int64_t,   # subgroups_n
  int64_t   # unroll_k
>

This mma variant represents MMA ops with data-tiling details. The |intrinsic| field specifies which particular MMA intrinsic is targeted by the data-tiling.

The other fields default to one, and that default results in a single intrinsic equivalent to MMAAttr, while values greater than one result in wider "kernels" consisting of multiple intrinsics, with the data layout already swizzled into a tile layout that allows each intrinsic to access data at an offset that's as simple as possible a mapping from the thread ID.

Parameters:link
Parameter C++ type Description
intrinsic ::mlir::iree_compiler::IREE::GPU::MMAIntrinsicAttr
unroll_m int64_t Unrolling along the M dimension, on the same thread.
subgroups_m int64_t Unrolling along the M dimension, distributed across this many more threads.
unroll_n int64_t Unrolling along the N dimension, on the same thread.
subgroups_n int64_t Unrolling along the N dimension, distributed across this many more threads.
unroll_k int64_t Unrolling along the K dimension, on the same thread, with interleaved layout.

DerivedThreadConfigAttrlink

drive lowering of an operation by deriving thread distribution when needed.

Syntax: #iree_gpu.derived_thread_config

Lowering config for a single thread tiling level that is inferred after previous (often reduction) levels of tile + fuse. This is intended for fused operations where it is much easier to compute the tile sizes to use after previous levels of tile + fuse, rather than trying to pre-propagate tiling configs.

DotProductOpsAttrlink

Supported dot product ops

Syntax:

#iree_gpu.dotproduct_ops<
  ::mlir::iree_compiler::IREE::GPU::DotProductOps   # value
>

Enum cases: * none (None) * dp4xi8toi32 (DP4xI8ToI32)

Parameters:link
Parameter C++ type Description
value ::mlir::iree_compiler::IREE::GPU::DotProductOps an enum of type DotProductOps

GPUPipelineOptionsAttrlink

GPU pipeline options attribute.

Syntax:

#iree_gpu.pipeline_options<
  BoolAttr,   # prefetch_shared_memory
  BoolAttr,   # no_reduce_shared_memory_bank_conflicts
  BoolAttr,   # use_igemm_convolution
  ReorderWorkgroupsStrategyAttr   # reorder_workgroups_strategy
>

This attributes describes lowering pipeline specific configuration options: * prefetch_shared_memory: Boolean option indicating whether or not to run the loop prefetching pass in the lowering pipeline. * no_reduce_shared_memory_bank_conflicts: Boolean option indicating whether or not to skip the bank conflict reduction pass in the lowering pipeline. * reorder_workgroups_strategy: Enum attribute indicating which strategy to choose for the workgroup reordering pass. Options are None, Swizzle, and Transpose.

Parameters:link
Parameter C++ type Description
prefetch_shared_memory BoolAttr
no_reduce_shared_memory_bank_conflicts BoolAttr
use_igemm_convolution BoolAttr
reorder_workgroups_strategy ReorderWorkgroupsStrategyAttr

IteratorTypeAttrlink

Iterator type

Syntax:

#iree_gpu.iterator_type<
  ::mlir::utils::IteratorType   # value
>

Enum cases: * parallel (parallel) * reduction (reduction)

Parameters:link
Parameter C++ type Description
value ::mlir::utils::IteratorType an enum of type IteratorType

LaneIdAttrlink

Syntax:

#iree_gpu.lane_id<
  int64_t   # dim
>

An attribute for mapping scf.forall ops to subgroup lanes.

Parameters:link
Parameter C++ type Description
dim int64_t

LoweringConfigAttrlink

drive lowering of an operation for gpu compilation.

Syntax:

#iree_gpu.lowering_config<
  DictionaryAttr   # attributes
>

GPU specific implementation of a lowering config. This carries just a dictionary attribute to store any relevant fields. This is the simplest form of a lowering config, offering flexibility at the cost of structure.

Parameters:link
Parameter C++ type Description
attributes DictionaryAttr The configured fields, including tiling levels

MMAAttrlink

Attribute describing a particular shape of matrix-multiply and accumulate instruction. Abstractly, all attributes of this type represent the following unit of arithmetic for matrices A, B, and C.

  C += A x B

Where the shape of matrix A is [m, k], B is [k, n], and C is [m, n]. This intentionally leaves the layout information abstract and uses interface methods to materialize layout information only when needed. The shape of the mma intrinsic is stored explicitly here as that information is queried frequently.

The element types for this particular mma intrinsic are |aType|, |bType|, and |cType| for matrices A, B, and C respectively.

link

This mma variant describes configurations for MMA ops. The |intrinsic| field specifies which particular MMA intrinsic this refers to, with each intrinsic implicating a specific MNK shape and operand types. See IREEGPUEnums.td for the definition of the intrinsics.

Parameters:link
Parameter C++ type Description
intrinsic MMAIntrinsicAttr
mSize int64_t
nSize int64_t
kSize int64_t
aType ::mlir::Type
bType ::mlir::Type
cType ::mlir::Type

MMAIntrinsicAttrlink

Descriptor for different MMA intrinsics

Syntax:

#iree_gpu.mma_intrinsic<
  ::mlir::iree_compiler::IREE::GPU::MMAIntrinsic   # value
>

Enum cases: * MFMA_F32_16x16x4_F32 (MFMA_F32_16x16x4_F32) * MFMA_F32_16x16x16_F16 (MFMA_F32_16x16x16_F16) * MFMA_F32_32x32x8_F16 (MFMA_F32_32x32x8_F16) * MFMA_I32_16x16x16_I8 (MFMA_I32_16x16x16_I8) * MFMA_I32_32x32x8_I8 (MFMA_I32_32x32x8_I8) * MFMA_F32_16x16x8_BF16 (MFMA_F32_16x16x8_BF16) * MFMA_F32_32x32x4_BF16 (MFMA_F32_32x32x4_BF16) * MFMA_F64_16x16x4_F64 (MFMA_F64_16x16x4_F64) * MFMA_F32_16x16x16_BF16 (MFMA_F32_16x16x16_BF16) * MFMA_F32_32x32x8_BF16 (MFMA_F32_32x32x8_BF16) * MFMA_F32_16x16x32_F8E5M2FNUZ (MFMA_F32_16x16x32_F8E5M2FNUZ) * MFMA_F32_16x16x32_F8E5M2FNUZ_F8E4M3FNUZ (MFMA_F32_16x16x32_F8E5M2FNUZ_F8E4M3FNUZ) * MFMA_F32_16x16x32_F8E4M3FNUZ (MFMA_F32_16x16x32_F8E4M3FNUZ) * MFMA_F32_16x16x32_F8E4M3FNUZ_F8E5M2FNUZ (MFMA_F32_16x16x32_F8E4M3FNUZ_F8E5M2FNUZ) * MFMA_F32_32x32x16_F8E5M2FNUZ (MFMA_F32_32x32x16_F8E5M2FNUZ) * MFMA_F32_32x32x16_F8E5M2FNUZ_F8E4M3FNUZ (MFMA_F32_32x32x16_F8E5M2FNUZ_F8E4M3FNUZ) * MFMA_F32_32x32x16_F8E4M3FNUZ (MFMA_F32_32x32x16_F8E4M3FNUZ) * MFMA_F32_32x32x16_F8E4M3FNUZ_F8E5M2FNUZ (MFMA_F32_32x32x16_F8E4M3FNUZ_F8E5M2FNUZ) * MFMA_I32_16x16x32_I8 (MFMA_I32_16x16x32_I8) * MFMA_I32_32x32x16_I8 (MFMA_I32_32x32x16_I8) * WMMA_F32_16x16x16_F16 (WMMA_F32_16x16x16_F16) * WMMA_F16_16x16x16_F16 (WMMA_F16_16x16x16_F16) * WMMA_F32_16x16x16_BF16 (WMMA_F32_16x16x16_BF16) * WMMA_BF16_16x16x16_BF16 (WMMA_BF16_16x16x16_BF16) * WMMA_I32_16x16x16_I8 (WMMA_I32_16x16x16_I8) * NV_WMMA_F32_16x16x16_F16 (NV_WMMA_F32_16x16x16_F16) * NV_WMMA_F16_16x16x16_F16 (NV_WMMA_F16_16x16x16_F16)

Parameters:link
Parameter C++ type Description
value ::mlir::iree_compiler::IREE::GPU::MMAIntrinsic an enum of type MMAIntrinsic

MMAOpsArrayAttrlink

Syntax:

#iree_gpu.mma_ops<
  ::llvm::ArrayRef<MMAAttr>   # value
>
Parameters:link
Parameter C++ type Description
value ::llvm::ArrayRef<MMAAttr>

MMAScheduleAttrlink

Syntax:

#iree_gpu.mma_schedule<
  ::mlir::iree_compiler::IREE::GPU::MmaInterfaceAttr,   # intrinsic
  int64_t,   # subgroup_m_count
  int64_t   # subgroup_n_count
>

A schedule of MMA intrinsic instruction and various levels of tile sizes to solve a specific contraction problem.

Parameters:link
Parameter C++ type Description
intrinsic ::mlir::iree_compiler::IREE::GPU::MmaInterfaceAttr
subgroup_m_count int64_t
subgroup_n_count int64_t

ReorderWorkgroupsStrategyAttrlink

Strategy for workgroup reordering

Syntax:

#iree_gpu.reorder_workgroups_strategy<
  ::mlir::iree_compiler::IREE::GPU::ReorderWorkgroupsStrategy   # value
>

Enum cases: * None (None) * Transpose (Transpose)

Parameters:link
Parameter C++ type Description
value ::mlir::iree_compiler::IREE::GPU::ReorderWorkgroupsStrategy an enum of type ReorderWorkgroupsStrategy

StorageBitwidthsAttrlink

Supported bitwidths for storage

Syntax:

#iree_gpu.storage_bitwidths<
  ::mlir::iree_compiler::IREE::GPU::StorageBitwidths   # value
>

Enum cases: * b64 (B64) * b32 (B32) * b16 (B16) * b8 (B8)

Parameters:link
Parameter C++ type Description
value ::mlir::iree_compiler::IREE::GPU::StorageBitwidths an enum of type StorageBitwidths

SubgroupOpsAttrlink

Supported subgroup ops

Syntax:

#iree_gpu.subgroup_ops<
  ::mlir::iree_compiler::IREE::GPU::SubgroupOps   # value
>

Enum cases: * none (None) * shuffle (Shuffle) * arithmetic (Arithmetic)

Parameters:link
Parameter C++ type Description
value ::mlir::iree_compiler::IREE::GPU::SubgroupOps an enum of type SubgroupOps

TargetAttrlink

Full GPU target attribute

Syntax:

#iree_gpu.target<
  ::llvm::StringRef,   # arch
  ::llvm::StringRef,   # features
  TargetWgpAttr,   # wgp
  TargetChipAttr   # chip
>

This attributes describes a full GPU target. It contains a few fields: * The canonical target architecture for compilation, e.g., sm_80 for cuda, gfx942 for hip * A TargetWgpAttr describing the GPU features and limits in a single GPU workgroup processor (WGP), that is, AMD compute unit or NVIDIA streaming multiprocessor * An optional TargetChipAttr describing GPU features for the final chip or product, e.g., wgp count

Parameters:link
Parameter C++ type Description
arch ::llvm::StringRef target architecture
features ::llvm::StringRef target features
wgp TargetWgpAttr
chip TargetChipAttr

TargetChipAttrlink

Chip level target description

Syntax:

#iree_gpu.target_chip<
  uint32_t,   # wgp_count
  DictionaryAttr   # extra
>

This attribute contains hardware features/limits at a single GPU chip level. Here a GPU chip means the hardware functionality scope where the whole software compute grid is scheduled onto. A chip typically contains many AMD compute units or NVIDIA streaming multiprocessors; it's the final SKU.

Parameters:link
Parameter C++ type Description
wgp_count uint32_t
extra DictionaryAttr

TargetWgpAttrlink

Workgroup processor level target description

Syntax:

#iree_gpu.target_wgp<
  ComputeBitwidthsAttr,   # compute
  StorageBitwidthsAttr,   # storage
  SubgroupOpsAttr,   # subgroup
  DotProductOpsAttr,   # dot
  MMAOpsArrayAttr,   # mma
  DenseI32ArrayAttr,   # subgroup_size_choices
  DenseI32ArrayAttr,   # max_workgroup_sizes
  int32_t,   # max_thread_count_per_workgroup
  int32_t,   # max_workgroup_memory_bytes
  DenseI32ArrayAttr,   # max_workgroup_counts
  std::optional<int32_t>,   # max_load_instruction_bits
  std::optional<int32_t>,   # simds_per_wgp
  std::optional<int32_t>,   # vgpr_space_bits
  DictionaryAttr   # extra
>

This attribute contains hardware features/limits at a single GPU workgroup processor (WGP) level. Here a GPU workgroup processor means the basic hardware functionality unit where a software workgroup is scheduled onto; that is, a compute unit for AMD GPUs or a streaming multiprocessor for NVIDIA GPUs.

Parameters:link
Parameter C++ type Description
compute ComputeBitwidthsAttr
storage StorageBitwidthsAttr
subgroup SubgroupOpsAttr
dot DotProductOpsAttr
mma MMAOpsArrayAttr
subgroup_size_choices DenseI32ArrayAttr
max_workgroup_sizes DenseI32ArrayAttr
max_thread_count_per_workgroup int32_t
max_workgroup_memory_bytes int32_t
max_workgroup_counts DenseI32ArrayAttr
max_load_instruction_bits std::optional<int32_t>
simds_per_wgp std::optional<int32_t>
vgpr_space_bits std::optional<int32_t>
extra DictionaryAttr

VirtualMMAAttrlink

Syntax:

#iree_gpu.virtual_mma_layout<
  ::mlir::iree_compiler::IREE::GPU::VirtualMMAIntrinsicAttr   # intrinsic
>

This mma variant represents "virtual" MMA ops that has modification to its native layouts by unrollK and/or interleave reads. The |intrinsic| field represents different kinds of "Virtual" MMA Ops we found helpful.

These interleaving and/or unrolling changes in the layout is especially useful to coalesce reads from shared memory to register or align layouts in a chained-matmul operation.

Parameters:link
Parameter C++ type Description
intrinsic ::mlir::iree_compiler::IREE::GPU::VirtualMMAIntrinsicAttr

VirtualMMAIntrinsicAttrlink

Descriptor for different Virtual MMA intrinsics

Syntax:

#iree_gpu.virtual_mma_intrinsic<
  ::mlir::iree_compiler::IREE::GPU::VirtualMMAIntrinsic   # value
>

Enum cases: * VMFMA_F32_16x16x32_F16 (VMFMA_F32_16x16x32_F16) * VMFMA_F32_32x32x16_F16 (VMFMA_F32_32x32x16_F16) * VMFMA_F32_16x16x32_F8E4M3FNUZ (VMFMA_F32_16x16x32_F8E4M3FNUZ) * VMFMA_F32_32x32x16_F8E4M3FNUZ (VMFMA_F32_32x32x16_F8E4M3FNUZ)

Parameters:link
Parameter C++ type Description
value ::mlir::iree_compiler::IREE::GPU::VirtualMMAIntrinsic an enum of type VirtualMMAIntrinsic

Enumslink

ComputeBitwidthslink

Supported bitwidths for compute

Cases:link

Symbol Value String
FP64 1 fp64
FP32 2 fp32
FP16 4 fp16
Int64 8 int64
Int32 16 int32
Int16 32 int16
Int8 64 int8

DotProductOpslink

Supported dot product ops

Cases:link

Symbol Value String
None 0 none
DP4xI8ToI32 1 dp4xi8toi32

MMAFragmentlink

Descriptor for a particular fragment of an MMA operation

Cases:link

Symbol Value String
Lhs 0 Lhs
Rhs 1 Rhs
Acc 2 Acc

MMAIntrinsiclink

Descriptor for different MMA intrinsics

Cases:link

Symbol Value String
MFMA_F32_16x16x4_F32 4112 MFMA_F32_16x16x4_F32
MFMA_F32_16x16x16_F16 4128 MFMA_F32_16x16x16_F16
MFMA_F32_32x32x8_F16 4129 MFMA_F32_32x32x8_F16
MFMA_I32_16x16x16_I8 4288 MFMA_I32_16x16x16_I8
MFMA_I32_32x32x8_I8 4289 MFMA_I32_32x32x8_I8
MFMA_F32_16x16x8_BF16 4384 MFMA_F32_16x16x8_BF16
MFMA_F32_32x32x4_BF16 4385 MFMA_F32_32x32x4_BF16
MFMA_F64_16x16x4_F64 4352 MFMA_F64_16x16x4_F64
MFMA_F32_16x16x16_BF16 4640 MFMA_F32_16x16x16_BF16
MFMA_F32_32x32x8_BF16 4641 MFMA_F32_32x32x8_BF16
MFMA_F32_16x16x32_F8E5M2FNUZ 4656 MFMA_F32_16x16x32_F8E5M2FNUZ
MFMA_F32_16x16x32_F8E5M2FNUZ_F8E4M3FNUZ 4657 MFMA_F32_16x16x32_F8E5M2FNUZ_F8E4M3FNUZ
MFMA_F32_16x16x32_F8E4M3FNUZ 4658 MFMA_F32_16x16x32_F8E4M3FNUZ
MFMA_F32_16x16x32_F8E4M3FNUZ_F8E5M2FNUZ 4659 MFMA_F32_16x16x32_F8E4M3FNUZ_F8E5M2FNUZ
MFMA_F32_32x32x16_F8E5M2FNUZ 4660 MFMA_F32_32x32x16_F8E5M2FNUZ
MFMA_F32_32x32x16_F8E5M2FNUZ_F8E4M3FNUZ 4661 MFMA_F32_32x32x16_F8E5M2FNUZ_F8E4M3FNUZ
MFMA_F32_32x32x16_F8E4M3FNUZ 4662 MFMA_F32_32x32x16_F8E4M3FNUZ
MFMA_F32_32x32x16_F8E4M3FNUZ_F8E5M2FNUZ 4663 MFMA_F32_32x32x16_F8E4M3FNUZ_F8E5M2FNUZ
MFMA_I32_16x16x32_I8 4800 MFMA_I32_16x16x32_I8
MFMA_I32_32x32x16_I8 4801 MFMA_I32_32x32x16_I8
WMMA_F32_16x16x16_F16 6176 WMMA_F32_16x16x16_F16
WMMA_F16_16x16x16_F16 6177 WMMA_F16_16x16x16_F16
WMMA_F32_16x16x16_BF16 6178 WMMA_F32_16x16x16_BF16
WMMA_BF16_16x16x16_BF16 6179 WMMA_BF16_16x16x16_BF16
WMMA_I32_16x16x16_I8 6336 WMMA_I32_16x16x16_I8
NV_WMMA_F32_16x16x16_F16 8224 NV_WMMA_F32_16x16x16_F16
NV_WMMA_F16_16x16x16_F16 8225 NV_WMMA_F16_16x16x16_F16

MMAScopelink

Descriptor for a particular scope of an MMA operation

Cases:link

Symbol Value String
Workgroup 0 Workgroup
Subgroup 1 Subgroup

ReorderWorkgroupsStrategylink

Strategy for workgroup reordering

Cases:link

Symbol Value String
None 0 None
Transpose 1 Transpose

StorageBitwidthslink

Supported bitwidths for storage

Cases:link

Symbol Value String
B64 1 b64
B32 2 b32
B16 4 b16
B8 8 b8

SubgroupOpslink

Supported subgroup ops

Cases:link

Symbol Value String
None 0 none
Shuffle 1 shuffle
Arithmetic 2 arithmetic

TilingLevellink

Descriptor for tiling levels for GPU lowering configs

Cases:link

Symbol Value String
Workgroup 0 Workgroup
Reduction 1 Reduction
PartialReduction 2 PartialReduction
Thread 3 Thread
Subgroup 4 Subgroup
Lane 5 Lane

VirtualMMAIntrinsiclink

Descriptor for different Virtual MMA intrinsics

Cases:link

Symbol Value String
VMFMA_F32_16x16x32_F16 0 VMFMA_F32_16x16x32_F16
VMFMA_F32_32x32x16_F16 1 VMFMA_F32_32x32x16_F16
VMFMA_F32_16x16x32_F8E4M3FNUZ 2 VMFMA_F32_16x16x32_F8E4M3FNUZ
VMFMA_F32_32x32x16_F8E4M3FNUZ 3 VMFMA_F32_32x32x16_F8E4M3FNUZ

IteratorTypelink

Iterator type

Cases:link

Symbol Value String
parallel 0 parallel
reduction 1 reduction