Virtual Dense MFMAs for Skinny GEMMlink
When we have a GEMM A * B = C, and it is the situation that A has a
small number of rows and many columns, we classify this problem as a skinny
GEMM. The decode phase of LLM inference is a common sight of this problem: a
small batch of tokens multiplies against a large weight matrix. Skinny GEMMs are
less convenient for modern GPU architectures than their non-skinny cousins. One
reason is because modern GPUs take advantage of matrix core units which offer
instructions that are specifically designed for matrix multiplication and
operate on fixed tile sizes, and skinny GEMMs are too small to utilize them to
their intended size.
On AMDGPUs and in particular on the MI3XX Instinct (CDNA) series, these
instructions are known as MFMA instructions; for example,
V_MFMA_F32_16x16x16_F16. One useful part of the name is the MxNxK
tile shape consumed, where M is the number of rows of the left hand matrix,
N is the number of columns of the right hand matrix, and K is the shared
dimension of both.
For the ordinary dense GEMM MFMA path available to AMDGPU CDNA series, the
relevant 16-bit and 8-bit MFMAs have at least 16 rows in M. Consider M=8, which
is larger than the path we take in IREE for GEMV-like problems, but evidently
smaller than 16. The previous codegen path in IREE handled this by padding the
workgroup M tile to 16 and
using the ordinary dense MFMA configuration. The IR snippet below shows this
directly: the logical M=8 operation is configured with padding = [16, ...],
a dense mma_layout, and a workgroup tile of 16 rows.
IR with padding
%10 = linalg.generic {
indexing_maps = [
affine_map<(d0, d1, d2) -> (d0, d2)>,
affine_map<(d0, d1, d2) -> (d1, d2)>,
affine_map<(d0, d1, d2) -> (d0, d1)>
],
iterator_types = ["parallel", "parallel", "reduction"]
} ins(%6, %7 : tensor<8x16384xf16>, tensor<13312x16384xf16>)
outs(%9 : tensor<8x13312xf32>)
attrs = {
lowering_config = #iree_gpu.lowering_config<{
mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>,
padding = [16, 64, 128],
promote_operands = [0, 1],
reduction = [0, 0, 8],
subgroup = [1, 2, 0],
workgroup = [16, 64, 0]
}>
} {
...
}
Padding is simple and robust, but we would be wasting cycles on rows that are not present in the original matrix. The question is whether we can use the 16 physical rows of the hardware instruction more carefully.
Removing Padding with Sparse MFMAlink
AMD sparse MFMA instructions, V_SMFMAC, are matrix-core accumulate
instructions for a 4:2 structured-sparse A matrix and a dense B matrix. The
old D value is the accumulator, and the encoded third source is sparse index
metadata, not a separate C matrix operand. The 4:2 structured-sparse
operand is defined along K: in each group of four K positions, the sparse
index metadata tells the
instruction which two positions are non-zero.
On CDNA3/gfx942, the relevant sparse instruction has the same physical 16x16
output tile and the same number of cycles. For
F16/BF16, dense V_MFMA_F32_16X16X16_F16 and sparse
V_SMFMAC_F32_16X16X32_F16 are both 16-cycle instructions on gfx942. For
8-bit inputs, the analogous 16-cycle sparse instruction is 16x16x64.
The idea, described in the Hugging Face
MI300 kernel article, is to make
two sparse rows represent one dense row. One lane selects positions
{0, 1} in each group of four. Its paired lane selects positions {2, 3}.
Together, the two lanes cover the dense K positions for one logical row. The
benefit, in addition to removing padding, is that a 16-cycle sparse
instruction covers twice the logical K depth of the corresponding dense
16-cycle F16/BF16 MFMA.

Figure: "Using sparsity for skinny inputs" from Creating custom kernels for the AMD MI300.
After the sparse MFMA, the four-element native accumulator contains pairs of
partial sums for the same logical rows. The lowering adds those pairs together,
so the result again has the normal dense M=8 meaning.
Original HuggingFace Approachlink
On the standard path for processing data enroute to MFMA instructions, we go
through global memory -> LDS/Shared memory -> Registers -> MFMA instruction*.
In the original Hugging Face skinny GEMM kernel, data from matrix A is shuffled
on the way into LDS. The shuffle is necessary to meet the semantics of using the
sparse trick. If we were to use even lanes to select positions {0,1} and odd
lanes to select positions {2,3}, then for a load with 8 contiguous elements along
K:
K0 K1 K2 K3 K4 K5 K6 K7
We would want even lanes to hold:
K0 K1 K4 K5
and odd lanes to hold
K2 K3 K6 K7
In other words, the data loaded from LDS looks exactly like:
lane 0: K0 K1 _ _ K4 K5 _ _
lane 1: _ _ K2 K3 _ _ K6 K7
Together (as an even/odd pair), and across all threads in the subgroup, these precisely reconstruct the original dense rows. Following the loop around the inner K tile, these partials are then reduced to yield the full dense result.
* Shared-memory hierarchy note
This path is a simplified storyline. The actual shared-memory hierarchy has more detail than is useful for the VDMFMA discussion; refer to the AMDGPU ISA documentation for the full memory hierarchy and instruction-level behavior.
Adaptation in IREE as VDMFMAlink
The HF kernel makes A sparse-trick "friendly" before we read it from shared
memory. If IREE wanted to materialize that shuffled A form as a
compiler-owned tensor or storage layout, the natural existing mechanism would be
data tiling: attach an encoding, carry the encoded tensor type through the
producer/consumer boundary, and materialize the layout change with
packing/unpacking or other physical layout operations when needed. That is the
model described in IREE's data-tiling path. In the
GPU data-tiling path, encoded contractions reach
#iree_gpu.data_tiled_mma_layout on iree_codegen.inner_tiled.
Instead, we take advantage of "virtual" MMAs in IREE. Virtual MMAs in IREE
represent a lowering which is intended to match real MFMAs in the same way but
are otherwise composed of or are a modification of ordinary MFMAs.
#iree_gpu.virtual_mma_layout is an MMA/inner-tile descriptor: it supplies the
semantic tile shape, distributed thread layout, and target lowering, while the
promoted/shared-memory layouts remain unchanged. The
subgroup level MMA lowering keeps A as is when loaded from LDS and performs a
per-lane shuffle of the B matrix register data. Choosing to shuffle B in
registers keeps this part local to the virtual MMA; shuffling A into LDS would
also need a matching promotion/read layout for that operand. The final assembly
forms generates ds_read2_b64 LDS reads, which incidentally loads
twice as much data from LDS as the HF kernel.
With VDMFMA, we give flexibility and keep the sparse trick from becoming a
skinny-only tensor layout. The current selector still uses it conservatively,
only when the problem's total
M fits in the virtual M=8 tile and total K is divisible by the VDMFMA
selection tile. But the abstraction is an 8-row virtual MMA, not an encoded
storage format for an entire matmul. A future selector could tile a larger
multiple-of-8 M problem into VDMFMA-sized pieces.
Concretely, we represent VDMFMA in the following form:
#iree_gpu.virtual_mma_layout<VDMFMA_F32_8x16x64x2_F16>
Read this as a dense 8x16x64 virtual operation with F16 inputs and F32
accumulation. The trailing x2 says that, on the
CDNA3 F16 path, the virtual operation lowers to two native sparse MFMA
instructions along K.
At the virtual MMA level, each lane sees dense fragments:
A : vector<8xf16>
B : vector<16xf16>
Acc : vector<2xf32>
The sparse instruction wants a different physical view:
A : vector<4xf16>
B : vector<8xf16>
Acc/D : vector<4xf32>
SparseIndex : vector<4xi8>
VDMFMA is the adapter between these two views. It expands the accumulator,
chooses sparse metadata from lane parity, slices A and B, shuffles the per-lane
B register fragment, issues the sparse MFMAs, and collapses the accumulator
back to the dense virtual shape.
For one lane pair, the two instructions can be visualized as follows. The K
numbering below is the numbering in the dense per-lane fragment after
distribution. -- marks A positions that are implied zero for that physical
sparse row. The non-zero A samples are packed, and sparse index metadata maps
them back to positions within each K group of four.
first smfmac second smfmac
sparse indices 0 1 2 3 | 0 1 2 3 0 1 2 3 | 0 1 2 3
L0, selector 0x44 K0 K1 -- --| K2 K3 -- -- K4 K5 -- --| K6 K7 -- --
L1, selector 0xEE -- -- K8 K9| -- -- K10 K11 -- -- K12 K13| -- -- K14 K15
B after shuffle B0 B1 B8 B9| B2 B3 B10 B11 B4 B5 B12 B13| B6 B7 B14 B15
The corresponding shuffle indices in the lowering are:
first smfmac B shuffle: [0, 1, 8, 9, 2, 3, 10, 11]
second smfmac B shuffle: [4, 5, 12, 13, 6, 7, 14, 15]
The lowering may thus be logically represented as:
acc = [d0, d1] -> [d0, 0, d1, 0]
sparse_index = (lane_id & 1) ? 0xEE : 0x44
acc = smfmac(A[0:4], shuffle(B, [0, 1, 8, 9, 2, 3, 10, 11]), acc, sparse_index)
acc = smfmac(A[4:8], shuffle(B, [4, 5, 12, 13, 6, 7, 14, 15]), acc, sparse_index)
acc = [d0, d1, d2, d3] -> [d0 + d1, d2 + d3]
The accumulator conversions are wrapped in util.hoistable_conversion. In
IREE, this marks temporary marshaling between the layout used by inner_tiled
and the layout expected by the target intrinsic, so matching conversions can be
moved out of loops or canceled when the surrounding IR permits it. For VDMFMA,
that marshaling expands the logical two-element accumulator into the
four-element SMFMAC form before the sparse MFMA chain, then collapses the native
accumulator back by summing lane-pair partials.
Virtual MMA Layout in VDMFMAlink
The virtual MMA layout uses MMASingleSubgroupLayout, so it is worth unpacking
the terminology.
A single subgroup layout describes how one operand of one subgroup-level matrix
operation is distributed across lanes in IREE. More precisely, it maps a lane id
and a per-lane vector element index to semantic operand dimensions such as M,
N, and K. For each semantic operand dimension, it has:
outer: outer repetitions of element tiles in the logical per-thread operand vector;thread: the logical thread grid over all dimensions;tstrides: the lane-id stride for moving by one element tile along that dimension;element: the contiguous logical element tile within that vector
For each dimension, outer[i] * thread[i] * element[i] is the semantic tile
size. For the F16 VDMFMA LHS, IREE uses:
outer = {1, 1}
thread = {8, 4}
tstrides = {2, 16}
element = {1, 16}
The semantic dimensions are M and K, so this is an 8x64 LHS tile:
1 * 8 * 1 = 8 rows and 1 * 4 * 16 = 64 reduction elements. The thread-grid
part can be visualized as adjacent lane pairs over the 8x4 M/K grid:
K thread coordinate
0 1 2 3
M0 T0, T1 T16, T17 T32, T33 T48, T49
M1 T2, T3 T18, T19 T34, T35 T50, T51
M2 T4, T5 T20, T21 T36, T37 T52, T53
M3 T6, T7 T22, T23 T38, T39 T54, T55
M4 T8, T9 T24, T25 T40, T41 T56, T57
M5 T10, T11 T26, T27 T42, T43 T58, T59
M6 T12, T13 T28, T29 T44, T45 T60, T61
M7 T14, T15 T30, T31 T46, T47 T62, T63
For ordinary layouts, prod(outer) * prod(element) is the actual per-lane
vector length. Here, the product of thread is 32, while the CDNA3 subgroup
size is 64. This means that lanes 2p and 2p+1 therefore share the same
logical M/K thread-grid coordinates. IREE then splits the divisible element
dimension, K, so lane 2p
receives the lower 8 elements of the 16-wide K element tile and lane 2p+1
receives the upper 8. The RHS and accumulator layouts have thread products of
64, so their logical thread-grid positions already match the physical lanes.
This is the layout-side part that gives VDMFMA the "virtual dense" behavior: the
compiler still distributes a dense 8x64 LHS tile, but the physical lanes are
grouped so that each even/odd lane pair owns the two dense halves that the sparse
instruction trick will reinterpret.
Selecting VDMFMAlink
VDMFMA is not selected for every matmul. IREE has multiple codegen pipelines,
and the one which is relevant for the shape of skinny GEMMs belongs to
TileAndFuse. TileAndFuse derives VDMFMA candidates from the target's concrete
MFMA capabilities. On the CDNA3 F16 path, the
virtual VDMFMA_F32_8x16x64x2_F16 candidate is derived from
MFMA_F32_16x16x16_F16.
There is one tuning detail that is easy to miss. Since sparse MFMAs have twice the K-depth as dense MFMAs, the compute phase is shorter than the padded dense MFMA sequence it replaces. In a software-pipelined loop, that can reduce the amount of compute available to hide the next tile's memory latency. The final selection change scales the reduction tile count by the virtual intrinsic's K unroll factor to compensate for the shorter compute phase.
With VDMFMA selected for the same shape, the new IR excerpt
has no M=16 padding. The workgroup M tile is 8, and the MMA kind is the
virtual layout.
IR with VDMFMA
%10 = linalg.generic {
indexing_maps = [
affine_map<(d0, d1, d2) -> (d0, d2)>,
affine_map<(d0, d1, d2) -> (d1, d2)>,
affine_map<(d0, d1, d2) -> (d0, d1)>
],
iterator_types = ["parallel", "parallel", "reduction"]
} ins(%6, %7 : tensor<8x16384xf16>, tensor<13312x16384xf16>)
outs(%9 : tensor<8x13312xf32>)
attrs = {
lowering_config = #iree_gpu.lowering_config<{
mma_kind =
#iree_gpu.virtual_mma_layout<VDMFMA_F32_8x16x64x2_F16>,
promote_operands = [0, 1],
reduction = [0, 0, 4],
subgroup = [1, 2, 0],
workgroup = [8, 64, 0]
}>
} {
...
}
Performancelink
The first end-to-end 16-bit selection change reported the following numbers on CDNA3, compared with the padded dense baseline:
| Shape | VDMFMA | Baseline | Improvement |
|---|---|---|---|
f16_8x13312x16384 |
189 us | 206 us | +8.3% |
f16_8x13312x8192 |
117 us | 116 us | - |
f16_8x2304x16384 |
133 us | 138 us | +3.6% |
f16_8x2304x8192 |
103 us | 110 us | +6.4% |
f16_8x6656x16384 |
127 us | 130 us | +2.3% |
f16_8x6656x8192 |
102 us | 109 us | +6.4% |
Conclusionlink
VDMFMA is a small compiler abstraction around a target-specific instruction
mapping. This is represented in the IR as a "virtual dense" 8x16xK MMA.
The generated code for the F16 kernel above uses paired ds_read2_b64 LDS reads
to form dense per-lane fragments; the virtual MMA lowering then uses lane
parity, B register shuffling, sparse MFMA instructions and accumulator
reduction to fulfill the conditions of the sparse trick for skinny GEMMs. At
configuration time, it is currently selected only for skinny shapes where the
total M fits within the virtual M=8 tile and total K
is divisible by the VDMFMA selection tile. The result is an end-to-end
adaptation of a hand-written HIP optimization into IREE's AMDGPU codegen
pipeline.