2026link

July 23, 2026
in Performance
8 min read

Running models on RISC-V with IREE

IREE compiles machine learning models to native RISC-V CPU code, with support for the RISC-V Vector extension (RVV), hand-written microkernels, and data-tiling. This post walks through the full flow for a model: importing it from PyTorch, compiling it for a RISC-V target, running it, and benchmarking the result.

All commands below run under qemu-riscv64. The flow on real hardware is identical — the QEMU invocation is simply replaced by running the tools natively on the target.

May 28, 2026
in Performance
10 min read

Virtual Dense MFMAs for Skinny GEMM

When we have a GEMM A * B = C, and it is the situation that A has a small number of rows and many columns, we classify this problem as a skinny GEMM. The decode phase of LLM inference is a common sight of this problem: a small batch of tokens multiplies against a large weight matrix. Skinny GEMMs are less convenient for modern GPU architectures than their non-skinny cousins. One reason is because modern GPUs take advantage of matrix core units which offer instructions that are specifically designed for matrix multiplication and operate on fixed tile sizes, and skinny GEMMs are too small to utilize them to their intended size.

On AMDGPUs and in particular on the MI3XX Instinct (CDNA) series, these instructions are known as MFMA instructions; for example, V_MFMA_F32_16x16x16_F16. One useful part of the name is the MxNxK tile shape consumed, where M is the number of rows of the left hand matrix, N is the number of columns of the right hand matrix, and K is the shared dimension of both.