# IREE / MLIR / Linalg tutoriallink

## Introductionlink

This tutorial is simultaneously about IREE, MLIR, and specifically the MLIR Linalg dialect.

### What is MLIR?link

MLIR is a programming language, but MLIR in itself is almost just an empty shell. What it really provides is a framework allowing to define MLIR dialects which are where the features come from.

The "IR" part of the MLIR name stands for "intermediate representation". It means that MLIR is meant to be primarily for compiler-internal representations of code. But MLIR is actually fairly nice for humans to work with, and it's not hard to hand-author some MLIR programs from scratch. That is exactly the topic of this tutorial.

The "ML" part of the MLIR name stands for "multi-level" (not machine learning!). It means that MLIR allows for multiple dialects to be freely mixed in the same MLIR programs. Each dialect can define operations, types and attributes, and each single MLIR statement can mix ops, types and attributes coming from different dialects.

### What is the Linalg dialect?link

Linalg is a MLIR dialect that
essentially consists of a single op, `linalg.generic`

, with most other ops in
this dialect being just convenience aliases for special cases of
`linalg.generic`

. So, to describe Linalg dialect is essentially to describe
`linalg.generic`

.

The point of this is that this single op, `linalg.generic`

, is:

- General enough to express the entirety of usual machine learning workloads in any quantization scheme at all.
- High-level enough to be lowered to efficient code for any target (CPU, GPU, ...)
- Designed to be a good fit for compiler IR-to-IR transformations.

These traits make the Linalg dialect an ideal "middle-end" IR for a machine learning compiler.

### What is IREE?link

IREE is a MLIR compiler and runtime that can lower MLIR programs through successive, ever lower-level dialects, ultimately producing machine code for various CPU, GPU and other hardware targets. Check out the Developer overview docs and the ML frameworks docs.

Front-ends can ingest source programs from various machine-learning frameworks into MLIR Linalg dialect. Boundaries are in flux, but it is a good enough mental model to think of anything up to Linalg as "front-end". One example is, for ingesting PyTorch programs, the front-end is torch-mlir and end-users are encouraged to use iree-turbine, which integrates IREE, torch-mlir and PyTorch.

This tutorial is only concerned about the Linalg dialect, and we are going to learn to hand-author some Linalg programs. The point of the above tangent about front-ends is to make it clear that no matter which way you feed a program into IREE, it will internally be rewritten into a Linalg program, because that really is the intermediate representation in this compiler.

### Getting IREE binarieslink

IREE builds can be downloaded or installed as Python packages or built from sources.

## First linalg programslink

Before we start: there is also an official Linalg tutorial. It takes a different approach compared to the present tutorial, so the two are complementary.

### Static-shape, element-wise addition of two 1D arrayslink

Here is our first Linalg function. The scalar type used in this program, `f32`

,
is 32-bit floating-point.

Notice some elements of MLIR syntax:

- The
`%`

prefix on an identifier indicates a SSA value, like here`%result`

. - The
`@`

prefix on an identifier indicates a function, like here`@foo`

. - The
`^`

prefix on an identifier indicates a block, like here`^bb0`

. - The
`#`

prefix on an identifier indicates an attribute alias, like here`#map_1d_identity`

. - The
`x`

letter is used as delimiter in shapes, and between the shape and the element type, like here`10xf32`

meaning a 1D shape of size 10 with element type`f32`

. - Operations have the form
`dialect.name`

. For example,`tensor.empty`

is the`empty`

operation within the`tensor`

dialect, and`func.func`

is the`func`

operation within the`func`

dialect.

```
// The 1D identity map, used below.
#map_1d_identity = affine_map<(m) -> (m)>
// Define a function @foo taking two tensor arguments `%lhs` and `%rhs` and returning a tensor.
func.func @foo(
%lhs : tensor<10xf32>,
%rhs : tensor<10xf32>
) -> tensor<10xf32> {
// A constant used below.
%c0f32 = arith.constant 0.0 : f32
// Create a result "init value". Think of it as an abstract "allocation",
// creating a tensor but not giving its elements any particular value. It would be
// undefined behavior to read any element from this tensor.
%result_empty = tensor.empty() : tensor<10xf32>
// Perform the computation. The following is all a single linalg.generic op.
%result = linalg.generic {
// This {...} section is the "attributes" - some compile-time settings for this op.
indexing_maps=[
// Indexing maps for the parameters listed in `ins(...)`
#map_1d_identity,
#map_1d_identity,
// Indexing maps for the parameters listed in `outs(...)`
#map_1d_identity
],
// There is one tensor dimension, and it's a parallel-iteration dimension,
// meaning that it occurs also as a result tensor dimension. The alternative
// would be "reduction", for dimensions that do not occur in the result tensor.
iterator_types=["parallel"]
} // End of the attributes for this linalg.generic. Next come the parameters:
// `ins` is where we pass regular input-parameters
ins(%lhs, %rhs : tensor<10xf32>, tensor<10xf32>)
// `outs` is where we pass the "outputs", but that term has a subtle meaning
// in linalg. Here we are passing a tensor.empty, meaning just a placeholder
// for the output with no preexisting element values. In other examples with
// an accumulator, this is where the accumulator would be passed.
outs(%result_empty : tensor<10xf32>)
// End of parameters. The next {...} part is the "code block".
{
// bb0 is a code block taking one scalar from each input tensor as argument, and
// computing and "yielding" (ie returning) the corresponding output tensor element.
^bb0(%lhs_entry : f32, %rhs_entry : f32, %unused_result_entry : f32):
%add = arith.addf %lhs_entry, %rhs_entry : f32
linalg.yield %add : f32
} // End of the basic block. Finally, we describe the return type.
-> tensor<10xf32>
// End of the linalg.generic op.
// Return the function's return value.
return %result : tensor<10xf32>
}
```

Compile it like this:

```
iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb
```

Note

These are just minimalist `iree-compile`

flags for running on CPU without
trying to maximize performance.

- To run on GPU or other non-CPU targets, explore other values for
`--iree-hal-target-backends=`

. You will then need to pass a matching`--device=`

to`iree-run-module`

below. - To cross-compile, explore
`--iree-llvmcpu-target-triple=`

. - To enable higher CPU performance by enabling CPU features:
- On x86, explore
`--iree-llvmcpu-target-cpu=`

(e.g.`--iree-llvmcpu-target-cpu=znver4`

to target AMD Zen4). - On other architectures, explore
`--iree-llvmcpu-target-cpu-features=`

. - To optimize for running on the same machine that the compilation ran
on, pass
`--iree-llvmcpu-target-cpu=host`

. That works regardless of CPU architecture.

- On x86, explore
- Check out
these docs for
more useful
`iree-compile`

flags.

Run it like this:

```
$ iree-run-module --module=/tmp/prog.vmfb \
--input=10xf32=[0,1,2,3,4,5,6,7,8,9] \
--input=10xf32=[90,80,70,60,50,40,30,20,10,0]
EXEC @foo
result[0]: hal.buffer_view
10xf32=90 81 72 63 54 45 36 27 18 9
```

Here, each `--input`

parameter specifies one input. First its shape and element
type, `10xf32`

, then the example array elements in `[...]`

brackets. The output
of `iree-run-module`

above shows the contents of the result.

### Dynamic-shape, element-wise addition of two 1D arrayslink

While we are going to mostly focus on static shapes for simplicity in the rest of this tutorial, let us give one dynamic-shape example to at least show that that's not a problem. Here is the dynamic-shape equivalent of the previous example.

```
#map_1d_identity = affine_map<(m) -> (m)>
func.func @foo(
%lhs : tensor<?xf32>,
%rhs : tensor<?xf32>
) -> tensor<?xf32> {
%c0f32 = arith.constant 0.0 : f32
%c0 = arith.constant 0 : index
%size = tensor.dim %lhs, %c0 : tensor<?xf32>
%result_empty = tensor.empty(%size) : tensor<?xf32>
%result = linalg.generic {
indexing_maps=[
// Indexing maps for the parameters listed in `ins(...)`
#map_1d_identity,
#map_1d_identity,
// Indexing maps for the parameters listed in `outs(...)`
#map_1d_identity
],
iterator_types=["parallel"]
} ins(%lhs, %rhs : tensor<?xf32>, tensor<?xf32>)
outs(%result_empty : tensor<?xf32>)
{
^bb0(%lhs_entry : f32, %rhs_entry : f32, %unused_result_entry : f32):
%add = arith.addf %lhs_entry, %rhs_entry : f32
linalg.yield %add : f32
}
-> tensor<?xf32>
return %result : tensor<?xf32>
}
```

This program can be compiled and run exactly like the previous one, except that
now the `iree-run-module`

command may specify inputs of arbitrary length. The
only requirement is that both inputs have the same length, otherwise the
`linalg.generic`

will have undefined behavior.

```
$ iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb
$ iree-run-module --module=/tmp/prog.vmfb \
--input=10xf32=[0,1,2,3,4,5,6,7,8,9] \
--input=10xf32=[90,80,70,60,50,40,30,20,10,0]
EXEC @foo
result[0]: hal.buffer_view
10xf32=90 81 72 63 54 45 36 27 18 9
```

### Passing one of the inputs in `outs`

link

Here is a more concise variant achieving the same result in fewer lines of code,
and giving us a first taste of that that `outs(...)`

parameters list can do. We
didn't want to show it first, because it's less idiomatic. `outs`

will only
become really necessary (and idiomatic) when we will look at `reduction`

iterators. In the previous examples, we had only passed a `tensor.empty`

placeholder for `outs`

. This new example shows that we can actually pass there
any of the inputs that are shaped like the result.

```
#map_1d_identity = affine_map<(m) -> (m)>
func.func @foo(
%lhs : tensor<10xf32>,
%rhs : tensor<10xf32>
) -> tensor<10xf32> {
%result = linalg.generic {
indexing_maps=[
// Indexing maps for the parameters listed in `ins(...)`
#map_1d_identity,
// Indexing maps for the parameters listed in `outs(...)`
#map_1d_identity
],
iterator_types=["parallel"]
} ins(%lhs : tensor<10xf32>)
outs(%rhs : tensor<10xf32>)
{
^bb0(%lhs_entry : f32, %rhs_entry : f32):
%add = arith.addf %lhs_entry, %rhs_entry : f32
linalg.yield %add : f32
}
-> tensor<10xf32>
return %result : tensor<10xf32>
}
```

```
$ iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb
$ iree-run-module --module=/tmp/prog.vmfb \
--input=10xf32=[0,1,2,3,4,5,6,7,8,9] \
--input=10xf32=[90,80,70,60,50,40,30,20,10,0]
EXEC @foo
result[0]: hal.buffer_view
10xf32=90 81 72 63 54 45 36 27 18 9
```

### A first `reduction`

example: summing a 1D arraylink

This function takes a 1D array of floats and returns their sum. `tensor<f32>`

is a 0-dimensional tensor type. We could as well extract the single `f32`

element and return that, but we wanted to make this example as simple as
possible.

What's subtle here is how the `bb0`

block in the `linalg.generic`

now actively
uses the `%result_entry`

as an operand to `arith.addf`

, yielding the result of
this addition on every iteration. Implicitly, this stores the result of that
addition to the destination, from where it is re-loaded on the next iteration
again as `%result_entry`

. So the SSA value `%result_entry`

has a different
value on each iteration.

Because the values from the `outs`

parameter are now actually used, we can't
directly pass there the `tensor.empty`

, whose elements are uninitialized. We
have to initialize the result entries as zeroes, which is achieved by the
`linalg.fill`

.

```
#map_1d_identity = affine_map<(m) -> (m)>
#map_1d_proj_0d = affine_map<(m) -> ()>
func.func @foo(
%input : tensor<10xf32>) -> tensor<f32> {
%result_empty = tensor.empty() : tensor<f32>
%cst_0 = arith.constant 0.0 : f32
%result_init = linalg.fill ins(%cst_0 : f32) outs(%result_empty : tensor<f32>) -> tensor<f32>
%result = linalg.generic {
indexing_maps=[
// Indexing maps for the parameters listed in `ins(...)`
#map_1d_identity,
// Indexing maps for the parameters listed in `outs(...)`
#map_1d_proj_0d
],
iterator_types=["reduction"]
} ins(%input : tensor<10xf32>)
outs(%result_init : tensor<f32>)
{
^bb0(%input_entry : f32, %result_entry : f32):
%add = arith.addf %input_entry, %result_entry : f32
linalg.yield %add : f32
}
-> tensor<f32>
return %result : tensor<f32>
}
```

```
$ iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb
$ iree-run-module --module=/tmp/prog.vmfb --input=10xf32=[0,1,2,3,4,5,6,7,8,9]
EXEC @foo
result[0]: hal.buffer_view
f32=45
```

### Combining `parallel`

and `reduction`

iterators: summing each row of a 2D array.link

This is our first 2D example so for the first time we have to start explaining
how the `iterator_types`

are enumerated and we start seeing some more
interesting examples of `affine_map`

.

```
#map_2d_identity = affine_map<(m, n) -> (m, n)>
#map_2d_proj_first = affine_map<(m, n) -> (m)>
func.func @foo(
%input : tensor<3x5xf32>) -> tensor<3xf32> {
%result_empty = tensor.empty() : tensor<3xf32>
%cst_0 = arith.constant 0.0 : f32
%result_init = linalg.fill ins(%cst_0 : f32) outs(%result_empty : tensor<3xf32>) -> tensor<3xf32>
%result = linalg.generic {
indexing_maps=[
// Indexing maps for the parameters listed in `ins(...)`
#map_2d_identity,
// Indexing maps for the parameters listed in `outs(...)`
#map_2d_proj_first
],
iterator_types=[
// Rule: the i-th iterator_type corresponds to the i-th coordinate in the
// source space of the affine_maps defined above, (m, n). So:
"parallel", // This refers to the `m` coordinate in the affine-maps.
// This is the coordinate that is preserved in the result,
// see the map_2d_proj_first map given above.
"reduction" // This refers to the `n` coordinate in the affine-maps.
// This is the coordinate that is dropped by the map_2d_proj_first
// given above and thus not present in the 1D result.
]
} ins(%input : tensor<3x5xf32>)
outs(%result_init : tensor<3xf32>)
{
^bb0(%input_entry : f32, %result_entry : f32):
%add = arith.addf %input_entry, %result_entry : f32
linalg.yield %add : f32
}
-> tensor<3xf32>
return %result : tensor<3xf32>
}
```

```
$ iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb
$ iree-run-module --module=/tmp/prog.vmfb \
--input=3x5xf32=[[0,1,2,3,4],[5,6,7,8,9],[10,11,12,13,14]]
EXEC @foo
result[0]: hal.buffer_view
3xf32=10 35 60
```

## Matrix multiplication as a `linalg.matmul`

and as a `linalg.generic`

link

We are now ready to see how to express matrix multiplication as a
`linalg.generic`

. But actually, rather than just writing that by hand, we are
going to let Linalg do it for us. Indeed, in addition to `linalg.generic`

,
Linalg contains a number of "named ops", which are essentially just short-hand
notation for special cases of `linalg.generic`

. One of them is `linalg.matmul`

,
doing matrix multiplication accumulating into an existing accumulator. Here is
a simple function performing a matrix-multiplication-with-accumulation using
`linalg.matmul`

. Also in this example, we use dynamic shapes (the `?`

in the
shapes, see the above section where we encountered that), but we could just as
well use static shapes.

```
func.func @foo(%lhs: tensor<?x?xf32>, %rhs: tensor<?x?xf32>, %acc: tensor<?x?xf32>) -> tensor<?x?xf32> {
%result = linalg.matmul
ins(%lhs, %rhs: tensor<?x?xf32>, tensor<?x?xf32>)
outs(%acc: tensor<?x?xf32>)
-> tensor<?x?xf32>
return %result: tensor<?x?xf32>
}
```

```
$ iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb
$ iree-run-module --module=/tmp/prog.vmfb \
--input=2x2xf32=[[1,2][3,4]] \
--input=2x2xf32=[[1,4][3,2]] \
--input=2x2xf32=[[0,0][0,0]]
EXEC @matmul_dynamic
result[0]: hal.buffer_view
2x2xf32=[7 8][15 20]
```

Now we encounter another IREE tool: `iree-opt`

. Unlike `iree-compile`

which
compiles a MLIR program all the way down to a `.vmfb`

that's ready to run on the
target device, `iree-opt`

only applies selected transformations.

We run:

```
iree-opt --linalg-generalize-named-ops prog.mlir
```

And that prints:

```
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
func.func @foo(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>, %arg2: tensor<?x?xf32>) -> tensor<?x?xf32> {
%0 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0, %arg1 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%arg2 : tensor<?x?xf32>) {
^bb0(%in: f32, %in_0: f32, %out: f32):
%1 = arith.mulf %in, %in_0 : f32
%2 = arith.addf %out, %1 : f32
linalg.yield %2 : f32
} -> tensor<?x?xf32>
return %0 : tensor<?x?xf32>
}
}
```

So that's the `linalg.generic`

implementing matrix multiplication equivalently
to the above `linalg.matmul`

form. We can compile and run that like the above
program and it will have exactly the same result.

Here the 3 listed `iterator_types`

, `["parallel", "parallel", "reduction"]`

,
correspond to the 3 listed coordinates in the `affine_map`

's, `(d0, d1, d2)`

.
So, `d0`

and `d1`

are parallel dimensions and `d2`

is the reduction dimension.
That's why the first two `affine_map`

's results involve `d2`

(they are
respectively for the LHS `%arg0`

and RHS `%arg1`

) and the last `affine_map`

's
result only involves the parallel `d0`

and `d1`

, as it refers to the result
matrix.

Note

Some current IREE compiler optimizations are only triggering on named ops
like `linalg.matmul`

, not on the equivalent `linalg.generic`

form. Think of
that as a non-essential current limitation, and the intent is over time to
overcome these, but in the near term do use `linalg.matmul`

when performance
matters.

### Integer element typeslink

MLIR defines integer types for absolutely any bit-width, including non-power-of-two bit-widths, and in three signedness flavors:

- Signed integers, indicated by the letters
`si`

. - Unsigned integers, indicated by the letters
`ui`

. - Sign-less integers indicated by the letter
`i`

. "Sign-less" means that the integer type does not convey signedness; the integer value may be used as either a signed or an unsigned value but that's a property of the*operation*using that value as an operand, that's not encoded in the*type*.

So for instance, `si16`

is the 16-bit signed integer type, `ui24`

is the 24-bit
unsigned integer type, and `i8`

is the sign-less 8-bit integer type.

Now here is a very important principle of how the MLIR dialects that are relevant to us in IREE operate:

Note

Only use sign-less types. Always encode signedness in operations, not in types.

For example, here is how we perform a matrix multiplication where the LHS is
signed 8-bit integers, the RHS is unsigned 8-bit integers, and the accumulator
is signed 32-bit integers. Notice how the fact that LHS is signed and the RHS is
unsigned is encoded only in the implementation of the `linalg.generic`

basic
block, where the LHS and RHS entries are extended, respectively as signed
(`arith.extsi`

) and unsigned (`arith.extui`

):

```
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
func.func @foo(%lhs: tensor<?x?xi8>, %rhs: tensor<?x?xi8>, %acc: tensor<?x?xi32>) -> tensor<?x?xi32> {
%result = linalg.generic
{indexing_maps = [#map, #map1, #map2],
iterator_types = ["parallel", "parallel", "reduction"]}
ins(%lhs, %rhs : tensor<?x?xi8>, tensor<?x?xi8>)
outs(%acc : tensor<?x?xi32>) {
^bb0(%lhs_entry: i8, %rhs_entry: i8, %acc_entry: i32):
%lhs_extended = arith.extsi %lhs_entry : i8 to i32
%rhs_extended = arith.extui %rhs_entry : i8 to i32
%mul = arith.muli %lhs_extended, %rhs_extended : i32
%add = arith.addi %acc_entry, %mul : i32
linalg.yield %add : i32
} -> tensor<?x?xi32>
return %result : tensor<?x?xi32>
}
}
```

```
$ iree-compile --iree-hal-target-backends=llvm-cpu prog.mlir -o /tmp/prog.vmfb
$ iree-run-module --module=/tmp/prog.vmfb \
--input=2x2xi8=[[-1,-2][-3,-4]] \
--input=2x2xi8=[[1,4][3,2]] \
--input=2x2xi32=[[0,0][0,0]]
EXEC @foo
result[0]: hal.buffer_view
2x2xi32=[-7 -8][-15 -20]
```

Note

A current runtime limitation,
https://github.com/iree-org/iree/issues/16241,
prevents passing sub-byte-bit-width integers on the `iree-run-module`

command line.