I/O Benchmarking with FIO - Part 1

Introduction

Performance is often times an afterthought when it comes to Software Development and DevOps, kind of like how security is. No one seems to consider or care about it until suddenly everything’s too slow and taking forever to work.

More often than not, performance can be boiled down to I/O since storage is frequently the weakest link in the chain. Understanding how I/O works and possible options to fine-tune it can make wonders on your infrastructure. This is exactly what we’ll be discussing in this post today.

We’ll be going over the basics of I/O benchmarking and trying to emulate various workloads while using the open source tool fio, short for Flexible I/O tester.

Setup

Luckily we don’t need a fancy or complicated setup to do our benchmarking. All we need is:

One Linux VM with a distro of your choice. Personally I use Ubuntu 24.04 LTS.
An HDD volume attached to the VM.
fio for I/O benchmarking.

You can run the commands below to set up and mount your volume:

1
2
3
4
5
6
7
sudo mkdir /mnt/dev

# check where your volume's located 
lsblk
# any filesystem would work
sudo mkfs.ext4 /dev/vdb
sudo mount /dev/vdb /mnt/dev

Journaling and existing I/O workloads will affect the benchmark. So make sure to run on a clean system to get accurate measures

Installation

FIO can be installed using the package manager. Check the documentation for your own distro. In my case, it’ll be Ubuntu. So the installation is done using the command:

1
sudo apt install -y fio

FIO Overview

This section will focus mainly on high-level concept explanation. The technical, nitty-gritty details will come after.

fio has an overwhelming number of parameters to simulate all sorts of workloads. We’ll be focusing on a handful of key ones that are enough to cover most scenarios:

bs: block size
size: File size to read/write from
iodepth: Queue size for I/O submissions
numjobs: Number of processes to create and perform the I/O operations
rw: Type of the I/O operation to perform such read, write, rw etc. Check the documentation for further options.
runtime: The duration to perform the I/O test
direct: Boolean attribute, true if it’s set to 1 and 0 otherwise.
ioengine: Two most common engines are io_uring and libaio.
output-format: Defaults to shell-based format. json is recommended for parsing and automated processing.

fio runs in one of two modes. Size-based, until the size is reached. Time-based, until the runtime is exhausted. I’d recommend to set one, not both.

`numjobs` vs `iodepth`

This is a common point of confusion and can be misleading, often leading to very different results. In short:

numjobs controls the process-level parallelism meaning, fio would spawn and create different dedicated processes to perform the I/O task.
iodepth controls the queue size, or, depth, at the job level

The total number of maximum in-flight I/O requests is numjobs * iodepth.

Block size

Setting the correct and proper block size is tricky. It can mess the entire benchmark and give a false impression. Block size can be split into two broad categories:

Small block size: Typically 4k up to 16k in size. Useful for random I/O and database/transactional workloads. AWS uses 4k block size by default for their EBS service
Mid-range block size: Typically 16k up to 64k and is used for mixed workloads. MS SQL Server uses 64k block size by default
Large block size: 128k and up to 4M. Recommended for sequential reads, backups and data warehousing. AWS Redshift uses 1M block size

Each application has its own requirements and the I/O has to be fine-tuned for the workload itself. A good rule of thumb is to use 4k, 64k and 1M for small, middle and large block sizes respectively.

`ioengine`

libaio, which stands for Library Asynchronous I/O, used to be the default I/O engine on Linux. It’s not completely deprecated, since it’s still the recommended I/O engine for testing HDD storage and legacy async I/O behavior. io_uring would be the go-to for SSD and NVMe type storage.

FIO supports plenty of engines, including sync for synchronous I/O, mmap, windowsaio and plenty more. You can find the full list from the documentation.

Always read the documentation when choosing an engine, not all parameters are compatible with every engine.

direct vs non-direct I/O

Direct I/O means that the I/O goes directly to the disk and bypasses any OS/page cache, hence the name. Non-direct I/O, on the other hand, can be considered as buffered-I/O. Meaning that I/O operations are collected together, in a buffer ( OS page cache), and then written (or read) at the same time. Buffered I/O will almost always be faster than non-buffered I/O, though it’s not a fair comparison since they perform different things and serve different purposes. The natural question then becomes, when to use which? Use Direct I/O to test the raw hardware storage performance and non-direct I/O for testing your RAM and caching system.

To sum things up, as a rule of thumb:

Use direct=1 when benchmarking storage hardware
Use direct=0 when benchmarking the cache system

HDD vs SSD Considerations

Different storage types behave very differently, so tuning fio parameters for HDDs and SSDs is important. HDDs, being mechanical, are latency-bound and benefit from low queue depths and single-job workloads ( numjobs=1). Random reads/writes are slow, so small blocks (4k) are used for testing database-like workloads, while sequential tests can use larger blocks (1M) and are useful for user applications/workloads.

SSDs, on the other hand, are throughput-bound, handle high parallelism very well, and can achieve maximum performance with higher iodepth and multiple jobs. block sizes also vary depending on the workload that we’re trying to simulate.

Tuning numjobs, iodepth, and bs parameters according to the storage type is critical to ensure the benchmarks reflect the device’s true performance to avoid erroneous conclusions.

A good rule of thumb:

Storage Type	Recommended `numjobs`	Recommended `iodepth`	Typical `bs`	I/O Engine
HDD	1	1–8	4k (small) / 1M (large)	`libaio`
SSD / NVMe	4–16+	16–64+	4k–128k (random) / 1M+ (sequential)	`io_uring`

Our First FIO Command

We’re finally ready to test our first real FIO command after having built up a lot of background info and knowledge:

1
2
3
4
5
6
7
8
9
sudo fio --name=hello-fio-read \
  --numjobs=1 \
  --iodepth=1 \
  --rw=read \
  --directory=/mnt/dev/ \
  --ioengine=libaio \
  --bs=1M \
  --direct=1 \
  --runtime=30

The command above basically spawns a single process to do a read I/O at /mnt/dev/ directory using the libaio engine with a block size set to 1M. The process will perform O_DIRECT I/O operations (--direct=1) to avoid the cache, and it will do so for a total of 30 seconds. Having specified the --directory, fio will create the file for us and perform the task. Finally, when the process finishes, we get a summary of the output:

1
2
3
4
5
6
7
hello-fio: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
Jobs: 1 (f=1): [R(1)][100.0%][r=346MiB/s][r=346 IOPS][eta 00m:00s]
  read: IOPS=166, BW=166MiB/s (174MB/s)(4991MiB/30005msec)
     lat (usec): min=613, max=376663, avg=6008.79, stdev=16419.17
   iops        : min=   20, max=  370, avg=164.86, stdev=87.26, samples=59
Run status group 0 (all jobs):
   READ: bw=166MiB/s (174MB/s), 166MiB/s-166MiB/s (174MB/s-174MB/s), io=4991MiB (5233MB), run=30005-30005msec

Some of the output has been removed for maintaining brevity.

Key metrics to focus on are mainly:

lat: Latency
BW: Bandwidth
iops: Number of I/O operations per second

Parameter Comparison

We’ve mentioned earlier the difference between direct and non-direct, big block size and small block size. So let’s see it in practice.

All commands and tests are done on the same machine to get accurate and consistent results

Small Block size vs Large Block size

We’ll be using this base command:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
sudo fio --name=hello-fio \
  --numjobs=1 \
  --iodepth=1 \
  --direct=1 \
  --rw=write \
  --directory=/mnt/dev \
  --ioengine=libaio \
  --runtime=30 \
  --size=1G \
  --time_based

Let’s start with bs=4k, we get:

1
2
3
4
5
6
7
hello-fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
  write: IOPS=895, BW=3582KiB/s (3668kB/s)(105MiB/30001msec); 0 zone resets
     lat (usec): min=358, max=29885, avg=1114.65, stdev=844.16
   iops        : min=  412, max= 1564, avg=893.86, stdev=304.19, samples=59

Run status group 0 (all jobs):
  WRITE: bw=3582KiB/s (3668kB/s), 3582KiB/s-3582KiB/s (3668kB/s-3668kB/s), io=105MiB (110MB), run=30001-30001msec

And now with bs=1M, we get:

1
2
3
4
5
6
  write: IOPS=69, BW=69.8MiB/s (73.2MB/s)(2094MiB/30011msec); 0 zone resets
     lat (msec): min=6, max=222, avg=14.29, stdev=10.75
   iops        : min=   38, max=   84, avg=69.76, stdev= 8.34, samples=59

Run status group 0 (all jobs):
  WRITE: bw=69.8MiB/s (73.2MB/s), 69.8MiB/s-69.8MiB/s (73.2MB/s-73.2MB/s), io=2094MiB (2196MB), run=30011-30011msec

We have the plot below (generated using Python’s matplot lib) to better summarize and visualize the results:

4k vs 1M Block Size Comparison

We can notice a pattern. The 4k block size has:

Low throughput (an impressive 3.49 MB/s)
Low latency
High IOPs

While the 1M block size is the complete opposite:

High throughput
High latency
Low IOPs

Direct vs Non-Direct

Now let’s see the difference with the direct parameter. Just as before, we’ll use this base command:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
sudo fio --name=hello-fio \
  --numjobs=1 \
  --iodepth=1 \
  --rw=write \
  --directory=/mnt/dev \
  --ioengine=libaio \
  --bs=4k \
  --runtime=30 \
  --size=1G \
  --time_based

It’s important to run both tests with the SAME block size to get as accurate of a comparison as possible

First, we’ll test direct=1 (same as the command in the earlier section):

1
2
3
4
5
6
7
hello-fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
  write: IOPS=895, BW=3582KiB/s (3668kB/s)(105MiB/30001msec); 0 zone resets
     lat (usec): min=358, max=29885, avg=1114.65, stdev=844.16
   iops        : min=  412, max= 1564, avg=893.86, stdev=304.19, samples=59

Run status group 0 (all jobs):
  WRITE: bw=3582KiB/s (3668kB/s), 3582KiB/s-3582KiB/s (3668kB/s-3668kB/s), io=105MiB (110MB), run=30001-30001msec

Now running with direct=0 and we get:

1
2
3
4
5
6
7
  write: IOPS=128k, BW=499MiB/s (523MB/s)(14.6GiB/30001msec); 0 zone resets
     lat (usec): min=2, max=2795, avg= 3.15, stdev= 3.20
   bw (  KiB/s): min=  840, max=1247432, per=100.00%, avg=797612.32, stdev=385436.40, samples=37
   iops        : min=  210, max=311858, avg=199403.24, stdev=96359.24, samples=37

Run status group 0 (all jobs):
  WRITE: bw=499MiB/s (523MB/s), 499MiB/s-499MiB/s (523MB/s-523MB/s), io=14.6GiB (15.7GB), run=30001-30001msec

Let’s plot the comparison just like before and see:

We have the following comparison:

Direct vs Non-Direct Comparison

Direct I/O performs so badly that we can barely see the throughput. We could’ve done some normalization and log scaling, but the point was to show the real drastic difference between the two.

Then again, seeing the results from the non-direct I/O, they are just too good to be true. This is the tricky part when it comes to benchmarking, knowing what results actually make sense, which ones are realistic, and which ones aren’t. Setting direct=0 will throw you off and you’d end up testing a completely different scenario (in this case, the RAM).

Testing Other Operations

Testing for other rw options is straightforward with FIO, we just need to change read to any of the supported options:

write
rw
randread
randwrite

The base command remains the same. So if we want to test our write performance, the command would be adjusted as such:

1
2
3
4
5
6
7
8
9
sudo fio --name=hello-fio-write \
  --numjobs=1 \
  --iodepth=1 \
  --rw=write \
  --directory=/mnt/dev/ \
  --ioengine=libaio \
  --bs=4k \
  --direct=1 \
  --runtime=30

And that’s it, we just had to change one parameter to test a completely different scenario. Which brings us to the next point, I/O testing can be, and is, very tricky because of that. A single parameter change would result in completely different outcomes and scenarios. So it’s important to carefully define what needs to be tested and which parameters to tinker with.

Conclusion

We’ve covered plenty of concepts and topics in this post, providing a solid foundation for running and interpreting fio benchmarks.

The key takeaway is that running a benchmark is only half the work. Understanding what you’re measuring and why is just as important. Small parameter changes can lead to vastly different results, and some trial and error is needed when defining realistic workloads. In the end, benchmark results are only meaningful if the workload matches reality.

In the next post, we’ll take this a step further by automating fio runs and diving deeper into the JSON output, including how to parse and analyze results at scale.