A High Performance Write File System

I would like to take some time to describe how to write a high performance file system that is optimized. For simplicity sake, I call it the Write File System and it has 3 goals:

Three Goals of A Write File System

1) USER APPLICATION

We do not want any kernel code. Why? Because it makes it much easier to port to a new platform, both for hardware and operating system reasons. In addition, user code is more reliable. That is, errors that don’t crash the system and just result in a core dump.

2) OPTIMAL WRITE PERFORMANCE

We want to match the speed of the system disk controllers, for sustained write throughput.

3) UFS FILE SYSTEM INTERFACE

We want to use the established UFS file system interface – {fopen(), fclose(), fread(), fwrite()}. This makes it much easier to integrate into existing user code.

Basic Principles

The following are the basic principles that determine if a write file system is optimized :

RAW IO TO DISK.

The File System Buffer Cache is useful for general purpose file IO and can in general be helpful in performance improvements. But for pure disk write performance, raw IO is far superior. Raw IO bypasses the buffer cache, thus freeing up memory and eliminating overhead.

LARGE WRITES.

We want to use optimal sized buffers for writing to the disks. If we need to write out 32K bytes, then it is much faster to write this data in a single write as apposed to 32K writes of one byte each.

Write File System & Optimal Buffer Size

So what is the optimal buffer size?

Write a little program that uses various buffer sizes to write to a disk. Buffer sizes should be in powers of 2. Run enough buffer writes so that the test takes at least 5 minutes. If you time each test, then you will see that test performance improves as the buffer size improves, up to a point. If you increase the buffer size after this point, performance does not improve and it might even slightly degrade.

This break even point is usually the size of the disk control internal buffer. Be that as it may, we now have the buffer size that results in the optimal write performance.

ONE WRITER PER DISK CONTROLLER.

We want the disk heads to move sequentially and not have to perform seeks when writing to the disks. We thus only a single writer per disk. So each disk controller will have a dedicated disk writer daemon – fs_writer().

PARALLEL WRITES.

If we have multiple disk controllers in the system, we want to take advantage of parallelism. Since each disk controller has its own dedicated disk writer that are all independent, each disk writer – fs_writer(), can be run in parallel.

This parallelism is hidden from the user by using a dedicated File System Controller daemon called – fs_control(). All user write operations thus are send to fs_control().
fs_control() then, is responsible for determining which fs_writer() is best suited to the task.
A Round Robin scheduling technique could be employed. Or perhaps each fs_writer() could indicate when it is busy.

SHARED MEMORY.

fopen() returns a File Handle, which lives in shared memory.
Subsequent calls to fwrite() copy data into the shared memory File Handle.
fclose() sends a message to fs_control() that the file is ready for creation.

WRITE ONCE – READ FOREVER.

Do not rewrite a file. Files are written once and read forever. This removes the need for locking.

CLOUD LIKE PARTITIONING.

WFS uses cloud like partitioning. That is, directories are flat. They are not nested.
In addition, directories are segmented into buckets. Buckets are of fixed size – which is configurable.

So tying it all together, this is how data moves from a user program to the disk:

1) FILE = fopen(bucket, file)
The user program calls fopen() with a bucket name and file name.
fopen() returns a pointer to a shared memory object.

2) fwrite(FILE, data)
Data is copied to the File Handle in shared memory.

3) fclose(FILE)
This function sends the File Handle pointer to fs_control().

4) fs_control()
fs_control() sends the File Handle pointer to the best available disk writer – fs_writer().

5) fs_writer()
Each fs_writer() keeps an internal buffer for each directory type that has been defined.

The buffer size is optimal and has already been determined.
A buffer is flushed to disk (using raw IO) when it is full or after a long enough timeout.

Now as to reading Write File System files, a file is identified by the tuple {bucket, offset}.
When a file is created in a bucket, the file is prepended with a header that contains necessary metadata. That is, a bucket name and offset. Utilities are provided to list the files in a given bucket and to search buckets for a given file name. Pretty simple actually.

So you now know how to create a write file system that will max out a disk controller. This will enable you to take full advantage of available disk bandwidth.