Database Performance with Openstack Swift Improvements – Part 3

In Part 2 of this blog post series, I proposed replacing SQLite with MySQL as a database engine and changing the database schema to use chunking. These changes ensure that database performance is consistent for Account and Container Databases.

But what about Objects? As previously stated, Object data is stored in files. Are there problems with this?

Yes. Storing Object data in files is nice in that it is simple and Object data is easy to find – just look in the file system. But what about performance?

A file system adds a lot of overhead to the equation, functionality is not free. There are a number of problems with this approach.

Database Performance Problem 1: Buffer Cache

File systems use an in memory buffer cache to speed up repetitive, sequential, small file IO. But Swift Objects are written once with no partial updates or reads. Object are always completely rewritten. They are also always completely read. So the buffer cache sucks up a lot of memory and does not improve performance. In fact it degrades performance because of overhead.

Database Performance Problem 2: Multi Use of the File System

You will recall that the file system is also used to store the Account and Container databases. In addition, XFS supports Extended Attributes. These allow name/value pairs to be attached to a file. Swift uses XFS Extended Attributes to store user defined headers (meta data).

The end result is that the file system is multi use: database, meta data, and object data. This results in poor disk performance. All three of these components use the file system in different ways and thus have different effects on the file system and causes constant disk head seeks. For the best performance, we want to minimize seeks and keep the disk heads moving in one direction.

Database Performance Problem 3: Using File System Meta Data

Querying file system meta data performs poorly. File systems are great at moving data. They are terrible at querying data. Traversing inodes and data blocks is expensive. Consider the code to do an “ls”. This is basically a straight forward problem. Get the inode for a directory, perform multiple reads on the inode data, and filter and sort the results. This is a substantial ammount of code. Now consider if the meta data was in a database. The “ls” code becomes a single database query.

Database Performance Problem 4: All Components in a Single Repository

The database, meta data, and object data are all in the same file system. They thus are all in the same repository. Thus if the repo is down, then nothing is available. Now consider moving the meta data into the database and moving the database to a different repo. The result is that the database is still available if the file system is down. And the file system is still available if the database is down.

Database Performance Solution

The solution to these four problems is to move the file system meta data into the database and then move the database to another repo. In addition, scrap the file system and change the database to point directly into raw disk partitions. This means that Object Data would no longer be stored as files but the Object Table (in the Container Database) would point to the: partition name, partition offset, and object size. An object would thus be defined by the three tuple: (partition, offset, size). We thus remove the file system in all of its complexity and overhead and use raw disk partitions. To summerize:

  1. Replace SQLite with MySQL.
  2. Separate the database repo from the file system.
  3. Remove the file system.
  4. Move all meta data into the database.
  5. Change Object Table to use raw disk partitions.

That’s how I’ve fixed the problem. How have you solved these database performance issues?