Last time we theorized why the RESAR Swift MySQL approach has such dismal performance results. In this post we will discuss a new approach.
B-Trees & Hashing
Specifically, we will define the stream star schema that will result in a database that is optimized for insertion performance. This performance improvement will primarily be the result of the improved attribute indexing mechanism. Many relational databases like MySQL use B-trees to implement attribute indexes. Index table insertion time thus depends on the table size and is O(log n) where n is the number of entries in a table. For the stream star schema, we proposed using Hash Tables instead of B-Trees.
Hash table insertion time is a constant and thus not dependent on table size. But the problem was how to implement hashing in our new database. The answer was in the judicious use of Python dictionaries. In Python, arrays or tables can be defined as lists or dictionaries.
Lists are arrays that are indexed by position. Dictionaries are indexed by attribute name. Python dictionaries are analogous to data structures in the C programming language, where each attribute is identified by a unique name. In Python, dictionary attribute names are hashed. So a dictionary is actually a hash table.
Once we had solved the hashing problem, we needed to create a generic stream star schema engine that would allow custom databases to be created. We thus defined an XML Schema Definition (XSD) that is used to define the star schema database. This XSD allows for the creation of multiple tables. A table can include multiple attributes. Attributes have two qualities: type and index. Attribute type is one of: int8, int16, int32, int64, uint8, uint16, uint32, uint64, string. Attribute index is a boolean that allows attributes to be hashed. If attribute index is FALSE, then the attribute is defined as a simple data value. On the other hand if attribute index is TRUE, then the attribute will be defined as a dictionary of lists. That is, a hash table, where each entry is a list of table indecies that contain that particular attribute value. For a given stream star schema (SSS), fact table (F), dimension (D), dimension value (V), data row (R) then the Python code to insert an attribute value into an attribute hash table is:
- if “index” in sss[‘fact_tables’][F][‘dimensions’][D]:
- if V not in sss[‘fact_tables’][F][‘dimensions’][D][‘index’]:
- sss[‘fact_tables’][F][‘dimensions’][D][‘index’][V] if R not in sss[‘fact_tables’] if [F][‘dimensions’][D][‘index’][V]:
- sss[‘fact_tables’][F][‘dimensions’][D][‘index’][V].append(R)
This code snippet clearly has constant performance time. Thus, performance for the stream star schema attribute indexing method was constant and thus optimal. With this technology in hand, we are now able to define the Python data structures for the stream star schema:
Stream Star Schema & MySQL Compatible
The final hurdle that we had to cross so that the stream star schema would be compatible and as capable as MySQL was how to store the database on disk? We envisioned writing a complex library that would flush a stream star schema from memory to disk and then allow the converse. Turns out that this problem has already been solved in Python.
The solution is called “pickle“.
For a given file (F) and stream star schema (SSS), the following code shows how a
stream star schema database is stored on disk:
- fd = open(F, “wb”)
- pickle.dump(SSS, fd)
- fd.close()
For a given file (F) the following code show how a stream star schema (SSS) is
populated from disk:
- fd = open(F, “rb”)
- SSS = pickle.load(fd)
- fd.close()
Amazingly simple.