Data Mining in the Cloud – Part 1

Data Mining in the cloud is a hot topic nowadays. As such, I’d like to define the Data Mining Stack and how it fits in the Cloud.

Data Mining Stack & The Cloud

Firstly, I will define the Data Mining Stack to consist of three layers: Algorithms (Heuristics), Hyper Cube, and Ingest DataBase.  The first layer (Algorithms) consists of mathematical methods that draw conclusions from data sets.  Some of the more common methods are: Gaussian EliminationNeural Network, and Rule Association. It is of the utmost importance to note that deriving inferences is based on the field of statistics. And statistics gets its power from large data sets. It does not make sense to base decisions on small data sets. Small data sets are random and thus insignificant.

So how do Data Mining Algorithms access large data sets? There are two different mechanisms for data access. The oldest is OnLine Transaction Processing (OLTP).  This is usually implemented as a Relational DataBase. Probably the most common and widely used relational databases are Oracle (enterprise) and MySQL (embedded).  Relational DataBase technology was invented back in the 70s. And it was originally designed to process MegaBytes (MB) of data.  Well guess what? Today we deal with PetaBytes (1MB x 1MB x 1KB) and quite possibly ExaBytes (1MB x 1MB x 1MB) of data. Processing this amount of data in a real time fashion is a bit of a challenge (to say the least).

A new type of database was thus created to deal with modern hyper data sets, called OnLine Analytical Processing (OLAP). This type of database is commonly called a HyperCube, because it is multi-dimensional. In geometry, a one-dimensional object is a line.  A two-dimensional object is a plane. A three-dimensional object is a cube.  And objects with dimensions greater than three are called HyperCubes.Each dimension of the OLAP HyperCube is indexed by a dimension value. It is important to note that OLAP HyperCubes store Data Counts, they do not store values. Relational DataBases store Data Values.  So a HyperCube Cell contains a count that represents the number of data records (in the OLTP DataBase) that have that particular dimensional equivalence. That is, the count of the number of data records that have the same value for dimension 1, the same value for dimension 2, the same value for dimension 3, and so on.  Thus a HyperCube Cell contains a count that represents the number of data records that have the same value for all dimensions.An OLAP HyperCube in essence coalescences a OLTP DataBase so that the data can be more readily used by Data Mining Algorithms.

The bottom layer in the Data Mining Stack (as you probably already have guessed) is the Ingest DataBase, which is commonly a Relational DataBase.  The Ingest DataBase must be efficient enough to keep up with the input rate into the database.  This is not a trivial matter.  Today’s Network Interface Cards (NIC) commonly support 10 and 20 Gigabit IO.  And the use of 100 Gigabit networks is very common.

So to tie this whole thing together, the Data Mining Stack consists of three layers.  The top layer is comprised of Data Mining Algorithms.  The middle layer consists of the OLAP HyperCube.  And the bottom layer is defined by the Ingest DataBase.

In my next blog post, I will discuss how the Data Mining Stack fits into the Cloud.