Data Mining in the Cloud Part 2

In my last Data Mining post, I defined the concept of the Data Mining Stack and that it consists of three layers.  The top layer is comprised of Data Mining Algorithms.  The middle layer consists of the OLAP HyperCube, and the bottom layer is defined by the Ingest DataBase. So how does the Data Mining Stack fit into the Cloud?  And why would we want to implement it in the Cloud in the first place?

Ingest DataBase Layer

Consider the Ingest DataBase layer.  The limiting factor of this layer is the insert rate.  If the Ingest DataBase cannot keep up with the input rate, then multiple instances of the Ingest DataBase will have to be created and the input stream will be divided between them.

The easiest way to multiplex the Ingest DataBase is to implement it in the Cloud.  Each Ingest DataBase Instance will be a Virtual Machine in the Cloud Grid.  The Cloud Grid would then ensure that each Instance is alive and well.  If an Instance stops processing input (the Instance dies or is hung), then the Cloud Grid will terminate the sick Instance and create a new one.  Input data streams are often times dynamic, in that the input rate is not static, it can fluctuate.  The Cloud Grid could then dynamically allocate Instances as needed (the input rate dramatically increases) and deallocate them when they are no longer needed (the input rate dramatically decreases).

Cloud Grid Efficiency

The Cloud Grid is thus efficient with respect to resources.  You only create Cloud Virtual Machines when you need them.  Thus, you only pay for Cloud resources when you need them. This approach is obviously Highly Available,  in that Instance down time can be absolutely minimized.

Compare the Cloud Grid to an implementation that consists of physical servers, the Cloud Grid is much more flexible. The Cloud Grid easily supports Operating Systems upgrades.  Try doing that with a server. It is a trivial matter to increase/decrease physical memory in a Cloud Virtual Machine. Try doing that with a server.  It is a trivial matter to increase/ decrease physical disk space in a Cloud Virtual Machine. Try doing that with a server.

Now consider the OLAP HyperCube layer.  The limiting factor of this layer is the construction rate.  That is, how long does it take to create a HyperCube cell?  For example: if it takes time T for a Hypercube to create C number of cells and time T is not efficient, then multiple instances of the OLAP DataBase could be created and HyperCube could be created in parallel.

This problem is essentially identical to the Ingest DataBase problem. An guess what? The solution is identical. The easiest way to multiplex the OLAP HyperCube is to implement it in the Cloud.  Each Ingest HyperCube Engine will be a Virtual Machine in the Cloud Grid.

Data Mining Algorithm Layer

Now consider the Data Mining Algorithm layer. The limiting factor of this layer is how fast can the Algorithms process the HyperCube? If the Algorithms are not efficient enough, then the problem could be parallelized by the Divide and Conquer Method. This is a classic problem that is well understood in Algorithmic Analysis. In this case, the problem essentially consists of creating separate Data Mining Algorithm Engines. Each Engine would process a portion of the HyperCube. Results from each Algorithm Engine would then be coalesced.

This problem is essentially identical to the Ingest DataBase problem and the OLAP HyperCube problem. And one again, the solution is identical. The easiest way to multiplex the Data Mining Algorithm Engine is to implement it in the Cloud. Each Algorithm Engine will be a Virtual Machine in the Cloud Grid.

In my next blog post, I will discuss how all the different pieces fit together.