Data Mining in the Cloud Part 3

In my first Data Mining blog post, I defined the concept of the Data Mining Stack, that the Data Mining Stack consists of three layers. The top layer is comprised of Data Mining Algorithms.  The middle layer consists of the OLAP HyperCube.  And the bottom layer is defined by the Ingest DataBase.

In the second Data Mining post, I discussed how the Data Mining Stack fits into the Cloud.  And we want to implement it in the Cloud in the first place.

Cloud Storage vs. Cloud Computing

So how is this whole thing tied together?  The Ingest DataBase Cloud Grid is responsible for creating Relational DataBase Chunks as needed.  These Chunks will be written to Global Cloud Storage.  For example: in Amazon Cloud Computing, Global Cloud Storage is called
Elastic Block Store (EBS).  This is NOT the same thing as Simple Storage Service (S3), otherwise known as Cloud Storage.  Global Cloud Storage is thus a part of Cloud Compute
and does not have anything to do with Cloud Storage.  Global Cloud Storage can only be attached to one Virtual Machine at a time.  Thus such storage would be attached to an Ingest DataBase Instance when the Relational DataBase Chunk is being created.  Once the Chunk is complete, it is closed and thus detached from the Ingest DataBase Instance.  It is then available for reading by the OLAP Layer.

The OLAP Layer then uses Relational DataBase Chunks (which are stored in Global Cloud Storage) as input to create HyperCube Chunks. and just like the Ingest DataBase Layer, HyperCube Chunks are created in Global Cloud Storage.

Which brings us to a problem.  As previously stated, Global Cloud Storage can only be attached to one Virtual Machine at a time. So then, one and only one OLAP Instance could read a Relational DataBase Chunk at a time? This will result in a bottleneck and thus hurt performance and scaling.

The solution is to allow Global Cloud Storage to be attached to multiple Instances as long as the attachment is Read Only. This enhancement should not be to hard to implement.  Perhaps it is already in the works.

Be that as it may, Global Cloud Storage will be attached to each OLAP Instance when the HyperCube Chunk is being created. Once the Chunk is complete, it is closed and thus detached from the OLAP Instance. It is then available for reading by the Data Mining Algorithm Layer.

Data Mining Algorithm Layer

The Algorithm Layer then uses HyperCube Chunks (which are stored in Global Cloud Storage) as input to create Algorithm Chunks. And just like in the other layers, Algorithm Chunks are created in Global Cloud Storage.

So in conclusion, input to the Ingest DataBase Grid is the Ingest Data Stream Output from the Ingest DataBase Grid is multiple Relational DataBase Chunks. Which is input to the OLAP Grid.  Output from the OLAP Grid is multiple HyperCube Chunks. Which is input to the Data Mining Algorithm Grid. The Algorithm Grid uses multiple Algorithm Chunks to generate the final answer.