Cloud computing is definitely here to stay. The key premise is that cloud providers can build large data centers cheaply and then rent compute and storage space out to small and medium sized businesses much cheaply than what the small and medium sized companies would have paid if they had built them on their own. Moreover, the cloud providers are making money while helping the small and medium sized companies. Thus, it seems like a win-win situation.
There are different types of cloud providers like infra-structure cloud providers (Amazon, Google), application cloud providers (Salesforce), storage cloud providers (Amazon, Nirvanix) etc. So, the key question is, is there a once size fits all type of back end cloud storage architecture for all of these different types of clouds? Is it cost effective for the cloud provider to always use a shared-nothing (where storage is directly attached to each of the nodes) architecture for content depot type of applications? Is it always cost effective for the cloud provider to always make keep 3 copies of every data item? There are certain non-technical reasons that have clouded (no pun intended) the thinking process and I would like to first articulate them before providing some technical analysis.
·
Don’t use Google and Amazon’s technical solutions as the reference architecture: One thing I have noticed is that many other companies are trying to mimic the Google file system architecture. It is important to note that Google and Amazon’s business model gives them the luxury of storing 3 copies of data. Similarly, since they anyway themselves have high computation needs for their internal map-reduce applications, they are able to leverage that and have smaller CPU to disk ratios for the nodes in their cluster.
·
High cost does not mean bad architecture: The higher cost of the solutions being offered by traditional storage controller vendors does not mean that their storage architecture is flawed? The business models of these companies prevent them from flooring the prices of their boxes but fundamentally these storage boxes are also made out of commodity components, and thus, their COGs are not that high. Thus, many of the cloud providers are getting away with storing 3 copies of data and in some cases having few disks behind lowly utilized CPUs because the total cost of the solution is still less than what they would have to pay to a storage controller vendor.
So, from a technical standpoint (wrt space efficiency, CPU utilization etc), I don’t think one architecture is optimum for all types of workloads and cloud deployments. Let me briefly articulate the different types of requirements:
·
Map-Reduce Applications: In these applications the CPU processing and memory needs of an application scales evenly along with the storage needs of the application. Thus, a shared-nothing architecture (with light-weight nodes with respect to number of disks attached to a node) makes sense because the CPUs are being kept busy, and also by having the storage local to the processing node one cuts down on network utilization.
·
Content Depot Applications: In these applications objects (content) is stored into repositories. There are archival content depots (write once and read maybe) and active content depots (write once and read frequently). These applications warrant heavy-weight storage nodes. That is there are thousands of disks behind a CPU complex to keep the CPU busy.
Thus, there is no single type of architecture that is appropriate for both of these workload types. Even with the emergence of flash, the light-weight shared-nothing cluster model is not useful if the application is not IOPs intensive. That is, flash is a good replacement for disks if the workload is random IOPs intensive. Otherwise, if one just wants capacity for content depots, then heavy-weight cluster nodes with thousands of disks is still the way to go.
In conclusion, I have the following different technical recommendations for different folks:
·
Cloud Providers: At the end of the day, if you use a single type of cloud storage architecture for all different types of workloads, then you will have inefficient deployments. There will be other hungrier cloud providers who will be willing to have different types of storage deployments for different workloads to gain better resource efficiencies, and thus, will be able to provide better pricing to their customers.
·
Storage Vendors: It is important for storage vendors to be adaptive and provide storage solutions for both types of workloads. They need to provide cheap and deep solutions for content depot type of workloads, and they need to provide a distributed shared-nothing storage solution that can scale to thousands of nodes for the map-reduce type of environments (if they want to be the provider for those types of applications also).