As I mentioned in the previous part of this blog, primary storage at host (or DAS) model is definitely going to get more traction. This second part is only trying to articulate the different models for how the second copy of data can be stored. Second copy of data is required for availability (to overcome machine, site failures, data corruption), performance (load-balancing) and potentially for data integrity verification purposes. In this blog entry I am not focusing on the performance reasons for having the second copy. The two prevalent models being analyzed in this blog are peer-peer and host-storage controller models. In the peer to peer model, primary storage is backed at other peer nodes, whereas, in the host-storage controller model, the primary host storage is backed up at the storage controller. A peer is a light node with only 4 to 8 disks and it runs application code, whereas, a storage controller can support hundreds of disks and usually does not run application software. One could easily replace a storage controller with a storage cloud infra-structure. Furthermore, one could potentially also use tapes instead of disk based storage to store the second copy.
Now, I will articulate under what circumstances what will be the choice for the second copy of the data:
· Case for Peer-Peer: For Map-Reduce applications, there is a need for a lot of processing nodes. Nowadays, in commodity configurations, most of these nodes come packaged also with some storage. If this storage is not being fully utilized for primary storage, then the peers will use the available space to backup other peer’s backup data. If the CPU/Memory requirements of the application don’t scale, then pursuing a peer to peer approach just for storing backup data will not be cost effective because of the low CPU/Memory utilization at the nodes. Similarly, if the cost of the storage media at the hosts is higher than the cost of the non-host storage media, then too it will not make economic sense to store the second copy at the host.
· Case for Peer-Peer: Some companies have argued that a peer to peer setup consisting of commodity parts is more cost effective than using storage controllers. In some cases they have argued that it is cheaper to even have multiple copies of data in the peer to peer setup, rather than storing data in a storage controller. This is primarily the case if one is storing their data in Tier-1 storage controllers with full hardware redundancy and specialized non-commodity hardware. Furthermore, there is a difference in the cost to create a storage controller and the price the vendor charges. Thus, if a vendor uses commodity hardware components, then the vendor has the flexibility of reducing the price markup. In some cases due to their business models storage controller vendors will not be able to reduce the price.
· Case for Storage Controller: If cost is really the issue, one could create a storage controller using commodity parts. That is, there are advantages to having higher number of disks behind a CPU complex (as in a storage controller) than as is the case in a host to prevent low utilization of the CPU resources.
· Case for Storage Controller: One could argue that one can optimize the data layout format in a storage controller to focus of storage efficiency instead of performance. Furthermore, one can focus on operational efficiency by use very dense shelves, and also have the ability to power down/shutdown unused disks. It will be very difficult to realize these features in a peer to peer environment where the peers are storing both primary data as well as the second copy.
· Case for Cloud Storage: For a small to medium size company, the operational efficiency (power, space, storage management) benefits of putting the second copy in a cloud managed by someone else is an attractive alternative. Internally, the cloud provider could employ either a peer to peer or a storage controller storage model. However, one has to trade-off cost/operational efficiency for privacy, wide-area network performance and availability concerns associated with cloud storage. Many organizations are experimenting with putting their archival storage into a cloud. There are other cost related reasons why small to medium organizations would want to also put their primary copy in a cloud (that is a separate topic for discussion).
· Case for tape based storage: Based on the access pattern and purpose of the second copy, it could potentially be stored on tape. For example, people store backup data on tapes. People also store archival data, where the probability of retrieving archival data being close to zero, on tapes. However, the disk densities are approaching tape densities, and so, unless the tape cost is significantly lower, disk will be the preferred storage medium for the second copy due to the ability to perform random I/O, and due to the packaging of the disk head with the disk platters (that is, one does not have to worry about whether a particular head can read a particular cartridge).
Thus, as can be seen, people will choose different options based on the situation.

Comments