November 09, 2009

Invitation to be the Guest Editor for ACM Transactions on Storage (Special Cloud Storage Issue)

Folks, recently I have been invited to be the guest editor for the ACM journal of Transactions on Storage (special issue on Cloud Storage). The ACM transaction series of journals cover many key areas of computing science and are generally regarded as a premier set of technical journals. The purpose behind writing this blog is 1) to share my thinking about how to construct this special issue and get your feedback and 2) to encourage you all to submit paper ideas or paper abstracts. I also want to sincerely thank the editor-in-chief Sree for this opportunity.

Currently there is a lot of buzz going on about Cloud Computing. Many people are wondering as to whether there are any new technical problems in this area in comparison to grid/utility computing, or whether this is really a new business paradigm disruption and not a technical disruption. There are many different ways to classify clouds such as public or private, and infrastructure, application or storage clouds. Moreover, different providers are advocating different types of competing storage architectures for the clouds. The primary objectives of this special issue are: 1) to provide an introduction to the area of storage for clouds and how it relates to other types of clouds and the key motivation/use cases 2) to take the fluff out and highlight the key technical problems in this area from the standpoint of storage vendors, different types of cloud providers, and cloud users 3) to provide high quality technical papers that are addressing some of these technical challenges.

So, currently I am thinking about articles in the following key areas for this special issue (I would like your feedback):

·

Intro to Storage Clouds: Motivates this area, provides classification, and then highlights the key technical challenges

·

Security: These articles will deal with different types of threat models for the cloud. Deal with how to ensure that data that is not being accessed is still safe. Deal with provenance and meta-data management issues.

·

QoS/SLA Management: Performance based SLA management is currently a major problem for the cloud providers. They are providing some initial forms of protection and availability guarantees. These articles deal with how to express SLA/SLOs, discuss different types of workloads, and describe underlying management mechanisms that help the providers to guarantee SLAs. QoS management is an extremely hard problem in light of multi-tenancy. Multi-tenancy is a key tenet to increase cloud resource utilization. SLA management also becomes a challenging problem especially when the scope of the cloud can span many data centers and geographies. Design of efficient fine-grained chargeback, and the design of penalty functions is also an interesting research area in this space.

·

Protocols and Namespace: Currently Amazon S3 namespace, access model and protocol seem to be the dominant paradigm due to the first mover’s advantage. There definitely is a need to understand the interoperability between this access model and the traditional file system access models, also to understand the performance implications of this new http/object level access model.

·

Storage Architectures: Currently, different providers and vendors are advocating different types of cloud storage architectures such as a large cluster of shared nothing nodes (with few disks behind each computing head) or large nodes with many disks behind each head. Some vendors are also advocating gateway nodes in front of traditional storage controllers to provide support for cloud protocols and policy based (SLA/SLO) management.

·

Caching Technologies: Currently, many customers would like to store their archival copies in public clouds and have quick access to this copy from their private clouds. Similarly, another use case is where an organization has all of its data in the cloud and would like quick access to this data from geographically distributed locations (WAN distances). There are other use cases where a provider/.vendor would like to provide caching functionality for the storage it is current hosting.

·

Standards: There definitely is a lot of opportunity and need to standardize cloud storage access models. Some might argue that it is too early to standardize because this will curtail innovation, I believe that many users are already moving quite fast towards adopting clouds. Thus, lack of standards will lock these users in, and will be a general liability towards further acceptance of clouds. Thus, there is a need for papers that describe various storage area cloud standards like SNIA CDMI, and how they are related to each other standards.

·

Data Mobility: The amount of data being generated by most organizations is exponentially increasing. The problem of moving this data from the private cloud to the public cloud and vice-versa is a very important problem. Since WAN costs are still relatively expensive, there is a need for new types of compression/de-duplication techniques, and eager/lazy copy techniques to efficiently move data across WAN distances.

·

Storage for Server Virtualized Environments: Many cloud providers are using some form of virtualization technology to increase server utilization. Please note that not all cloud providers are using standard off the shelf virtualization stacks but instead have some form of light-weight home grown stack to provide benefits similar to a hypervisor. There are numerous new challenges with respect to performance, management, protection etc when one accesses storage via a hypervisor due to differences in management granularity, due to loss of application context when going through a hypervisor, and due to overhead introduced by the hypervisor. Thus, there are a lot of opportunities for good technical papers in this area.

In conclusion, I think some of the above problems have been looked at by people in general systems area. Thus, I encourage folks from the database area networking area, operating systems area, security area, fault-tolerance area, performance analysis area, and computer systems architecture area to also think about submitting papers to this special issue.

September 25, 2009

Cloud Computing is a Game Changer for Developing Countries

In 1970s it was extremely difficult for common man in India to get a land telephone line installed at their residence. The whole notion of phoning for help during emergencies was a privilege not available to the common man. Then with the advent of cellular telephone technology, advances in the telephone switching technology, and low cost mass production of cellular phones, most Indians bypassed the land line telephone generation. Within the context of India, this development was disruptive and can definitely be termed as a revolution in the field of telecommunications.

Does cloud computing provide similar hope for a majority of the developing countries like India (both individuals as well as small and medium sized companies) to bypass a generation of computing technology? Most small and medium sized organizations in North America are looking at cloud computing as a technology and business model that allows them to cut down on their capital (initial purchase costs) and operational (power, cooling and operator costs) expenditures. In North America, the model of cloud computing has existed in some form (grid, utility computing) in the past. However, Amazon and Google have made this model very popular. The main reason is that they had to build their data centers to cope with the peak load on their systems (e.g. December shopping season in case of Amazon), thus, they wanted to rent their idle hardware resources during off-peak times. Business wise this gave them a lot of power to control the price of their cloud resources and make cloud computing attractive to small and medium sized businesses. Commodity based hardware architectures, and virtualization based software architectures facilitated Amazon and other cloud providers to reduce their costs and increase their resource utilization. In addition to understanding the existing technical challenges in the area of cloud computing, it is also important to understand the impact of cloud computing from an Indian context.

The following important questions need to be discussed with respect to how cloud computing can potential help a developing country like India:

·

The vision of how the combination of the $100 computer, cellular networks and cloud computing can make computing power and various applications affordable to the masses? Customers can access applications (most common case), or run their applications on the cloud infra-structure, or just backup their local data remotely. In most cases running hosted applications will make software affordable to the masses.

·

What are the key business challenges in realizing the above dream? What should be the pricing structure for providing the different types of cloud services?

·

Should government stay out of this venture? What is the role of government in both regulating as well as building the necessary manufacturing as well as data center infra-structure to help realize this vision?

·

What are the cradle to grave power requirements for managing potentially a billion computers and millions of data centers? Where and how will a developing country get the power/energy to realize this vision?

·

Is it possible to build a solar powered, low power consuming laptop that makes the operational cost of owning a computer further down?

·

Can we start a cottage industry for servicing these green laptops in the villages? How does one build the infra-structure to train the repair people?

 

There are still numerous technical challenges in realizing the above vision and here are a few that are immediately obvious:

·

Upgrading Cellular infra-structures to handle the extra load due to data traffic in addition to the voice traffic.

·

Federation of data centers to provide user mobility. In many cases the data centers will be in different administrative domains, and thus, standards have to be developed for the interoperability of the data centers.

·

QoS features especially when dealing with various classes of potential customers who are sharing the same underlying physical resources.

·

Scale of the system with respect to the number of users, number of logical objects, and number of data centers.

·

Security aspects with respect to secure access and data integrity guarantees.

·

Trust issues especially with the increase in the number of system administrators, there are bound to be some rogue elements.

·

Monitoring/reporting at the level of individual users given multiple users can share the underlying physical resources.

·

Caching technologies at the low cost client devices, edge gateways and also at the servers in the cloud. Since most of the clients will share many common applications, caching technology (similar to what CDNs provide) becomes extremely important.

·

Since these low cost clients are being operated by people who cannot afford electricity for a long period of time, both efficient conservation of power at the computer level, as well as generation of electricity from alternate approaches (like solar power at the house level or at the individual computer level) have to be aggressively pursued.

·

If batteries are part of the solution, then the design of more efficient batteries and safe battery disposal strategies are very important part of the overall solution.

Overall, the impact of realizing the above vision cannot be expressed in words. I will just provide three examples here to illustrate my point:

1) Just imagine if every classroom in a small school in remote villages has access to digital teaching resources.

2) The ability to remotely diagnose patients in remote villages.

3) The ability of farmer to find market rates for their produce and to directly sell their goods to the highest bidder via on-line auctions.



September 18, 2009

Is Quality a Zero-Sum Game?

Is Quality a Zero-Sum game?

That is, in order to get good quality do you have to give up on new functionality? Unfortunately for many companies without a coherent long-term strategy, quality has become a zero-sum game. That is, one gives up on functionality for the sake of higher quality. However, in this blog entry I will argue that this doesn’t have to be the case:

Before diving into the details of these three points I would like to briefly articulate some general business trends that are magnifying the need for quality:

·

Tier-1 quality from Tier 2 Systems: Customers are expecting Tier-1 availability and performance predictability features from systems that were initially designed to be tier-2 products.

·

Pressure from Open Source Solutions: With the availability of open-source storage solutions, customers are expecting higher quality for solutions for which they actually have to pay. That is, they have higher quality expectations for even Tier-3 and below systems.

·

Cradle to Grave Quality: Quality is not constrained to just the product but most customers are expecting positive cradle to grave experience right from buying experience to moving the box out of their data center. This encompasses every service call, web experience and other forms of services.

·

End to End Quality: Customers care about end to end quality. That is, they don’t want excuses with respect to whether the problem is in the storage controller or SAN switch or host HBA etc. They want the entire solution (including management solution) to be well integrated and tested.

·

Downtime leads to big losses: Many companies cannot afford downtime of their core IT systems because the downtime translates into financial losses that are critical for the survival of the company.

 

Many people quickly assume that in order to have good quality one needs to have good processes and quantitative measurements in place. While these things are definitely important, I believe the solution to the problem of obtaining good quality is much larger in nature. I believe there are 3 fundamental strategic directions that will help one attain good quality and they are:

·

Cultural Reasons (Disciplined and Open Culture): I define culture as an attitude. You can learn it from watching others. Culture percolates itself into people, products, processes, schedules, meetings etc.

o

Train new employees (and managers) right from day 1 that they should bring to the attention of their managers when they find that development or testing processes are cumbersome or inefficient like long builds, lack of clear documentation, and lack of state of the art development tools.

o

Encourage middle tier employees to question strategic direction changes if they feel that those changes have not been well thought through. Encourage them to question unclear requirements. Encourage them to not hide problems just because they will delay schedules.

o

Encourage senior employees to listen to their juniors and not castigate them when ideas are challenged or new ideas are proposed.

o

Teach employees on how to deliver the above feedback in a constructive manner and not in a confrontational manner.

o

Encourage employees to have the discipline to focus on their tasks and reward focused employees. Lack of focus leads to slippage in schedules.

o

Finally the most important part of culture is when employees take pride and ownership of their work. People will only take ownership if they feel there is something at stake for them. The employee evaluation system should not just evaluate employees on whether they completed a task. Instead it should evaluate them on whether they took ownership of their assigned tasks. The difference between ownership and just completing a task on time is the extra energy the employee expends in ironing out issues, helping others, constantly thinking about how to improve things, commitment to schedules, thinking about the goodness of the project and team instead of selfish goals, and being a positive source of energy in the group. Please note, being positive is not a substitute for technical incompetence. However, technical brilliance should not be tolerated at the expense of arrogance and selfish behavior because it poisons the entire culture.

·

Architectural Reasons (Modular Architectures and aggressive External Integration): Modular design and integration of your product or solution with others in the eco-system are two key technical mantras that lead to high quality.

o

A modular architecture helps in a) isolating bad design and code and b) makes it easier and faster to test code c) makes it easier develop and integrate new functionality especially when the requirements are constantly changing. Training engineers in modular design techniques and aggressive review of design ensures modularity. Many people wrongly assume that programming in C++ or Java will ensure a modular product. In those cases where companies are dealing with legacy code that isn’t modular in nature, those companies have to temporarily take a step back and re-architect their product in order to be a much more nimble competitor in the future (this is the only place where my argument about quality being not a zero-sum game is contradicted).

o

It is very important for a company to ensure that their products are fully integrated and tested in the customer’s environment because customers want end to end quality. This requires vendors to aggressively integrate and test their products with the products from other vendors.

·

Change Management: A company needs to manage change of different types in an efficient and structured manner. For example:

o

From product testing standpoint, one needs to be able to have an automated test infra-structure that can handle changes to ensure relevance. Most people focus on building an automated test infra-structure, but only few people focus on how to have a testing infra-structure that can accommodate changes efficiently.

o

From mergers and acquisition standpoint a company needs to very carefully decide when and if to integrate the new company technology into their products. It is important to note that it is not always beneficial to integrate the acquired company technology. Most acquisitions and mergers (M&A) end up as failures. So, it is extremely important to hire people successful M&A background. In most cases cultural integration is a must in order to have a successful technical integration of the products.

o

The inability of the senior technical leaders to carefully think through the ramification of changes at the strategic/architectural level cannot be overcome via good design, implementation and testing. Thus, senior technical people have to do thorough due diligence and they need to carefully think through the solution many levels deep before advocating it to the others. There is no substitute for quantitative analysis, large scale system development experience, and thorough domain knowledge.

At the end of the day, in order to have a quality product, one needs to have a quality organization where every employee should be trained to think would I purchase the service or product if I were the customer? This mentality has to trickle down to the smallest of processes and the company culture and reward system should be targeted towards employee ownership where each employee feels they can make a difference.

July 20, 2009

All Roads Lead to SLO Management

Currently, important trends like 1) cloud computing 2) leveraging of Flash 3) management of storage in server virtualized environments and 4) power efficient storage are getting most of the attention.  However, the underlying core problem that needs to get solved for all of these trends is the same. Service Level Objective management is the core underlying problem that needs to be solved in some flavor in order to provide cost effective solutions for each of the above trends. In this blog I will describe why storage management is a hard problem:

 

·    Combination of complex hardware and software features:  Unlike in the past where one had to only concern themselves with selecting different RAID levels, different types of disks and types of boxes with different CPUs/memory, nowadays, the functionality in side a storage box is more complex with features like de-duplication/compression, encryption, use of combinations of flash and disks, caching layers, different types of replication services, non-disruptive migration, data stripped across cluster nodes, shared-nothing or shared disk architectures, storage at hosts (application servers) etc. Thus, the interplay of these various features and their impact on performance, protection etc is non-trivial to determine.

·    Multi-tenancy:  In the past, most companies over-provisioned the required amount of storage to avoid sharing of the underlying resources, and thus, get some guaranteed level of service.  In many cases they used different controllers to host storage belonging to different applications. However, now due to cost cutting reasons the storage belonging to many applications is co-hosted on the same storage hardware. In many cases, the sharing of the underlying resources across multiple workloads makes it very difficult to guarantee performance notions to applications. One also needs to be able to monitor and quantify the amount of resources being used by the various applications.

·    Use of same underlying resources to offer different classes of service: In the past, people leverage fibre-channel disks to provide storage for IOPs intensive applications and SATA drives to provide storage for non-IOPs intensive applications or capacity intensive applications. However, going forward, a storage box will contain both flash SSDs as well as low speed SATA drives. In many the system will partition the pool of SSDs and HDDs to carve up and offer different classes of service. Thus, the optimum way to partition these resources across competing workloads is a non-trivial task.

·    End to End requirements: Customers do not care about the SLO of their storage in isolation. They want end to end SLO guarantees right from the application/hypervisor all the way to its storage. Thus, it is necessary for storage management tools to operate in conjunction with other management tools in a seamless manner.

·    Change of Requirements:  Customers are not operating in a static environment. The value for different pieces of data changes according to either pre-determined policies or based on changing business conditions. The ability of the underlying system to ensure that the right type of data resides on the right type of storage at the right time is a non-trivial task.

·    Global Optimization across Data Centers:  Most large enterprises have their resources spread across multiple data centers. Thus, it is necessary for management tools to be able to manage resources across multiple data centers in an integrated manner.

 

In conclusion, I want to re-iterate that SLO management is one of the most important core problems amongst the hype associated with clouds, flash and server virtualization. Modeling of hardware/software resources, SLO qualification of storage resources, ability to do end to end planning, and developing monitoring/discovery engines and correlating end to end discovered data are some of the key challenges that need to be solved in the area of SLO management.

July 18, 2009

Why not use a traditional RDBMS to store embedded meta-data?

Traditional relational database management systems (DBMSs) are under pressure to continue their growth and in some cases to defend their existing territory. Traditional DBMSs still have a stronghold in OLTP environments and DSS environments. However, their use is being challenged in Web 2.0 space and in the embedded meta-data management arena (e.g. meta-data management in storage systems). In both these cases, the users feel that the traditional DBMSs are too heavy-weight. In this blog, I will articulate the key factors that are causing people to move away from using traditional DBMSs or the reasons for the emergence of new DBMS companies that are targeting their DBMSs for these newer use case scenarios. Here are some reasons as to why people are moving away from traditional DBMSs:

·

Do not want Strong ACID semantics: ACID semantics stand for Atomicity, Consistency, Integrity and Durability. DBMSs were fundamentally designed to provide ACID semantics. Ensuring consistency in a distributed environment is an expensive task with respect to latency and number of network messages. One of the fundamental premises for the new emerging applications is that they can tolerate a slight amount of inconsistency. This, in turn, allows them to be deployed on highly scalable architectures with thousands of nodes.

·

Ability to Auto-Scale with the addition of nodes: One of the key requirements in these new Web 2.0 environments and storage system environments is for new nodes to be seamlessly added to the existing cluster. The DBMS system needs to appropriately migrate/replicate/partition data in order to balance the load and reduce the query latency. Thus, many of the web 2.0 infra-structure builders feel that the traditional DBMSs cannot scale to these large cluster sizes. [Some people also feel that the traditional DBMSs do not scale wrt the required number of objects that need to be managed. However, I believe that high end DBMSs do scale and manage very large number of items].

·

Column Store Semantics: In many cases, the queries are over a single or few attributes of the rows. That is, unlike retrieving all the content of the rows, the queries focus on a few of the columns. Thus, the layout of the storage on the disks is optimized for column based queries.

·

Do not want sophisticated query processing and transaction semantics: Most of these applications want support for simple key-value lookup operations. This means that they do not need the sophisticated query engine. Similarly, do to relaxed consistency requirements distributed transaction management capabilities across nodes are not required.

·

Integrated meta-data management: In many cases, users store their meta-data in a DBMS and their data in a storage controller. Ideally, users want to manage their data and meta-data in an integrated manner (that is, coordinated data migration, deletion, security management etc). Thus, if a storage system additionally also provides a meta-data store then the overall infra-structure management costs will be reduced because now the administrators do not need to manage their meta-data and data infra-structures separately.

·

Low overhead management: Administrators do not need to define schemas, indices, buffer sizes, and complex queries/views. They also do not need to separately worry about where to store the data (volumes), backup/restores, disaster recovery, data partitioning across the cluster nodes etc. Thus, there should be no additional management tasks for managing meta-data, and a storage administrator should be able to handle the administrative tasks associated with meta-data.

·

Cost: In the Web 2.0 space, the companies do not want to pay high license fees to use industrial strength DBMSs, especially when the existing DBMSs do not quite satisfy their needs. Thus, many companies have developed their own tailored data stores for storing meta-data information (like Google, Amazon and Yahoo).

In this blog, I am not going to argue whether existing DBMSs with some modifications can address the above needs, or whether the architectures of the existing DBMSs are inherently inflexible and bloated to be able satisfy the above needs. The following two trends definitely are arguing for the latter: 1) Many startup DBMS companies are definitely trying to build DBMSs that satisfy the above requirements and 2) Many web 2.0 companies are definitely trying to use homegrown or open source provided DBMS solutions.

April 30, 2009

My Storage Systems TextBook

Three years ago I was asked to teach a graduate course on Storage Systems at University of California as a visiting faculty member. So, as I was preparing the curriculum I realized that there did not exist a good textbook on storage systems. By Storage Systems I mean the area below a database system. They have many textbooks on database systems, communication systems, programming languages and operating systems. But in my opinion there does not exist a good textbook on storage systems. So, I planned the lectures such that each week's lectures would contain the content present in one chapter of the textbook. Here is an outline that can be used by prospective authors of storage systems textbook:

Storage Architectures (block storage, file storage, object storage, SAN file systems, clustered systems): These lectures discussed the different types of storage architectures. We also discussed the trade-offs between these different types of architectures.

Storage Devices (Storage Controllers, Disks, Tapes, SSDs, Optical Devices): These lectures described the architectural and operational details of these different devices (and the variants in each device type). We also discussed the relative tradeoffs between these different device types.

File System: These lectures first dealt with the basics of a file system, and then discussed the design choices that were made by different types of file systems like GPFS, GFS, WAFL, EXT3 etc.

Storage Protocols (parallel SCSI, FCP, iSCSI, SATA, SAS, NFS, CIFS, pNFS, different RDMA protocols, WebDAV): These lectures dealt with different types of storage protocols. Once again we discussed the trade-offs between the different types of protocols.

Storage Protection Mechanisms (RAID algorithms, checksum techniques, Redundancy components in the controller, scrubbing techniques, DR services, long term data preservation): In these lectures we covered the different ways to protect oneself from disk failures, head failures, site failures, and also different ways to detect and correct data corruption.

Storage Efficiency Techniques (De-duplication, Compression, Thin Provisioning): I did not have time to cover this topic. However, storage efficiency has become an extremely important area of research.

Storage Management: In these lectures we covered the basic framework for monitoring, analyzing, planning and executing storage management tasks such as provisioning, backup/recovery, performance management etc. We also dealt with different types of storage virtualization techniques, and also storage management within the context of server virtualization at the host.

Storage Security/privacy: These lectures dealt with on-disk, on-wire, access control, authentication, provenance and trust issues. We also discussed the different types of possible threats and how to protect against them.

Storage Power Management: These lectures dealt with power management metrics, proactive (high density disks, efficient power sources, compression, de-dup) and reactive power management techniques (disks spin-down, shutdown), and also briefly dealt with data center cooling techniques.

Performance Enhancement (Caching, Pre-fetching, Log Structured file system, Data Layout, I/O scheduling): These lectures discuss different techniques for improving the overall I/O performance.

Workload Classification: I did not have time to cover this topic but the goal is to discuss the various types of workloads (HPC, DSS, OLTP, Archival etc) and their impact on the design of the storage systems with respect to performance.

So, as you can see a lot was covered in that class. Most importantly, the students enjoyed the class (they gave a positive evaluation) and many of them are doing very well with respect to working at good companies, and also in their PhD research with respect to publishing papers at reputed conferences. I am interested in getting feedback wrt what other topics should be covered if I write this book.


Storage Architectures for Clouds: One Size Does Not Fit All

 

Cloud computing is definitely here to stay. The key premise is that cloud providers can build large data centers cheaply and then rent compute and storage space out to small and medium sized businesses much cheaply than what the small and medium sized companies would have paid if they had built them on their own. Moreover, the cloud providers are making money while helping the small and medium sized companies. Thus, it seems like a win-win situation.

There are different types of cloud providers like infra-structure cloud providers (Amazon, Google), application cloud providers (Salesforce), storage cloud providers (Amazon, Nirvanix) etc. So, the key question is, is there a once size fits all type of back end cloud storage architecture for all of these different types of clouds? Is it cost effective for the cloud provider to always use a shared-nothing (where storage is directly attached to each of the nodes) architecture for content depot type of applications? Is it always cost effective for the cloud provider to always make keep 3 copies of every data item? There are certain non-technical reasons that have clouded (no pun intended) the thinking process and I would like to first articulate them before providing some technical analysis.

·

Don’t use Google and Amazon’s technical solutions as the reference architecture: One thing I have noticed is that many other companies are trying to mimic the Google file system architecture. It is important to note that Google and Amazon’s business model gives them the luxury of storing 3 copies of data. Similarly, since they anyway themselves have high computation needs for their internal map-reduce applications, they are able to leverage that and have smaller CPU to disk ratios for the nodes in their cluster.

·

High cost does not mean bad architecture: The higher cost of the solutions being offered by traditional storage controller vendors does not mean that their storage architecture is flawed? The business models of these companies prevent them from flooring the prices of their boxes but fundamentally these storage boxes are also made out of commodity components, and thus, their COGs are not that high. Thus, many of the cloud providers are getting away with storing 3 copies of data and in some cases having few disks behind lowly utilized CPUs because the total cost of the solution is still less than what they would have to pay to a storage controller vendor.

So, from a technical standpoint (wrt space efficiency, CPU utilization etc), I don’t think one architecture is optimum for all types of workloads and cloud deployments. Let me briefly articulate the different types of requirements:

·

Map-Reduce Applications: In these applications the CPU processing and memory needs of an application scales evenly along with the storage needs of the application. Thus, a shared-nothing architecture (with light-weight nodes with respect to number of disks attached to a node) makes sense because the CPUs are being kept busy, and also by having the storage local to the processing node one cuts down on network utilization.

·

Content Depot Applications: In these applications objects (content) is stored into repositories. There are archival content depots (write once and read maybe) and active content depots (write once and read frequently). These applications warrant heavy-weight storage nodes. That is there are thousands of disks behind a CPU complex to keep the CPU busy.

Thus, there is no single type of architecture that is appropriate for both of these workload types. Even with the emergence of flash, the light-weight shared-nothing cluster model is not useful if the application is not IOPs intensive. That is, flash is a good replacement for disks if the workload is random IOPs intensive. Otherwise, if one just wants capacity for content depots, then heavy-weight cluster nodes with thousands of disks is still the way to go.

In conclusion, I have the following different technical recommendations for different folks:

·

Cloud Providers: At the end of the day, if you use a single type of cloud storage architecture for all different types of workloads, then you will have inefficient deployments. There will be other hungrier cloud providers who will be willing to have different types of storage deployments for different workloads to gain better resource efficiencies, and thus, will be able to provide better pricing to their customers.

·

Storage Vendors: It is important for storage vendors to be adaptive and provide storage solutions for both types of workloads. They need to provide cheap and deep solutions for content depot type of workloads, and they need to provide a distributed shared-nothing storage solution that can scale to thousands of nodes for the map-reduce type of environments (if they want to be the provider for those types of applications also).


February 28, 2009

2nd Annual NetApp University Day

The second annual NetApp University day on Feb 24th/09 was a big hit. Twenty professors from top notch systems schools from across US and Canada visited NetApp. We had profs from Harvard, Duke, Wisconsin-Madison, CMU, Michigan, Stony Brook, UC Berkeley, UC Santa Cruz, UC San Diego, Waterloo, Toronto (student), Tennessee, Georgia Tech, Brown, UIUC, University of Illinois Chicago, Johns Hopkins, and Cornell attend this event. Some other professors from MIT and Stanford also wanted to attend but could not attend due to scheduling conflicts. There were technical presentations by Steve Kleiman (Chief Scientist NetApp) and by CTO and Vice President Engineering of a cloud provider (not naming them for privacy reasons). There were lively discussions on various topics such as a) Flash b) Cloud Computing c) Virtualization and d) Storage Management. A lot of ideas got exchanged in the room and everyone learned something new. Scott Dawkins (VP NetApp Advanced Technology Group) also gave a presentation on NetApp University funding/relationship model. From NetApp's standpoint it was a great opportunity to get feedback from the professors about our university relationship model and our research direction. By the way, a byproduct of NetApp's university relationship is the joint publishing of 3 papers in this year's FAST conference with university co-authors. Moreover, one of these papers also got the best paper award in FAST 2009. Finally, whether we hire them or whether they accept our offer or not, most of the top graduate students from the above universities, as a minimum, at least interview at NetApp.

Cloud Standards

I was asked to sub in as a co-chair at the 2009 FAST conference SNIA BOF on cloud computing by Alan Yoder. Mark Carlson from Sun was the primary chair person at this event. After this event, here are some things I am thinking about:

·

Involve non-traditional big boys: Any storage standardization effort without participation from non-traditional storage vendors like Google, VMWare and Amazon will not be that successful. Thus, it is important to actively lobby them to join SNIA and take their input.

·

Other Standards Initiatives: Any standard by SNIA has to work with the following other standards: a) OVF a virtualization standard b) SMI-S a storage management standard c) Cloud Computing Interoperability Forum (CCIF) and others server and network management standards.

·

Don’t standardize everything: It is not the right time to standardize everything. For example, basic data access protocols from Amazon S3 have de-facto become the data access standard that others are copying. However, other features like policy management, data auditing etc are not yet ready for standardization. That is, innovation should not be stifled by trying to prematurely standardize these features. [Mind you, policy management notions have been around for a while, but the previous efforts by SNIA in this area have not been successful].

·

Government Regulations: Standards and technology by themselves will not be enough to bring order into the cloud space. Some government regulations are also necessary to ensure protection for customers with respect to providers going out of business, or what the providers do with the customer data or where and how they store it.

·

End-End Standards: Ultimately the cloud storage related standards that SNIA comes up with need to interoperate with the overall server, network related cloud standards. For example, in some cases customers want to move their entire infra-structure from one data center to another data center (not just storage).

·

Taxonomy: SNIA is correctly targeting to initially come up with an agreed upon cloud storage related taxonomy before trying to standardize different features. For example, there are different types of clouds like a) storage as a service cloud like Amazon S3 b) application as a service cloud (application is also provided by the cloud provider) like Salesforce.com c) infra-structure provider cloud like Amazon EC2. There are public clouds as well as private clouds.

February 04, 2009

Why NetApp is Number 1 for a Storage Researcher!!

You all probably heard by now that NetApp has been selected as the best place to work in America by Fortune Magazine in 2009. This nomination is primarily based on the feedback received from NetApp employees. Now, let me give you my perspective on why I think NetApp is the best company to work for in North America. I work in the Advanced Technology Group at NetApp. Our group is part of the CTO office and it consists primarily of PhD and MSc students from top universities/research labs in the world. My perspective would be relevant for aspiring Masters and PhD graduate students who want an industrial research lab job. So, here are the top reasons why I think NetApp is number 1:

·

Get to meet with real customers on a regular basis: Every ATG member gets numerous opportunities to interact with real customers. This is a great opportunity for us to learn about real customer problems. Thus, it is not difficult to find real customer problems to work on. Trust me, this also makes for great problem motivation when writing papers.

·

Opportunity to work on a diverse range of problems in the systems area: Many students have the misconception that in a purely storage company they will not have a diverse range of problems to work on. We work on problems in areas such as storage management, on how to leverage new hardware technologies, coding theory, new emerging architectures, and data mining algorithms. You name the new emerging technical area like virtualization, Web 2.0, new memory technologies etc, and we are working on those areas.

·

Opportunity to work with product groups and make product impact: Unlike some other research labs, the work you do has a very good chance of ending up in products. Ultimately, there is no greater feeling than seeing your work being used by actual customers. You get guidance and feedback from product group architects on a regular basis. Usually, you solve a problem and build a prototype. Subsequently, the product group leverages your experience and they, in turn, build products or make changes.

·

Support for publishing in top conferences: NetApp employees get an opportunity to publish both in an internal journal and also at top conferences. Our employees are collaborating with many professors from top notch US universities. We also have an excellent summer internship program where we get top students from top universities. NetApp has an excellent publication track record at reputed top conferences like FAST and USENIX.

·

Opportunity to interact with professors and attend university retreats and conferences: NetApp employees get opportunities to attend university retreats and leading conferences. NetApp also hosts an annual University Day where top systems professors from across the world come to NetApp and meet with NetApp employees.

·

ATG labs in multiple locations: We have labs in Boston, Raleigh, Sunnyvale California, and Bangalore India. Thus, people have opportunity to work in a geographic area of their choice.

·

Proactive management and top notch colleagues: Initially, I was very nervous about leaving IBM research labs after working there for many years and moving to NetApp. But believe me, NetApp management is very proactive and pragmatic. They work very hard to make sure that their employees are happy and productive. Lastly, but most importantly, my co-workers are very positive and helpful. The atmosphere is very conducive for stimulating new ideas (a lot of patents are being filed by our team members) and for building complex prototypes.

© NetApp, Inc.  |  "Safe Harbor" Statement  |  Privacy Policy