December 01, 2008

Resurgence of Direct Attached Storage Model

Hey, isn’t 2009 just around the corner, and why am I still talking about DAS (direct attached storage)?  The shared storage (NAS/SAN) proponents argued that NAS/SANs were desirable because a) they de-coupled compute purchasing from storage purchasing and b) one could consolidate storage administration for multiple applications by making the applications share the storage infra-structure, and thus, reduce operational costs by having dedicated storage administrators.

In the DAS model, the local storage attached to an application server is only accessible to that particular application server, whereas, in the NAS/SAN model, multiple application servers can access the common storage. Nowadays, the DAS model is making a comeback for the following important reasons:

  • Application Vendors are providing DAS solutions: Many application vendors are encouraging their customers to use direct attached storage (as an appliance) instead of using shared storage to reduce hardware costs. The application vendors are providing replication functionality to overcome box failures.  The idea behind this approach is that the application administrator will do end-end management (both application and storage) of that box.
  • Emergence of Flash Storage: With the emergence of Flash technology, one could potentially have enough flash at the host so that one can fit the entire working set of an application in the flash storage at the host. This will definitely help to cut down on network latency. Furthermore, Flash is beginning to provide a competitive IOPs/Dollar equation.
  • Emergence of Map-Reduce (web-indexing, data mining etc) Applications: The CPU, memory and disk requirements of these applications scale evenly. Therefore, it makes sense for these applications to pursue a DAS model. Some of these architectures pursue an asymmetric meta-data server model (like in Hadoop File System).

Now, some important questions that need to be answered are: 1) should these DAS boxes backup their data on to a shared secondary storage box that provides storage efficiency via de-dup/compression, power savings, search/indexing, disaster recovery etc? Or 2) should these DAS boxes be connected to each other in a peer-peer model and backup their data at other peers?   When would one want to use the former approach and when is the later approach desirable, and when is a combination of the approaches desirable? I will analyze the answers to these questions in my next posting.

November 30, 2008

Storage Vendor Requirements for Cloud Computing

Currently, there is a lot of hype regarding clouds. There are many application level (Salesforce, Google, Oracle), compute processing level (Amazon EC2), and storage level (Amazon S3, Nirvanix) public cloud providers. The public cloud providers are primarily targeting small and medium sized businesses. There are also service companies like IBM that are aspiring to provide private cloud computing solutions to enterprise level customers with multiple locations and data centers. Finally, there are many computer companies like VMWare (VCloud), EMC (Atmos), IBM (BlueCloud) etc that are aspiring to provide hardware and software to public and private cloud providers.

The basic objective of a cloud provider is to provide a web service/SLA based interface to a combination of hardware and software resources. Moreover, the cloud provider provides the necessary management support and is able to dynamically adapt the supply of hardware or software resources based on the user demand. In the past, people have advocated grid/utility computing paradigms that also provided similar benefits. People are trying to analyze the differences/similarities between cloud computing and grid/utility computing, but in our opinion this analysis is very subjective in nature and it does not provide much value-add.

In this blog entry, I will try to list all of the important cloud friendly attributes that need to be provided by a storage vendor. These features can be leveraged by storage clouds, or indirectly via application clouds.

Requirements

Cloud User Requirements

Interface requirements:

Non-Posix Interface

Web Services Based Interface

Search Capability

Ability to attach Meta-Data

Transaction Support

Partial file loading

Basic file I/O commands

Policies/SLAs requirements

Data Availability (protection from various types of failures)

Performance

-Workloads

-Some applications require fast read throughput

-Some applications are archival in nature

-There is minimal read/write or write/write conflicts

Reliability

Security

Spin Down

Data Copies and Placement

Object De-Dup

Object Versioning

Geographic Multi-Site Requirements:

Global Namespace

Global Policy Scope/Engine

Data accessible from anywhere

Integration with Edge Caches

Management Requirements

Reporting/Chargeback

Application Level Cloud Requirements on Storage

Application level clouds will be more popular than storage only clouds. Therefore, it is very important to have good integration with applications like Exchange, VMWare, Oracle, SAP.

Interop-Heterogeneity requirements

Leverage/Incorporate existing legacy resources into cloud

Clouds containing heterogeneous resources

Data Migration to clouds and from clouds

Additional Cloud Provider Requirements

Global Efficiency Requirements

Global Storage Efficiency

Global Resource Utilization Efficiency

Physical Space Efficiency

Power Usage Efficiency

Management Requirements

Change Management

Capacity Planner

Provisioning

Global DR Setup

Reporting

Multi-Tenancy  (ensuring there are secure partitions for the different tenants)

October 23, 2008

Top Storage Management Challenges

In this blog entry, I will articulate the key technical trends that will drive the
architectures of future storage management software. I have been thinking about this
topic for the past few days and so I decide to post my thoughts. Some of the key
trends that will influence storage management architectures are:

* Application Driven Management: Traditionally, server, storage, fabric and
application management tasks have been treated separately. However, going
forward, most of these management tasks will be driven by business requirements
via application level policies.  Application vendors (like Oracle Automatic
Storage Management), hypervisor vendors (like VMWare Virtual Center) and
server vendors (like IBM Director) are starting to provide storage management as
an integral part of application, hypervisor and server management respectively. 
Thus, it is imperative for storage vendors to ensure that their management stacks
interoperate (via well defined APIs) and also integrate (GUI, database, agents)
with these different management stacks. 

*  End-End SLA Management/Policy-Based Management/Autonomic
Management:  With an increase in a) the number of devices in the data center, b)
the number of tunable knobs on hardware and software resources, c) the number
of competing applications with different workload characteristics sharing the
storage infra-structure, and d) number of hardware and software layers between
the application server and the storage controller, manual planning and manual
SLA enforcement have become tedious and error prone processes.  Thus, there is
a need for both pro-active policy-based planning tools (for provisioning, disaster
recovery setup etc) and reactive SLA enforcement tools (like automatic migration,
workload throttling etc). However, it is important to still keep the administrator in
the loop. That is, plans should be presented to administrators, and they should be
given the opportunity to override or change them. It is also important to provide
end to end policy-based management. For example, a policy-based provisioning
tool should generate plans for host, fabric and storage resources. That is, figuring
out which storage controller or storage pool to use is just part of the required
solution. The generated plan should determine the number of paths in the fabric,
how the devices should be zoned, the number of switch hops etc.  Finally, with
the emergence of cloud computing phenomena, SLA management has become an
even more important problem because customers using the cloud want SLA
guarantees.

* Dynamic Data Centers:  The state of a data center is constantly changing
because of the constantly changing business (application) requirements. That is,
change in the number of users, change in capacity and performance requirements,
and change in security and availability requirements results in a data center that is
in a constant state of flux. . In general, in order to have a dynamic data center one
needs to have sophisticated change management tools.  In complex data center
environments, it is wise to pro-actively perform “What-if” analysis before
actually making the changes. Re-active problem debugging, after making a
change, is difficult and time consuming. Thus, there is a need for pro-active
planning and analysis tools.

* Increase in scale with respect to data centers, devices, management objects
and meta-data: The amount of data getting digitized and stored persistently is
increasing at a very fast pace. Furthermore, the number of copies for a data item is
also increasing. Finally, due to compliance requirements, and for sentimental
reasons, data items are being kept around for longer periods of time. All of these
factors are contributing towards an increase in the number of objects being
managed (in the range of billions of objects) and also towards an increase in the
number of devices. Furthermore, most enterprises have multiple data centers for
disaster recovery and latency reduction reasons. In addition, the amount of meta-
data associated with an object is also increasing because of system, inferred and
user provided meta-data. Thus, it is important to have scalable management server
design (either a monolithic meta-data server or a federation of meta-data servers)
that is able to handle billions of objects and thousands of devices both with
respect to GUI design as well as with respect to our data store design. The data
management architecture also needs to be able to perform planning operations
across multiple data centers. It is also very important to re-visit the notion of
sending meta-data from storage devices to the meta-data server because this data
shipping approach will not scale.

* Primary Storage not at the Storage Controller:  Storage at host (direct attached
storage) is being advocated by application vendors (like Microsoft, Oracle etc)
and is also being leveraged by Web 2.0 companies which want to scale CPU,
memory and disks (co-locate applications and storage). The re-emergence of DAS
architecture will become more prevalent with the gaining popularity of flash.
With the increased capacity of HDDs and SSDs, a large portion of the application
working set will fit at the hosts. Thus, primary storage will now exist at hosts.
With primary storage being present at hosts, one can envision  a) a peer to peer
DAS architecture or b) a client-server architecture where the DAS box is the
client and a storage controller is the backup box or c) a combination of both types
of architectures. The management software provider needs to determine whether
to pursue a distributed (where there is a management server on each host) or a
centralized (where management meta-data is shipped to a centralized management
server) management architecture. Furthermore, it is not clear whether the host
side storage should be managed by application management software or storage
management software or both. It is important to assess the benefits of each of
these different alternatives.

* Storage Management as a Service: There is a limit on the amount of storage that
can be managed by a single storage administrator. This upper bound is improving
with the use of storage management tools. However, it is still not keeping pace
with the enterprise data consumption rates. Thus, enterprises are constantly in
search of experienced storage administrators. Furthermore, all of the storage
management tasks are not performed at the same frequency. For example,
capacity planning tasks are performed at a lower frequency than data provisioning
tasks. Hence, it makes sense for an organization to outsource either all or some of
the storage management tasks. In many cases, the vendor providing storage
management services will not have storage management personnel dedicated for
just one customer. Instead, they would like to either remotely access the storage
management meta-data, or ship the storage management meta-data from the
customer site to their site, in order to perform the management tasks. Storage
management solution vendors need to ensure that their management solutions
handle the security and scalability issues of remote storage management.

*  Heterogeneous Management:  In order to ensure that they don’t get locked into
a single vendor, in future, data center operators will definitely have storage boxes
from multiple storage vendors. Data center operators do not like using many
different storage management tools to manage the storage from the different
vendors. Ideally, if there is single management software that can manage storage
from the different vendors, that software will become a key control point. SMI-S
standards are still evolving. They provide many basic profiles that help to perform
basic monitoring and control operations. However, many of the vendors have
their own extensions to the basic SMI-S profile for the various resources, and
many of the advanced features (different types of copy services operations) are
not available via SMI-S interfaces. Thus, there is a lot of effort required to build a
heterogeneous management framework.

October 21, 2008

How can researchers make Product Impact?

When new PhDs join an industrial research group, one of the most important question they wrestle with is to figure out a way to make product impact.  Many major IT companies like IBM, HP, Microsoft, Yahoo, Sun, NetApp have a research or advanced technology group. Usually these research groups are separate than the revenue generating product or business units. Most of these new grads are well versed in the technical aspects of the emerging technologies. However, they are usually in for a rude surprise when they find out that technical excellence is not sufficient for making product impact.  So, what is the secret towards making product impact?

Different people have different opinion about what constitutes as product impact. Many times there is disagreement between the business unit technical leads and the researcher about what can be considered as product impact. There is no single answer to this question. Instead, a researcher and a product architect need to mutually come to an agreement as to what constitutes as a product impact. Some ways in which one can make a product impact are:

  • Develop a new algorithm that is implemented by a product group.
  • Answer some questions that help to improve the understanding of some unknown questions (minimize risk). In some cases even negative results are useful.
  • Develop an advanced feature and push it into a product (either at the architecture phase, or via implementation and testing).
  • Build a prototype which, in turn, generates new feature ideas for different products.

However, usually having a bright idea, or getting a best paper award, or having a cool prototype, does not result in making a product impact. So, once again what is the secret towards making product impact?

Relationships, relationships, relationships……

In addition to having an idea/prototype, having a good working relationship with the product architects is very important towards making product impact. There are several ways in which one can cultivate a good working relationship:

·        Involve the product architect in all phases of idea development: That is, don’t just develop a prototype and say, “can’t you see this is so cool. Why don’t you use it?” Instead get periodic input from the architect from the very beginning. Try to jointly come up with new ideas.

·        Build your prototype using existing product base: If possible try to implement your prototype with the architect’s help in the existing product. This will help to increase the credibility of your idea.

·        Enhance your ideas via discussions with the product architects because they will have many unique insights that are very valuable. File joint patents and write papers with them on these jointly developed ideas.

·        Don’t leave the product architect to deal with all the problems after dropping the prototype into their lap. Instead continue to proactively help them with the problems because as they say, “A happy customer is a repeat customer”.

·        Don’t worry about who is leading the joint effort: That is, don’t worry about the organizational structure of the joint effort. Let the product architect lead the implementation effort. Your management should be mature enough to observe your contributions.

·        Have joint meetings between customers and product architects/managers: Product managers usually have a huge laundry list of high priority things to do. By facilitating joint meetings between customers who want these new features, and the product managers, you will help the product managers to realize the value of the new ideas/features you are proposing.

·        Most importantly be confident about your ideas but be also humble. That is, you should not flash your PhD as an offensive weapon. Instead your PhD should give you the wisdom to smartly deal with people. Most experienced architects are smart that they will not perceive humility as weakness.

In conclusion, a pro-active approach will yield more fruits rather than engaging with the architects only after the prototype has been built. It usually takes multiple months (even years) to build a good solid relationship. But a good relationship is for life.

September 02, 2008

Green Revolution

This article of mine will appear in Logistics 2.0 magazine's special issue on Green IT.

There was a green revolution in the 1960s and 70’s which increased the yield of agriculture production in many developing countries. Similarly, there is another type of green revolution that is occurring in today’s environmentally conscious world for information technology (IT) systems. A data center that hosts application servers, network switches and storage devices is considered to be green if it consumes less power for running and cooling the computer systems hosted by it. Data center operators are very conscious about power consumption because there is a limit on the amount of power that can be provided by the power sub-station supporting the data center. Once that limit is reached the data center operator has to open a new data center, and the new data center will initially not be fully utilized. Thus, in today’s cost conscious era, most data center operators are vigilant about a) ensuring that the computing resources are not underutilized (because under utilized resources still consume a lot of power even when they are not doing useful work)  b) procuring devices that consume less power and c) ensuring that they are using efficient data center cooling mechanisms. Thus, in this article we will discuss how data center administrators deal with the following issues:

* How to increase resource utilization?
* How to evaluate the resources they are purchasing with respect to power?
* How to ensure the devices are consuming less power?
* How to design/leverage efficient data center cooling mechanisms?

How to increase resource utilization?
Memory, CPU, fan, network cards and disks are some important computer resources that consume power. The input workload determines how much work these resources have to perform, and this, in turn, dictates the amount of power consumed by the computer system. It is interesting to note that he amount of power consumed by a system that is 50 percent utilized is not that much more than a computer system that is 10 percent utilized (due to the presence of fixed power consumption costs). Hence, less power is consumed by a single box that is 50 percent utilized than five boxes that are each only 10 percent utilized. In the past, the software running on a box was very tightly coupled to the hardware box, and thus, it was very difficult to dynamically move an application from one computer to another. However, with the emergence of hypervisor (server virtualization) technologies like Xen, HyperV, and VMWare, now it is possible to dynamically move applications between computer systems, and thus, increase the overall system utilization. The hypervisor technology allows multiple applications to run on a single box, and the failure of a single application does not affect the execution of the other applications running on the computer.  The hypervisor technology has been around since 1960s (IBM VM operating system), but only now this technology has been made available to run on commodity hardware systems, and thus, it has become more prevalent. Most data center operators are re-designing their data centers to leverage this technology.

How to evaluate resources with respect to their power consumption properties?

Standards organizations like SNIA (Storage Networking Industry Association) are grappling with the task of how to rate a storage device with respect to its power consumption properties. For example, a storage device might use flash technology instead of disk drives, and thus, it can consume less power while being more expensive. So, it is not prudent to just look at the amount of power that is consumed by a computer device in isolation. Instead, one should look at power consumption in conjunction with performance, availability, reliability, physical shelf space requirements, and cost considerations. Therefore, system administrators have to make trade-offs between these different parameters, and there is a need for new power related metrics like IOs/Watt or IOs/Watt/dollar, or Watts/Cubic Feet etc.  Furthermore, the administrators need to look at their respective application workloads to select the proper type of computer resources. For example, there is a difference in the I/O characteristics of archival workloads and on-line transaction processing (OLTP) type workloads. In archival workloads one does not care about high throughput, whereas, in OLTP workloads throughput requirements are very important. Thus, one can purchase a storage system with slower RPM disks for archival workloads than for OLTP workloads because slower RPM disks consume less power and are usually cheaper.

How to ensure that devices are consuming less power?

There are both pro-active and re-active techniques with respect to reducing power consumption in computer devices.

Pro-active techniques:  These techniques a priori ensure that devices consume less power. For example, one can cut down on the number of disks being used by using higher capacity disks. A one Terabyte disk will consume less power than ten 100 Gigabyte disks. Similarly, one can reduce the number of copies of data, use data compression, thin provisioning and data de-duplication techniques to reduce the amount of data being stored on disks. This, in turn, reduces the number of disk drives being used which, in turn, leads to less power consumption. Similarly, one can also use Flash drives or higher efficiency power supplies to also pro-actively cut down on the amount of power being consumed.

Re-active techniques: In re-active techniques one dynamically changes the state of a physical resource from high power consuming state to low power consuming state. The state of CPU, memory and disk drives can be dynamically transitioned between different power states. It is important to note that there is a trade-off between power consumption and performance when one transitions a device to a lower power state. For example, if we spin down or shut down a disk drive, the next time we want to read data from that drive we will incur higher latency. This is not acceptable behavior for all the different types of workloads. For example, interactive applications cannot wait for disks to spin up.

How to design efficient cooling mechanisms?

In the past people assumed that for every 1 watt of power consumed, one requires 1 watt of power for cooling. However, now people are building sophisticated data centers to reduce the power required for cooling. Data center builders are using the notion of hot aisles and cold aisles, and are also encasing (insulating) the racks to ensure that hot air does not mix with the freshly brought cold air. Data center designers are also using blanking panels to fill up empty space in racks in order to manage air flow efficiency. People are also locating data centers in regions where the outside air temperature and humidity is optimum (temperature range of 20 degree to 25 degree C, and humidity range of 40 to 45 % with a maximum dew point of 17 degree C). Some system designers have started to leverage water cooling in lieu of air cooling in order to more efficiently remove the heat from the hot systems. However, the plumbing infra-structure requirement for water cooling leads to higher startup costs. Data center designers are also employing raised floor designs to facilitate better air flow circulation. In conclusion, the use of these cooling techniques is now leading to a ratio of less than 1 watt of cooling for every 1 watt of power consumed.

In conclusion, it is important to note that in addition to performance, power management is another quantitative way of measuring system performance. Going forward, as standards bodies produce new power measurement units this will become another key differentiator between the products from different vendors.

July 18, 2008

Why Should Industrial Research Labs Sponsor University Research

NetApp recently hosted its 1st Annual University Day at NetApp Sunnyvale Campus. Top notch professors from Wisconsin-Madison, CMU, UC Berkeley, UCSC, MIT, UCSD, Harvard, UIUC, Waterloo, Indian Institute of Science, John Hopkins, and University of London attended this successful event. In addition, faculty members from Georgia Tech, Michigan, Stony Brook, and Duke wanted to attend but had to cancel due to scheduling conflicts. [Note: If professors from other universities are interested in attending the next NetApp University day then please send me an email].  During this event we got feedback from all of the professors that “NetApp has got it right” with respect to how it interacts/supports universities. NetApp provides a) research funding b) access to its internal system usage logs that provide valuable information about workloads, device failures etc c) equipment d) opportunities for grad students to intern at NetApp and e) most importantly, it allows its researchers to interact with faculty and students, and thus, provide access to real-world customer problems. As a result, NetApp’s reputation as a premier research lab in the storage field is definitely growing.

Now, what does NetApp get in return? How does supporting University Research positively affect NetApp’s bottom line? What edge does a company gain by supporting University Research because research results are publicly published and are available to everyone? Isn’t having a research division enough, why should a company additionally also support university research?

In this blog, I will try to articulate the benefits of supporting university research. NetApp advanced technology group brain trust in the CTO office had the foresight and vision to support university research. Supporting university research can help a company in the following ways:

  • Enhances Reputation as an innovator: Perception influences the decision making process of most human beings. If CIOs perceive a company as an innovator, then this will definitely influence their purchasing decisions positively in that company’s favor. For example, NetApp collaborated with Wisconsin-Madison, UCSC, UIUC and John Hopkins to publish 6 papers in 2007 FAST conference. FAST is the premier storage conference which is attended by people from top universities, government agencies, and storage vendors. The positive press NetApp received in the blogs and magazines wrt to the innovations in these papers cannot be measured even in gold. Many CIOs were interested in these technologies and contacted NetApp as to when we will ship products with these cool features. In addition, many new professors now want to interact with NetApp, and also most top notch graduating students want to interview at NetApp. These papers were made possible due to our collaboration with the above mentioned universities.
  • Access to top notch Graduate Students: A company’s success or failure depends upon the type of people it hires. The inability to attract top people for a prolonged period of time is the first sign that the company’s future is spiraling downwards. It is very essential to build relationships with graduate students from good universities if you expect them to apply for a job at your company. These relationships cannot be built over night. They have to be cultivated over a 3 year time span by meeting the students at their universities, inviting them to give talks at your company, collaborating with them on papers, inviting them as summer interns at your company. Professors happily provide access to their graduate students if they have a working relationship with your organization. The endorsement of a company by a faculty member positively influences the graduate student about working for that company. The “Google effect” sucked-in and made many top students unavailable to other companies for a period of 3-4 years. However, now due to limited amount of external publication (or lack of publication by many top grad students after they joined Google) by Google researchers and reduced monetary benefits in the post-IPO era, many graduate students are once again giving other research labs a fair shot.
  • Paper/Patent Collaboration: As described above, funding research in universities make the professors more receptive towards collaboration with the funding organization. Universities appreciate money, but they appreciate “smart money” even more. Smart money means interacting/advising students in addition to providing money. This collaboration results in numerous papers and patents.
  • Faculty Sabbaticals: Different organizations have different cultures and outlooks. The confluence of ideas from these different organizations can lead to creative positive energy. This energy if channeled positively can lead to many innovations. Thus, sabbaticals can be highly effective in cross-pollination of ideas.
  • Influence Purchasing Decisions: In countries like India, faculty members participate in government committees that have been constituted to make purchasing decisions for different government organizations and public sector companies. If a faculty member has a positive view about a company, this will definitely help that target company.
  • Invitation to Retreats: University retreats are an excellent place to meet with faculty, students and also researchers from other organizations. In these events, researchers from various companies usually keep their guards down, and they freely discuss new ideas and problems they are facing. Once again these events foster cross-pollination of ideas.
  • Invited speakers: Faculty members are more receptive to traveling and giving talks at sponsoring organizations. The fresh perspective provided via these talks can stimulate the thinking cells in a large number of people in the hosting organization.
  • Access to research results:  Many universities have a research consortium setup. That is, many industrial companies fund a group of professors. The results/patents obtained by the professors are available to the sponsoring consortium companies for early evaluation. The companies usually have to pay a nominal fee if they want to license or use the patents in their products.
  • Opportunity to teach at Universities: Industrial researchers are invited to teach graduate courses or lectures at universities. I have personally taught a storage system graduate course for one semester at UCSC, and have given lectures at UC Berkeley. This is a great opportunity to interact with graduate students. This gives the company an insider’s track wrt good graduate students (who can be subsequently invited for summer internships).

  • Access to Federal grants: Governments allow universities to jointly apply for funding in collaboration with industrial labs. This is another source of income for corporations to develop technologies for government. This same technology (with modifications) can potentially be also leveraged by the company in its products.
  • Consultation Help: Companies can get consultation help from experts from academia in areas where they do not have resident experts. Many times companies get leads to expert faculty members from other faculty members who already have a good working relationship with the sponsoring company.
  • Invitation to be on Program Committees, PhD committees, Journal Editorial Boards: Researchers in companies get invited to be on various committees. This is usually due to the existence of excellent working relationships with faculty members who are conference chairs or PhD supervisors. The presence of industrial researchers in these committees helps to boost the reputation of their respective organizations.

In conclusion, just as it is extremely important for a company to have excellent relationships with its customers, it is equally important for an organization to have good relationships with the top talent producing institutions of the world (note: it is important to develop university relationships with institutions across the world because no single country has a monopoly on creativity). As they say, “Great people make great companies”.  Thus, if a company does not have sustained access to good talent, then it is just a matter of time before that company becomes a mediocre “me-too” company.

July 07, 2008

5 Disruptive Trends in the Storage Industry

Folks, in this post I will talk about what I consider to be 5 disruptive trends in storage industry. By disruptive I mean things that if a storage company (traditional SAN/NAS storage controller/filer vendors) misses out on can potentially be detrimental to its bottom line.  The 5 disruptive trends (in no particular order) are:

1.   Flash (NVRAM technologies)

2.   Server Virtualization and Hypervisor Technologies

3.   Green Storage

4.   Web 2.0

5.   Cloud Computing/Storage Services

Now, I will briefly discuss why I consider these things to be disruptive (in my next blog posting I will list out the outstanding research challenges in each of these areas):

·       Flash: Cost of NAND Flash (SLC) technology is rapidly falling and it now makes the use of Flash to be cost effective in comparison to high end fast disk drives.  Flash provides much better random I/O performance than disks (disks are competitive for sequential I/O performance).  Flash has slow writes, wear-out, read disturb and program disturb problems. So, one needs to have software that can overcome these problems. Flash read speeds are not as fast as DRAM, and thus, it will not be a replacement for RAM.  Thus, if vendors can figure out how to put random read intensive data on flash, and other types of data on disks, then one can get the best of both worlds.  Another interesting aspect is that one can potentially put their entire working set in Flash, and thus, the IOPs/dollar and IOPs/Watt benefits of flash are extremely competitive wrt disks.  Finally, one can put 2 Flash SSDs in a 1 U server box. Thus, a server can potentially host its entire working set in a Flash. This development will change the role of a traditional storage box. That is, it will be used more as a secondary store than as primary store. Thus, companies which figure out how to best leverage flash at the host will definitely have an upper hand over companies that will simply use flash as just a primary storage or cache. Furthermore, companies that figure out how to store the right type of data on flash and disks  (hybrid solutions) will definitely have an advantage over companies that store all of the data just on flash or disks.

·       Server Virtualization/HyperVisor Technology:

o      Server Virtualization technology is allowing companies to consolidate their physical server boxes leading to power, space and purchase cost savings. This, in turn, affects storage companies in the following interesting ways:

§       Storage companies have to ensure that their data management abstractions match the management abstractions being used by the server virtualization companies. For example, a VMWare VMDK should match a storage volume, instead of forcing the users to put multiple VMDKs into a storage volume.

§       Hypervisor companies want to provide end-end management solutions. That is, ideally they would like smarts in their management software (like VMware Virtual Center) and they would want a common dumb storage (least common denominator) interface for managing the underlying storage from different vendors. So, the storage vendors have to re-examine the role of their respective storage management software. Currently, the storage management functionality in the Hypervisor management software is not that advanced, but that will change in the future.

o      Storage controller companies have to leverage the hypervisor technology themselves in order to provide advanced functionality to their customers. Some storage vendors are already running their storage controller software on top of Hypervisors. Hypervisor technology can be leveraged by storage vendors to:

      • Run application software on storage controllers.
      • Re-design storage software into finer grained modules for better fault isolation.
      • Scale the storage software as the number of CPU cores increases.
      • Provide non-disruptive upgrade functionality

Thus, companies that excel in better mapping their storage management constructs to the server virtualization constructs, and which can leverage hypervisor technology in their storage boxes will definitely have a clear advantage over other companies.

  • Green Storage: During the past 2 decades most computer companies have been obsessed with performance because this was one of the few quantitative ways of comparing systems. Going forward, “Power” will become the next quantitative metric that will be primarily used for comparing computer systems. SNIA has already started to design comparison metrics such as IOPs/Watt, Watts/Cubic Feet, Watts/Gigabyte etc. So, storage companies will have to design power efficient systems not just for archival workloads but also for regular OLTP type workloads.  People are pursuing both pro-active as well as re-active techniques in this regard. Pro-active techniques like data de-duplication, compression, thin-provisioning, using high capacity disks help to cut down on the number of disks, and thus, reduce the power consumption. Reactive techniques try to dynamically spin-down or shut-down the storage devices to save power. New technologies like Flash will also play a key role in improving the power efficiency of storage systems. So, going forward, the companies that can excel in pro-active and re-active power management will definitely have an upper hand.
  • Web 2.0 Architectures: Google wanted to use 1 U commodity servers as the base for its computation framework. As they were designing this, they realized that they can use the same underlying framework to also satisfy their storage needs because each of their 1 U servers had 2 disks. The key point is that they decided not to use traditional RAID 5 type protection mechanism. Instead they decided to use 3 copy protection mechanism with weak consistency semantics between the copies. Thus, Google was able to drastically cut their upfront capital expenditure costs by using the combination of commodity hardware and home grown distributed file system software. They traded cost gains for more advanced functionality offered by storage vendors such as better data integrity and reliability guarantees, better performance and better capacity utilization. This approach has spread like wild fire amongst the Web 2.0 companies. There are even some open source distributed data management software similar to Google File System such as Hadoop File System, and Lustre that are being used by many of the Web 2.0 comapnies in lieu of Google File System software. The combination of this open source software and commodity hardware has the potential to do to the storage industry what Linux has done to the server operating system industry. This is a triple threat for traditional vendors because:
    • Web 2.0 companies are not using traditional storage boxes.
    • Web 2.0 companies are starting to offer storage hosting services (while not using traditional storage boxes) to enterprises.
    • Server/Service companies like IBM/HP can storage service solutions hosted at the customer sites comprised of non-traditional storage vendor boxes.

Thus, storage companies have to package their offerings in such as way that their offerings have less upfront costs. This will then entice the Web 2.0 companies to purchase storage from them instead of building their own solutions. Furthermore, in future, if the Web 2.0 companies want more advanced storage management features, then the storage companies can provide this functionality and charge the Web 2.0 companies accordingly.

  • Cloud Computing or Storage Services or Storage Hosting: Companies like Google and Amazon are starting to offer compute/storage hosting services to enterprises. Now, this trend is different than the failed storage backup services that were offered in the past in the following key ways:
    • Eat their own cooking: Google and Amazon anyway need large data centers for their own internal needs. They are leveraging this internal need to amortize costs for their customers.
    • Higher Value Add Services: They are offering more than just storage hosting or compute hosting services. Instead they are offering an eco-system where they provide database, security, file system, messaging etc services to their customers. Thus, customers can quickly build their applications.
    • Trust/Reputation: Google and Amazon are trusted companies that will not go bankrupt unlike smaller storage hosting companies.

There are trust/security issues, availability issues and performance issues that one still needs to overcome. Moreover, there are still no well defined SLAs being offered wrt performance by these companies. However, the convenience and pricing structure of these offerings is making people lose their inhibitions about using these services (similar to how people are using TurboTax to do their taxes and e-file them via Intuit).  The combination of cloud computing and the fact that the companies offering these hosted services are not using traditional storage boxes is definitely a potentially disruptive trend for traditional storage vendors.  It is important to note that the current Web 2.0 solutions cannot quite offer the reliability, performance, availability characteristics desired by traditional OLTP type workloads. However, over time, these things will be worked upon and will be improved. So, it is important for storage companies to assess whether they want to be suppliers (both low end and high end functionality) to these storage/application hosting companies or whether they want to get into the storage hosting game themselves.

February 07, 2008

Web 2.0 Storage Architecture: Is it a threat to traditional storage controller (server) vendors?

Since the publication of the Google File System article in SOSP 2003, there has been a lot of interest in the Web 2.0 companies to forgo the traditional Storage Controller (server) based NAS and SAN architectures for cluster of servers with direct attached storage at each of the server nodes. This storage model can be disruptive for the traditional storage vendors because Google and Amazon are not only using this architecture internally, but are also offering hosted storage services based on this architecture.  In this blog, I will try to analyze whether the world is really coming to an end for the traditional storage controller vendors, or exactly what are the trade-offs between the traditional storage controller functionality and the commodity server based model being pursued by the Web 2.0 vendors. The Web 2.0 architecture consists of commodity servers with either 2 or 4 direct attached SATA disks.  They run a clustered file system, and typically also provide an object interface to their storage infra-structure.

Google and other Web 2.0 companies are making the following basic assumptions with respect to their environments:

  • The workloads are primarily append only. That is, there are no read/write or write/write locking conflicts, and they do not try to optimize performance for random workloads (like OLTP). Usually, they don’t have to deal with multiple types of workloads.
  • Initially, Google had a need for the CPU cycles on the server nodes. However, other Web 2.0 companies are still pursuing the same architecture even if they don’t really have the need for the CPUs because the cost per Gigabyte is much lower than the traditional storage controllers. The Web 2.0 companies are also beginning to increase the server to disk ratio.
  • They need a scalable multi-site file system architecture that can support thousands of nodes.
  • Different Web 2.0 applications have different levels of replica consistency requirements.
  • Many of the Web 2.0 companies prefer an object interface model. That is, they want to put/get objects into persistent storage.
  • Many of these Web 2.0 applications do not care about data loss because a) they can always re-create the data and b) since they do not charge their customers any money for the storage services, they employ the “Buyer Beware” policy. Most of the Web 2.0 architectures employ a 3 copy replication model to provide availability. The first copy is a local copy on a different server, and the second copy is a remote copy. The 3-copy replica mechanism is adequate for ensuring availability. They employ check-summing functionality at the file system level to detect data corruption.
  • They employ home-grown storage/infra-structure management tools.

Even though the current storage controllers can satisfy the above set of requirements, for cost reasons, the Web 2.0 companies are not using the traditional storage controller boxes.  The storage controller companies are typically providing the following added functionality that is seemingly not required by the Web 2.0 companies:

  • Provide good performance for OLTP type workloads. More importantly, since they are general purpose storage boxes they have to provide decent performance for a wide variety of workloads. They typically have very large read caches and NVRAM to provide fast write performances.
  • Provide support for different NAS/SAN protocol interfaces.
  • Provide high availability (it varies from Tier-1 boxes that have for practical purposes zero down time, to Tier-2 boxes that also have decent high availability). High end controllers are usually configured in active-active configurations to provide RTO/RPO support of 0. The 3-copy availability model could also potentially provide the same level of availability, however, the high end controllers try to provide this in a space efficient manner, and they also try to provide fast rebuild support. More importantly, they use higher reliability hardware components to reduce the number of failures/rebuilds and the periods of potential exposure in comparison to using commodity parts.
  • Provide varying types of continuous and point-in-time data copy services. These services help to provide support from disasters and also help to quickly test applications before they are brought on-line.
  • Provide thin-provisioning support. That is, volume space is allotted on-demand.
  • Provide remote debugging/diagnostic support. Most of the high end controllers provide customer support by which they can remotely diagnosis failures, and also predict potential failures.
  • The storage controller model allows for a de-coupling of server/storage purchasing decision. That is, customers don’t have to purchase servers if they really only want more storage capacity. But the cost (and warranty service) of commodity servers is so attractive that the Web 2.0 companies do not care about not utilizing the CPUs on the servers (that is, only use them for the attached storage).
  • Provide storage resource management software that provides monitoring, planning, analysis and workflow based action enforcing mechanisms. Currently, the web 2.0 vendors are developing their own customized management applications because most of the storage management products primarily support the hardware from the specific storage vendor. Recently, with the emergence of storage management standards, storage management software vendors are beginning to provide basic management support for hardware from other vendors.
  • Provide virtualization, encryption, data de-duplication, and compression functionality. This functionality is currently not considered to be high priority by the Web 2.0 vendors. However, one could argue that data-de-duplication and compression can help to reduce the overall cost of the Web 2.0 company data infra-structure.

In conclusion, I would like to make the following key points:

  1. I believe that the Web 2.0 companies have correctly/smartly recognized that their needs can be adequately satisfied by cheaper commodity based storage architecture.
  2. The traditional storage vendors can potentially de-couple their software from their proprietary hardware and provide the subset of functionality (in a lighter version of their product) that is required by the Web 2.0 customers to reduce costs. In other words, the storage companies can provide software-only solutions to the Web 2.0 customers.
  3. There are many other customers (for most companies their mission critical data) that require good performance and high availability. The current Web 2.0 architectures do not target these markets, and therefore, there is still a need for the boxes from the traditional storage vendors. In a nutshell, the traditional storage companies will continue to be in business.
  4. There is also a need to design storage systems where based on policies one can re-configure the software characteristics such as the selection of a replication consistency model, or the selection of a buffer management mechanism etc. Business case will dictate whether the traditional storage vendors will pursue these options.
  5. The Web 2.0 companies can definitely use the Google storage model to provide storage hosting support for archival types of workloads. Since this is an important class of workloads, I anticipate that most of the traditional storage vendors will either get into the storage hosting business themselves, or will provide software based solutions that can work with commodity hardware which in turn will be used by storage hosting service companies.

December 10, 2007

How to consistently publish papers in an industrial research setting

Welcome to the “Adventures of a researcher in an industrial lab” blog. Let us assume you have a freshly minted PhD or a MSc from a university and have started to work in an industrial lab or are a student who is currently in a university and wants to know how it is working in an industrial setting then this blog is for you.  In this blog I will cover some of the technical and non-technical challenges one has to overcome in order to succeed in an industrial lab setting. In industrial labs you will be evaluated with respect to your product, patent and paper contributions (3Ps).  Here in this weekly posting I will muse about both technical and non-technical things that will a) help you enjoy your daily work b) will discuss some new upcoming disruptive technologies and c) will discuss some best practices that will help you satisfy the 3Ps requirements.  In my first blog entry, I will talk about how to consistently publish a paper in a top conference as a researcher working in an industrial lab.

Most of the researchers in universities and in industrial labs aspire to publish papers in top conferences. Nowadays, papers are published in journals for either archival purposes, or there are those rare papers whose content cannot fit (without compromising quality) in the space typically allotted to a conference paper. Thus, publishing paper in a top notch conference is the desire of most researchers.  SOSP, OSDI, USENIX, FAST, SIGMETRICS, VLDB, SIGMOD, ISCA and DSN are some of the top conferences in which storage researchers like to publish papers. What are some of the unique challenges that a storage researcher faces in comparison to a researcher in a university lab with respect to publishing papers, and how can one overcome them.

One of the fundamental problems is that ideas have to be cutting edge and novel in order for someone to be able to publish them in a top notch conference. Top conferences also require an implementation (prototype is acceptable) and thorough experimentation based on real workloads.  Nowadays "NO IMPLEMENTATION means NO PAPER". In many cases, these cutting edge ideas are not mature enough to be put into a product. Thus, it takes a lot of extra work to make them ready for products. Industrial researchers have to make a conscious decision with respect to whether they want to spend their energy in trying to productize their idea or whether they want to spend their energy running the necessary experiments and writing the paper. Furthermore, in some cases, the product environment is not ready for these cutting edge ideas. Thus, many researchers pursue less risky ideas that have a higher probability of getting productized in order to not show a goose egg in the product impact column at the end of the year, because many organizations give higher marks to product impact in comparison to paper impact.

From my past experience most storage researchers fall into either the paper impact camp or the product impact camp. That is, the researchers in the paper camp focus more on publishing papers at the expense of making product impact, and vice versa. The order of importance between papers, patents and product impact varies from on research lab to another (actually it even varies from manager to manager).  An industrial researcher can always publish a one-off paper but is there a formula for an industrial researcher to publish papers (while also making a product impact) on a consistent basis? The answer interestingly enough is “YES”.

There are the following three types of paper that can be published by an industrial researcher (each paper should be evaluated on its technical merit, and thus, I do not like to rate one category of paper to be better than other categories):

  • Product related: A product related paper corresponds to either the product work the industrial researcher is actively pursuing, or has complemented in the past. The difficulty in publishing papers in this category has been already discussed above. It is important to note that as an industrial researcher this is not the only category of papers that one can produce. One has to be realistic that a researcher can produce one product related paper once every 2 to 3 years. One should set realistic expectations and plan accordingly for writing product related papers.
  • Collaboration with Universities: Industrial researchers have access to real workloads and they have problem databases that contain information about failure rates, and customer usage/configuration patterns. Industrial researchers also get an opportunity to interact with real customers, and thus, they are privy to many real customer problems. Collaboration with universities leads to a synergistic win-win situation where the universities bring their technical expertise and graduate student manpower to the table, and the industrial lab researchers bring the above mentioned knowledge to the table.  Provided that the patent issues can be worked out between the two groups, this arrangement can lead to some good publications. The time commitment on the industrial researchers’ part is not as much as in the product related paper case, and thus, they can still work separately on making a product impact via a different idea.
  • Collaboration with a summer intern: A good summer intern can really help the industrial researcher publish a good paper by providing relief with prototype implementation and experimentation efforts. The paper idea does not necessarily have to be related to the product related work that the industrial researcher might be actively pursuing. Summer internships typically last for 3 months, and thus, the research problem has to be a priori identified before the summer intern’s arrival, and the intern must have some background in the selected area in order for one to realistically have a shot at success.  Getting a good summer intern is not easy. One has to cultivate relationships with the faculty at good schools in order to get the referrals. Ideally, one wants a student who is good at implementing, conducting experiments and writing papers. It is very rare to find a student who is proficient in all three aspects. Thus, an industrial researcher should have the patience to train a student. In some cases, it might take multiple summer internships with the same student to provide the necessary training. In some rare cases if the student can intern for longer than 3 months or has time to work on the same problem even after returning to the university, then one dramatically increase the chance for good publications.

Making both product and paper impact simultaneously is not impossible. As discussed above, one should systematically plan and try to write at least one paper belonging to either one of the above categories every year. In addition to smart planning, and hard work, one also needs luck in order to be able to publish in top conferences.  Thus, to keep the industrial researchers motivated,  some partial credit should be also given to them for submitting good papers to top conferences. Team work is very very important in order to publish papers in an industrial lab setting. Don't worry about who is the first author or the second author of a paper. As a PhD student it is important to be the first author in order to get full-time employment, but once you are in the industry, it is more important to co-operate and operate as a team in order to share ideas and the required effort, and this is the first required step towards publishing good papers. Having a good team will cut down on the amount of time one has to individually spend, and that saved time can be used to make other product and patent related contributions. Also, do not over extend yourself by simultaneously working on too many papers. You will end up doing a crappy job on all of your papers. Always think "Quality over Quantity".  Finally, work with experienced people with good publication track records. There is nothing better than learning from a good "Guru".

© NetApp, Inc.  |  "Safe Harbor" Statement