Back on October 6, in a post on the decision by the London Stock Exchange to drop its Microsoft-based trading system, I mentioned a question I always ask when someone tells me that they want to use Windows servers to implement their infrastructure:
But when people ask me if I think that they should install Microsoft Windows servers and infrastructure, I always ask them the same question: What do you think you know about running servers that Google, Amazon, and Yahoo! (as well as, now, the LSE) don’t know?
I was reminded of that question once again when I saw a note at the DatacenterKnowledge.com about a presentation given by Google Fellow Jeff Dean at an ACM conference on “Large Scale Distributed Systems and Middleware”, in which Mr. Dean discusses aspects of Google’s current hardware and software architecture, and some of its plans for future development. (The slides from the presentation can be downloaded here [PDF].) Since Google has, historically, been somewhat reticent in describing its operations, this is a most interesting look “under the covers”.
Mr. Dean starts off by observing that computing seems to be gravitating toward a world combining very small (smart cellphones, netbooks) devices, and very large ones (a Google data center). Within a data center, Google uses clusters of racks, each rack containing 40-80 individual servers and an Ethernet switch. A typical server has 16 GB of RAM, and 2 TB of disk. A rack with 80 servers will contain about 1 TB of RAM, and 160 TB of disk; and a cluster of 30 racks will contain 30 TB of RAM, and around 5 petabytes of disk. The design must take into account not only the capacities, but also the latency and bandwidth of different types of storage. Typical values for these, depending on their location in the cluster, are shown below:
Drive Location | Capacity | Latency | Transfer / Second |
---|---|---|---|
Server | 2 TB | 10 ms | 200 MB |
Local Rack | 160 TB | 11 ms | 100 MB |
Cluster | 4.8 PB | 12 ms | 10 MB |
Another issue that has to be considered is hardware failures. Modern hardware is very reliable, but even small probabilities of failure for single components translate into fairly frequent failures in the overall system Mr. Dean points out that, if you use super-reliable servers, with a MTBF of 30 years, and build a system with 10,000 of them, approximately one server will fail per day. He presented a list of Google’s expectations for hardware failure in the first year of life of a typical cluster — here is a sampling:
- 1-5% of the disk drives will die.
- Servers will crash at least twice (2-4% failure rate)
- Three routers will fail
- Twelve routers will have to be re-started
- One power distribution unit will fail
Figures like this make it clear that designing resilience and fault tolerance into the system from the beginning is a necessity, not an option.
Google’s servers run a version of the Linux operating system, using distributed scheduling software developed by Google, and the Google File System (GFS). GFS is designed to handle very large quantities of data:
- 200+ clusters, many with 1,000 or more servers
- 4-5 petabytes per filesystem
- Thousands of clients
- Read/write load of 40 GB per second.
A typical Google Web search (yes, those things you type into the little rectangular box) may touch 50 or more network services, and thousands of servers.
Mr. Dean went on to discuss two of Google’s software innovations: the BigTable system, which is essentially an adaptation of the relational data base idea to store data that can be thought of as very large, sparse matrices; and the MapReduce system, which is a tool for processing large sets of data. With MapReduce, the typical problem is factored like this:
- Read a lot of data
- (Map) Extract the information you care about
- Shuffle, sort, slice and dice it
- (Reduce) Aggregate, summarize, filter, or transform
- Produce output
Perhaps the most intriguing part of the presentation was an overview of a new system Google is working on called Spanner. This is intended to be a management system than spans all of Google’s data centers. It is certainly being designed for large-scale: on the order of 10 million servers, 1013 directories, 1018 bytes of storage, and 1 billion simultaneous clients around the world. The three functional goals of the project are:
- One unified global namespace across all systems
- Choice of settings for trading off data consistency v. performance
- High degree of operational automation
One of the key challenges in building this system is developing a usable general model of the tradeoff involved between data consistency and performance.
This is fascinating stuff. The problems of running a very large Internet service, which Google does and does well, are different both quantitatively and qualitatively from those of running a small office network. To the extent that “cloud computing” gets established, it will depend on finding good solutions to these sorts of problems.
Updated Friday, October 23, 20:35
I accidentally omitted some HTML tags from the paragraph about Spanner, so some of the figures did not turn out as exponents. Mea culpa.