Tuesday, 23 October 2007

How Google Works!

Number have never been confirmed on the size of Google's Infrastructure size. Some reports estimate that Google maintains over 450,000 servers, arranged in racks located in clusters in cities around the world, with major centers in Mountain View, California; Virginia; Atlanta, Georgia; Dublin, Ireland; and new facilities constructed in The Dalles, Oregon and Saint-Ghislain, Belgium.In 2009 Google is planning one of its first sites in the upper midwest to open in Council Bluffs, Iowa close to abundant wind power resources for fulfilling green energy objectives and proximate to fiber optic communications links.

When an attempt to connect to Google is made, Google's DNS servers perform load balancing to allow the user to access Google's content most rapidly. This is done by sending the user the IP address of a cluster that is not under heavy load, and is geographically proximate to them. Each cluster has thousands of servers, and upon connection to a cluster further load balancing is performed by hardware in the cluster, in order to send the queries to the least loaded Web Server. This makes Google one of the biggest and most complex known content delivery networks.

Racks are custom-made and contain 40 to 80 servers (20 to 40 1U servers on either side), while new servers are 2U Rackmount systems. Each rack has a switch. Servers are connected via a 100 Mbit/s Ethernet link to the local switch. Switches are connected to core gigabit switch using one or two gigabit uplinks.

Since queries are composed of words, an inverted index of documents is required. Such an index allows obtaining a list of documents by a query word. The index is very large due to the number of documents stored in the servers.

Server types

Google's server infrastructure is divided in several types, each assigned to a different purpose:

* Google DNS Servers answer the DNS requests and serve as intelligent, worldwide load-balancers. They guess the data center nearest to the user to speed up all HTTP requests.

* Google Web Servers coordinate the execution of queries sent by users, then format the result into an HTML page. The execution consists of sending queries to index servers, merging the results, computing their rank, retrieving a summary for each hit (using the document server), asking for suggestions from the spelling servers, and finally getting a list of advertisements from the ad server.

* Data-gathering servers are permanently dedicated to spidering the Web. They update the index and document databases and apply Google's algorithms to assign ranks to pages.

* Index servers each contain a set of index shards. They return a list of document IDs ("docid"), such that documents corresponding to a certain "docid" contain the query word. These servers need less disk space, but suffer the greatest CPU workload.

* Document servers store documents. Each document is stored on dozens of document servers. When performing a search, a document server returns a summary for the document based on query words. They can also fetch the complete document when asked. These servers need more disk space.

* Ad servers manage advertisements offered by services like AdWords and AdSense.

* Spelling servers make suggestions about the spelling of queries.

Server hardware and software

Original hardware

The original hardware (CA. 1998) that was used by Google when it was located at Stanford University, included:

* Sun Ultra II with dual 200 MHz processors, and 256MB of RAM. This was the main machine for the original Backrub system.
* 2 x 300 MHz Dual Pentium II Servers donated by Intel, they included 512MB of RAM and 9 x 9GB hard drives between the two. It was on these that the main search ran.
* F50 IBM RS/6000 donated by IBM, included 4 processors, 512MB of memory and 8 x 9GB hard drives.
* Two additional boxes included 3 x 9GB hard drives and 6 x 4GB hard drives respectively (the original storage for Backrub). These were attached to the Sun Ultra II.
* IBM disk expansion box with another 8 x 9GB hard drives donated by IBM.
* Homemade disk box which contained 10 x 9GB SCSI hard drives.

Current hardware

Servers are commodity-class x86 PCs running customized versions of Linux. Indeed, the goal is to purchase CPU generations that offer the best performance per unit of power, not absolute performance. Estimates of the power required for over 450,000 servers range upwards of 20 megawatts, which could cost on the order of US$2 million per month in electricity charges.


* Over 450,000 servers[1] ranging from a 533 MHz Intel Celeron to a dual 1.4 GHz Intel Pentium III (as of 2005)
* One or more 80GB hard disks per server (2003)
* 2–4 GiB of memory per machine (2004)

The exact size and whereabouts of the data centers Google uses are unknown, and official figures remain intentionally vague. In a 2000 estimate, Google's server farm consisted of 6000 processors, 12,000 common IDE disks (2 per machine, and one processor per machine), at four sites: two in Silicon Valley, California and two in Virginia.[6] Each site had an OC-48 (2488 Mbit/s) internet connection and an OC-12 (622 Mbit/s) connection to other Google sites. The connections are eventually routed down to 4 x 1 Gbit/s lines connecting up to 64 racks, each rack holding 80 machines and two ethernet switches.

Project 02

Google is currently developing a supercomputer at a data center located in the town of The Dalles, Oregon, on the Columbia River, approximately 80 miles from Portland. The project, codenamed "Project 02", is expected to substantially add to their current global network capable of processing billions of search queries per day and a growing repertoire of other services. The new complex is approximately the size of two football fields with cooling towers four stories high.
For those who read about Google buying a dark fib

Server operation

Most operations are read-only. When an update is required, queries are redirected to other servers, so as to simplify consistency issues. Queries are divided into sub-queries, where those sub-queries may be sent to different ducts in parallel, thus reducing the latency time.

To lessen the effects of unavoidable hardware failure, data stored in the servers may be mirrored using hardware RAID. Software is also designed to be fault tolerant. Thus when a system goes down, data is still available on other servers, which increases the reliability.

Google File System:

Google File System is considered the most powerful Clustered file system. Sometimes I see it more like a huge grid file system (by the way there are couple of Open Source project that gives you the facility to build your Grid Application and FileSystem).

The nature of Google's operation required to create what is called Google File System
0) The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
1) The system stores a modest number of large files. They expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
2) The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Files are often used as producer-consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concur-recently append to a file. Atomicity with minimal synchronization overhead is essential. The file may be
read later, or a consumer may be reading through the file simultaneously.
Reference :
You can find the white paper of Google File system h3r3.

  1. ^ a b Carr, David F. "How Google Works." Baseline Magazine. July 6, 2006. Retrieved on July 10, 2006.
  2. ^ "[1]." Invest Wallonia. April 27, 2007. Retrieved on May 10, 2007
  3. ^ "[2]." Council Bluffs. July 9, 2007. Retrieved on August 21, 2007
  4. ^ a b c Web Search for a Planet: The Google Cluster Architecture (Luiz André Barroso, Jeffrey Dean, Urs Hölzle)
  5. ^ "Google Stanford Hardware." Stanford University (provided by Internet Archive). Retrieved on July 10, 2006.

No comments:

FEEDJIT Live Traffic Feed