Papers at Google Labs

Here are some interesting papers spewed out by Google Labs. A few abstracts are below.

The Google File System

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

...

The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

Web Search for a Planet: The Google Cluster Architecture

Amenable to extensive parallelization, Google's Web search application lets different queries run on different processors and, by partitioning the overall index, also lets a single query use multiple processors. To handle this workload, Google's architecture features clusters of more than 15,000 commodity class PCs with fault-tolerant software. This architecture achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.

Who Links to Whom: Mining Linkage between Web Sites

Previous studies of the web graph structure have focused on the graph structure at the level of individual pages. In actuality the web is a hierarchically nested graph, with domains, hosts and web sites introducing intermediate levels of affiliation and administrative control. To better understand the growth of the web we need to understand its macro-structure, in terms of the linkage between web sites. In this paper we approximate this by studying the graph of the linkage between hosts on the web. This was done based on snapshots of the web taken by Google in Oct 1999, Aug 2000 and Jun 2001. The connectivity between hosts is represented by a directed graph, with hosts as nodes and weighted edges representing the count of hyperlinks between pages on the corresponding hosts. We demonstrate how such a "hostgraph" can be used to study connectivity properties of hosts and domains over time, and discuss a modified "copy model" to explain observed link weight distributions as a function of subgraph size. We discuss changes in the web over time in the size and connectivity of web sites and country domains. We also describe a data mining application of the hostgraph: a related host finding algorithm which achieves a precision of 0.65 at rank 3.

More research publications here. [Via Google Blogscoped.]

Coredump

Papers at Google Labs

The Google File System

Web Search for a Planet: The Google Cluster Architecture

Who Links to Whom: Mining Linkage between Web Sites

Comments

Post a Comment

Popular posts from this blog

Pull files off Android phone

Open URL in Chrome incognito mode from Terminal

Get disk size in Linux