The Internet Archive, a non-profit organization dedicated to creating a digital archive and library of Internet content, has just celebrated its collection reaching 10 petabytes (10,000,000,000,000,000, or 1.0×1016 bytes). The collection contains approximately 150 billion historical Web pages, as well as texts, images, audio, and video. The Internet Archive provides the Wayback Machine to allow retrieval of archived pages, as well as more general search tools.
The Internet Archive also announced the availability, for research purposes, of 80-terabytes (8.0×1013 bytes) of archived Web crawl data from 2011. The data set characteristics are:
- Crawl start date: 09 March, 2011
- Crawl end date: 23 December, 2011
- Number of captures: 2,713,676,341
- Number of unique URLs: 2,273,840,159
- Number of hosts: 29,032,069
Interested researchers can get in touch with the Archive to arrange access.
If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it. We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.
The San Francisco Chronicle recently had a front-page profile of the Internet Archive and its founder, Brewster Kahle.