Some of you may have used Google’s Books search tool for identifying books related to a particular phrase or set of keywords. It is part of a very ambitious project to, in effect, create a digital card catalogue for most of the world’s books. There are three main sources for the information:
- Books whose copyright has expired, and which are in the public domain. Google makes the complete text of these available.
- Books from a group of academic libraries around the world.
- Books included by arrangement with their authors or publishers.
In the case of copyrighted material, Google will show you basic bibliographic data and perhaps some excerpts from the text, and tell you where you can buy the book, or borrow it from a library.
This week, the New York Times has an article describing a fascinating new resource that Google has made available as an outgrowth of the Books project. It is a database of word usage, culled from approximately 5.2 million books that have been scanned and digitized by the project.
The digital storehouse, which comprises words and short phrases as well as a year-by-year count of how often they appear, represents the first time a data set of this magnitude and searching tools are at the disposal of Ph.D.’s, middle school students and anyone else who likes to spend time in front of a small screen. It consists of the 500 billion words contained in books published between 1500 and 2008 in English, French, Spanish, German, Chinese and Russian.
In other words, the database is something like a concordance of word and phrase usage, as reflected in the sample of books used. The sample contains only about one-third of the books that Google has digitized; the subset was selected from those that have the best available “metadata” (that is, information like date and place of publication) and scan quality.
The database does not contain the actual texts of the included books (that would present copyright issues); instead, Google constructed what it calls n-grams: sequences of n words found together in the text. For example, ‘hamburger’ is a 1-gram, and ‘Jimmy Carter’ a 2-gram.
A group of researchers from Harvard, working with Google, has published an initial paper [abstract] in Science, which uses the data to try to identify cultural and linguistic trends, an endeavor the research team call culturomics. (Science also has an overview article available online.) For example, one phenomenon examined was the transformation of irregular verb forms (such as ‘burnt’) into regular forms (such as ‘burned’); it is interesting that the pattern of this change was often different in the US and the UK (as Shaw said, two countries divided by a common language). The team also looked for cultural trends:
With a click you can see that “women,” in comparison with “men,” is rarely mentioned until the early 1970s, when feminism gained a foothold. The lines eventually cross paths about 1986.
The team estimates that the data set contains information from about 4% of all the books ever published. That may not seem like much, but it’s a very respectable sample percentage for such a large underlying population.
One of the most intriguing aspects of this announcement is that Google is making the entire n-gram database available online, for viewing or downloading; there is also a Web-based query tool that you can use to explore questions of interest.
As with any new idea — especially in the social sciences — there will be some controversy about what this data means, how representative it is, possible biases in the sample selection, and so on. Nonetheless, having it available is of great potential value; and it is something that just could not have been done, in any practical way, without the assistance of technology.
There are also articles on this development at Wired, Technology Review, New Scientist, and Ars Technica.
Update, Saturday, 18 December, 15:05 EST
The “Law & Disorder” blog at Ars Technica has an amusing post, using the Google n-gram database to chronicle “A History of computing flamewars”.
Update, Saturday, 18 December, 17:48 EST
The “Short, Sharp Science” blog at New Scientist has results from another initial exploration of the data, including some of its warts. Note to George W. Bush: sorry, you didn’t invent “misunderestimate”.