Last summer, in one of my earlier posts on IBM’s Watson project, to build a computer system that could play Jeopardy!, I mentioned Wolfram|Alpha, another system designed to answer queries expressed in natural language. Yesterday, Stephen Wolfram, the designer of Wolfram|Alpha, and of the Mathematica software, published a blog post on Watson, comparing and contrasting it with Wolfram|Alpha.
The most fundamental difference between the two systems is the sort of information that they process. Watson is fundamentally designed to work with unstructured text data, while Wolfram|Alpha uses a “curated” data base that attempts to represent knowledge directly.
The key point is that Wolfram|Alpha is not dealing with documents, or anything derived from them. Instead, it is dealing directly with raw, precise, computable knowledge. And what’s inside it is not statistical representations of text, but actual representations of knowledge.
Whereas Watson starts with a large body of text, and tries to extract and classify information from it, a lot of the classification work for Wolfram|Alpha is done in the process of setting up the computable knowledge data base.
In Wolfram|Alpha most of the work is just adding computable knowledge to the system. Curating data, hooking up real-time feeds, injecting domain-specific expertise, implementing computational algorithms—and building up our kind of generalized grammar that captures the natural language used for queries.
As Wolfram points out, there are, generally speaking, two types of data stores in corporations and other organizations. The first is the traditional data base, which embeds knowledge of the data domain in its structure. (For example, think of the entity-relationship diagrams used in designing data bases.) The second type includes large amounts of unstructured information: things like memos, letters, product literature, images, and E-mails. In a very broad sense, the Watson project is really aimed at this second category of data.
There are typically two general kinds of corporate data: structured (often numerical, and, in the future, increasingly acquired automatically) and unstructured (often textual or image-based). The IBM Jeopardy approach has to do with answering questions from unstructured textual data—with such potential applications as mining medical documents or patents, or doing e-discovery in litigation.
What Wolfram is doing with Alpha is, in a sense, to find ways to support free-form, unstructured queries on the structured (or curated) data in the first category. In many ways, the two approaches are complementary. Wolfram suggests, for example, that Watson might pre-process text data to make it easier to structure it for Wolfram|Alpha.
The whole post is worth a read; it gives a good overview of both technologies.