Automatic Language Decipherment

July 4, 2010

Many readers will be familiar with the automated translation services offered on the Web (by Google, for example).  If you have used them, you’ve probably observed, as I have, that they can be very helpful in at least getting the gist of a passage in an unfamiliar language.  These translation services rely primarily on a statistical analysis of a large body of already-translated texts.  (For example, many official documents of the European Union are published in official translations in multiple languages.)   Although they usually do a satisfactory job, occasionally they are flummoxed.  Many of you have probably also heard of some of the legendary problems with earlier, AI-based machine translation: one program is reported to have translated the English expression, “Out of sight, out of mind”, into the Russian equivalent of “Invisible insanity”.

A bulletin issued recently by the MIT News Office describes a new modeling technique for language translation developed by researchers at the Computer Science and Artificial Intelligence Lab.   The modeling technique has been tested on a harder problem than translation between known languages: the decipherment of an unknown written language.  For the test, of course, the subject language, Ugaritic, was one that has already been deciphered by human analysts; otherwise, there would be no way to measure the quality of the result.

The technique that the researchers developed, and their tests of it, are described in a paper [PDF] to be presented at the Annual Meeting of the Association for Computational Linguistics, being held in Sweden next month.  The model attempts to incorporate some of the paths to insight used by human analysts.  First, it assumes that there is a known language that is, in some sense, close to the unknown language being analyzed.  In the actual (human) decipherment of Ugaritic, the known similar language was Hebrew.  The model then uses an iterative process to try to find similarities in three key areas:

  • Alphabet If the unknown language has a different alphabet, the model’s initial “intuition” is that each letter in the unknown language will map to at most a few letters in the known language’s alphabet.
  • Words The known and unknown languages will have a reasonable number of cognate words, like homme and hombre in French and Spanish, or wasser and water in German and English.
  • Morphemes Morphemes are the smallest independent units of meaning in a language.  For example, the English word unwind has two morphemes, un- and wind.  The model’s “intuition” here is that the known and unknown languages will have similar patterns of morphemes in similar words.

Using these assumptions as a starting point, the model uses an iterative statistical technique to search for a decipherment scheme that exhibits the greatest internal consistency.  The principal test was carried out on Ugaritic, assuming it was an unknown language, and using Hebrew as the similar, known language.  The model managed, in a few hours, to make a pretty respectable start on solving the puzzle.

The Ugaritic alphabet has 30 letters, and the system correctly mapped 29 of them to their Hebrew counterparts. Roughly one-third of the words in Ugaritic have Hebrew cognates, and of those, the system correctly identified 60 percent.

The researchers reported that many of the system’s errors were off by only small amounts (one or two letters), so that a human using the results could probably correct them fairly easily (by considering context, for example, which the system does not use).

Of course, there are other difficulties in learning to read unknown languages in the real world: poor quality manuscripts, transcription errors, and so on.  But some of the insights gained through using this kind of modeling approach may prove helpful in making ordinary machine translation more robust and useful.


Faster Random Numbers

July 4, 2010

I’ve written here before about the importance of random numbers in applications ranging from simulation experiments to cryptography.  One of the basic problems in this field is that some physical processes (such as radioactive decay, or thermal noise in electronic components) can produce a stream of numbers that, at least as far as anyone knows, are truly random, using those processes tends to be cumbersome or slow.   Algorithmic methods for generating “pseudo-random” numbers can be quite fast, and produce number streams that pass statistical tests for randomness, but are fundamentally deterministic.

Now, according to an article in Technology Review,  Intel has developed a technique for building what it claims is a true random number generator on a chip.  Essentially, it combines a multi-stable memory element (which can have values 0 or 1) with circuitry that allows the value to be “flipped” by thermal noise from the surrounding circuitry.  (This sounds somewhat similar to the approach I wrote about in my earlier post, referenced above, although neither article has enough specifics to be sure.)   The Intel design allows the random number generator to be integrated onto the processor chip, making some side-channel attacks much more difficult.  Intel claims that the circuit, which has a 45  nanometer minimum feature size, can produce a random bit stream at 2.4 Gbps, and that the resulting stream passes the NIST benchmarks for true “randomness”.

It is certainly too soon to say that the problem of generating random numbers has been solved; there are several previous examples of promising schemes that have turned out to have vulnerabilities.  But it is good to see that work is progressing on this front, because a good solution is of significant importance in so many areas.


Happy Birthday, Rube Goldberg

July 4, 2010

Today is Independence Day in the United States; it is also the anniversary of the birth, in 1883, of the cartoonist Rube Goldberg.  Although he was a newspaper political cartoonist, and won a Pulitzer Prize for this work, he is most widely remembered for his cartoons of “Rube Goldberg Machines” that used a convoluted, chain reaction process to carry out a simple task.  In honor of Goldberg’s birthday, and of Independence Day, Google has a Rube Goldberg version of their logo, on the main search page.    Click on the “Stars and Stripes” arrow at the left to see how the skyrocket is launched.


%d bloggers like this: