Many readers will be familiar with the automated translation services offered on the Web (by Google, for example). If you have used them, you’ve probably observed, as I have, that they can be very helpful in at least getting the gist of a passage in an unfamiliar language. These translation services rely primarily on a statistical analysis of a large body of already-translated texts. (For example, many official documents of the European Union are published in official translations in multiple languages.) Although they usually do a satisfactory job, occasionally they are flummoxed. Many of you have probably also heard of some of the legendary problems with earlier, AI-based machine translation: one program is reported to have translated the English expression, “Out of sight, out of mind”, into the Russian equivalent of “Invisible insanity”.
A bulletin issued recently by the MIT News Office describes a new modeling technique for language translation developed by researchers at the Computer Science and Artificial Intelligence Lab. The modeling technique has been tested on a harder problem than translation between known languages: the decipherment of an unknown written language. For the test, of course, the subject language, Ugaritic, was one that has already been deciphered by human analysts; otherwise, there would be no way to measure the quality of the result.
The technique that the researchers developed, and their tests of it, are described in a paper [PDF] to be presented at the Annual Meeting of the Association for Computational Linguistics, being held in Sweden next month. The model attempts to incorporate some of the paths to insight used by human analysts. First, it assumes that there is a known language that is, in some sense, close to the unknown language being analyzed. In the actual (human) decipherment of Ugaritic, the known similar language was Hebrew. The model then uses an iterative process to try to find similarities in three key areas:
- Alphabet If the unknown language has a different alphabet, the model’s initial “intuition” is that each letter in the unknown language will map to at most a few letters in the known language’s alphabet.
- Words The known and unknown languages will have a reasonable number of cognate words, like homme and hombre in French and Spanish, or wasser and water in German and English.
- Morphemes Morphemes are the smallest independent units of meaning in a language. For example, the English word unwind has two morphemes, un- and wind. The model’s “intuition” here is that the known and unknown languages will have similar patterns of morphemes in similar words.
Using these assumptions as a starting point, the model uses an iterative statistical technique to search for a decipherment scheme that exhibits the greatest internal consistency. The principal test was carried out on Ugaritic, assuming it was an unknown language, and using Hebrew as the similar, known language. The model managed, in a few hours, to make a pretty respectable start on solving the puzzle.
The Ugaritic alphabet has 30 letters, and the system correctly mapped 29 of them to their Hebrew counterparts. Roughly one-third of the words in Ugaritic have Hebrew cognates, and of those, the system correctly identified 60 percent.
The researchers reported that many of the system’s errors were off by only small amounts (one or two letters), so that a human using the results could probably correct them fairly easily (by considering context, for example, which the system does not use).
Of course, there are other difficulties in learning to read unknown languages in the real world: poor quality manuscripts, transcription errors, and so on. But some of the insights gained through using this kind of modeling approach may prove helpful in making ordinary machine translation more robust and useful.