Language and Wikipedia

March 10, 2013

In addition to being, in my view, the finest news magazine published in English, The Economist has a number of interesting and highly literate blogs on its site, covering a wide range of topics.  I think I have occasionally mentioned the “Babbage” blog, which covers science and  technology.  Another of my favorites is “Johnson”, named for the writer of dictionaries, a harmless drudge, Samuel Johnson; it covers the “use and abuse” of language around the world.

A recent post discusses the multi-lingual character of Wikipedia, the Internet encyclopedia that is just over twelve years old.  Most readers probably know that Wikipedia has articles in a number of languages, but might be surprised to learn that are now official versions of Wikipedia in 285 languages.  There is of course a considerable amount of variation in the number of articles available in different languages.  It is no surprise that English has the most content, with 4,182,130 articles at present.  There are four other languages that have more than 1 million articles.  Three of these are not too surprising: German, French, and Italian.  But the other, Dutch, is a language that, as “Johnson” points out, has only about 20 million native speakers.  The post also points out that virtually every student in The Netherlands studies English; and I can confirm, from business and pleasure trips, that virtually everyone one meets, at least in cities, speaks excellent English.  Perhaps the number of articles in Dutch reflects the availability of a large group of potential translators.

The next group of languages, those which have more than 100,000 Wikipedia articles, presents an interesting assortment.  It includes some obvious candidates, “big” languages like Russian, Spanish, and Japanese; but it also has languages that I, at least, had never heard of, like Cebuano, a language spoken by about 20 million people (yes, about the same as Dutch) in the Philippines, with 273,316 articles.  The “made up” language, Esperanto, makes the cut with 177,002 articles.

There are further listings of languages with 10,000+, 1,000+, 100+, 10+, and 1+ Wikipedia articles.  When you get toward the bottom of the list, I’d wager that most of the entries will be unfamiliar to you, unless you are a professional linguist.  Some are local African languages, some are American Indian, and some come from Pacific islands, for example.  (One handy feature of the listing is that, if you click on the English name of the language, in the second column, you will get the Wikipedia page that describes that language.)

The listing also gives some other interesting statistics on the various Wikipedias, including the number of registered and active users, the number of administrators, and the number of edits.  It also includes a measure called “depth”, defined as:

Edits/Articles × Non-Articles/Articles × Stub-ratio

This gives a rough measure of how frequently articles are updated, and is one aspect of the articles’ quality.  Again, it is hardly surprising that English has the highest depth score, at 749; almost all other languages have scores less than half as large, although there are a few local languages (for example, Fijian, depth 451) that get relatively high depth scores despite having only 265 articles.  This probably reflects a small group of “hard core” enthusiasts.  Gothic also shows a high depth score at 394, with 431 articles, despite having no native speakers; the language was effectively extinct by about the ninth century AD.  (There is a considerable extant corpus of written material in Gothic.)

All of this is interesting to browse through, and speculate about; I imagine it could be a useful resource for students of linguistics.  The availability of articles in so many languages is a positive sign that the Internet is doing something useful to spread knowledge around the world.


The Best Tweeting Languages

April 1, 2012

This week’s issue of The Economist has an amusing article on the suitability of different languages for short message services, like the micro-blogging site Twitter.   Twitter limits “tweets” to 140 characters, but the amount of information that can be incorporated varies significantly with the language used.

This 78-character tweet in English would be only 24 characters long in Chinese.

This is largely due, of course, to the use of logograms in written Chinese, rather than an (approximately) phonetic alphabet.   Japanese, which uses the Chinese characters (called Kanji) as a part of its writing system, is also quite succinct.   Arabic works well, too, because vowels are customarily omitted in the written language.  And, as the illustration accompanying the article suggests, hieroglyphics might have worked a treat.  European languages, especially Romance languages, with their many inflected forms, tend to produce long messages by comparison.

A chart accompanying the article gives the average change in length when translating a 1000-character English message into various other languages.  The changes range from a reduction in length of more than 60% when translating to Chinese to an increase of about 40% when translating to Spanish.

Of course, people do not generally use formal language for their text messages or Twitter, and informal English more than holds its own.  As with the language itself, we use a hodge-podge of abbreviations, homophones (‘4’ in place of ‘for’, ‘U’ for ‘you), and other shortcuts; and  we muddle along quite nicely.


Positively Speaking

September 2, 2011

The use of language has always been held up as one of the things that separates Homo sapiens from other animals.  Although we now know that using sound to communicate is not a uniquely human activity, the complexity and expressive range of human language is definitely unusual.  What prompted the development of language is not clear.  In part, language is a mechanism for conveying and organizing information: here is where the big animals are, and here is how we can hunt them.  But language also serves a social function, in a very social species; humans seek to understand and explain the world by telling stories, and people everywhere are inveterate gossips and chatterboxes.

Wired has an article on the “Wired Science” blog that described some interesting new research that may shed a bit of light on this second, social function of language. Researchers from the University of Vermont and Cornell University attempted to measure the emotional content of language; from the abstract [full PDF available]:

Within the last million years, human language has emerged and evolved as a fundamental instrument of social communication and semiotic representation. People use language in part to convey emotional information, leading to the central and contingent questions: (1) What is the emotional spectrum of natural language? and (2) Are natural languages neutrally, positively, or negatively biased?

There have been past attempts to answer these questions via psychology experiments, with somewhat mixed results.  The new work is interesting because it was conducted by mathematicians, led by the University of Vermont’s Isabel Klouman, who took a different, statistical approach.  They assembled four large bodies of English text, taken from different sources:

  • 3.29 million Google Books, containing 361 billion words
  • 821 million tweets, from 2008 through 2010, containing 9 billion words
  • 1.8 million New York Times articles, from 1987 to 2007, containing 1 billion words
  • Lyrics from 295,000 popular songs, containing 58.6 million words

The team compiled a list of the 5,000 most common words in each corpus, and then combined these lists to get a final list of 10,122 common words.  Then, for each word, they got ratings from 50 different people (using Amazon’s “Mechanical Turk” service), rating the words on an emotional content scale, ranging from 1 (extremely negative) to 9 (extremely positive).  In all four samples, words with positive emotional connotations significantly outnumbered words with negative connotations; furthermore, the positive words were more frequently used.

The findings “suggest that a positivity bias is universal,” wrote Klouman and colleagues. “In our stories and writings we tend toward pro-social communication.”

The implications of this bias are not entirely clear, and it remains to be seen whether the results are similar in other languages.  Still, our use of language certainly contains some clues to what we are thinking; as Yogi Berra reportedly said, sometimes “you can observe a lot by just watching” — or, in this case, listening.


Reality SATs ?

March 20, 2011

Taking the SAT test is one of the traditional rites of spring for high school students.  It seems that, in a test administration given last weekend, the essay question on some students’ exams has generated some controversy.  (For readers of my approximate vintage who may not know, the SAT was changed in 2005 to include an essay section.  We didn’t have that; you are not having a middle-aged moment.)  It seems that the essay question, or “prompt”, was related to the recent rapid growth in popularity of “reality television” shows.

Many of the negative comments from parents about this question ran more or less along the lines of, “My kid is serious and works hard; (s)he doesn’t have time to watch reality TV.”    There were also similar comments from students.  The College Board, which administers the SAT, said that the prompt contained sufficient information to write a top-scoring essay, and did not require any detailed knowledge of reality TV shows.  Here is the actual question from the exam, as quoted in the Washington Post:

Reality television programs, which feature real people engaged in real activities rather than professional actors performing scripted scenes, are increasingly popular. These shows depict ordinary people competing in everything from singing and dancing to losing weight, or just living their everyday lives.

Most people believe that the reality these shows portray is authentic, but they are being misled. How authentic can these shows be when producers design challenges for the participants and then editors alter filmed scenes?

Do people benefit from forms of entertainment that show so-called reality, or are such forms of entertainment harmful?

In this particular case, I think the College Board has the better argument.  Recall that the purpose of the essay section is to evaluate the student’s ability to formulate an argument, and express it in writing.  I certainly don’t watch reality TV shows — I can’t stand them — but I think I could write an acceptable essay on the basis of this question.  (In fact, I did talk about them a little in the context of the Colorado “Balloon Boy” back in the fall of 2009.)  I also have a bit of difficulty believing that there are high school students leading such sheltered lives that they have never seen one of these shows.   What I find a little disturbing about some of the complaints is the underlying notion that the essay had to be mainly an exercise in regurgitating facts, rather than an expression of ideas.

Alexandra Petri has an amusing blog post at the Washington Post site, on how this will affect the obsessive parent.


The Evolution of Machine Translation

February 22, 2011

One of the potential applications of computers that people have always found intriguing is the automatic translation of natural (human) languages.  It is easy to understand the appeal of the idea; I certainly would love to be able to communicate easily with any person on Earth.  Yet, despite some serious efforts, for many years computer translations were mostly a joke.  There is a classic story, quite possibly apocryphal, of the program that translated the English phrase, “Out of sight, out of mind”, into the Russian equivalent of “Invisible insanity”; going the other direction, it translated the verse from the Bible, “The spirit is willing, but the flesh is weak”, to “The vodka is good, but the meat is rotten.”

Of course, to be fair to the machines, translation can be a tricky business.  We have all probably puzzled over the assembly instructions for something, as translated from the original Chinese, or have encountered some decidedly odd variants of English in various places.  I remember a sign in my room in a small German hotel, which requested, “Please not to smoke while being in bed.”  This was the translation of the perfectly straightforward, “Bitte nicht rauchen im Bett”, which is more or less literally, “Please do not smoke in bed”. Humans are also far from perfect at the translator’s job.

Today’s Washington Post has an interesting article on the evolution of machine translation.  Beginning in the early 1950s, the general approach to the translation problem was to build a rule-based system.  That is, the system “knew” about the rules of grammar, how to conjugate verbs, and so on.  (Of course any translation system must also have a comprehensive dictionary of some sort.)   The idea was that, knowing the rules of the source language and the target language, one could be reliably transformed into the other.   But it is fair to say that these systems never did much to threaten interpreters’ job security.

The problem is not that there are no rules, but that there are too many of them.  According to the work done by Chomsky and others, there is a certain amount of deep structure common to all languages.  But there are an enormous number of special rules, idiosyncratic to particular languages, that have to be taken into account.  (Recall when you were learning to spell.  Let’s see, it’s “I before E, except after C …” except when it isn’t.  Mark Twain once remarked that he would rather decline three free drinks than one German adjective.)

As the article points out, machine translation has improved considerably, mostly because newer efforts have taken a different approach.  Instead of trying to specify all the rules of a language, they approach translation as an exercise in statistical inference.  By examining a large body of parallel text in two or more languages, the system could learn common constructions and words usages in each language.  In a sense, the approach is like that used in trying to break an unknown cipher.

Warren Weaver, a mathematician at the Rockefeller Foundation, had first raised the idea of a statistical model for translation in a 1947 letter in which he wrote: “When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols.’ “

(I wrote last summer about the use of a similar technique in the attempt to decipher unknown languages, like those of some ancient civilizations.)

The new, statistically-based techniques are the basis of Google’s translation service, and are also a significant part of Yahoo’s BabelFish service.   The quality of the results for European languages has also been helped as a side effect of the formation of the European Union.  Because all official documents must be translated and made available in all 23 official and working languages of the EU, and because governmental organizations produce documents as routinely as cattle produce cow-pats, there is a very large and steadily growing body of text to use as a source.  Having used some of the older systems, and the newer ones, I think it is fair to say that a significant improvement has been made.

I think, too, there’s an interesting parallel between the evolution of machine translation, and the evolution of “intelligent” systems — but that is a subject for a later post.