Language and Wikipedia

March 10, 2013

In addition to being, in my view, the finest news magazine published in English, The Economist has a number of interesting and highly literate blogs on its site, covering a wide range of topics.  I think I have occasionally mentioned the “Babbage” blog, which covers science and  technology.  Another of my favorites is “Johnson”, named for the writer of dictionaries, a harmless drudge, Samuel Johnson; it covers the “use and abuse” of language around the world.

A recent post discusses the multi-lingual character of Wikipedia, the Internet encyclopedia that is just over twelve years old.  Most readers probably know that Wikipedia has articles in a number of languages, but might be surprised to learn that are now official versions of Wikipedia in 285 languages.  There is of course a considerable amount of variation in the number of articles available in different languages.  It is no surprise that English has the most content, with 4,182,130 articles at present.  There are four other languages that have more than 1 million articles.  Three of these are not too surprising: German, French, and Italian.  But the other, Dutch, is a language that, as “Johnson” points out, has only about 20 million native speakers.  The post also points out that virtually every student in The Netherlands studies English; and I can confirm, from business and pleasure trips, that virtually everyone one meets, at least in cities, speaks excellent English.  Perhaps the number of articles in Dutch reflects the availability of a large group of potential translators.

The next group of languages, those which have more than 100,000 Wikipedia articles, presents an interesting assortment.  It includes some obvious candidates, “big” languages like Russian, Spanish, and Japanese; but it also has languages that I, at least, had never heard of, like Cebuano, a language spoken by about 20 million people (yes, about the same as Dutch) in the Philippines, with 273,316 articles.  The “made up” language, Esperanto, makes the cut with 177,002 articles.

There are further listings of languages with 10,000+, 1,000+, 100+, 10+, and 1+ Wikipedia articles.  When you get toward the bottom of the list, I’d wager that most of the entries will be unfamiliar to you, unless you are a professional linguist.  Some are local African languages, some are American Indian, and some come from Pacific islands, for example.  (One handy feature of the listing is that, if you click on the English name of the language, in the second column, you will get the Wikipedia page that describes that language.)

The listing also gives some other interesting statistics on the various Wikipedias, including the number of registered and active users, the number of administrators, and the number of edits.  It also includes a measure called “depth”, defined as:

Edits/Articles × Non-Articles/Articles × Stub-ratio

This gives a rough measure of how frequently articles are updated, and is one aspect of the articles’ quality.  Again, it is hardly surprising that English has the highest depth score, at 749; almost all other languages have scores less than half as large, although there are a few local languages (for example, Fijian, depth 451) that get relatively high depth scores despite having only 265 articles.  This probably reflects a small group of “hard core” enthusiasts.  Gothic also shows a high depth score at 394, with 431 articles, despite having no native speakers; the language was effectively extinct by about the ninth century AD.  (There is a considerable extant corpus of written material in Gothic.)

All of this is interesting to browse through, and speculate about; I imagine it could be a useful resource for students of linguistics.  The availability of articles in so many languages is a positive sign that the Internet is doing something useful to spread knowledge around the world.

Prof. Felten’s New Blog

April 30, 2012

In discussing technology policy and security issues here, I’ve frequently mentioned Professor Ed Felten of Princeton, director of the University’s Center for Information Technology Policy [CITP], who is serving a term as the Chief Technologist of the US Federal Trade Commission [FTC].  I’ve just discovered that, in his new capacity, he has recently started a blog, Tech@FTC; he describes the goal this way:

Our goal is to talk about technology in a way that is sophisticated enough to be interesting to hard-core techies, but straightforward enough to be accessible to the broad public that knows something about technology but doesn’t qualify as expert.  Every post will have an identified author–usually me–who will speak to you in the first person.  We’ll aim for a conversational, common-sense tone–and if we fall short, I’m sure you’ll let us know in the comments.

I have not yet had a chance to read all the posts that are there, even though there are not that many yet, but I am sure that they will be worth reading.  I’ll mention two recent posts that I have read.  The first explains why “hashing” data, such as Social Security numbers, does not make the data anonymous,  The second discusses why pseudonyms aren’t anonymous, either.  (I’ve previously written a couple of times about the difficulty of “anonymizing” data.)

I’m looking forward to reading the rest of what’s there, and to Prof. Felten’s future posts.  At the time his appointment to the FTC post was announced, I was pleased that someone so well-qualified had been chosen.  Reading the new blog reinforces that feeling.

%d bloggers like this: