Statistical Twisters

May 22, 2013

During yesterday evening’s ABC World News program, which was largely taken up with coverage of the tornado disaster in and around Moore OK, there was a segment on a 90+ year old resident who had lost her house to a tornado for the second time.   (The first time was in May 1999, when a similar strong twister hit Moore.)  There was then a statement, which caught my attention, that the odds against this happening were “100 trillion to 1”.

Now, those are pretty long odds.  One hundred trillion is 100 × 10¹²; by way of comparison, it is about twenty times the estimated age of the universe, since the Big Bang, measured in days.  If the odds are true, we are talking about a really rare phenomenon.

Thinking about the question this morning, I decided to double-check the report — perhaps I had just misunderstood the number that was being quoted.  I found a report on the ABC News site, which actually made the whole odds business more questionable:

A recent tornado probability study, published by Weather Decision Technologies, predicted the odds of an E-F4 or stronger tornado hitting a house at one in 10,000.

That same study put the odds of that same house getting hit twice at one in 100 trillion.

It is almost impossible to imagine how both these probability assessments could be correct, or even reasonable guesses.  If the odds against the house being hit once are one in 10,000 (probability 0.0001) , then, if tornado hits are independent, the probability of a house being hit twice is (0.0001)², or odds of 1 in 100 million.  That would make the quoted odds (1 in 100 trillion) off by a a factor of one million.  Of course, if tornado hits are not independent, then my calculations are inappropriate.  But for the numbers to work as quoted, the first hit would have to, in effect, provide truly enormous protection against a second hit.  (If the odds against the first are one in 10,000, then the odds against the second must be truly astronomical to produce cumulative odds of one in 100 trillion.)

Now, I don’t actually believe that tornado hits are independent.  Tornadoes certainly do not occur uniformly across the world, or even across the United States.  The NOAA Storm Prediction Center’s Tornado FAQ Site has a map highlighting “tornado alley”, the area where most significant tornadoes occur.  Although a tornado may, in principle, occur almost anywhere, you are considerably more likely to encounter one in Kansas or Oklahoma than you are in northern Maine or the upper peninsula of Michigan.

This question of independence is directly relevant to the news segment I mentioned at the beginning; it turns out that the unfortunate lady who has lost two houses built the second one on the same site as the first one, destroyed in 1999.  If the odds are affected at all by location (as they seem to be, at least “in the large”), then this was not, perhaps, the best possible choice.

I’ve griped before about the widespread ignorance of journalists and others when it comes to statistical information.  I have tried to find a copy of the  “Tornado Probability Study” mentioned in the quote above, so far without success.  I’ll keep trying, and report on anything I discover.  If I’m missing something, I’d like to know; if the probabilities are just made up, I’d like to know that, too.


Homomorphic Encryption Library Released

May 2, 2013

One of the issues that tends to come up when people consider the possible use of “cloud computing” is data security.  The data can be stored in an encrypted form, of course, but it generally has to be decrypted in order to be used; this means that the data is potentially vulnerable while it is is plain text (decrypted) form.

I’ve written before about the potential of homomorphic encryption to address this problem.  The basic idea is that we would like to discover an encryption function such that a defined set of operations can be performed on encrypted data (say, adding them) to produce an encrypted result that, when decrypted, is the same as that obtained by operating on the plain text data.

In other words, if we have two numbers, α and β, and suitable encryption and decryption functions E(x) and D(x), respectively, and if

α + β = S

Then, if

E(α) + E(β) = S*

it will be true that

D(S*) = S

So we are able to add the two encrypted values to get a sum that, when decrypted, is the sum of the original (unencrypted) numbers.

This sounds almost like magic, but it has been proved to be theoretically possible, and there is a considerable amount of ongoing work to try to reach a practical implementation.  (For example, back in 2011, a group of researchers at MIT introduced CryptDB, database software that incorporates homomorphic encryption in a form suitable for some applications.

Now, according to an article at the I Programmer site, researchers from IBM’s Thomas J. Watson Research Laboratory have released an open-source library for homomorphic encryption, HElib (documentation and source code available at Github).

HElib is a software library that implements homomorphic encryption (HE). Currently available is an implementation of the Brakerski-Gentry-Vaikuntanathan (BGV) scheme, along with many optimizations to make homomorphic evaluation runs faster, focusing mostly on effective use of the Smart-Vercauteren ciphertext packing techniques and the Gentry-Halevi-Smart optimizations.

Although, as Bruce Schneier has observed, it will take a fair while for any of this technology to be scrutinized thoroughly enough by enough knowledgeable people to ensure that it doesn’t have serious flaws, getting algorithms and code out and available for inspection is an essential part of that process.

Update Thursday, May 2, 22:59 EDT

I have changed the headline/title of this post; originally, it was “IBM Releases Homomorphic Encryption Library”; that could be interpreted as describing an “official”  corporate action of IBM.  Since I have no knowledge, one way or another, about this sort of thing, I thought the new headline was less likely to lead to misunderstanding.


Bletchley Park Trust Joins Google Cultural Institute

March 25, 2013

I’ve written here previously about Bletchley Park, the home during World War II of the UK Government Code and Cipher School, also known as Station X.  The work of the cryptanalysts at Bletchley Park was responsible for the breaking of the German Enigma machine encryption on a large-scale basis, as well as the more difficult Lorenz cipher, used by Hitler to communicate with his field commanders.   Some historians estimate that this work shortened the war in Europe by two or more years.  The site is now run by the Bletchley Park Trust, and also houses the UK National Museum of Computing.

A project to restore the Bletchley Park facility, along with some of its specialized equipment, was launched a couple of years ago.  I noted then that Google had taken an active role in supporting the project.

A recent post on the Official Google Blog describes some further developments in this relationship.  The Bletchley Park Trust has become a member of the Google Cultural Institute, which features an online gallery of exhibits dealing with (relatively) recent history.  The Bletchley Park exhibit has an overview of the work that was done at Station X.  It includes images of the Bombe machines that were used to break the Enigma cipher on a production basis, and of Colossus, the electronic computer used, along with the Tunny Machine, in breaking the Lorenz cipher.

The blog post also has an interesting short video presentation by Ms. Jean Valentine, one of the original Bombe operators.

In her role operating the Bombe, Jean directly helped to decipher messages encoded by Enigma. In this film Jean gives us a firsthand account of life at Bletchley Park during the war, and demonstrates how the Bombe worked using a replica machine now on show at the museum.

Much of this history remained a closely-guarded secret for many years after the end of WWII.  It’s fascinating to see how much truly creative work was done under very difficult conditions.


Happy Pi Day, 2013

March 14, 2013

Today, March 14, is one of the days that is sometimes celebrated as “Pi Day”, in honor of the best-known irrational and transcendental number, the ratio of the circumference of a circle to its diameter, usually written as the Greek letter π (pi).  The date, 3/14, is chosen because the approximate value of π is 3.14159265…   Legend has it that the value was named π because pi is the first letter of the Greek word “περίμετρος”, meaning perimeter.

The New Scientist reports that this year, to observe Pi Day, Professor Marcus du Sautoy of the University of Oxford is sponsoring Pi Day Live, a project to “crowd source” the calculation of π (pi).  The value has, of course, alreay been calculated to trillions of decimal places; because it is an irrational number, it cannot be represented exactly by any finite decimal number.  (Pi is transcendental, also, of course.)  Pi Day Live is suggesting some relatively easy methods of getting an approximate value for π, including Buffon’s Needle.  I mentioned Buffon’s Needle in a Pi Day post back in 2010.  The New Scientist headline calls it an “ancient” method, which I think is a bit over the top for something described in the 18th century.

That earlier Pi Day post also tells a related story, of the Indiana state legislature’s attempt to set the value of pi by law, one of the all-time great accomplishments of legislative lunacy.

Finally, take a thought today for 134th anniversary of the birth of Albert Einstein.

Update Thursday, 14 March, 15:35 EDT

I’ve just noticed that there is a rendering error (at least in Firefox) on the “Find Pi” page I linked above.  The equation for the estimated value of pi is a bit garbled (where it reads 2L\over xp; the correct equation (using the “Find Pi” variable names) is:

\pi = \dfrac {2 L}{x p}

I’ve dropped the site a note with the correction.


Open-Access Mathematics Journals

January 29, 2013

I have written here a number of times before about the movement toward providing open access to scholarly research.  I’ve noted before the decisions by a number of different organization, including Princeton University, the Royal Society, the JStor research archive, and the World Bank, to provide open access to some or all of their research publications.  There have been launch announcements from some new open-access journals, notably in particle physics and in the life sciences.

Now Nature is reporting, in a recent news article, that a new series of open-access journals in mathematics is being put together.  The plan is that these journals will have a peer review process similar to traditional print journals, but will post their articles on the arXiv pre-print site, hosted by the Cornell University Library.

The initiative, called the Episciences Project, hopes to show that researchers can organize the peer review and publication of their work at minimal cost, without involving commercial publishers.

“It’s a global vision of how the research community should work: we want to offer an alternative to traditional mathematics journals,” says Jean-Pierre Demailly, a mathematician at the University of Grenoble, France, who is a leader in the effort. Backed by funding from the French government, the initiative may launch as early as April, he says.

The “epijournals” would provide Web directories to the articles approved by their review processes, along with editorial reviews, and possibly forums for comments.   Readers might have to give up something; for example, reviewed articles might not follow formatting standards to the same extent as articles in traditional journals.   On the other hand, the general availability of articles would be substantially increased.

One of the supporters of this project is the Cambridge University mathematician Timothy Gowers, a recipient of the Fields Medal (often referred to as the “Nobel Prize of mathematics”).   He has a blog post that explains the idea of these “overlay journals” in more detail.

What is an arXiv overlay journal? It is just like an electronic journal, except that instead of a website with lots of carefully formatted articles, all you get is a list of links to preprints on the arXiv. The idea is that the parts of the publication process that academics do voluntarily — editing and refereeing — are just as they are for traditional journals, and we do without the parts that cost money, such as copy-editing and typesetting.

There was a time when the typesetting and copy editing function provided real economic value (although, of course, not necessarily what the publishers were charging for it).  Today, though, better technology (think MathML or LaTeX) allows authors to prepare publishable drafts with reasonable effort.

Mr. Gowers was also a prime mover in the Elsevier boycott movement, launched early in 2012.  He’s apparently done some “sounding out” regarding the possibilities in one of his areas of interest:

Apparently, the plan is for the whole thing to start this April. Because I have known about the project for some time, I have quietly sounded out a few people in additive combinatorics, and it seems that there is enough enthusiasm that we will be able to start an epijournal broadly in that area …

I’m glad to hear of this development, and I hope that the new journals will be a success.   As I’ve said, one of the most important potential benefits of the “Internet Age” is the wider availability of knowledge, particularly to a large chunk of humanity that would otherwise, for reasons of geography, politics, or economics, never have had a chance.


Detecting Election Fraud

October 3, 2012

As I’m sure readers know, this is a presidential election year in the United States.  (There are also elections for members of the House of Representatives and for some Senate seats.)   A side attraction in this year’s festivities is an ongoing political and legal tussle over the attempts, in some states, to impose more stringent identification requirements for those wishing to vote.  Proponents of these measures argue that they are necessary to prevent election fraud.   Actual evidence that this occurs, at least in the form (impersonation) that these measures would address, is in remarkably short supply.  Nonetheless, it does prompt the question: how can election fraud be detected?

The Proceedings of the National Academy of Sciences recently published a paper [abstract, full PDF download available], that takes an interesting new approach to this question.   The authors (Peter Klimek, Yuri Yegorov, Rudolf Hanel, and Stefan Thurner) write:

Democratic societies are built around the principle of free and fair elections, and that each citizen’s vote should count equally. National elections can be regarded as large-scale social experiments, where people are grouped into usually large numbers of electoral districts and vote according to their preferences. The large number of samples implies statistical consequences for the polling results, which can be used to identify election irregularities.

There have, of course, been previous studies that used statistical methods to try to uncover election fraud.  (I wrote about an analysis of Iranian election results, back in 2009, that used Benford’s Law, and similar techniques.)  The authors of the current paper argue that these generally have two drawbacks.

  • They can provide a strong suggestion of fraud; however, since there is no theory of how particular types of fraud (e.g., ballot box stuffing) should change the results, they are far from conclusive.
  • The results may vary depending on the degree to which the election data are aggregated (for example, the size of the voting precincts).

To address these concerns, they develop a parameterized model for the distribution of several election variables (e.g., voter turnout).  This allows them to predict the effect that ballot stuffing should have on the results.  In particular, they find that the distribution of results should have higher kurtosis† in a fraudulent election.   When tested against real-world election data, the model seems to work well across a range of aggregation levels.

This is a fascinating area of research.  The success of techniques of this kind (which have also been used to help spot financial fraud) depends, at least in part, on people’s lack of intuition about seemingly random phenomena, and on their general inability to construct a convincing fake.

____

Kurtosis is a statistical measure of the shape of a distribution, in particular the degree to which it is “peaked” around the mean.  (The word comes from the Greek κυρτόσ, meaning “bulging”.)  The most common measure, based on the fourth moment of the distribution, is “excess kurtosis” relative to the normal (Gaussian) distribution.  A distribution with positive excess kurtosis has a narrower, more acute central “hump” and fatter tails than the normal distribution, and is called leptokurtic; the Poisson distribution is an example.  A distribution with negative excess kurtosis, called platykurtic, has a wider, less pronounced hump, and thinner tails; the uniform distribution is an example.