Programming for the Cloud

I’ve written here before about some of the issues and problems raised by the development of multi-core processors, and the attempt to parallelize computations to take advantage of the additional hardware capabilities.  Programming effectively for the “cloud computing”  environment is a similar, but more difficult problem.  The degree of concurrency (e.g., the number of available processors) is often not known in advance.  Timing problems and potential race conditions can be tricky even on a single machine with multiple processors; on a “machine” comprised of many distinct physical machines, connected by IP networks, relative timing is essentially impossible to predict.

According to an article at Technology Review, a research group at the University of California, Berkeley, is developing a new set of tools to make programming for the cloud easier.  (The Technology Review article is part of the TR 10 series, an annual list of the most important emerging technologies, as seen by the editors.)   Their starting point is the idea behind database programming languages, ranging from the venerable Structured Query Language [SQL] to more complex systems like Google’s Map-Reduce.   These languages describe what is to be done with the data, not how it is to be processed.  This has the effect of abstracting the data manipulation from the details of how the data is stored and retrieved. It also provided a framework (for example, the relational algebra underlying SQL) in which the data problem could be analyzed.

(I can remember the introduction of the first relational data base products that used SQL.  In the IBM mainframe world, they were DB/2, for MVS systems, and SQL/DS, for VM systems.  Prior data base systems were notoriously difficult to program, because the application had to be aware of how the data was stored, indexed, and so on.)

The group at Berkeley, led by Joseph Hellerstein, proposes to extend this idea to incorporate the specification of temporal variation in the data.

The solution, ­Hellerstein explains, is to build into the language the notion that data can be dynamic, changing as it’s being processed. This sense of time enables a program to make provisions for data that might be arriving later–or never.

This is a very intriguing idea.  Just as, for example, SQL allows the process of query optimization to be done “under the covers” without the direct involvement of the application programmer, these tools might allow concurrency problems to be dealt with without introducing unnecessary complexity in applications.

The project, called BOOM (for Berkeley Orders Of Magnitude) is still in a relatively early stage, but the team hopes to have a version of its new, open-source declarative language (called, somewhat confusingly, BLOOM) ready for release in late 2010.  They have already done some preliminary development with their new tools:

So far, Hellerstein’s group has used the Bloom language and its predecessors to quickly rebuild and add major features to popular cloud tools such as Hadoop, a platform used to manipulate very large amounts of data. By lowering the complexity barrier, these languages should increase the number of developers willing to tackle cloud programming, resulting in a wave of ideas for new types of powerful applications.

The project has a list of technical papers that are available for download, as well as a FAQ page, which talks about some of the work done with Hadoop.

Comments are closed.

%d bloggers like this: