Happy Birthday, WWW

April 30, 2013

Most readers are probably acquainted with at least the outline history of the World Wide Web [WWW], developed originally, beginning in 1989, by Sir Tim Berners-Lee and Robert Cailliau at the European nuclear research establishment, CERN (Organisation Européenne pour la Recherche Nucléaire).   At the time, the Internet was very much a new thing, and that first project was aimed at using hyper-text to make accessing scientific information easier.  (There were other search and indexing tools available, like Archie and Gopher, but none had really caught on in a big way.)  The new WWW was made accessible to the public via the Internet in August, 1991.

As an article at Ars Technica reminds us, it was twenty years ago today, on April 30, 1993, that CERN announced the conclusion of an internal debate, making the WWW technology freely available to anyone, putting three software packages in the public domain: a basic Web server, a basic client (a line mode browser), and a common library.  Quoting from the announcement:

CERN’s intention in this is to further compatibility,  common practices, and standards in networking and computer supported collaboration.

CERN has announced today that, in commemoration of that 1993 decision, it is starting a project to restore the world’s first website, which was hosted on Berners-Lee’s NeXT workstation, and explained how to use the new technology.   (A slightly later copy is available here.)  It also intends to restore related files and documents.

To mark the anniversary of the publication of the document that made web technology free for everyone to use, CERN is starting a project to restore the first website and to preserve the digital assets that are associated with the birth of the web. To learn more about the project and the first website, visit http://info.cern.ch

CERN also has a restoration project page.


HTML 5 Now “Feature Complete”

December 20, 2012

Earlier this week, the World Wide Web Consortium [W3C] announced that the definition of HTML 5  and the accompanying Canvas 2D graphics specification are now “feature complete”.

The World Wide Web Consortium (W3C) published today the complete definition of the HTML5 and Canvas 2D specifications. Though not yet W3C standards, these specifications are now feature complete, meaning businesses and developers have a stable target for implementation and planning.

This means that the set of capabilities to be provided is now, essentially, frozen.  These definitions are not yet official Web standards, but they now have “Candidate Recommendation” status; the focus of work going forward will be on testing and checking inter-operability.  Web developers would, ideally, like to have a set of standards that is implemented equally in all browsers.  Having a feature-complete standard means that all the browser makers have a common target to aim for.

During this stage, the W3C HTML Working Group will conduct a variety of activities to ensure that the specifications may be implemented compatibly across browsers, authoring tools, email clients, servers, content management systems, and other Web tools. The group will analyze current HTML5 implementations, establish priorities for test development, and work with the community to develop those tests.

Innovation and creativity on the part of browser makers has helped drive the development of the Web; having standards helps avoid a chaotic mess of incompatible implementations.


Strict Transport Security Adopted as Web Standard

November 23, 2012

Most Web users are familiar with the secure version of the basic HTTP protocol, denoted by https: at the start of a URL,and typically marked by a small padlock icon in the browser.  The secure protocol provides for identification of the site, using a cryptographic certificate, and encrypts all communications between the user’s browser and the server.   This helps assure the user that (s)he is interacting with the desired site, and not an impostor; it also provides protection against session “sniffing” (otherwise trivially easy on wireless networks) and man-in-the-middle attacks.  Many sites, from banks to Facebook, offer HTTPS connections.  But people still have to use them, although some sites (GMail, for example)  allow the user to set a preference to always use HTTPS.

Another step in the direction of better security has just been taken, according to an article in the Australian publication, Computer World.  The Internet Engineering Task Force, a group responsible for setting Internet technical standards, has just approved a standard [RFC 6797] for HTTP Strict Transport Security (HSTS).  

This specification defines a mechanism enabling web sites to declare themselves accessible only via secure connections and/or for users to be able to direct their user agent(s) to interact with given sites only over secure connections. This overall policy is referred to as HTTP Strict Transport Security (HSTS).

Essentially, the standard allows a site to declare that it will only allow secure connections, and a method for browsers to conform to that policy.

The new standard fixes some loopholes and bad design choices in the original HTTPS standard.  For example, when a browser attempts to set up an HTTPS connection, it will generally issue a warning message if there is some problem with the site’s cryptographic certificate; but the user can choose to proceed anyway.   In many cases, this is OK; the certificate problem is not serious.  Unfortunately, though, sometimes the problem really is serious; this is a Bad Thing if users have become accustomed to just clicking “OK”.  With HSTS, the browser will just refuse to make the connection.   This may seem draconian, but users are typically not well qualified to evaluate certificate problems, so this approach amounts to “better safe than sorry”.  The new standard also addresses a variety of other security issues.

At present, not many sites have support for the new HSTS standard (though PayPal, Twitter, and some Google sites do).  I hope that, with the adoption of the formal standard, more sites will provide support for a mechanism that can significantly improve security.


Internet Archive Celebrates 10 Petabytes

October 28, 2012

The Internet Archive, a non-profit organization dedicated to creating a digital archive and library of Internet content, has just celebrated its collection reaching 10 petabytes (10,000,000,000,000,000, or 1.0×1016 bytes).   The collection contains approximately 150 billion historical Web pages, as well as texts, images, audio, and video.  The Internet Archive provides the Wayback Machine to allow retrieval of archived pages, as well as more general search tools.

The Internet Archive also announced the availability, for research purposes, of 80-terabytes (8.0×1013 bytes) of archived Web crawl data from 2011.  The data set characteristics are:

  • Crawl start date: 09 March, 2011
  • Crawl end date: 23 December, 2011
  • Number of captures: 2,713,676,341
  • Number of unique URLs: 2,273,840,159
  • Number of hosts: 29,032,069

Interested researchers can get in touch with the Archive to arrange access.

If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it.  We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.

The San Francisco Chronicle recently had a front-page profile of the Internet Archive and its founder, Brewster Kahle.


%d bloggers like this: