Is Drupal on your IT map yet? Chances are pretty good that either you are shaking your head vigorously in the affirmative, or you have no idea what I’m talking about. Drupal is an open source web content management system … though this is actually a little like saying that a Jaguar is a car; it’s true as far as it goes, but the description doesn’t really do Drupal justice.
Drupal started out in 2000 as a community project in the Netherlands originally called Druppel. The creator, Dries Buytaert, planned originally on calling it Dorp (which means Village in Dutch), but he introduced a typo when filling out the domain registration, and liked the way that it sounded. The idea behind it was simple - build a CMS system that promoted the concept of community rather than simply being a way to store content. To do this, Drupal was build early on around the idea of content nodes (think of them as very simple documents with a title and body) and the heavy use of syndication.
Drupal has long been just underneath the radar. My first encounter with it was in 2003, when I became involved with a group of programmers supporting the Howard Dean campaign, where we settled upon Drupal as a good foundation for an easy to roll out web CMS that could support local grassroot groups. Part of the ability of the Dean campaign to raise funds on what amounted to a shoestring budget could be effectively attributed to the Druapl implementation, at the time dubbed DeanSpace. After the election, the developers submitted the extensions back into the Drupal code base as CivicSpace, and this is still one of the most widely used open source political sites to date.
Drupal gained further clout with the rise of blogging in 2004-2005, and much of the functionality that has been added to Drupal has served only to strengthen this blogging capability. The company Bryght was founded in Vancouver in 2004 to sell hosted Drupal services, and along the way significantly pushed Drupal into the spotlight as one of the premier blogging and social networking platforms (Bryght was recently acquired by sibling Vancouver company Raincity Studios, combining a cutting edge web design company with the hosting services Bryght itself provides.
Drupal has gone through a number of iterations, and just recently released its Version 6.0 release. Having worked with it myself in its beta incarnation, I have to commend the many open source developers who put in countless hours in this version - Drupal is finally coming into its own as perhaps one of the best web platforms out there … and the irony is that blogging, while still an integral part of its underlying system, has taken a back seat to its extensive use of taxonomies, feed manipulations and effective use of AJAX systems.
The Power of Classified Syndication
One of the more intriguing sites I’ve seen recently is the just released Eureka! Science News site. I’m something of a science junkie, and regularly monitor feeds from Scientific American, the National Science Foundation, Discovery, and many other sites. The idea behind Eureka! is relatively simple - they subscribe to the same data feeds and a host of others, but use a processing algorithm on all incoming content to apply classications to them automatically. Michael Imbeault, the brains behind Eureka, described his own frustrations with existing tools that motivated his search for a new solution:
First, a little bit of history about how I discovered Drupal; I launched Biology News Net 4 years ago using Movable Type - biology is the #1 science and I found it weird that no site was dedicated to biology news. The site quickly became popular (#1 on Google for ‘biology news’) - it was unexpected, as the site was started as a hobby project / blog and thus I hit the limitations of Movable Type really fast; adding functionality was complicated, performance was not great (even on a dedicated server), customization was not really doable. Just as an example, the forum is actually a phpBB installation that has its sessions tied to the Movable Type sessions - it’s clunky even if it works, and upgrading is a nightmare.
As I could not really expect more from a blogging engine - Movable Type served me well - I searched for something better - this is when I found Drupal (about 2 years ago) and fell in love with it! It has a significant learning curve, but it is so powerful that the time invested to learn it is easily worth it in the long run. While I do not have time to contribute much to the actual development of Drupal, I help when I can and maintain one module (quickstats.module, coded by chx with small improvements from me).
One of the most important innovations within Drupal is the use of Taxonomies and Views as the basis for nearly everything. Most people, especially those in the XML and data modeling community, see a taxonomy as a collection of names - the tags used in a given XML document; database people have a similar definition. The concept behind Drupal taxonomies is considerably more robust - a taxonomy is a collection of terms within a given vocabulary (something aking to a namespace), each term of which can in turn be attached to one or more nodes of content.
For instance, suppose that I have two articles - one about the successful mission in getting the Mars Lander to the red planet, the other about the collision of the Milky Way and the Andromeda galaxies - that I wish to add to the site. The first could be classified with the terms “Mars,Mars Lander,Robot,Astronomy,Planets,Science” and the second “Milky Way, Andromeda, Galaxy,Astronomy,Science”. In this particular case, the two stories “intersect” at the terms “Astronomy” and “Science” - meaning that if I selected either of these terms, I could have the system create a list of both of these stories, whereas “Galaxy” would in turn collect only a single entry.
These collections are analogous to the collections of news feeds (indeed, there’s a DEEP connection here that I’ll explore in a subsequent article), meaning that I could actually syndicate collections of items based solely upon the categorization terms. Intriguingly, this kind of process can be throught of as the creation of virtual (parametric) folders of content.
Now, if you take the taxonomy and define relationships between terms, then this analogy becomes even stronger. For instance, you could create the relationship:
+ Science
+ Astronomy
+ Planets
- Mars
+ Galaxies
- Milky Way
- Andromeda
+ Technology
+ Robot
- Mars Lander
This relationship basically means that if you click on science, you’ll get both of the stories, while clicking on Planets will only yield (”contain”) one story. In other words, you have the containment capabilites of folders without having to specifically store content within those folders - you’re only storing the relationships of the corresponding taxonomic terms.
The challenge with such tagging is that it is both unpredictable and time consuming, especially when trying to tag with multiple terms. This in fact has often been the Achilles Heel of most folksonomy collections - the process of categorization is comparatively expensive unless you’re disciplined to do it or are passionate about the topic in question. What this means is that folksonomies work well in some areas, but if you’re trying to get free taxonomies implemented in a business or research environment, getting people beyond the creators to tag content becomes untenable fairly quickly.
This is why the approach Imbeault has taken should be looked at very closely by other organizations. He recognized that the best approach that he could take was not to attempt to manually tag the content himself, but rather to analyse the content via Bayse filters to determine which terms in a previously defined vocabulary most closely matched the topics, perhaps with a separate set of terms in a free taxonomy that matched unique or near unique terms in each of the articles found in a syndication feed. This way, rather than creating a wholesale text-index system for each article, he only needed to keep the links and the corresponding keyword terms in one or two vocabulary sets
Wholesale text-indexing of content for any given sector is in general not cost-effective unless you have the ability to handle large (even massive) server farms; search engine optimizations can help, but realistically, relevance at an affordable price involves using Baysian methods and stochastic (probabilistic) tools that can be tailored to your specific audience, by performing the analysis and assignation of articles to vocabularies which work best for you.
Content Creation, Content Syndication
The manual version of this process is in fact one of the major roles that an editor brings to the table; most content falls into one of two types of formats. The first kind consists of articles that fall into specific topical buckets - biology, physics, sports, finance, and so forth. These topics perform the same role as the taxonomic terms described above, save that in most print media, very few articles will be in more than one such bucket at any given time. On the web, where it’s trivial to assign such vocabulary terms to a given article, you can instead have the same article appear in several different buckets at once. This fact alone is forcing the editor to become more of a formal taxonomist than he or she was in the past.
The second type of article is the “column” which today is basically synonymous with the blog. In this case, it is the authority of the content creator that provides the terms of a vocabulary. However, even here, there’s usually a secondary taxonomy at work that is topical in nature - what the blogger is talking about in this particular issue. Editors of online content understand that this “reputation taxonomy” is a major organizing principle of “opinion content”, but also recognize that much of this opinion content also holds topical interest to readers beyond the reputation of the author.
Stochastic analysis of content introduces something new to this mix, however; it provides a very inexpensive semantic layer that doesn’t necessarily require human intervention. In this sense, a stochastic analyser begins to look increasingly like a compiler - something that will apply a set of rules to determine the abstract “keyword space” of an article. Interestingly, most early compilers were often seen as tools that would get you started, but a programmer may very well have to go back in and “tweak” the results with special assembly language code in order to get better performance. Today, however, most compilers are so sophisticated that there are almost no tweaks that can be made afterwards that won’t in fact degrade the quality or performance of the final application.
We’re probably some ways from the point where hand-tweaking of stochastic analysers will be counterproductive, but the trend will be towards that - computers become better at abstracting information from an article than people are, at least for those processes that are relevant to classification or navigation.
The effect of this upon news and related factual content will be (is already) profound. The role of editor as arbiter and gate keeper is increasingly becoming automated because the taxonomy systems are becoming too complex for any one person to keep abreast of. However, this is also important because taxonomy is the new navigation, something which I believe Drupal does inordinately well. Most news sites have transcended the level where a human being can reasonably serve to build navigation, search engines face a problem of geometric expansion of content in the long term, and thus its likely that taxonomic navigation will be the dominant face of finding news moving forward.
Watch the space of stochastic taxonomic analyzers; I suspect it will be a significant growth industry in the comparatively near term. The irony of course is that in building the initial web, the metaphor most commonly used was that of the magazine, but as with any new technology, the metaphors that drove the initial adoption eventually fade away as the capabilities of the new technology shape the parameters of what can be done in that medium. Whether the existing news providers will in fact survive that transition remains to be seen.
Kurt Cagle is the managing editor of XML.com. He lives in Victoria, British Columbia, and is beginning to wonder what happened to summer.