Pinderkent

Pain and glory from the trenches of the IT world.

It's rare to see a database schema as clean as MediaWiki's.

Posted on Saturday, January 24, 2009 at 3:36 PM.

Over the past week or so, I've seen links to this diagram of the MediaWiki database schema posted on a number of other sites and blogs. Now, MediaWiki is no minor piece of software. It is used to power Wikipedia, which Alexa currently states is one of the top ten most popular Web sites on the Internet.

At many of the companies I have consulted with or worked for, it is rare to find a document or diagram as precise and effective as the one for the MediaWiki database schema. It's even rarer to find a real-world database schema that is so consistent and sensible. Clearly, a great amount of care and experience has gone into the development of this database.

Even without studying it for very long, it is apparent that it is a rather clean database schema. One thing that is apparently almost immediately is the prefixing of the table columns. This is a technique often used to reduce the ambiguity between identically-named columns of separate tables when those tables are joined. It also helps make it more clear from which table a given column is coming from when looking at the results of a query.

The one minor inconsistency I do see is that the names of some tables contain words that are separated with an underscore (for example, user_groups, user_newtalk, page_restrictions and site_stats), while others do not separate the words (for example, filearchive, oldimage, pagelinks, searchindex, and recentchanges). At least the tables related to certain functionality tend to have similar naming conventions.

The normalization and the relationships between the tables appear to be quite sensible. The short descriptions, even if only a sentence or two long, do significantly help make it clear why a certain table exists. This is something that is lacking from many enterprise databases.

Performance is clearly an issue, so we see some caching-related tables in the top-right corner of the current database schema diagram. We see other tables, like job, testitem and testrun, that show that this is a real-world database where we have external batch jobs, as well as testing. This isn't just some academic exercise dreamed up in a classroom where the necessities of the real world are not to be found.

Any piece of software backed by a database should strive to have a similar schema diagram. And it needs to be kept up-to-date as changes are made to the database. This single schema diagram teaches us much about the MediaWiki architecture, but it also speaks to the quality of the product. Software that is built upon a sensible data model often proves to be very natural, and thus in many cases easier to develop, which often leads to a higher-quality product. If a production database schema can be kept as clean as that of MediaWiki, then we'll likely see greater programmer productivity, and higher software quality.

Permalink: http://pinderkent.phumblog.com/post/2009/01/its_rare_to_see_a_database_schema_as_clean_as_mediawikis
Share:
Feeds
  • RSS 2.0 Feed
  • Atom 2.0 Feed
Tags
Archives