Where pegs grow legs: hanging ideas on words

“I have no special talents. I am only passionately curious.” ~ Albert Einstein

The MediaWiki parser, uncovered

The MediaWiki parser is one of the most essential and yet one of the most complex pieces of code of the entire MediaWiki project. Without it, you would not be able to markup Wikipedia pages with sections, links, or images, nor view or easily change the markup of others. Yet it is still flexible enough to allow both beginners as well as HTML experts to contribute to pages alike. This has made the parsing code somewhat complex, and it has gone through many iterations over the years. Yet even today, it is still fast enough for Wikipedia, one of the largest web sites in the world. Let’s take a look under the covers of this under appreciated (and perhaps slightly daunting) piece of code.

A short history

First, a disclaimer, this history is as I understand it, much taken from discussions I’ve listened to intently over the years on the Wikimedia mailing lists, as well as discussions at the 2006 Wikimania Conference. Up until about a year ago, the MediaWiki parser suffered from extreme complexity, based on its need to maintain single-pass (for speed), but also because additional, sometimes new rules were tacked on to the existing code. Over time, it became a spaghetti mess that was difficult to debug, and even tougher to improve. Rewriting it was made almost impossible by the fact that it was so essential to the software. Millions of pages on Wikipedia could have easily been made garbldy-gook in an instant if changes were not handled correctly.

What to do

There were a lot of discussions about solving the problem. They included rewriting the parser in C, which would greatly improve speed and thus allow for potential multi-pass parsing, an approach to deal with the increasing number of templates and templates of templates that were being transcluded in Wikipedia pages. They also included changing the MediaWiki syntax such that certain potential ambiguities (such as between ””’bold or italic?”bold or italic?””’, or intentions between triple-brackets and double-brackets in templates) would be removed. In the end, what they decided on, which I think was a brilliant idea, was to leave the parser in PHP (rewriting in C would have probably produced two classes of MediaWiki developers) and divide parsing into 2 steps, preprocessing, and parsing. The job of the preprocessor was to produce an XML DOM representation of the wikitext. The parsing step would then iterate through the DOM structure as many times as it needed (eg. to expand templates) to produce valid, static HTML. Iterating through the DOM is lightning fast, and also very natural from an XHTML point of view. There is also good support for it in PHP.

The preprocessor

You will find two versions of the preprocessor, the Hash and the DOM version, found in /includes/parser/Preprocessor_Hash.php and /includes/parser/Preprocessor_DOM.php respectively. We will concentrate only on the DOM version, which is practically identical to the Hash version, but faster because it takes advantage of PHP’s XML support, an optional component in PHP. The most important function in the preprocessor class is called preprocessToObj(). Inside the Preprocessor_DOM.php file, there are a couple of other important classes the preprocessor uses: PPDStack, PPDStackElement, PPDPart, PPFrame_DOM, and PPNode_DOM.

The preprocessor produces less than you think

So what does the MediaWiki XML look like? Here’s an example of the text representation of the XML output of the wikitext “{{mytemplate}} this is a [[test]]”:

<root><template><title>mytemplate</title></template> this is a [[test]]</root>

Notice how the internal link is not preprocessed at all. While the current code seems to elude that this may/could be change(d) in the future (and it makes sense to do so), the only real work the preprocessor does is create XML elements for templates and a couple of other items. Here are the possible items, ie. base nodes, in full:

  • template, tplarg, comment, ext, ignore, h

If you’ve ever worked with MediaWiki wikitext, you should already know what specific text each of these base nodes corresponds to. Nonetheless, here they are:

  • template = double-brackets, ( {{…}} )
  • tplarg = triple-brackets, ( {{{…}}} )
  • comment = Any type of HTML comment (<!– –> )
  • ext = Node reserved for anything that should get parsed in an extension
  • ignore = Node for wrapping escaped tags of type noinclude, as well as tag and text of type includeonly
  • h = Node for wrapping sections

That’s it. Anything else gets ignored and returned in its original wikitext to the parser.

How the preprocessor works

There is nothing special here, but it is worthy of note. In order to produce the XML representation we need, the preprocessor must iterate through each character in the wikitext. There is no other way to account for recursive templates correctly, which could be represented in a myriad of ways due to the syntax. So if our Wikipedia article is 40,000 characters long, it is very likely it will run through a loop around 40,000 times. Now you’re beginning to see why speed was such an issue.

The real deal: parsing

Glancing over the remaining details of the preprocessor and how it uses the classes mentioned above to produce the XML, let’s turn our attention to the parser and take a look at a typical pass of the parser when you click on a Wikipedia page and ask for the HTML representation of the wikitext that was saved. Keep in mind that wiki pages are all cached whenever possible, so you may not be calling the parser directly when you click on a page.

Here’s a typical, generalized function call tree of the parser (of a current revision of a page), starting with the Article object.

Parse function call tree

Let’s take a look at these functions. Again, these are the _major_ functions at play, not all of them. Numbers 2-4 retrieve and return the article’s wikitext from the database. This text is passed to outputWikiText, which prepares the parse for Parser::parse(). It gets interesting again from 8-11. Inside replaceVariables, the text is preprocessed into its DOM representation, iterating through each character in the article to find beginning and ending marks for templates, subtemplates, and the other nodes mentioned above.

Number 11 is an interesting step which I’m going to skip over at the moment because it requires some knowledge of the other classes in the Preprocessor_DOM.php file (mentioned above). Expand is very important and does a lot of things (among being called recursively), but suffice to say that it has the job of actually getting the text within the DOM’s nodes (remember that templates can be nested – you may not already have the full output text from each transcluded page) and returning valid expanded HTML text, with exception of three main areas: tables, links, and (un)numbered lists. So in our example above, “{{mytemplate}} this is a [[test]]”, the expand() return value would be:

“I’ve included the [[text]] from my template. this is a [[test]]”

As you can see in this simplified example, at this point everything except tables, links, and the (un)numbered lists is parsed.

Links are special

Yes, links get their own separate section. Not only are they probably the most existential item to what makes a wiki a wiki (besides the editing capability), they are also the most specially handled of all markup in the parser code (currently that is). What makes them special is that they are handled in two parts: one to mark every link with a unique id, and the second part to replace the “link holders” with valid HTML. So in our example, here is the output after the first part.

“I’ve included the <!–LINK 0–> from my template. this is a <!–LINK 1–>”

As you can imagine, there is also an array that maps the link text to the LINK #IDs, which is a Parser class variable called mLinkHolders. Besides the mapping, it also stores the Title objects of each link.

So the second part of the link parsing is to use this array and do a simple find and replace. Now we’re done! Ship the parsed text out the door!

Next up

In installment 2 of 2, I’m going to concentrate more on the preprocessor and detail what each of the classes in the Preprocessor_DOM.php file do and are used for in building the initial XML DOM. Also, I’ll talk about how I hacked this to cache infoboxes for faster retrieval in an extension called Unbox.

1 comment Digg this

1 Comment so far

  1. Litso March 31st, 2009 5:08 am

    Hey,

    Very interesting article, this covers a lot about the parser that I couldn’t figure out myself from reading the code.

    What I’m still not quite sure of is when additional html-like tags in extensions (like for the Flash extension) get parsed. I guess I’ll have to wait for part two of this article?

Leave a reply