Where pegs grow legs: hanging ideas on words

“I have no special talents. I am only passionately curious.” ~ Albert Einstein

Archive for the 'wiki' Category

Wikimania 2009: Recent work a boon for government communities

Wikipedia is growing up. That was my overall impression of the 2009 Wikimania conference in Buenos Aires recently. This spells great news for the government community, which more or less culturally speaking, but certainly technically speaking, has followed the lead of Wikipedia. In a similar project for example, there is the idea of “sysops”, community organizers and wiki administrators. There also is the idea of open editing, or that individuals’ edits are all considered equal across all topics. There is no delineation of specialty other than the delineation one makes in choosing articles that they wish to contribute to. While there are some major differences between the projects, the general ethos is the same.

While an amazing success, Wikipedia has been plagued by usability problems, subtle bias in articles (relating to trust), and a true lack of understanding of its own community. These issues are not new either – I remember us talking about them in 2006. But this year felt different: the Wikimedia foundation seems to be ready to tackle these issues head on, and there is concrete progress to show.

Technical Progress

I was glad to see some very promising projects at the conference that the technical community has been working on, many of which will benefit the government community. Anyone who has ever used the MediaWiki discussion system knows that there is a lot to be desired. Liquid threads is an extension that looks like it will be a great replacement for the hodgepodge of colons, double-colons, and other syntax that is used to help visually denote a “thread” in the current discussion system.
From techblog.wikimedia.org, new discussion system The new system makes it easier to start new discussions, follow existing ones, and quickly summarize the contents of an entire discussion. Take a look at a test system here.

Flagged Revisions

Wikipedia appears to be interested in introducing the idea of “flagging” revisions, or revisions that have been reviewed by a group of pre-ordained experts on a particular topic. The reviewers would be experienced individuals who clearly have demonstrated their expertise in some arena, eg. Biology. These reviewers won’t be allowed to change a revision, but they can rate it (flag it). This insures individuals such as newspapers can be assured that at a particular point in time, the article did not have egregious errors in it, and can be better trusted for use as references. Last minute edits that introduce embarrassing errors in an article won’t end up in the NY Times.

For the government community, this is huge. One of the major complaints of the raw wiki system is that is lessens the role of the subject matter experts, who form the vast majority of people doing the analysis that is produced today. Unlike the volunteers in Wikipedia, they are paid to work on subjects that are of importance to the organizations they work for. The flagged revision extension will hopefully provide a medium ground between the two camps. I can only see good things coming out of this, particularly as the extension and the idea underlying it are refined.

PDF export

This was also the first time that I saw the pdf export tool, which was developed with help from pediaPress, the organization that publishes Wikipedia articles in printed format. That was always an item that many folks asked for, and I think they’ve finally gotten the implementation down well. Kudos.

More Trust

Additionally, Wikipedia will be doing testing on an extension that provides “trust info” to articles. The extension is developed by Luca de Alfaro and others at UCSC, and it batch processes every revision in an article to determine authorship of each word in the current article, as well as provide inferences as to the likelihood of controversial or hotly debated phrases, as determined by the number of times a phrase has been reverted in an article. They’ve implemented the capability as a firefox extension. You can download that here. This could be potentially interesting to understand more about the regularity of content. I also think the government community could extend this to include looking at groups of people, by taking editors and categorizing them by type (eg. Agency).

Usability

New beta version of Wikipedia

New beta version of Wikipedia

Lastly, I think some of the greatest work and most telling as to the project’s maturation is the work on usability. It is fairly established that editing on Wikipedia is not as easy as it should be. The Wikimedia foundation has hired several developers to tackle some of the problems. Here is the list of things in their current release.

  • Tab reorganization – The new interface called Vector provides clear indication of the state of “read” and “edit” whether you are in an article or an discussion page.
  • Edit toolbar improvements – Action-grouped expandable toolbar hides infrequently used tool icons. Users can expand the toolbar based. Power users can expand the toolbar to access specialized tools. Special characters and help references (cheat sheet) are built into the toolbar for easy access. Special characters are displayed based on the configuration of each language site. Toolbar icons are redesigned by reusing Tango and Gnome icons.
  • Improved search interface – Search result page is often the entry point to articles. The visibility of relevant results are increased by removing the clutter.
  • General aesthetic improvements – Some aesthetic improvements have been applied for visual enhancements and redundant information has been removed.
  • Opt-in/Opt-out switch and survey – As the features above are deployed as one of the user preferences, Opt-in/Opt-out page allows logged in users to turn on and off multiple preferences for the usability initiative at the same time. Users are asked to participate in a quick short survey when they opt-out.

You can try it too, by clicking on the “Try Beta” link at the top of the page. They are looking for feedback at this point, so please let them know what you think!

Statistics are huge

So much can still be learned from the data in Wikipedia: from looking at the works themselves, to the rate of change in the creation of pages, to the rate at which new users drop off the project from different countries. At the keynote, Jimmy Wales showed a slide that highlighted cultural differences between Wikipedias, as well as areas of potential concern. There is so much more that can be learned from this massive data store, and we’ve only scratched the surface. I’m hoping to get involved with building some tools that can help automate additionally unique ways of looking at the data, as well as at the same time provide a strategic mechanism to insure we’re answering questions that are most interesting to the community. I’m hoping that this type of work will help Wikimedia gain a deeper understanding of the dynamics of the community, making it become an even bigger success.

No comments

The MediaWiki parser, uncovered

The MediaWiki parser is one of the most essential and yet one of the most complex pieces of code of the entire MediaWiki project. Without it, you would not be able to markup Wikipedia pages with sections, links, or images, nor view or easily change the markup of others. Yet it is still flexible enough to allow both beginners as well as HTML experts to contribute to pages alike. This has made the parsing code somewhat complex, and it has gone through many iterations over the years. Yet even today, it is still fast enough for Wikipedia, one of the largest web sites in the world. Let’s take a look under the covers of this under appreciated (and perhaps slightly daunting) piece of code.

A short history

First, a disclaimer, this history is as I understand it, much taken from discussions I’ve listened to intently over the years on the Wikimedia mailing lists, as well as discussions at the 2006 Wikimania Conference. Up until about a year ago, the MediaWiki parser suffered from extreme complexity, based on its need to maintain single-pass (for speed), but also because additional, sometimes new rules were tacked on to the existing code. Over time, it became a spaghetti mess that was difficult to debug, and even tougher to improve. Rewriting it was made almost impossible by the fact that it was so essential to the software. Millions of pages on Wikipedia could have easily been made garbldy-gook in an instant if changes were not handled correctly.

What to do

There were a lot of discussions about solving the problem. They included rewriting the parser in C, which would greatly improve speed and thus allow for potential multi-pass parsing, an approach to deal with the increasing number of templates and templates of templates that were being transcluded in Wikipedia pages. They also included changing the MediaWiki syntax such that certain potential ambiguities (such as between ””’bold or italic?”bold or italic?””’, or intentions between triple-brackets and double-brackets in templates) would be removed. In the end, what they decided on, which I think was a brilliant idea, was to leave the parser in PHP (rewriting in C would have probably produced two classes of MediaWiki developers) and divide parsing into 2 steps, preprocessing, and parsing. The job of the preprocessor was to produce an XML DOM representation of the wikitext. The parsing step would then iterate through the DOM structure as many times as it needed (eg. to expand templates) to produce valid, static HTML. Iterating through the DOM is lightning fast, and also very natural from an XHTML point of view. There is also good support for it in PHP.

The preprocessor

You will find two versions of the preprocessor, the Hash and the DOM version, found in /includes/parser/Preprocessor_Hash.php and /includes/parser/Preprocessor_DOM.php respectively. We will concentrate only on the DOM version, which is practically identical to the Hash version, but faster because it takes advantage of PHP’s XML support, an optional component in PHP. The most important function in the preprocessor class is called preprocessToObj(). Inside the Preprocessor_DOM.php file, there are a couple of other important classes the preprocessor uses: PPDStack, PPDStackElement, PPDPart, PPFrame_DOM, and PPNode_DOM.

The preprocessor produces less than you think

So what does the MediaWiki XML look like? Here’s an example of the text representation of the XML output of the wikitext “{{mytemplate}} this is a [[test]]”:

<root><template><title>mytemplate</title></template> this is a [[test]]</root>

Notice how the internal link is not preprocessed at all. While the current code seems to elude that this may/could be change(d) in the future (and it makes sense to do so), the only real work the preprocessor does is create XML elements for templates and a couple of other items. Here are the possible items, ie. base nodes, in full:

  • template, tplarg, comment, ext, ignore, h

If you’ve ever worked with MediaWiki wikitext, you should already know what specific text each of these base nodes corresponds to. Nonetheless, here they are:

  • template = double-brackets, ( {{…}} )
  • tplarg = triple-brackets, ( {{{…}}} )
  • comment = Any type of HTML comment (<!– –> )
  • ext = Node reserved for anything that should get parsed in an extension
  • ignore = Node for wrapping escaped tags of type noinclude, as well as tag and text of type includeonly
  • h = Node for wrapping sections

That’s it. Anything else gets ignored and returned in its original wikitext to the parser.

How the preprocessor works

There is nothing special here, but it is worthy of note. In order to produce the XML representation we need, the preprocessor must iterate through each character in the wikitext. There is no other way to account for recursive templates correctly, which could be represented in a myriad of ways due to the syntax. So if our Wikipedia article is 40,000 characters long, it is very likely it will run through a loop around 40,000 times. Now you’re beginning to see why speed was such an issue.

The real deal: parsing

Glancing over the remaining details of the preprocessor and how it uses the classes mentioned above to produce the XML, let’s turn our attention to the parser and take a look at a typical pass of the parser when you click on a Wikipedia page and ask for the HTML representation of the wikitext that was saved. Keep in mind that wiki pages are all cached whenever possible, so you may not be calling the parser directly when you click on a page.

Here’s a typical, generalized function call tree of the parser (of a current revision of a page), starting with the Article object.

Parse function call tree

Let’s take a look at these functions. Again, these are the _major_ functions at play, not all of them. Numbers 2-4 retrieve and return the article’s wikitext from the database. This text is passed to outputWikiText, which prepares the parse for Parser::parse(). It gets interesting again from 8-11. Inside replaceVariables, the text is preprocessed into its DOM representation, iterating through each character in the article to find beginning and ending marks for templates, subtemplates, and the other nodes mentioned above.

Number 11 is an interesting step which I’m going to skip over at the moment because it requires some knowledge of the other classes in the Preprocessor_DOM.php file (mentioned above). Expand is very important and does a lot of things (among being called recursively), but suffice to say that it has the job of actually getting the text within the DOM’s nodes (remember that templates can be nested – you may not already have the full output text from each transcluded page) and returning valid expanded HTML text, with exception of three main areas: tables, links, and (un)numbered lists. So in our example above, “{{mytemplate}} this is a [[test]]”, the expand() return value would be:

“I’ve included the [[text]] from my template. this is a [[test]]”

As you can see in this simplified example, at this point everything except tables, links, and the (un)numbered lists is parsed.

Links are special

Yes, links get their own separate section. Not only are they probably the most existential item to what makes a wiki a wiki (besides the editing capability), they are also the most specially handled of all markup in the parser code (currently that is). What makes them special is that they are handled in two parts: one to mark every link with a unique id, and the second part to replace the “link holders” with valid HTML. So in our example, here is the output after the first part.

“I’ve included the <!–LINK 0–> from my template. this is a <!–LINK 1–>”

As you can imagine, there is also an array that maps the link text to the LINK #IDs, which is a Parser class variable called mLinkHolders. Besides the mapping, it also stores the Title objects of each link.

So the second part of the link parsing is to use this array and do a simple find and replace. Now we’re done! Ship the parsed text out the door!

Next up

In installment 2 of 2, I’m going to concentrate more on the preprocessor and detail what each of the classes in the Preprocessor_DOM.php file do and are used for in building the initial XML DOM. Also, I’ll talk about how I hacked this to cache infoboxes for faster retrieval in an extension called Unbox.

1 comment

What if random IO speed wasn’t a database limitation?

This isn’t so far fetched, in particular today with the growing use of solid state drives. In what scenarios would MySQL perform better if it used a DRAM-based SSD? With extremely low latency, and very fast random IO compared to mechanical devices, IO bound db workloads could change considerably. Clustered indexes would matter less as long as the number of page accesses doesn’t increase, as opposed to trying to access data purely sequentially (which is data that has some relationship to its adjacent node – that’s why it’s stored for sequential access). Joins lose some of their cons, because random IO doesn’t matter as much. It would be interesting to see benchmark data comparing SSDs versus traditional disks on a database’s performance. Has anyone done this type of analysis?

1 comment

Deki Wiki Worth Checking Out

Many of you probably know that I’m a little bit of a wiki geek, so when I do the rare thing of pointing out a wiki product, it’s because it’s got potential. I just stumbled on Deki Wiki today, and in checking out the interface and doing a quick browse of the sources, it’s definitely a breath of fresh air. Where did this product come from?! They derived it from MediaWiki, then added in WYSIWYG, decoupled the presentation layer from a powerful business logic API written in C# for Mono, and made it open source. It’s got all the things you’d expect living in a MediaWiki-centric world, but many times the flexibility. Now only if they had an upgrade path that translated existing MediaWiki markup into their XHTML markup formatting…

Check out the explanation of its architecture in this video. (Once I figure out how to embed a Viddler video, I’ll do that instead)

1 comment