Where pegs grow legs: hanging ideas on words

“I have no special talents. I am only passionately curious.” ~ Albert Einstein

metadata vs data, an artificial but existential distinction

In two interesting blog posts about how the distinction between data and metadata is artificial, and that it is merely a functional difference, it caught my eye, so I thought I’d join in. We were also talking about this just the other day, in terms of data.gov, the site cataloging government data feeds. We felt that the site might be more accurately termed metadata.gov, since it stores data about data. (I don’t think it necessarily should be, because that would be more confusing to most)

I would argue that the physical world naturally distinguishes data and metadata, and not just in computer software. The human memory seems to do so, recording just the most important snippets of events and occurrences so that we are able to later reconstruct events past, and more importantly, find them in their complete form later. I remember places, but just enough to be able to find them again (and when I’m lucky, enough to be able to name and describe them to other interested people). I remember ideas in articles and books, but just enough to be able to summarize them later, and maybe find the full excerpt when I need it. What my brain remembers is clearly the data, but a distinct subset of the data.

Thought of in that way, metadata is the most pertinent, useful data about the data. That certainly makes it functional data. However, defining what that data is is difficult, because it clearly depends on who you ask. I find some things more pertinent than other individuals, and vice versa. But even in the physical world, I think it is entirely natural to be separating data from the metadata, as that is the only way that the finite, limited capacities of our minds are able to store our existence in a functional way.

So the distinction between metadata and data is a functional difference, but not merely one. Rather, the difference is existential, and without it in the physical world, we wouldn’t be able to function. In the computer world though, we could do without it, and hopefully with projects like FluidDB (I haven’t checked it out yet), as well as new approaches solving the limitations that keep metadata around (B-trees are still much faster than full scans), we’ll have new interesting possibilities in the digital world.

Of course, that doesn’t keep it from being messy. Where the distinction lies between data and metadata is well, in the eye of the beholder.

No comments Digg this

Wikimania 2009: Recent work a boon for government communities

Wikipedia is growing up. That was my overall impression of the 2009 Wikimania conference in Buenos Aires recently. This spells great news for the government community, which more or less culturally speaking, but certainly technically speaking, has followed the lead of Wikipedia. In a similar project for example, there is the idea of “sysops”, community organizers and wiki administrators. There also is the idea of open editing, or that individuals’ edits are all considered equal across all topics. There is no delineation of specialty other than the delineation one makes in choosing articles that they wish to contribute to. While there are some major differences between the projects, the general ethos is the same.

While an amazing success, Wikipedia has been plagued by usability problems, subtle bias in articles (relating to trust), and a true lack of understanding of its own community. These issues are not new either – I remember us talking about them in 2006. But this year felt different: the Wikimedia foundation seems to be ready to tackle these issues head on, and there is concrete progress to show.

Technical Progress

I was glad to see some very promising projects at the conference that the technical community has been working on, many of which will benefit the government community. Anyone who has ever used the MediaWiki discussion system knows that there is a lot to be desired. Liquid threads is an extension that looks like it will be a great replacement for the hodgepodge of colons, double-colons, and other syntax that is used to help visually denote a “thread” in the current discussion system.
From techblog.wikimedia.org, new discussion system The new system makes it easier to start new discussions, follow existing ones, and quickly summarize the contents of an entire discussion. Take a look at a test system here.

Flagged Revisions

Wikipedia appears to be interested in introducing the idea of “flagging” revisions, or revisions that have been reviewed by a group of pre-ordained experts on a particular topic. The reviewers would be experienced individuals who clearly have demonstrated their expertise in some arena, eg. Biology. These reviewers won’t be allowed to change a revision, but they can rate it (flag it). This insures individuals such as newspapers can be assured that at a particular point in time, the article did not have egregious errors in it, and can be better trusted for use as references. Last minute edits that introduce embarrassing errors in an article won’t end up in the NY Times.

For the government community, this is huge. One of the major complaints of the raw wiki system is that is lessens the role of the subject matter experts, who form the vast majority of people doing the analysis that is produced today. Unlike the volunteers in Wikipedia, they are paid to work on subjects that are of importance to the organizations they work for. The flagged revision extension will hopefully provide a medium ground between the two camps. I can only see good things coming out of this, particularly as the extension and the idea underlying it are refined.

PDF export

This was also the first time that I saw the pdf export tool, which was developed with help from pediaPress, the organization that publishes Wikipedia articles in printed format. That was always an item that many folks asked for, and I think they’ve finally gotten the implementation down well. Kudos.

More Trust

Additionally, Wikipedia will be doing testing on an extension that provides “trust info” to articles. The extension is developed by Luca de Alfaro and others at UCSC, and it batch processes every revision in an article to determine authorship of each word in the current article, as well as provide inferences as to the likelihood of controversial or hotly debated phrases, as determined by the number of times a phrase has been reverted in an article. They’ve implemented the capability as a firefox extension. You can download that here. This could be potentially interesting to understand more about the regularity of content. I also think the government community could extend this to include looking at groups of people, by taking editors and categorizing them by type (eg. Agency).

Usability

New beta version of Wikipedia

New beta version of Wikipedia

Lastly, I think some of the greatest work and most telling as to the project’s maturation is the work on usability. It is fairly established that editing on Wikipedia is not as easy as it should be. The Wikimedia foundation has hired several developers to tackle some of the problems. Here is the list of things in their current release.

  • Tab reorganization – The new interface called Vector provides clear indication of the state of “read” and “edit” whether you are in an article or an discussion page.
  • Edit toolbar improvements – Action-grouped expandable toolbar hides infrequently used tool icons. Users can expand the toolbar based. Power users can expand the toolbar to access specialized tools. Special characters and help references (cheat sheet) are built into the toolbar for easy access. Special characters are displayed based on the configuration of each language site. Toolbar icons are redesigned by reusing Tango and Gnome icons.
  • Improved search interface – Search result page is often the entry point to articles. The visibility of relevant results are increased by removing the clutter.
  • General aesthetic improvements – Some aesthetic improvements have been applied for visual enhancements and redundant information has been removed.
  • Opt-in/Opt-out switch and survey – As the features above are deployed as one of the user preferences, Opt-in/Opt-out page allows logged in users to turn on and off multiple preferences for the usability initiative at the same time. Users are asked to participate in a quick short survey when they opt-out.

You can try it too, by clicking on the “Try Beta” link at the top of the page. They are looking for feedback at this point, so please let them know what you think!

Statistics are huge

So much can still be learned from the data in Wikipedia: from looking at the works themselves, to the rate of change in the creation of pages, to the rate at which new users drop off the project from different countries. At the keynote, Jimmy Wales showed a slide that highlighted cultural differences between Wikipedias, as well as areas of potential concern. There is so much more that can be learned from this massive data store, and we’ve only scratched the surface. I’m hoping to get involved with building some tools that can help automate additionally unique ways of looking at the data, as well as at the same time provide a strategic mechanism to insure we’re answering questions that are most interesting to the community. I’m hoping that this type of work will help Wikimedia gain a deeper understanding of the dynamics of the community, making it become an even bigger success.

No comments Digg this

GM – preparing for their future

I finally found the first real article today on Opel, GM’s large European division that was recently sold to Magna International, the Canadian cars parts manufacturer, as a part of the GM bankruptcy plight. When I first heard about GM killing off Saturn (a line of cars that are essentially Opel designed and built, with the Saturn mark), I was surprised that no one was talking about how ridiculous a decision that will be for the future of GM. From a political and American jobs perspective, the sale entirely makes sense: it will save American jobs that would have been had in Germany, while at the same time helping the effort to consolidate the insane numbers of brands GM upholds.

Yet when I was searching for a new car last year, the Saturn Astra was the only American car that became a final contender for my money. Of the 10 or so cars I looked at and test drove, 3 were American, 1 was German, and the rest were Japanese. When I narrowed it down to the top three, they were the Saturn Astra, the Subaru Outback Sport, and the Mazda 3. At the end of the day, I bought the Mazda 3. Saturn was the Hail Mary: it barely made the top 10 cut (I almost didn’t hear about it), then rocketed itself out of the blue to come in a close second. Of all the cars I test drove, I was most pleasantly surprised about the Saturn (er Opel) brand – finally I felt, GM was doing something right. They really are starting to listen to what Americans want to drive.

So when I heard GM was going to nix Saturn and sell off most of Opel, I was very surprised that no one was talking about it. Until today, anyways. Read this article when you get the chance, it provides a good critique of the GM merger, and why even though the recent decisions will help GM short term, they are exactly the wrong decisions for GM long term. As an American taxpayer, be in for a long and frustrating narrative that ends with a more-than-necessary wasted public money theme.

No comments Digg this

Your first MySQL source code patch

At the MySQL Hackfest Camp of the MySQL conference, Mark Callaghan, MySQL code extraordinaire, helped us to hack our first change to the MySQL source code. In an hour, we implemented an additional command “SHOW HELLO” to the MySQL server. In and of itself, the command is not very useful per se, but I wanted to share what Mark showed us to modify MySQL’s SQL parsing code. Getting started with creating your own patch isn’t as hard as you think!

It was nice having a MySQL code expert by your side, telling you what files you need to look into to make your changes work. Here is how he showed us how to add a “SHOW HELLO” to the existing SHOW commands in the MySQL server. Once you’re done, it works like the following:

mysql> SHOW HELLO;
+--------------------+
| Hello Output       |
+--------------------+
| Good day to you!   |
+--------------------+
1 row in set (0.00 sec)

mysql>

Files we’ll be modifying

There are a few files we’ll have to open. Luckily, all of them are in the sql/ directory. This makes sense since we’re modifying the parser. Here’s the full list at a glance.

  • sql/sql_yacc.yy – modifications in 4 places – bison then converts this file into .h and .cc files
  • sql/lex.h – modifications in 1 place – reserve HELLO as a symbol
  • sql/sql_lex.h – modifications in 1 place – let server know about new command
  • sql/sql_parse.cc – modifications in 1 place – maps parsing to execution code
  • sql/mysql_priv.h – modifications in 1 place – declare function that does execution in sql_show.cc
  • sql/sql_show.cc – implement the actual function that does the work

The modifications

sql_yacc.yy

In my_yyoverflow(), around line 321. Add a token HELLO_SYM.

%token  HAVING
%token  HELP_SYM
%token  HELLO_SYM
%token  HEX_NUM
%token  HIGH_PRIORITY

Around line 6913, tell the lexer to look for another command.

| MUTEX_SYM STATUS_SYM
  { Lex->sql_command = SQLCOM_SHOW_MUTEX_STATUS; }
| opt_full PROCESSLIST_SYM
  { Lex->sql_command= SQLCOM_SHOW_PROCESSLIST;}
| HELLO_SYM
  { Lex->sql_command= SQLCOM_SHOW_HELLO;}
| opt_var_type  VARIABLES wild_and_where
  {
    LEX *lex= Lex;
    lex->sql_command= SQLCOM_SELECT;
    lex->orig_sql_command= SQLCOM_SHOW_VARIABLES;
    lex->option_type= $1;
    if (prepare_schema_table(YYTHD, lex, 0, SCH_VARIABLES))
      YYABORT;
  }

Around 8047, add a keyword.

| GLOBAL_SYM            {}
| HASH_SYM              {}
| HELLO_SYM             {}
| HOSTS_SYM             {}
| HOUR_SYM              {}

lex.h

In the symbols[] array, around line 226.

{ "HASH",             SYM(HASH_SYM)},
{ "HAVING",           SYM(HAVING)},
{ "HELLO",            SYM(HELLO_SYM)},
{ "HELP",             SYM(HELP_SYM)},
{ "HIGH_PRIORITY",    SYM(HIGH_PRIORITY)},

sql_lex.h

In enum_sql_command(), add a command.

SQLCOM_SHOW_INNODB_STATUS, SQLCOM_SHOW_NDBCLUSTER_STATUS, SQLCOM_SHOW_MUTEX_STATUS,
SQLCOM_SHOW_PROCESSLIST, SQLCOM_SHOW_MASTER_STAT, SQLCOM_SHOW_SLAVE_STAT, SQLCOM_SHOW_HELLO,
SQLCOM_SHOW_GRANTS, SQLCOM_SHOW_CREATE, SQLCOM_SHOW_CHARSETS,

sql_parse.cc

Somewhere inside mysql_execute_command(), add the following. I’ve added around line 3725.

case SQLCOM_SHOW_PROCESSLIST:
  if (!thd->security_ctx->priv_user[0] &&
    check_global_access(thd,PROCESS_ACL))
  break;
  mysqld_list_processes(thd,
                       (thd->security_ctx->master_access & PROCESS_ACL ?
                        NullS :
                        thd->security_ctx->priv_user),
                       lex->verbose);
  break;
case SQLCOM_SHOW_HELLO:
  mysqld_print_hello(thd);
  break;
case SQLCOM_SHOW_STORAGE_ENGINES:
  res= mysqld_show_storage_engines(thd);
  break;

mysql_priv.h

Around line 909, under sql_show.cc comments.

/* sql_show.cc */
bool mysqld_show_open_tables(THD *thd,const char *wild);
bool mysqld_show_logs(THD *thd);
void mysqld_print_hello(THD *thd);
void append_identifier(THD *thd, String *packet, const char *name,
uint length);
int get_quote_char_for_identifier(THD *thd, const char *name, uint length);

sql_show.cc

Around line 1267, implement the new command that will run when you call SHOW HELLO.

#ifdef HAVE_EXPLICIT_TEMPLATE_INSTANTIATION
template class I_List ;
#endif

void mysqld_print_hello(THD *thd)
{
  sql_print_error("enter_hello");
  Item *field;
  List field_list;
  field_list.push_back(new Item_empty_string("Hello Output",16));
  Protocol *protocol= thd->protocol;
  if (protocol->send_fields(&field_list,
  Protocol::SEND_NUM_ROWS | Protocol::SEND_EOF))
  {
    sql_print_error("can't send_fields - hello");
    return;
  }
  protocol->prepare_for_resend();
  protocol->store("Good day to you!", system_charset_info);
  if (protocol->write())
  {
    sql_print_error("cant write - hello");
    return; /* purecov: inspected */
  }
  send_eof(thd);
  sql_print_error("exit_hello");
  return;
}

void mysqld_list_processes(THD *thd,const char *user, bool verbose)
{
  Item *field;
  List field_list;

Testing

Once you make the changes, compile MySQL using make, and test it. You should be doing this for every change that you intend to give back to MYSQL. To test, you need to create a .test and .result file inside the mysql-test/t and mysql-test/r directories respectively. You can name the files whatever you want, just make sure to call them the same thing.

mysql-test/t/hello_file.test

# Test for show hello
show hello;

mysql-test/r/hello_file.result

show hello;
Hello Output
Good day to you!

Once you’ve added those files, go into the mysql-test directory, and run ./mtr hello_file. This runs the new test you just added. If it passes, it works, and if not, you need to go back and look at something.

Summary

That’s it! That wasn’t too hard, was it? I realize that you probably would have had a much harder time finding all these places on your own. However, once someone shows you what pieces are important, you should be able to make other changes easily. Hopefully, there are others very familiar with the MySQL code base that would be willing to write a post about how to write a UDF, or add a new SQL function. Plus, I’d love to hear how best to send MySQL row change log output to something like Hypertable or Hbase!

No comments Digg this

GPS Tracker – save your entire trip on a keychain

I recently bought a pocket sized GPS tracker that literally is meant to fit on your keychain.

I’ve been looking for something this simple for a while now – all I wanted to be able to do was keep track of where I’ve _been_ so that I can refer to it later with a full computer; for that, this device exceeds expectations. I have a motorcycle and it’s really nice to be able to show off routes and back roads I recommend taking, not to mention help remind myself where “that picturesque sunset view” was when I was serendipitously passing through a not-so-well-known part of town. Here’s a quick lowdown of what the Taiwanese designed Qstarz can do.

Features

The BT-Q1300S is primarily geared towards runners and the fitness minded since it includes a nice sweatproof arm band for physical activities. Don’t let that fool you though, because it packs a lot of functionality into a small device. It tracks latitude/longitude, plus altitude and speed over time. Like I was saying, it’s all of about the size of a regular keychain, and it only has one button, a couple of leds for status purposes, and a mini-usb port to connect to your computer in order to make the nice pretty graphs I’ve got below.

The device’s interface to start and stop logging, set a waypoint, and turn on/off takes a little time to learn, but is fairly simple and straightforward once you know what those are. For example, to turn on the device, hold the button for 4 seconds. Once it’s locked onto satellites (one of the leds starts blinking to let you know), press and hold the button again for another 2 seconds to start logging. From this point, you can save a waypoint by pressing the button each time you want to save a location, or just let it save waypoints automatically. To stop logging, press and hold for another 2 seconds. Now you’re back in power-save mode. 4 more seconds to turn it off completely. Voila.

Smart waypoint saving

If you’ve ever used a GPS device before, you’ll understand how frustrating it is to later view your trip and see “clusters of waypoints” saved on the same location, or worse, no waypoints between important roads at all because you were traveling too fast. This is the result of using only one dimension (time) for determining when to save waypoints, and it’s all too common in modern devices. One of my favorite things about the Qstarz is that you can configure it to save it across three separate criteria: time, distance, and speed. These can also be combined in any combination in a logical AND fashion, which adds to its flexibility. For example, you want to save waypoints every 1 second AND when you’ve gone greater than 50 feet AND you’re traveling faster than 5 mph. This is awesome when it comes to not logging unnecessary waypoints when you’ve come to a complete stop, or when you’re traveling really, really slow and don’t need an update every x seconds.

Let’s get to the demo!

Ok, the cool stuff now. Here is an example of a trip that I took recently. The software has the ability to export as an html file (hugging to Google Maps for the map), which I’ve included as an iframe here.

It also creates some interesting line graphs to view your datapoints. Here is an example of speed over time.

An example of one of the line graphs Qstarz can create.

An example of one of the line graphs Qstarz creates.

What’s missing

I really wish I could give this device 5 stars, but there are a couple of important things that they forgot that unfortunately keep me from giving it ultimate honors. Most importantly, the software has no way to be able to edit waypoints. This is important if you want to build a view of part of your trip taken from the device, or if you find that some of the waypoints are inaccurate and you just want to delete them (this has happened). Since it creates the full kml and other formats for you on export, and since the software utilizes the Google Maps API within itself to view your waypoints as they are imported, this would be a trivial feature to add.

Additionally, but less important, the device’s software won’t create an export for plug-and-play into a blog like this one using an iframe. You see this on a lot of sites nowadays (including Google Maps), so it’d be nice to see this as well. I’m sure that the vast majority of bloggers wouldn’t have figured out how to take the html export and make it work in an iframe (I also had to add my site’s Google Maps API key to the html source, so if I weren’t a web software engineer, I think it would have been hard to get it to work).

No comments Digg this

The beauty of ssh

I thought I’d share one little and less commonly used, but very useful capability with ssh. It is technically called local application-level port forwarding, and I use it quite often. There are two use cases I find it useful for.

  1. Secure browsing at conferences and the like
  2. Providing an easy way to access local network resources such as an internal wiki (on my home network).

The command: ssh -l chris -D 12345 local.musialek.org

Setup and Assumptions

I assume that you have a broadband connection to the internet with some sort of router behind it (very common these days). Also, in my example, local.musialek.org has been set the IP address of my router. Last, in my router’s configuration, I have set up port forwarding, to forward packets on port 22 (standard ssh traffic) to the machine on the local network that actually has sshd running on it. There are other ways of doing this, but I’m not going to go through it, as this is not the point of the post.

Browser proxy configuration

Assuming these things are set up, open up a command line window, type in the command, and log in to your ssh server. Last, we need to configure a SOCKS proxy on the browser. With connecting to our ssh server on our local network, with the -D option, we’ve also opened a listening port on port 12345 locally to forward packets along our ssh “tunnel”. SOCKS is an interesting protocol whose job is simply to facilitate communication between other protocols (but only the higher layers of the OSI model). It is what allows us to talk multiple protocols over our tunnel. To be precise, any protocol at a higher layer than layer 5, the Session layer, can be proxied. This includes FTP, HTTP, HTTPS, LDAP, DNS, DHCP, etc. It’s most commonly used with HTTP however, and this is what I want to show today.

Since it’s more easily configurable, I use FoxyProxy to get the SOCKS proxy setup on Firefox. You’ll see options to configure SOCKS.

foxyproxy settings

foxyproxy settings

Make sure to set the port to 12345, which is what we configured our ssh client to listen on (with the -D option). Hit ok and now you’re browsing the internet over your ssh tunnel! That’s it!

Use #1 – Browsing securely

So use case number one, browsing securely. With the above setup, you’re at a conference, people are potentially looking at your traffic, and you don’t want that. So just start up an ssh session back home, configure your browser’s proxy and voila!, your entire session is now encrypted, and coming out of your wireless router back home.

Use #2 – Browse internal resources

I’ve got a wiki on my local network that you can’t get to on the internet, and I use it for keeping track of more personal things like my grocery list, and my list of recipes I’ve found over the years. Obviously, when I’m home and on my local network, it is accessible, but when I’m not home, I can’t get to it. But with the beauty of ssh, this is possible. Best of all, since DNS is also proxied (via the browser), I don’t have to add special rules so that any internal DNS I configure doesn’t have to be setup on my laptop!

No comments Digg this

The MediaWiki parser, uncovered

The MediaWiki parser is one of the most essential and yet one of the most complex pieces of code of the entire MediaWiki project. Without it, you would not be able to markup Wikipedia pages with sections, links, or images, nor view or easily change the markup of others. Yet it is still flexible enough to allow both beginners as well as HTML experts to contribute to pages alike. This has made the parsing code somewhat complex, and it has gone through many iterations over the years. Yet even today, it is still fast enough for Wikipedia, one of the largest web sites in the world. Let’s take a look under the covers of this under appreciated (and perhaps slightly daunting) piece of code.

A short history

First, a disclaimer, this history is as I understand it, much taken from discussions I’ve listened to intently over the years on the Wikimedia mailing lists, as well as discussions at the 2006 Wikimania Conference. Up until about a year ago, the MediaWiki parser suffered from extreme complexity, based on its need to maintain single-pass (for speed), but also because additional, sometimes new rules were tacked on to the existing code. Over time, it became a spaghetti mess that was difficult to debug, and even tougher to improve. Rewriting it was made almost impossible by the fact that it was so essential to the software. Millions of pages on Wikipedia could have easily been made garbldy-gook in an instant if changes were not handled correctly.

What to do

There were a lot of discussions about solving the problem. They included rewriting the parser in C, which would greatly improve speed and thus allow for potential multi-pass parsing, an approach to deal with the increasing number of templates and templates of templates that were being transcluded in Wikipedia pages. They also included changing the MediaWiki syntax such that certain potential ambiguities (such as between ””’bold or italic?”bold or italic?””’, or intentions between triple-brackets and double-brackets in templates) would be removed. In the end, what they decided on, which I think was a brilliant idea, was to leave the parser in PHP (rewriting in C would have probably produced two classes of MediaWiki developers) and divide parsing into 2 steps, preprocessing, and parsing. The job of the preprocessor was to produce an XML DOM representation of the wikitext. The parsing step would then iterate through the DOM structure as many times as it needed (eg. to expand templates) to produce valid, static HTML. Iterating through the DOM is lightning fast, and also very natural from an XHTML point of view. There is also good support for it in PHP.

The preprocessor

You will find two versions of the preprocessor, the Hash and the DOM version, found in /includes/parser/Preprocessor_Hash.php and /includes/parser/Preprocessor_DOM.php respectively. We will concentrate only on the DOM version, which is practically identical to the Hash version, but faster because it takes advantage of PHP’s XML support, an optional component in PHP. The most important function in the preprocessor class is called preprocessToObj(). Inside the Preprocessor_DOM.php file, there are a couple of other important classes the preprocessor uses: PPDStack, PPDStackElement, PPDPart, PPFrame_DOM, and PPNode_DOM.

The preprocessor produces less than you think

So what does the MediaWiki XML look like? Here’s an example of the text representation of the XML output of the wikitext “{{mytemplate}} this is a [[test]]”:

<root><template><title>mytemplate</title></template> this is a [[test]]</root>

Notice how the internal link is not preprocessed at all. While the current code seems to elude that this may/could be change(d) in the future (and it makes sense to do so), the only real work the preprocessor does is create XML elements for templates and a couple of other items. Here are the possible items, ie. base nodes, in full:

  • template, tplarg, comment, ext, ignore, h

If you’ve ever worked with MediaWiki wikitext, you should already know what specific text each of these base nodes corresponds to. Nonetheless, here they are:

  • template = double-brackets, ( {{…}} )
  • tplarg = triple-brackets, ( {{{…}}} )
  • comment = Any type of HTML comment (<!– –> )
  • ext = Node reserved for anything that should get parsed in an extension
  • ignore = Node for wrapping escaped tags of type noinclude, as well as tag and text of type includeonly
  • h = Node for wrapping sections

That’s it. Anything else gets ignored and returned in its original wikitext to the parser.

How the preprocessor works

There is nothing special here, but it is worthy of note. In order to produce the XML representation we need, the preprocessor must iterate through each character in the wikitext. There is no other way to account for recursive templates correctly, which could be represented in a myriad of ways due to the syntax. So if our Wikipedia article is 40,000 characters long, it is very likely it will run through a loop around 40,000 times. Now you’re beginning to see why speed was such an issue.

The real deal: parsing

Glancing over the remaining details of the preprocessor and how it uses the classes mentioned above to produce the XML, let’s turn our attention to the parser and take a look at a typical pass of the parser when you click on a Wikipedia page and ask for the HTML representation of the wikitext that was saved. Keep in mind that wiki pages are all cached whenever possible, so you may not be calling the parser directly when you click on a page.

Here’s a typical, generalized function call tree of the parser (of a current revision of a page), starting with the Article object.

Parse function call tree

Let’s take a look at these functions. Again, these are the _major_ functions at play, not all of them. Numbers 2-4 retrieve and return the article’s wikitext from the database. This text is passed to outputWikiText, which prepares the parse for Parser::parse(). It gets interesting again from 8-11. Inside replaceVariables, the text is preprocessed into its DOM representation, iterating through each character in the article to find beginning and ending marks for templates, subtemplates, and the other nodes mentioned above.

Number 11 is an interesting step which I’m going to skip over at the moment because it requires some knowledge of the other classes in the Preprocessor_DOM.php file (mentioned above). Expand is very important and does a lot of things (among being called recursively), but suffice to say that it has the job of actually getting the text within the DOM’s nodes (remember that templates can be nested – you may not already have the full output text from each transcluded page) and returning valid expanded HTML text, with exception of three main areas: tables, links, and (un)numbered lists. So in our example above, “{{mytemplate}} this is a [[test]]”, the expand() return value would be:

“I’ve included the [[text]] from my template. this is a [[test]]”

As you can see in this simplified example, at this point everything except tables, links, and the (un)numbered lists is parsed.

Links are special

Yes, links get their own separate section. Not only are they probably the most existential item to what makes a wiki a wiki (besides the editing capability), they are also the most specially handled of all markup in the parser code (currently that is). What makes them special is that they are handled in two parts: one to mark every link with a unique id, and the second part to replace the “link holders” with valid HTML. So in our example, here is the output after the first part.

“I’ve included the <!–LINK 0–> from my template. this is a <!–LINK 1–>”

As you can imagine, there is also an array that maps the link text to the LINK #IDs, which is a Parser class variable called mLinkHolders. Besides the mapping, it also stores the Title objects of each link.

So the second part of the link parsing is to use this array and do a simple find and replace. Now we’re done! Ship the parsed text out the door!

Next up

In installment 2 of 2, I’m going to concentrate more on the preprocessor and detail what each of the classes in the Preprocessor_DOM.php file do and are used for in building the initial XML DOM. Also, I’ll talk about how I hacked this to cache infoboxes for faster retrieval in an extension called Unbox.

1 comment Digg this

Tagging files for archive

In search of a way to folksonomy tag files that I’d like to find again at some point in the long-term future, I came across these software tools for the Mac worth noting.

  • Tagit and Leap – Simple tagging and rating for OS X
  • Tagbot – Spotlight File Tagging For Mac OS X
  • FileSpot from Synthesis Studios

Personally, Tagit and Leap combined seemed to feel most intuitive to use, although all did a decent job of integrating themselves in with Finder.

1 comment Digg this

A best-of-the-best of data visualizations

While it may be old news for some, I still wanted to share what I think is definitely one of the best-of-the-best of data visualizations and data exploration I have seen. We should all take notes on, and look up to, how Hans Rosling uses the data to produce a much more vivid and valuable picture of his topic: global health.

No comments Digg this

Encrypt your gmail – firefox extension

I came across this today, someone has created a firefox extension to be able to OpenPGP your gmail account. This will allow you to encrypt and/or sign any of your gmail messages. After you install the extension, when you compose a message in gmail, it will add buttons in the interface to encrypt and/or sign your messages using your imported GPG key. Because it’s not a key-manager, you will also have to install GnuPG if you’re on Mac/Linux, or WinPT and GPG for Windows. Pretty cool!

3 comments Digg this

Next Page »