Why would you want to do this?
If you have ever browsed Wikipedia (and who hasn't), then you have probably experienced the feeling of being happily lost, browsing from one interesting topic to the next and encountering information that they would never have searched for. Wikipedia's articles are peppered with hundreds of millions of links, which explain the topics being discussed, and create an environment where serendipitous encounters with information are commonplace.
The work described here aims to bring the same explanatory links—and the accessibility and serendipity they provide—to all documents.
Hasn't this been done already?
You may have seen other wikifier systems, like this one from University of North Texas. Our system is fairly different. It doesn't just use Wikipedia as a source of information to link to, but also as training data for how best to do it. In other words, it has been trained to make the same decisions as the people who edit Wikipedia. Try them both out and see the difference for yourself.
Where can I read more?
Smarter Searching with Wikipedia
You probably use search engines every single day. They are our go-to guys for everything from homework, to song lyrics, to advice on getting the baby to sleep. Still, they could be better. The big names—Google, Yahoo, Microsoft—all agree: search is not solved. Oddly enough, it might not be just the geeks who solve it. If you have ever written anything for the online encyclopedia Wikipedia, then you may also have a part to play.
The big limitation of current search engines is that they don't understand us. When they are out crawling the web, they don't see people or places or events. They just see patterns of letters.
To put yourself in their shoes, imagine being a librarian: your job is to find books for people. Now imagine transferring to a library full of ancient Egyptian scrolls, where every request is given in obscure hieroglyphs. Your job would dissolve into tediously checking whether scrolls contained the same characters as the requests—just like a search engine does.
Sure, you would get better as time went on. You would remember common patterns, or that people prefer some scrolls over others. The real breakthrough, though, would be when you actually learned the language. For search engines, that breakthrough hasn't happened yet.
This is where Wikipedia comes in. To steal its own advertising blurb, Wikipedia is the free encyclopedia that anyone can edit. Since anyone can contribute, it grows incredibly quickly. It covers more-or-less everything, from nanotechnology to the working habits of Barbie dolls. This makes for a great place to search—it's one of the most visited websites on the planet.
Wikipedia might have even more potential. It might be able to teach search engines how to understand the rest of the web.
At the University of Waikato, we've been using Wikipedia to train a program to automatically recognize topics when they are mentioned. This learns from those little blue links that pepper Wikipedia's articles. In a recent experiment, we were able to take those links out and put most of them back in automatically. Only 26% of the old links were missing, and only 25% of the new ones didn't belong. That's not bad, considering you and I would argue about what should and shouldn't be linked.
We can run the same program on any webpage or document. Running it on this article, for example, yields 20 links to Wikipedia topics such as search engines, nanotechnology, and the University of Waikato. A search engine can use this to see people, places, events and ideas rather than just letters. This deeper understanding can be used to make the article easier to search for. For example, Wikipedia tells us other ways to look for search engines, such as web searching, search services, an even (in Spanish) motor de búsqueda.
We can also detect how important these topics are; that this document is about search engines and Wikipedia, and only mentions barbie dolls in passing. This summarizes the article, and provides tags under which it can be organized. It also tells search engines that this article isn't going to be helpful for someone searching for hieroglyphs.
We are also using Wikipedia to measure how topics relate to each other. This helps to organize documents, by showing that all of the important topics in this article relate to technology and belong in that section rather than sport or entertainment. This can also be used to suggest new topics for the reader. If you have persevered long enough to read this far, for example, then our algorithms suggest you investigate information retrieval and query expansion.
In the near future, Wikipedia might make such investigation a lot easier. In the meantime, why don't you share what you know by becoming a Wikipedian? Your contributions might have more of an impact than you think.
Can I make this part of my own system?
Yes! In fact, there are a few options:
The wikify web service is machine-readable, so feel free to call it from within your code. It can be made to return results directly (without the surrounding web interface) by appending &wrapInXml=false to the url (like this). You can also make it return XML (with additional information) by appending &xml (like this).
There are several other parameters available for specifying the format of the document, how many links to create, etc. Check here for details.
You can also host the service yourself, or deal directly with the source code. It has been incorporated into the Wikipedia Miner Toolkit and released open source under the GNU Public Licence.