The WikipediaSimilarity 353 Test Collection

The WikipediaSimilarity 353 Test Collection is a dataset for measuring semantic relatedness between articles in Wikipedia. It is an adaption of an earlier dataset (the WordSimilarity 353 Test Collection) for measuring semantic relatedness between words.

Word Similarity

The original WordSimilarity 353 Test Collection is a set of 353 term pairs, each associated with between 12 and 15 human-assigned similarity judgements. It is available for download (along with details for how the data was obtained) here.

The following paper provides additional details:

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman and Eytan Ruppin, (2002) Placing Search in Context: The Concept Revisited, ACM Transactions on Information Systems 20(1), pp. 116-131

Wikipedia Similarity

The WikipediaSimilarity 353 Test Collection is simply a copy of the original dataset, where each term has been manually disambiguated to refer to a particular article in Wikipedia. It was used to develop and evaluate the Wikipedia Link Based Measure, an algorithm for measuring relatedness using Wikipedia's link structure as a source of background knowledge. The algorithm is available open-source (an can be used online) as part of the Wikipedia Miner toolkit.

The following paper provides additional details:

David Milne, Ian H. Witten (2008) An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of the first AAAI Workshop on Wikipedia and Artificial Intelligence, Chicago, I.L

Availability and Usage

wikipediaSimilarity353.csv

If you publish results based on this data set, please cite the papers listed above.

Questions

If you have any questions, then please contact me at [email protected].