Data

Wikipedia Data

Various text and graph data extracted from Wikipedia.

Enlish/French: Algebraic Geometry

These are paired Wikipedia documents in two languages. First, starting at the Algebraic geometry page in English, we move two steps out, following links in the Wikipedia graph. Given all those documents, we remove those that do not have a corresponding page in the French Wikipedia. Here "corresponding" means there is a link from the English to the French (on the left hand side of the page) and a link from the French back to the English. The files below are ordered: the graphs are edge lists with node number as document number, and document i in English corresponds to document i in French.

The data are found in: agdata.tgz. This compressed tar file contains the following files:

agen.edgelist The English Wikipedia graph
agfr.edgelist The French Wikipedia graph
agen.titles English titles
agfr.titles French titles
categories Wikipedia categories (English)
classes Class lables (see below)
In addition, there are five files with lists of the documents in each class: dates, locations, math, people, things. The class labels correspond to one of these 5 things plus "other".

In addition to these, there are three files: agen.wch and agfr.wch (the word-count histograms) and agxml.tgz, a tar containing the original Wikipedia xml (parsed out very minimally, so that there is (mostly) only the text). Each language has its own xml document.

The word-count-histogram files are in the format:

"id:title" "word1:count1" "word2:count2" ...

With the separator a tab. The "id" is from Wikipedia, and has no relationship to the vertex labels or anything else, and should pretty much be ignored. Punctuation has been mostly removed, except that single quotes remain. Also, no effort has been made to ensure that the word is actually a word in the purported language. There are often English words in Wikepedia articles of other languages, and links to other languages (in that language), which can be included in the WCH. Finally, words with Greek roots often have the original Greek word in the text, and these will also show up in the WCH.

English/Farsi

Like the Algebraic geometry dataset, these are pairs of corresponding documents in the English and Farsi Wikipedia. These are all the documents with a bijective link (as of the date the data were obtained, sometime in 2008). Untar enfawch.tgz and it will create a directory "Matched", containing two word-count histogram files, one for each language. Like the AG data, these are ordered -- line i in English corresponds to line i in Farsi. As with the AG data, almost no processing was done to these words, so out-of-language words and words containing single quotes appear in these histograms. They can be ignored, or used, as you see fit. The file docs contains a list of the rows corresponding to document pairs where each contain a "reasonable" number of words: 500 or more words, at least 100 distinct words, in both languages. There are 2448 of these.