Various text and graph data extracted from Wikipedia.
Enlish/French: Algebraic Geometry
These are paired Wikipedia documents in two languages. First, starting
at the Algebraic geometry page in English, we move two steps out, following
links in the Wikipedia graph. Given all those documents, we remove those
that do not have a corresponding page in the French Wikipedia. Here
"corresponding" means there is a link from the English to the French
(on the left hand side of the page) and a link from the French back to
the English. The files below are ordered: the graphs are edge lists with
node number as document number, and document i in English corresponds
to document i in French.
The data are found in: agdata.tgz.
This compressed tar file
contains the following files:
In addition, there are five files with lists of the documents in each class:
things. The class labels correspond to one of these 5 things plus "other".
|| The English Wikipedia graph
|| The French Wikipedia graph
||Wikipedia categories (English)
||Class lables (see below)
In addition to these, there are three files:
agfr.wch (the word-count histograms) and
agxml.tgz, a tar containing the original Wikipedia
xml (parsed out very minimally, so that there is (mostly) only the
text). Each language has its own xml document.
The word-count-histogram files are in the format:
"id:title" "word1:count1" "word2:count2" ...
With the separator a tab. The "id" is from Wikipedia, and has no
relationship to the vertex labels or anything else, and should pretty
much be ignored.
Punctuation has been mostly removed, except that single quotes remain.
Also, no effort has been made to ensure that the word is actually a word
in the purported language. There are often English words in Wikepedia
articles of other languages, and links to other languages (in that
language), which can be
included in the WCH. Finally, words with Greek roots often have the
original Greek word in the text, and these will also show up in the WCH.
Like the Algebraic geometry dataset, these are pairs of corresponding
documents in the English and Farsi Wikipedia. These are all the documents
with a bijective link (as of the date the data were obtained, sometime
in 2008). Untar
enfawch.tgz and it will create a directory
"Matched", containing two word-count histogram files, one for each
language. Like the AG data, these are ordered -- line i in English
corresponds to line i in Farsi. As with the AG data, almost no processing
was done to these words, so out-of-language words and words containing
single quotes appear in these histograms. They can be ignored, or
used, as you see fit. The file docs contains
a list of the rows corresponding to document pairs where each contain
a "reasonable" number of words: 500 or more words, at least 100 distinct
words, in both languages. There are 2448 of these.