Wikipedia is the most important and comprehensive source of open, encyclopedic knowledge.
The English Wikipedia alone features over 4.280.000 entities described by basic data points, so called info boxes, as well as natural language texts.
The DBpedia project has been extracting, mapping, converting and publishing Wikipedia data since 2007, establishing the LOD cloud and becoming its center in the process.
Article texts and the data they may contain are not especially focused, although they are the largest part of most articles in terms of time spent on writing, informational content and size.
Only the text of the first introductory section of the articles is extracted and contained in the DBpedia, called
Links inside the articles are only extracted as an unordered bag, showing only an unspecified relation between the linking and the linked articles, but not where in the text the linked article was mentioned or which relation applies between the articles. As the links are set by the contributors to Wikipedia themselves, they represent entities intellectually disambiguated by URL. This property makes extracting the abstracts including the links and their exact position in the text an interesting opportunity to create a corpus usable for, among other cases, NER and NEL algorithm evaluation.
This corpus contains a conversion of Wikipedia abstracts in seven languages (dutch, english, french, german, italian, japanese and spanish) into the NLP Interchange Format (NIF)[1]. The corpus contains the abstract texts, as well as the position, surface form and linked article of all links in the text. As such, it contains entity mentions manually disambiguated to Wikipedia/DBpedia resources by native speakers, which predestines it for NER training and evaluation.
Furthermore, the abstracts represent a special form of text that lends itself to be used for more sophisticated tasks, like open relation extraction. Their encyclopedic style, following Wikipedia guidelines on opening paragraphs adds further interesting properties. The first sentence puts the article in broader context. Most anaphers will refer to the original topic of the text, making them easier to resolve. Finally, should the same string occur in different meanings, Wikipedia guidelines suggest that the new meaning should again be linked for disambiguation. In short: The type of text is highly interesting.
All corpora come in a number of gzipped Turtle files. Linked Data is not available at the moment.
Size: 114,284,973 triples Abstracts: 1,740,494 Average abstract length: 317.82 Links: 11,344,612 Example
Size: 387,953,239 triples Abstracts: 4,415,993 Average abstract length: 523.86 Links: 39,650,948 Example
Size: 116,205,859 triples Abstracts: 1,476,876 Average abstract length: 349.73 Links: 11,763,080 Example
Size: 153,626,686 triples Abstracts: 1,556,343 Average abstract length: 471.88 Links: 15,859,142 Example
Size: 75,698,533 triples Abstracts: 907,329 Average abstract length: 398.31 Links: 7,705,247 Example
Size: 59,647,215 triples Abstracts: 909,387 Average abstract length: 154.94 Links: 6,660,236 Example
Size: 111,293,569 triples Abstracts: 1,038,639 Average abstract length: 517.66 Links: 11,558,121 Example
We are using NIF 2.0.
First, there is a nif:Context resource. It contains the abstracts in nif:isString. It also contains the length of this string in nif:endIndex, as well as the URL of the page in nif:sourceUrl.
Links themselves are resources of the type nif:Phrase or nif:Word (depending on the number of words they contain). They link the context they reference and their position in its string via offsets. They also link the respective DBpedia page via itsrdf:taIdentRef.
Abstracts were enriched using surface forms extracted from the corpus. The corpus was enriched in two steps:
Enrichment produces two distinct types of links, discriminated by a prov:wasAttributedTo property. Links that are prov:wasAttributedTo http://wikipedia.org are found in the original abstracts and are set directly be Wikipedia editors. Links that are prov:wasAttributedTo http://nlp.dbpedia.org/surfaceforms were added during enrichment.
We extracted surface forms from the abstracts and publish them here. Format is tsv. Content is formatted as
Surfaceform\tLink\tCount
where 'Count' is the number of occurrences of this specific suface form in the corpus.
Compiling a language is quite a process, but you can send requests to me. For existing languages, I can trivially generate domain specific corpora using lists of resource URIs, too!
License, as per Wikipedia and DBpedia is CC-BY. For NIF, please cite [1]. There is no publication for this corpus yet, so if you have any cool use-cases, feel free to contact me at bruemmer@informatik.uni-leipzig.de
[1] S. Hellmann, J. Lehmann, Sören Auer, M. Brümmer Integrating NLP using Linked Data Proceedings of the 12th International Semantic Web Conference, Sydney, Australia, October 2013 PDF
Martin Brümmer, AKSW, University of Leipzig, 2015