The DBpedia abstract corpus

Wikipedia is the most important and comprehensive source of open, encyclopedic knowledge. The English Wikipedia alone features over 4.280.000 entities described by basic data points, so called info boxes, as well as natural language texts. The DBpedia project has been extracting, mapping, converting and publishing Wikipedia data since 2007, establishing the LOD cloud and becoming its center in the process. Article texts and the data they may contain are not especially focused, although they are the largest part of most articles in terms of time spent on writing, informational content and size. Only the text of the first introductory section of the articles is extracted and contained in the DBpedia, called abstract.

Links inside the articles are only extracted as an unordered bag, showing only an unspecified relation between the linking and the linked articles, but not where in the text the linked article was mentioned or which relation applies between the articles. As the links are set by the contributors to Wikipedia themselves, they represent entities intellectually disambiguated by URL. This property makes extracting the abstracts including the links and their exact position in the text an interesting opportunity to create a corpus usable for, among other cases, NER and NEL algorithm evaluation.

Corpus content

This corpus contains a conversion of Wikipedia abstracts in seven languages (dutch, english, french, german, italian, japanese and spanish) into the NLP Interchange Format (NIF)[1]. The corpus contains the abstract texts, as well as the position, surface form and linked article of all links in the text. As such, it contains entity mentions manually disambiguated to Wikipedia/DBpedia resources by native speakers, which predestines it for NER training and evaluation.

Furthermore, the abstracts represent a special form of text that lends itself to be used for more sophisticated tasks, like open relation extraction. Their encyclopedic style, following Wikipedia guidelines on opening paragraphs adds further interesting properties. The first sentence puts the article in broader context. Most anaphers will refer to the original topic of the text, making them easier to resolve. Finally, should the same string occur in different meanings, Wikipedia guidelines suggest that the new meaning should again be linked for disambiguation. In short: The type of text is highly interesting.

Data

All corpora come in a number of gzipped Turtle files. Linked Data is not available at the moment.

Dutch

Dutch corpus, available at http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/nl.

Size: 114,284,973 triples
Abstracts: 1,740,494
Average abstract length: 317.82
Links: 11,344,612
Example

English

English corpus, available at http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/en.

Size: 387,953,239 triples
Abstracts: 4,415,993
Average abstract length: 523.86
Links: 39,650,948
Example

French

French corpus, available at http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/fr.

Size: 116,205,859 triples
Abstracts: 1,476,876
Average abstract length: 349.73
Links: 11,763,080
Example

German

German corpus, available at http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/de.

Size: 153,626,686 triples
Abstracts: 1,556,343
Average abstract length: 471.88
Links: 15,859,142
Example

Italian

Italian corpus, available at http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/it.

Size: 75,698,533 triples
Abstracts: 907,329
Average abstract length: 398.31
Links: 7,705,247
Example

Japanese ^beta!

Note: this corpus has not been enriched!

Japanese corpus, available at http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/ja.

Size: 59,647,215 triples
Abstracts: 909,387
Average abstract length: 154.94
Links: 6,660,236
Example

Spanish

Spanish corpus, available at http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/es.

Size: 111,293,569 triples
Abstracts: 1,038,639
Average abstract length: 517.66
Links: 11,558,121
Example

Data Format

We are using NIF 2.0.

First, there is a nif:Context resource. It contains the abstracts in nif:isString. It also contains the length of this string in nif:endIndex, as well as the URL of the page in nif:sourceUrl.

Links themselves are resources of the type nif:Phrase or nif:Word (depending on the number of words they contain). They link the context they reference and their position in its string via offsets. They also link the respective DBpedia page via itsrdf:taIdentRef.

Abstracts were enriched using surface forms extracted from the corpus. The corpus was enriched in two steps:

The surface forms of the article URI itself were acquired and annotated in the abstract. The article topic is never linked in Wikipedia, but every mention of the article topic is annotated and linked in our corpus.
For every mention, the surface forms of the link were acquired and annotated in the abstract after the original mention. Mentions are only linked once in a Wikipedia text. That's why, if we encounter a mention, we take its URI, look for other surface forms and annotated them in the text after the mention. That way, disambiguation is still being provided by Wikipedia editors, but repeat mentions are also repeatedly linked.

Enrichment produces two distinct types of links, discriminated by a prov:wasAttributedTo property. Links that are prov:wasAttributedTo http://wikipedia.org are found in the original abstracts and are set directly be Wikipedia editors. Links that are prov:wasAttributedTo http://nlp.dbpedia.org/surfaceforms were added during enrichment.

Surface forms

We extracted surface forms from the abstracts and publish them here. Format is tsv. Content is formatted as

Surfaceform\tLink\tCount

where 'Count' is the number of occurrences of this specific suface form in the corpus.

Further languages, domain specific corpora

Compiling a language is quite a process, but you can send requests to me. For existing languages, I can trivially generate domain specific corpora using lists of resource URIs, too!

License, Publications, Contact

License, as per Wikipedia and DBpedia is CC-BY. For NIF, please cite [1]. There is no publication for this corpus yet, so if you have any cool use-cases, feel free to contact me at bruemmer@informatik.uni-leipzig.de

[1] S. Hellmann, J. Lehmann, Sören Auer, M. Brümmer
Integrating NLP using Linked Data
Proceedings of the 12th International Semantic Web Conference, Sydney, Australia, October 2013
PDF

Martin Brümmer, AKSW, University of Leipzig, 2015