# DBpedia wikistatsextraction Contains the statistics from Wikipedia Dump files (Articles count, pair-wise occurence counts, surface forms count and context vectors) for dbpedia spotlight. DBpedia wikistatsextraction version 2022.03.01 + ## Update, September 2020 This update generates and upload to dbpedia databus the artifacts of dbpedia-models and wikistatsextractor for all langauges. Contains some fixes from [Klaus82](https://github.com/dbpedia-spotlight/model-quickstarter/pull/20) The update is available in [Julio-Noe's Github repo](https://github.com/Julio-Noe/model-quickstarter) and will be merged soon into the spotlight repo. ## wikistatextractor (spotlight-wikistats) The [wikistatsextractor](https://github.com/dbpedia-spotlight/wikistatsextractor/) extracts statistics statistics from Wikipedia Dump files. It extracts the same 4 files initially produced by pignlproc for dbpedia spotlight. File | Long name | Line Format ------------- | ------------- | ------------- uriCount | Articles counts | ```\t<uri>\t<count>``` pairCount | Pair-wise occurrence counts | ```<surface form>\t<uri>\t<count>``` sfAndTotalCount | Surface forms counts | ```<surface form>\t<count as SF>\t<count as token>``` tokenCount | Context vectors | ```<uri (wikipedia style)>\t{(context_token1,count1),(context_token2, count2),... }``` These files are now available to download in the Databus spotlight-wikistats repository. A short description of each file is as follows (examples could be found in the [DBpedia wiki](https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Raw-data)): - uriCount: Contains the number of mentions for each article - pairCount: Defines the number of times a surface form (plain text) was used to reference a resource (DBpedia URI) - sfAndTotalCount: Establish how many times a surface form was used as an anchor and how many times it occurred in text (regardless of wheter that was part of an anchor). ### model-quickstarter (spotlight-model) The dbpedia-models was generated with the [DBpedia model-quickstarter](https://github.com/dbpedia-spotlight/model-quickstarter) tool. ## Citation If you use the current (statistical version) of DBpedia Spotlight or the data/models created using this repository, please cite the following paper. ```bibtex @inproceedings{isem2013daiber, title = {Improving Efficiency and Accuracy in Multilingual Entity Extraction}, author = {Joachim Daiber and Max Jakob and Chris Hokamp and Pablo N. Mendes}, year = {2013}, booktitle = {Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)} } ```