GSoC/GCI Archive
Google Summer of Code 2014

DBpedia & DBpedia Spotlight

License: GNU General Public License version 2.0 (GPLv2)

Web Page: http://wiki.dbpedia.org/gsoc2014/ideas

Mailing List: https://lists.sourceforge.net/lists/admindb/dbpedia-gsoc

DBpedia (http://dbpedia.org) and DBpedia Spotlight (http://spotlight.dbpedia.org) are two projects that have strong ties (topic-wise as well as data-wise as well as community-wise), which is why we decided to join forces and apply together, this year. Almost every major Web company has now announced their work on a knowledge graph, including Google’s Knowledge Graph, Yahoo!’s Web of Objects, Walmart Lab’s Social Genome, Microsoft's Satori Graph / Bing Snapshots and Facebook’s Entity Graph. DBpedia is a community-run project that has been working on a free, open-source knowledge graph since 2006. DBpedia currently exists in 97 different languages, and is interlinked with many other databases (e.g. Freebase, New York Times, CIA Factbook) and hopefully, with this GSoC to Wikidata, too. The knowledge in DBpedia is exposed through a set of technologies called Linked Data. Linked Data has been revolutionizing the way applications interact with the Web. While the Web 2.0 technologies opened up much of the “guts” of websites for third-parties to reuse and repurpose data on the Web, they still require that developers create one client per target API. With Linked Data technologies, all APIs are interconnected via standard Web protocols and languages. One can navigate this Web of facts with standard Web browsers, automated crawlers or pose complex queries with SQL-like query languages (e.g. SPARQL). Have you thought of asking the Web about all cities with low criminality, warm weather and open jobs? That's the kind of query we are talking about. This new Web of interlinked databases provides useful knowledge that can complement the textual Web in many ways. See, for example, how bloggers tag their posts or assign them to categories in order to organize and interconnect their blog posts. This is a very simple way to connect “unstructured” text to a structure (hierarchy of tags). For more advanced examples, see how BBC has created the World Cup 2010 website by interconnecting textual content and facts from their knowledge base. Identifiers and data provided by DBpedia were greatly involved in creating this knowledge graph. Or, more recently, did you see that IBM's Watson used DBpedia data to win the Jeopardy challenge? DBpedia Spotlight is an open-source (Apache license) text annotation tool that connects text to Linked Data by marking names of things in text (we call that Spotting) and selecting between multiple interpretations of these names (we call that Disambiguation). For example, “Washington” can be interpreted in more than 50 ways including a state, a government or a person. You can already imagine that this is not a trivial task, especially when we're talking 3.64 million “things” of 320 different “types” with over half a billion “facts” (July 2011). After two successful GSoC series we have some brand new and exciting ideas, we hope you will get excited too!

Projects

  • Abbreviation Base – A multilingual knowledge base for abbreviations The goal of the abbreviation base project is deriving a dictionary from DBpedia. The project involves extraction of as many abbreviations as possible from as many DBpedia datasets (in different languages) as possible. They are to be aggregated and published in Linked Open Data knowledge base using Lemon Model. The knowledge base will be interlinked with DBpedia, queryable via SPARQL. The project is based on Semantic Web Technology. It is an effective application of DBpedia and Linked Open Data.
  • Distributed extraction of Wikipedia data dumps for DBpedia The DBpedia project “extracts structured, multilingual knowledge from Wikipedia and makes it freely available on the Web using Semantic Web and Linked Data technologies”. Large-scale data processing can be given a big performance boost if it is distributed over a cluster of computers. The aim of this project is to parallelize the download of Wikipedia dumps using different tools, and distribute their extraction using Apache Spark over multiple machines to ensure speed and scalability.
  • Fine grained massive extraction of wikipedia content This project aims at creating a new infrastructure to provide people with an easy way to contribute custom extractors to dbpedia
  • Media Extractor for DBpedia To make DBPedia more generic and flexible by supporting multimedia data sources other than Wikipedia.
  • Natural language question answering engine with DBpedia The major goal for this project is to develop a QA (question answering) engine for DBpedia. I will introduce how to implement each module in a typical QA engine including template-generation, disambiguation, querygeneration, answergeneration, and Rendering. I will also show how to extend it for multiple languages. In the final, the project will outcome a complete QA engine for DBpedia that is able to answer a varieties of questions in different languages.
  • New DBpedia Interfaces: Resource Widgets This is a proposal to create an autosuggest DBpedia plugin, an embeddable widget for DBpedia entities, and an application to showcase the autosuggest plugin and the widgets. The widgets will display DBpedia resources in an easy to understand and visually appealing format.
  • Wikimedia Commons extraction The proposal, based on an idea suggested by DBpedia, aims to enhance DBpedia's Information Extraction Framework to extract metadata from the Wikimedia Commons and incorporate it into the RDF data from Wikipedia currently being exported by the project.