GSoC/GCI Archive
Google Summer of Code 2015

DBpedia & DBpedia Spotlight

License: GNU General Public License version 2.0 (GPLv2)

Web Page: http://wiki.dbpedia.org/gsoc2015/ideas

Mailing List: https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

DBpedia (http://dbpedia.org) and DBpedia Spotlight (http://spotlight.dbpedia.org) are two projects that have strong ties (topic-wise, data-wise and community-wise). That is why we decided to join forces and apply together this year.

Almost every major Web company has now announced their work on a Knowledge Graph, including Google’s Knowledge Graph, Yahoo!’s Web of Objects, Walmart Lab’s Social Genome, Microsoft's Satori Graph / Bing Snapshots and Facebook’s Entity Graph. DBpedia is a community-run project that has been working on a free, open-source knowledge graph since 2006.

DBpedia currently describes 38.3 million “things” of 685 different “types” in 125 languages, with over 3 billion “facts” (September 2014). It is interlinked to many other databases (e.g., Freebase, Wikidata, New York Times, CIA Factbook). The knowledge in DBpedia is exposed through a set of technologies called Linked Data. Linked Data has been revolutionizing the way applications interact with the Web. While the Web 2.0 technologies opened up much of the “guts” of websites for third-parties to reuse and repurpose data on the Web, they still require that developers create one client per target API. With Linked Data technologies, all APIs are interconnected via standard Web protocols and languages. One can navigate this Web of facts with standard Web browsers, automated crawlers or pose complex queries with SQL-like query languages (e.g., SPARQL).

Have you thought of asking the Web about all cities with low criminality, warm weather and open jobs? That's the kind of query we are talking about.

This new Web of interlinked databases provides useful knowledge that can complement the textual Web in many ways. See, for example, how bloggers tag their posts or assign them to categories in order to organize and interconnect their blog posts. This is a very simple way to connect “unstructured” text to a structure (hierarchy of tags). For more advanced examples, see how BBC has created the World Cup 2010 website by interconnecting textual content and facts from their knowledge base. Identifiers and data provided by DBpedia were greatly involved in creating this knowledge graph. Or, more recently, did you see that IBM's Watson used DBpedia data to win the Jeopardy challenge?

DBpedia Spotlight is an open-source (Apache license) text annotation tool that connects text to Linked Data by marking names of things in text (we call that Spotting) and selecting between multiple interpretations of these names (we call that Disambiguation). For example, “Washington” can be interpreted in more than 50 ways including a state, a government or a person. You can already imagine that this is not a trivial task, especially when we're talking about millions of things and hundreds of types.

After three successful GSoC series, we have some brand new and exciting ideas, we hope you will get excited too!

Projects

  • 5.15: Better context vectors for disambiguation For entity disambiguation, DBpedia Spotlight uses a vector model to evaluate similarities between contexts and candidate entities. Word2Vec, GloVe and other models might be very effective replacements for the currently used discrete context vector model. This is a proposal for extending Spotlight to include some these more advanced models and to evaluate if they lead to better disambiguation.
  • Dbpedia Live scaling and new interface This project is divided in two parts: the first is to remove the MySQL database used as cache in the Live extraction; the second is to create an administrative interface where the user will be able to start, stop, queue items and explore a variety of statistics. To remove the MySQL database I’ll introduce an instance of a NoSQL database, MongoDB. The administrative interface will use the open source library Jetty and the real-time updates will be implemented using AJAX.
  • DBpedia Schema Enrichment on Web Protege DBpedia is a crowd-sourced community project aiming to extract the structured content from wikipedia. It allows users to ask sophisticated queries against Wikipedia and to link other datasets on the web to Wikipedia data. The goal of this project is to improve the quality of the data by Scheme Enrichment using the machine learning framework DL-Learner. It will be realized by implementing a DL-Learner-plugin for the collaborative ontology editor WebProtege.
  • Fact Extraction from Wikipedia Text The goal is to extract factual information from free text coming from Wikipedia. Finding an automated way of performing this would give a great boost to the whole DBpedia ecosystem as most of the information is currently extracted from the infoboxes, leaving a considerable amount of data untouched. The idea is to use frame semantics and machine learning to extract relevant information and to classify the facts into an ontology.
  • Get up and walk! - Adding live-ness to the Triple Pattern Fragments server Semantic Web projects have an issue of reliability. The keeping of SPARQL endpoints is expensive, so the Triple Pattern Fragments (TPF) were designed to offload the servers of processing requirements, and put it on the clients to do the processing. This project aims to provide the TPF server with the ability to update its dataset using the triples provided by the live extraction framework; just like the DBPedia Live. In short, we want to make the TPF live.
  • Keyword Search on DBpedia DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. The aim of this project is to develop a more Google-like (i.e., a scalable keyword-based approach) to search on DBpedia.
  • Parallel processing in DBpedia extraction Framework The project deals with making Distributed DBpedia extraction framework easily accessible via different distributed computing platforms. Scripts will be used to preconfigure the setup part for accessing the DBpedia project. Distributing the workload is expected to provide performance enhancements over the current approach.
  • Remodeling pignlproc for generating Named entity recognition models The aim of the project is to re-build currently running pignlproc. The new remodeled version should run fast and without much issues. The solution should be scalable to different languages.