GSoC/GCI Archive
Google Summer of Code 2014

Apache Software Foundation

License: Apache License, 2.0

Web Page:

Mailing List: No central list, see the lists of Apache projects at and Students can approach the GSOC Admins via

Established in 1999, the all-volunteer Foundation oversees nearly one hundred fifty leading Open Source projects, including Apache HTTP Server — the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 350 individual Members and 3,000 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(3)(c) not-for-profit charity, funded by individual donations and corporate sponsors including Citrix, Facebook, Google, Yahoo!, Microsoft, AMD, Basis Technology, Cloudera, Go Daddy, Hortonworks, HP, Huawei, InMotion Hosting, IBM, Matt Mullenweg, PSW GROUP, SpringSource/VMWare, and WANDisco.

Our ideas page can be filtered by the labels documented at


  • (PIG-2597) Move Grunt from JavaCC to ANTLR Apache Pig a platform for analyzing large scale data sets using high-level data flow language. Currently Pig utilizes two frameworks, JavaCC and ANTLR for generating language processors. In this Google Summer of Code project, I will reimplement GruntParser using ANTLR and merge it QueryParser. Removing GruntParser's dependency on JavaCC will help Apache Pig project keep clean, easy to maintain codebase and continue to be improved by many contributors.
  • [CLOUDSTACK-6045] Create GUI to add primary storage based on plug-ins The project’s goal is to provide admins who want to add primary storage to CloudStack that is not based on the default storage plug-in with a GUI plugin (or modified version of a current window) that enables them to invoke the addStoragePool API.
  • [CLOUDSTACK-6114] Create config management recipes to install CloudStack The aim of this project is to use Vagrant and one of the many configuration management tools to easily bring up Cloudstack environments for development or testing purposes.
  • A SpatialPrefixTree based on the Hilbert Curve and variable grid sizes A SpatialPrefixTree which uses Hilbert curve for ordering multi-dimension points to one dimension, Morton binary coding and variable sized grids to effeciently process geodetic and non-geodetic queries will be introduced in Apache spatial.
  • Add Security capabilities to Airavata Thrift services and clients There is a huge need for Airavata based science gateways to have secure communication with the Airavata server. The main goal of this project is to explore and add security capabilities to Airavata Thrift services and clients.
  • Amazon Glacier support for jclouds The following document is my proposal to add support for Amazon Glacier to jclouds. The next pages explains the needs of Glacier support, summarizes my vision of how the development process should be approached, and describe the main goals I’d like to achieve.
  • Apache Bloodhound : Batch create tickets Currently Apache Bloodhound provides a method for creating tickets by filling a form with relevant information. But as it provides the functionality just for add one ticket a time, for a user who requires to create a large number of tickets could be really tedious process. So the major idea of this project is to implement a functionality which will make it possible for the users to batch-create tickets by converting WikiFormatted tables into set of tickets.
  • Apache Helix Dashboard Proposed project mainly focuses on developing a dashboard for Helix. Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. Helix uses Zookeeper for maintaining the cluster state and change notifications. Helix approach of using a distributed state machine with constraints on state and transitions. Additional info:
  • Apache Nutch web GUI The goal of this project is to create a web-application, which allows to manage Apache Nutch without any effort.
  • Apache Thrift - Test Framework harmonization across all languages Apache Thrift is a software framework which enables communication between servers and clients which use different languages, different protocols and different transports. The main purpose of this project is to extend the current Thrift cross language test suite to cover all client-server-protocol-transport-socket combination with a test case.
  • COMDEV108 - Add Solr to Bloodhound The Bloodhound Search plugins supports different search backends, but only Woosh has been implemented so far. Apache Solr is a search platform focused on delivering enterprise class, high performance search functionality [2]. It provides great scalability, and great performance under heavy usage, being a great tool for services having a high number of users, and high amount of data.
  • CSV PropertyTable Support for Apache Jena (JENA-625) This project is about getting CSVs into a form that is amenable to Jena SPARQL processing, and doing so in a way that is not specific to CSV files. It includes getting the right architecture in place for regular table shaped data, using the core abstraction of PropertyTable.
  • Dynamic clientside autocompletion features for the Apache Bloodhound ticket Ticket system need to assist users when creating and modifying ticket to reduce potential errors and to minimize the amount need to type by the users by auto completing the fields. (COMDEV-111)
  • Enhancement Workflows. Enterprise Integration Patterns in Apache Stanbol Stanbol provides a set of components for Semantic Content Management. One of the components is the Enhancer, which can be used to extract features from content. The Enhancer is organized using Enhancements Chains, which defines how the content will be processed but they don't allow to integrate the current process with the business layer The goal of the proposal is to bring EIP to Stanbol for easing the integration of the Enhancement workflows within the business layer of enterprise systems.
  • GCE support for Stratos Apache Stratos is an opensource PaaS framework that provides a cloud computing platform along with software components to develop/test/run scalable applications by making use of IaaS providers - Amazon EC2, Openstack currently supported. Stratos uses jclouds to integrate with various IaaS providers. This project is to provide support for GCE(Google Compute Engine) which is google’s IaaS solution, and to run large scale tests in it.
  • GenApp Integration with Apache Airavata GenApp is a modular framework for multiscale science computations. Apache Airavata is a software framework for executing and managing computational jobs and workflows on distributed computing resources. The primary goal of this project is to integrate GenApp with Airavata to be able to exexcute long running computational jobs.GenApp also needs to be integrated with Airavata to enable composition and execution of workflows. GenApp needs to be integrated with Airavata's Messaging System as well.
  • Google Cloud Storage support for jclouds Apache jclouds is an open source library that allows Java and clojure developers to write highly portable code by abstracting cloud specific features through views such as Blobstore, Compute and Loadbalancer. At the same time jclouds allows developers to make full use of vendor-specific features through provider implementation.
  • GUI Client for Apache SIS SIS is a framework for geo information analysis, currently it has a command-line and a web application to spatial data analysis. The objective of this project is to implement JavaFX desktop application to use for spatial data analysis. This application provide features such as extract and visualize meta-data information from data files and also to map project spatial data on map for different queries.
  • Implementation of the LDP service for Apache Marmotta Based on SPARQL 1.1 This project will provide an alternative implementation of the Marmotta Linked Data Platform (LDP) service based on SPARQL 1.1, inspired by the Marmotta JIRA issue:
  • Improve GCE and EC2 python bridges to CloudStack The project would aim to extend the current bridges between Cloudstack and EC2/GCE, add APIs not currently covered and fix existing bugs.
  • Improvements to Auto-scaling in Apache Stratos Apache Stratos is a polyglot PaaS framework, providing users a cloud-based environment for developing, testing, and running scalable applications. It consists of number of components to provide capabilities , flexibility and scalability. Auto-scaling is also a such a component which enables user to automatically launch or terminate instances based on user-defined policies, health status checks, and schedules. This Gsoc project is on improving auto-scaling.
  • Improving classification module in Lucene This project want to make the Lucane classification module more usable and speed it up. This module is approx. one year old, and have some minor issues and some deficiency. I want to resolve some optimalization problem, with an online and a cashing classification model, make more configureable result sets, and make the whole product testable. I want to implement least one new classification algorithm and do integration in solr.
  • Integrate YAGO2 and AIDA NED with Apache Stanbol Apache Stanbol currently equipped with data-sets such as DBpedia and FreeBase. This project will integrate YAGO2 data set, witch has more cleaner thematic domains and efficient representation of temporal and spatial dimension. This will also integrate AIDA named entity disambiguation framework with Stanbol
  • LUCENE-4396: BooleanScorer should sometimes be used for MUST clauses I aim to improve the performance of BooleanQuery by implementing a better rule for choosing Scorer. In order to support the new rule, my proposed work also includes a new mechanism for BooleanScorer to support MUST clauses friendly, and improved implementation of BooleanScorer which makes it able to act as Sub-Scorer on some conditions. I'll also compare the performance between my new rule and the old one.
  • Mavenize Pig The project aims to move Apache Pig from a ant build system to maven.
  • ODE -- OModel refactoring and migration This proposal is aimed at refactor the OModel for ODE project. Current OModel is not evolveable. When BPEL evolves, new version of OModel objects fail on backward compatibility and compiler/runtime is not flexible enough to the OModel changes like new extension support. I propose JSON serialization and extension plugin mechanism for them respectively. It will reduces work needed for later migrations.
  • Optical Character Recognition for Apache PDFBox Apache PDFBox is widely used as a text extraction tool from PDF files. But in current approach text can not be extracted from image contents and corrupted character encodings. In this project a new approach to extract text from PDF is introduced using Optical Character Recognition.
  • Porting EC2 support with JClouds integration in Apache Airavata Apache airavata is a software framework to compose, manage,execute and monitor computational jobs and workflows that runs on variety of resources ranging from local grids to commercial clouds. This project offers improvements in the existing module for ec2 job submission as well as enhancements to execute single jobs on amazon cloud through the orchestrator.
  • Proposal for PDFBOX-1915 Implement shading with Coons and tensor-product pa This proposal contains my basic information and background. It also gives a concrete description of the work I have done on this project and the following work I will do and the approach I will use in this summer. There are several formula which may not display properly in the proposal, the pdf version of this proposal can be obtained through the addition info URL.
  • Prototype Airavata Support for Application Scheduling The proposed project is an enhancement to the Orchestrator component of Apache Airavata, which will collect and make use of application performance data for job run time prediction and scheduling purposes.
  • Re-Architecturing the Apache Airavata Database System Apache Airavata Registry currently contain every data related to persisting, metadata, provenance, inputs and outputs, workflows ...etc. This makes the system highly susceptible for changes. To add a new feature to any one of those, requires highly remodeling the whole system and implementing. For Example the recent Data Model Changes took a lot of development time to implement. This project is to make the Airavata Database system more stable and flexible for feature additions
  • SPARQL commands in Jena rules The goal of this project is enrich Jena framework with rules based on SPARQL commands or include SPARQL commands in a rule. Thus, we increase the expressiveness of Jena resulted from the combination of the rules with SPARQL commands.
  • Speech to Text Enhancement Engine for Apache Stanbol TBD enhancement engine uses Sphinix library to convert the captured audio. Media (audio/video) data file is parsed with the ContentItem and formatted to proper audio format by Xuggler libraries. Audio speech is than extracted by Sphinix to 'plain/text' with the annotation of temporal position of the extracted text. Sphinix uses acoustic model and language model to map the utterances with the text, so the engine will also provide support of uploading acoustic model and language model.