GSoC/GCI Archive
Google Summer of Code 2015

R Project for Statistical Computing

License: GNU General Public License version 2.0 (GPLv2)

Web Page: https://github.com/rstats-gsoc/gsoc2015/wiki/table-of-proposed-coding-projects

Mailing List: gsoc-r@googlegroups.com

The R Foundation (as the legal entity behind the R Project) is a
not-for-profit organization working in the public interest. It has been founded by the
members of the R Development Core Team in order to
- Provide support for the R project and other innovations in statistical computing. We believe that R has become a mature and valuable tool and we would like to ensure its continued development and the development of future innovations in software for statistical and computational research.
- Provide a reference point for individuals, instititutions or commercial enterprises that want to support or interact with the R development community.
- Hold and administer the copyright of R software and documentation.
R is an official part of the Free Software Foundation's GNU project, and the R Foundation has similar goals to other open source software foundations like the Apache Foundation or the GNOME Foundation.
Among the goals of the R Foundation are the support of continued development of R, the exploration of new methodology, teaching and training of statistical computing and the organization of meetings and conferences with a statistical computing orientation.

Projects

  • A C++ implementation of ARPACK and Its Integration with R This project aims at implementing a new version of the ARPACK library using modern C++ language, and integrating it with R through the RcppArmadillo interface. ARPACK is widely used to solve large scale eigenvalue problems, but it was written in the antique Fortran 77 language and its development has been inactive for a long time. This project tries to rewrite its main functionality and redesign the API, so that R, as well as other programs, can easily interface to this library.
  • Advanced techniques in Risk and Asset Allocation This project's goal is to build on the work of two successful GSoC projects (2012, 2013) that expanded on the work of Ram Ahluwalia at Wingfoot capital and established the standalone package Meucci by porting Attilio Meucci's orinal MATLAB source. The main part of this project focuses on extending the current functionalities of the package to the latest technical developments in this area, the second part of the project will focus on the continued improvement of the Meucci package.
  • An accelarated SDCA framework for statistical learning toolbox in R Stochastic learning procedure has become an increasingly popular framework, especially for large-scale optimization problems. Our goal in this project is to develop a generic computational system for a large family of linear regression/classification models via the accelerated stochastic dual coordinate ascent method. We describe the design and the implementation plan of our project, and propose a timeline for our development.
  • Biodiversity Data from Social Networking Sites Ecology, Biodiversity, Climate Change, invasive species research etc. use Primary Biodiversity occurrence data as basic unit. Even Global Biodiversity Information Facility, largest repository of this data, which serves ~530M records, has gaps and biases in terms of taxonomy, geography and seasonality. We propose to make a R package to wrap the APIs of SNS like Flicke and Picasa and make the data available to users in prevailing international standard formats like Darwin Core and Audubon Core.
  • Convert Kevin Dowd's Matlab code from 'Measuring Market Risk' Risk Analysis is integral part of portfolio management and it is widely used by researchers, students and practitioners alike. We aim to convert the Dowd’s MATLAB code for measurement and visualization of market risk to R and package it for distribution. Considering the wider use of R compared to MATLAB for statistical analysis, thousands of users will benefit from this. Expanding PerformanceAnalytics package with functions from the MATLAB toolbox will benefit a equally large user base.
  • Covariance Matrix Estimation for Markov Chain Monte Carlo in R I propose to extend the current mcmcse package in R to include multivariate extensions of the batch means, overlapping batch means and spectral variance estimators of the Monte Carlo error. In addition, I propose to add functions to this package that would be able to make use of this joint estimation.
  • Covariance Matrix Estimators There exists a wide range of methods for estimating covariance matrices for asset returns that are useful in risk models and portfolio optimization. The methods include shrinkage estimators, estimators for handling unequal histories, and robust estimators that are not much influenced by outliers, among others. This project will create an R covariance estimators package for which the resulting estimators may by easily used in factor based risk models and in portfolio optimization.
  • expanding mlr's visualization and exploratory data analysis capabiliites I will expand comparison plots of learners, add interactive visualization via ggvis that duplicates all existing plotting functionality, and add methods to visualize performance on subsets of the feature space. I will add partial and marginal dependence for all supervised learners, associated plotting methods, and feature importance measures. This will be done in C++ via Rcpp where possible.
  • Fast Implementation of SVM-based ensemble learner in mlr Ensemble learning is a popular and useful machine learning technique. However there are only limited R packages focusing on this task. I will implement several SVM-based ensemble learners in the R package mlr. The implementation will focus both on the accuracy and efficiency. The efficiency will be boosted by C++ based codes. There will be improvements in ensemble selection algorithms for stacked learners.
  • GEO-AWS: Gene Expression Omnibus Analysis with Shiny This project will involve the development of a web-based tool, using the R web framework package Shiny, which will provide users with an simple interface for analyzing gene expression datasets from the Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) therefore making statistical computing and data visualization more accessible.
  • Implementing new predict methods for spatial econometrics R packages Extending the predict methods of spdep, splm and sphet R packages with new predictors from the research literature. With the provision of predict methods, the usefulness of these estimation methods will increase considerably, in applied fields such as real estate, public economics, environmental impacts, epidemiology, criminology, industrial location, etc.
  • Implementing Statistical Fitting Algorithms for R through NIMBLE The NIMBLE R package implements statistical algorithms that can be used with models written in the BUGS language. We plan to expand the inferential capabilities of the NIMBLE package by implementing a variety of important algorithms that are not yet featured in NIMBLE. Specifically, we plan on implementing various flavors of sequential Monte Carlo, as well as Laplace approximation and integrated nested Laplace approximation.
  • Improving markovchain package The markovchain package contains classes and methods to easily handle Discrete Time Markov Chain (DTMC) processes. It currently is very helpful to performing standard analysis of DTMC and it is easy to use. It is needed in both internal code optimization (keeping the end - user interface the same) as well as enhancements in the statistical and probabilistic functions provided (more fitting methods, estimating the parameters by Maximum Likelihood, and improved algorithms to classify states).
  • Improving mlr's hyperparameter and tuning system The R package mlr is a mighty tool for executing various machine learning tasks. Due to its unified interface when dealing with different learners, users may employ them without learning the syntax for each one of them. By implementing a task-dependent default for the learners' parameters, mlr will benefit w.r.t. its user-friendliness and the model fit of the learners will improve. The parameter setup will also be integrated in further methods, e.g. allowing tuning methods to use their defaults.
  • Interactivity with Legends and other new Features for Animint Animint is an R package for making interactive animated data visualizations on the web using ggplot syntax. This project will enhance the current package in several ways: allowing the user to interact with the legends, adding fragment identifiers to the animations, and improving the Animint documentation. These improvements will help Animint be a more powerful and accessible tool for R programmers.
  • New features and optimizations for animint The project aims to implement several new features and optimize the compilation for animint. The new features includes new aesthetics that only make sense on the web/SVG, new aesthectics for configuring the appearance of selected items, shape and arrow aesthetics. .parallel and .compress options, and a new approach of generating chunck TSV files will be added to animint compiler for optimizations.
  • Nlmeviz In the field of population pharmacokintetics modeling softwares like NONMEM and R packages like nlme are used to fit models. The aim of this project is to create nlmeviz, an R package with dynamic interactive plots for diagnosing nonlinear mixed effects models. This package will provide the facility of dynamic linking plots to help diagnose these models.
  • R Implementation of a General Semi-parametric Shared Frailty Model Convenient and reliable means of survival analysis is critical to practitioners. Some tools exist for estimating the parameters of a shared frailty model, but they are currently incomplete and lagging behind the state-of-the-art. This work proposes an R package based on established modern research in estimating the parameters of a shared frailty model.
  • Stochastic Average Gradient: Large Scale Statistical Learning for R The efficiency, and even feasibility, of large scale statistical learning is highly dependent of the computational complexity of the used underlying optimization algorithm. Traditional optimization techniques do not scale linearly with sample size. This project aims to bring the Stochastic Average Gradient(SAG), in Le Roux, Schmidt, Bach (2014), to R. SAG is a large scale learning algorithm that, in the case of strongly convex cost functions, scales linearly with sample size.
  • Subsetted and parallel computations in matrixStats I aim to improve performance of matrixStats package to provide users with faster and more memory-conservative functions. The proposed work includes expanding existing matrixStats function to support optional arguments for subsetted computations, implementing support for parallel processing and tuning parallel setup arguments in order to make matrixStats smarter on parallel processing. I will also write sufficient tests, benchmark reports and relevant documents.
  • Test timings on Travis The idea of this project is to provide a package with functions that make it easy for R package developers to track quantitative performance metrics of their code, over time. It focuses on providing changes brought over in the package's performance metrics over subsequent development versions, most importantly relating to time and memory. It integrates with the git version control system and Travis CI service to provide and visualize the aforementioned metrics, among other related functions.
  • Turning R objects into Pandoc's markdown Pander is an R package for rendering R objects on markdown. It provides extensive support for different R classes combined with a lot of rendering options. During this GSoC session I want to improve on 4 things about Pander package: 1. Improve current test suite for pander package 2. Refactor pandoc.table function 3. Add rendering support for not yet supported R classes 4. Create a use-case specific vignettes and add more examples
  • WRDS R package Create a new R package for working with data from the CRSP/Compustat Merged (CCM) database from Wharton Research Data Services (WRDS). WRDS is a common portal for accessing the Compustat database of corporate fundamental data and the CRSP database of security prices and returns. This data is typically downloaded in large flat files which need ECTL operations performed prior to the data being usable for research and modeling. The wrds package is intended to automate and simplify this process.