GSoC/GCI Archive
Google Summer of Code 2012

R project for statistical computing

Web Page: http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2012&s=gsoc

Mailing List: https://groups.google.com/group/gsoc-r?pli=1

The R Foundation (as the legal entity behind the R Project) is a not-for-profit organization working in the public interest. It has been founded by the members of the R Development Core Team in order to

  • Provide support for the R project and other innovations in statistical computing. We believe that R has become a mature and valuable tool and we would like to ensure its continued development and the development of future innovations in software for statistical and computational research.
  • Provide a reference point for individuals, instititutions or commercial enterprises that want to support or interact with the R development community.
  • Hold and administer the copyright of R software and documentation.

R is an official part of the Free Software Foundation’s GNU project, and the R Foundation has similar goals to other open source software foundations like the Apache Foundation or the GNOME Foundation.

Among the goals of the R Foundation are the support of continued development of R, the exploration of new methodology, teaching and training of statistical computing and the organization of meetings and conferences with a statistical computing orientation.

Projects

  • Access and visualization of biodiversity data in R rOpenSci is a collaborative effort to develop R-based tools for facilitating Open Science. Biodiversity occurrence records can be accessed using rOpenSci tools but there is a need to strengthen the functionality, develop visualizations and tutorials for users to learn and make use of the rOpenSci tools. I address this need and also propose a back-end database support to get around memory limits of R while using larger data sets using rOpenSci in this proposal.
  • Add additional closed form and global optimizer backends to PortfolioAnalytics This project would first add support for quadratic, linear, and conical solvers to PortfolioAnalytics, and some intelligence to detect optimization objectives that would be suitable for this kind of back-end. For example, mean-variance portfolios with simple box constraints are quadratic, mean-gaussian-ETL objectives are linear or conical, etc. Additionally to add support for more global optimization solvers like a particle swarm optimization solver.
  • Additional performance measures and attribution functionality to PerformanceAnalytics PerformanceAnalytics is an R package that provides a collection of econometric functions for performance and risk analysis. It applies current research on return-based analysis of strategy or fund returns for risk, autocorrelation and illiquidity, persistence, style analysis and drift, clustering, quantile regression, and other topics. PerformanceAnalytics has long enjoyed contributions from users who would like to see specific functionality included. In particular, Diethelm Wuertz of ETHZ (and the well-known R/Metrics packages) has very generously contributed a very large set of functions based on Bacon (2008). It is a great starting point, but many of these functions need to be finished and converted such that the functions and interfaces are consistent with other PerformanceAnalytics functions, and are appropriately documented. In addition the functionality proposed in Bacon is incomplete in the contributed functions. That will require writing or re-writing the function with reference to Bacon (2008) and other underlying references Lastly all the functions must be well documentet thanks to the roxygen2 package
  • Aggregate CRAN package download statistics across multiple mirrors The goal of this project is to collect package download data from CRAN mirrors in a central location. Using cloud-computing services such as Amazon Web Services, Rackspace Cloud, or Google AppEngine, the data will be aggregated and relevant statistics will be computed. The approach described below enables us to collect the number of downloads of a package and break it down by package version, R version, and operating system. The statistics would be presented on a user-friendly website accessible to the public.
  • Bayesian Spatial Econometrics with R A collection of R functions for Bayesian and Spatial econometrics
  • Biganalysis: A robust, general-purpose R package for large scale classification We propose to develop a robust, general-purpose R package for large scale classification. Our aim is to implement several novel and reliable rank-based classification and feature selection methods including linear discriminant comparison analysis (LDCA); Pairwise comparison based classification and regression trees (TSP-CART); the TSP-CART based random forest, and the TSP-CART based gradient boosting algorithm. We will also consider the structured versions of these algorithms, which allow us to easily incorporate prior structural information into data analysis. The targeted application of this package includes large-scale scientific data analysis, marketing data analysis, and web data analysis.
  • BigMatrix: Super Scalable Predictive Analytics for Big Matrices in R Big data era is coming and it calls for revolutionary new levels of capacity and performance for statistical analysis of very big matrices in the popular R language. With regard to it, this project aims at providing super scalable predictive analytics to meet this challenge. Using the built-in “BigMatrix” package, R users are able to analyze (feature selection, classification, clustering and etc) the datasets. The data set can be both big (even larger than the RAM) and messy (lots of missing values). The project is targeted for a wide range of researchers and the potential applications include but not limited to financial marketing data, big genomic data and huge web data.
  • Develop an R package interfacing the computer algebra system Maxima The idea is to create a package that allows use capabilities Computer Algebra System Maxima which is much more powerful than the symbolic algebra system available for R right now.
  • Dynamic report generation in the web with R Dynamic report generation can rescue us from repetitive and unnecessary manual tasks. In our project, we will extend the knitr package for better applications in the web so that web pages can be easily compiled with R like Sweave documents. The knitr package is a general-purpose tool for dynamic report generation; the basic idea is like Sweave: we only write source code in a document mixed with normal texts, then use a tool to evaluate the code to get output, so that we do not need to copy and paste results from the computing engine to the output document. Knitr comes with a flexible design and can be extended to multiple output formats via output hooks. In this project, we will enhance HTML features of knitr package, including code highlighting, animation support, markdown conversion and integration with website-building tools.
  • Extend RTAQ for additional high frequency time series analysis The economic value of analyzing high-frequency financial data is now obvious, both in the academic and financial world. The goal of this project is to develop much needed extra functionality in this growing area. The project aims at (i) providing a convenient interface for high frequency data management, and (ii) making the latest developments in (co-)volatility (estimation and) forecasting based on high frequency financial data available to the community.
  • HyperSpec: Parallelization and Optimization Provide parallelization to targeted areas of the hyperSpec package. Offer flexible interface allowing choice between implicit and explicit parallelization, and types thereof. Incorporate data.table. Produce benchmarks of observed results on large datasets.
  • Improvements to xts time series visualization and subsetting Major extension to the plotting and analytic capabilities of the xts package with intent to provide a full time series suite around a single S3 class designed for high-performance time series work. Intended work includes major extensions to plot.xts to allow multi-column time series and block highlighting, implementation of major ts functions [ARIMA, Holt-Winters, StructTS] for irregular data. Possible work includes new data structures to allow multiple data types in a single xts/zoo object.
  • Inclusion of Attilio Meucci's implementations in ReturnAnalytics The existing PerformanceAnalytics, PortfolioAnalytics and FactorAnalytics packages have included the tools/functions for performance and risk analysis as well as portfolio optimization. More functions related to advanced risk and portfolio management proposed by Attilio Meucci can be added. This proposal focuses on extension of the mentioned packages related to these developments.
  • Interactive dendrogram An interactive dendrogram + heatmap plot for exploring results of hierarchical clustering analysis.
  • Portfolio Performance Measurement and Benchmarking The aim of the project is to add new functionality to R, described in the "Portfolio Performance Measurement and Benchmarking" book by J. Christopherson, D. Carino, W. Ferson. This will involve adding new functions to the existing R packages (PerformanceAnalytics, Blotter). Some functionality have already been implemented to some extent, while other (Performance Attribution) is not available in any existing R packages. The following Directions of works are suggested: 1. Performance attribution: Ch. 14-19 (Carino, 2009), Ch. 5-10 (Bacon, 2008) 2. Calculation of returns: Ch. 3-5 (Carino, 2009), Ch. 3 (Bacon, 2008) 3. Return and risk metrics: Ch.10, 12, 13 (Carino, 2009), Ch. 4 (Bacon, 2008)
  • SAM: A General-purpose Classifier for Modern Predictive Data Analysis The increasing complexity of modern data acquirement poses a great challenge to traditional SVM classifiers in the predictive data analysis. This project aims at providing an efficient and scalable implementation of the Sparse Additive Machine (SAM), which can conduct reliable non-linear classification and variable selection simultaneously. This package has the potential to become a general-purpose classifier for a wide range of data analysis practitioners. It targets at the large-scale classification in the scientific data analysis (e.g. genomics, proteomics, bio-imaging), social media data analysis (e.g. image, audio, video, text modeling) and financial time-series analysis.