GSoC/GCI Archive
Google Summer of Code 2013

National Evolutionary Synthesis Center (NESCent)

Web Page: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013

Mailing List:

NESCent facilitates synthetic research on grand challenge questions in evolutionary biology and also works to address critical needs in software infrastructure and education through promoting open, collaborative development of interoperable and standards-supporting open-source software. NESCent participates in GSoC as an umbrella organization for many projects in evolutionary informatics. 

Projects

  • Codon Alignment and Analysis in Biopython A codon alignment is an alignment of nucleotide sequences in which trinucleotides correspond directly to amino acids in translated product. Codon alignment is useful as it distinguished different types of nucleotides substitution and is frequently used to test neutrality and calculate selection pressure. My job in the project is to develop codon alignment support for biopython. This will facilitate quick and large scale evolutionary analysis in biopython environment.
  • Extend PartitionFinder to automatically partition DNA and protein alignments. ParitionFinder is a piece of phylogenetics software that combines similar sets of sites in a DNA or amino acid alignment into a partitioning scheme. The advantage of using a partitioning scheme is that each subset of sites can be independently modeled, which can have a substantial impact on the results of a phylogenetic analysis. PartitionFinder currently requires that the user predefine subsets (they often choose to do so by gene or codon position). Such assignments can often be arbitrary and may not result in the best fit model. This is especially true for alignments that include several different types of data (ultra conserved elements, introns, etc.), which are becoming increasingly common. This project will expand the utility of the software by implementing a new algorithm to automatically split either user defined subsets or entire alignments into one or more new subsets using site-specific substitution rates. This new functionality will result in partitioning schemes designed to closely reflect biological processes.
  • Identifying problems with gene predictions Genome sequencing is now possible at almost no cost. However, obtaining accurate gene predictions remains a target hard to achieve with the existing technology. GeneValidator is a tool that identifies problems with gene predictions, based on similarities with data from public databases. We apply a set of validation tests that provide useful information about the problems that appear in the predictions, in order to make evidence about how the gene curation can be made or whether a certain predicted gene may not be considered in other analysis.
  • Implementing Machine Learning Algorithms for Classification and Feature Selection in mothur A proposal to evaluate, select and implement machine learning algorithms for classification and feature selection in the metagenomic data analysis program mothur
  • Imran's proposal to Extend PhyML to use the BEAGLE library As the title suggests, this project aims to use CUDA/OpenCL in PhyML (via the BEAGLE library) for the likelihood calculations. These form the core of computation for maximum likelihood estimates of an inferred phylogeny. The challenge is in interpreting various Phylogenetic parameters (tree topology, evolutionary rate categories, eigen decomposition of the substitution matrix, etc) and making the appropriate BEAGLE calls (which are subsequently pipelined to the CPU/GPU). I look forward to contributing in a meaningful way. Moreover, this project will also help me to align my graduate research direction.
  • Phylogenetics in Biopython: Filling in the gaps Biopython is a set of open source python packages and modules for bioinformatics works. In the Bio.Phylo package, there are already implementations for some basic phylogenetics tasks: basic tree operations, parsers for Newick, Nexus and PhyloXML, and wrappers for Phyml, Raxml and PAML. While there are some important components that remain to be implemented to better support phylogenetic workflows. These include simple tree construction algorithms, consensus tree searching, and tree visualization.