Tuesday, March 3, 2015

GreenGenes - Phylogenetics Background

Been working on this for a few weeks, but I'll summarize:

Brief overview of Phylogenetics:

Multiple Sequence Alignment
generates a score between pairs of sequences

MUSCLE - multiple alignment software includes distance estimations using Kmer

Clustalw - takes a set of input sequences and carry out progressive alignment
   --> aligned in pairs in order to generate a distance matrix
   --> uses a Neighbor-Joining method to produced unrooted tree which serves as the guide for multiple alignment

                       INPUT DATA                         METHOD
                   2 - 100 protein seqs                      MUSCLE
          100 - 500 seqs globally aligned            
                        > 500 seqs
              small number of large seqs                Clustalw

Genetic Distance and Nucleotide Substitution Models
Genetic Distance - evolutionary distance

Rate Heterogeneity among sites - rate of nucleotide substitution can vary substantially for different positions
   --> Use Gamma Distribution - expectation 1.0 with variance 1/alpha

Phylogenetic Inference based on Distance Methods
Try to fit a tree to a matrix of genetic distances

Minimum Evolution (ME) - distance method for constructing additive trees to minimize length of tree
Neighbor-Joining - minimizes steps by finding a pair of neighboring OTUs

Phylogenetic Inference using Maximum Likelihood (ML) Methods
Highest probability of observed data under a set of parameters
Determines tree topology, branch lengths, and parameters of evolutionary model that maximizes the probability of observing the sequences in a particular arrangement

--> GOAL - to find tree among all possible tree structures that maximizes the global likelihood
However, impossible to compute all possible trees -> need to add heuristics
1. Stepwise Addition
2. Star Decomposition
3. Neighbor-Joining

PHYML - fast distance based method to quickly compute a full initial tree
RAxML - builds tree on maximum parsimony and optimizes with a variant of sub-tree
   Uses Lazy Subtree Arrangement (LSR) - assigns maximal distance between pruning and insertion point for Subtree prune and regraft (SPR) operations to restrict size of neighborhood
   Optimizes only the branch that originates at the pruning point
   Repeats using the current best tree
   Takes the 20 best trees found during LSR to reoptimize ML by adjusting branch lengths

Branch Support - all methods produce a single tree and ML values
Bootstrapping:
1. Pseudo-samples are created by randomly drawing with replacement l columns from the original l column alignment
2. From each pseudo-sample, a tree is reconstructed and a consensus tree is made
   Consensus Tree - incorporates branches that occur in the majority of trees
   Bootstrap Values used as an indicator for reliability of branches

No comments:

Post a Comment