Research: GreenGenes - Phylogenetics Background

Been working on this for a few weeks, but I'll summarize:

Brief overview of Phylogenetics:

Multiple Sequence Alignment
generates a score between pairs of sequences

MUSCLE - multiple alignment software includes distance estimations using Kmer

Clustalw - takes a set of input sequences and carry out progressive alignment
--> aligned in pairs in order to generate a distance matrix
--> uses a Neighbor-Joining method to produced unrooted tree which serves as the guide for multiple alignment

INPUT DATA METHOD
2 - 100 protein seqs MUSCLE
100 - 500 seqs globally aligned
> 500 seqs
small number of large seqs Clustalw

Genetic Distance and Nucleotide Substitution Models
Genetic Distance - evolutionary distance

Rate Heterogeneity among sites - rate of nucleotide substitution can vary substantially for different positions
--> Use Gamma Distribution - expectation 1.0 with variance 1/alpha

Phylogenetic Inference based on Distance Methods
Try to fit a tree to a matrix of genetic distances

Minimum Evolution (ME) - distance method for constructing additive trees to minimize length of tree
Neighbor-Joining - minimizes steps by finding a pair of neighboring OTUs

Phylogenetic Inference using Maximum Likelihood (ML) Methods
Highest probability of observed data under a set of parameters
Determines tree topology, branch lengths, and parameters of evolutionary model that maximizes the probability of observing the sequences in a particular arrangement

--> GOAL - to find tree among all possible tree structures that maximizes the global likelihood
However, impossible to compute all possible trees -> need to add heuristics
1. Stepwise Addition
2. Star Decomposition
3. Neighbor-Joining

PHYML - fast distance based method to quickly compute a full initial tree
RAxML - builds tree on maximum parsimony and optimizes with a variant of sub-tree
Uses Lazy Subtree Arrangement (LSR) - assigns maximal distance between pruning and insertion point for Subtree prune and regraft (SPR) operations to restrict size of neighborhood
Optimizes only the branch that originates at the pruning point
Repeats using the current best tree
Takes the 20 best trees found during LSR to reoptimize ML by adjusting branch lengths

Branch Support - all methods produce a single tree and ML values
Bootstrapping:
1. Pseudo-samples are created by randomly drawing with replacement l columns from the original l column alignment
2. From each pseudo-sample, a tree is reconstructed and a consensus tree is made
Consensus Tree - incorporates branches that occur in the majority of trees
Bootstrap Values used as an indicator for reliability of branches

Research

Tuesday, March 3, 2015

GreenGenes - Phylogenetics Background

No comments:

Post a Comment