Need to analyze EMP metadata:
Raw data - EMP_10k_merged_mapping_final.txt and full_emp_table_w_tax.biom
I was able to pull out Curto OTUs last week (see post from 3/12/15) from the full_emp.biom and convert to .txt file to be able to manipulate further.
Giant EMP_10k file has 14095 samples. The Curto OTU table only has 2882 samples.
Took the list of samples that appear in Curto OTU table and made a list. Wrote a rough code sample-ids-curto.py to parse EMP_10k file and extract only samples from Curto OTU table.
Creates a new file - curto-samples.csv
$ wc -l curto-samples.csv 
2490 
This means there were roughly 390 samples missing. So either code has a bug OR samples NOT in the EMP_10k file. Turns out, they are not in the EMP_10k file (no idea why?).
$vimdiff file1 file2 
Samples not included in further analysis found in samples-not-in-csv.txt
Need to redefine Curto OTU table - eliminate samples that are not found in the EMP_10k file.
Modified previous code slightly to parse Curto OTU Table and pull out correct samples.
*First had to transpose OTU table to get in correct format - code parses the first string in each row
                             OLD FORMAT                                           
               Sample1   Sample2   Samplen
OTU1
OTU2
OTUn
Creates a new file - full-emp-curto-only-with-found-samples.csv
Compare number of samples to check: both files have 2490 samples
Sort both curto-samples.csv and full-emp-curto-only-with-found-samples.csv 
$ sort curto-samples.csv curto-samples-sorted.csv
#and for other file
Combine two files and check to make sure sample IDs match-up *they should since they were sorted
Creates a new file - combined-samples-otu-table.csv
                                    NEW FORMAT                                           
                     OTU1   OTU2   OTUn …   METADATA
Sample1
Sample2
Samplen
Eliminate all columns in Metadata that contain "na" or "None" for every sample
--> 205 columns were eliminated
--> GRAND TOTAL = combined-samples-otu-table-annotated.xlsx
53 OTUs with 2489 samples with 271 columns of Metadata!
Tuesday, March 24, 2015
Tuesday, March 17, 2015
EMP, GreenGenes - Make Local DB and BLAST
Create a reference database from my GreenGenes + 16S strains
I used the rep_seqs that were generated when I created my phyla tree as my database.
Made a new file - curto-db.fasta
*these are aligned rep_set seqs
Two ways to create your own local database:
1. Use the BLAST command line
The sequences need to be in a specific format:
Ex.
>gnl|831711|Microbacteriaceae_Candidatus_Rhodoluna
DNA here
makeblastdb
$ makeblastdb -in curto-db.fasta -dbtype nucl -out curto.db
Find out more details HERE
2. Use Geneious
Tools -> Sequence Search
Window pops up and click "Add/Remove Databases" - select "Add Sequence Database"
Follow instructions (ie. select 'nucleotide' and 'custom BLAST')
Perform Sequence Search again, but this time Select "Database" and scroll to your new custom database!
___________________________________________________________________________
Next, BLAST the EMP seqs against my local database.
*The EMP seqs were generated from QIIME assign_taxonomy.py and took those who identified with Curtobacterium with greater 0.67 quality score
*The seqs are also extremely short - less than 200 bp
Export the data to a .txt file
I used the rep_seqs that were generated when I created my phyla tree as my database.
Made a new file - curto-db.fasta
*these are aligned rep_set seqs
Two ways to create your own local database:
1. Use the BLAST command line
The sequences need to be in a specific format:
Ex.
>gnl|831711|Microbacteriaceae_Candidatus_Rhodoluna
DNA here
makeblastdb
$ makeblastdb -in curto-db.fasta -dbtype nucl -out curto.db
Find out more details HERE
2. Use Geneious
Tools -> Sequence Search
Window pops up and click "Add/Remove Databases" - select "Add Sequence Database"
Follow instructions (ie. select 'nucleotide' and 'custom BLAST')
Perform Sequence Search again, but this time Select "Database" and scroll to your new custom database!
___________________________________________________________________________
Next, BLAST the EMP seqs against my local database.
*The EMP seqs were generated from QIIME assign_taxonomy.py and took those who identified with Curtobacterium with greater 0.67 quality score
*The seqs are also extremely short - less than 200 bp
Export the data to a .txt file
Really strange results - EMP seqs hit rep_seqs at equal frequency
Need to look at seqs in Geneious and check alignments!
Thursday, March 12, 2015
EMP - OTU Table
FINALLY FIGURED OUT HOW TO GET OTU TABLE!
$ biom subset-table -i full_emp_table_hdf5.h5 -a observation -s curto-only-ids.txt -o full_emp_table_curto.biom
$ biom convert -i full_emp_table_curto.biom -o full_emp_table_curto.txt --to_tsv --header-key taxonomy
- Remember the EMP Open .biom file was too large (too much memory - crashed Python)
- Converted format to HDF5 file for easier manipulation
- Found this convenient python class
- Which then enables (if biom is installed...) and only if hdf5 file is in correct format
$ biom subset-table -i full_emp_table_hdf5.h5 -a observation -s curto-only-ids.txt -o full_emp_table_curto.biom
$ biom convert -i full_emp_table_curto.biom -o full_emp_table_curto.txt --to_tsv --header-key taxonomy
Monday, March 9, 2015
GreenGenes - Pipeline and Phylogenies
1. Download the entire GreenGenes database
Need:
gg_13_5.fasta
gg_13_5_taxonomy.txt
2. Search for taxonomy of interest - start with Microbacteriaceae
#creates a text file with IDs matching search
$ egrep "f__Microbacteriaceae" gg_13_5_taxonomy.txt | awk '{print $1}' > ./gg-microbacteriaceae.txt
3. micro-only.py
#searches fasta file and creates a new fasta file with only IDs from gg-microbacteriaceae.txt
#found 5707 sequences
4. Combine my 16S reads
$ cat my-16S-reads.fasta gg-micro.fasta > output.fasta
5. QIIME - pick_otus.py - generates 327 OTUs
-m uclust
-s 0.97
-A #optimal search
***Swarm loses OTUs when running due to its algorithm
6. QIIME - pick_rep_set.py
-f gg-all-microbacteriaceae-with-16S.fasta
-r my-16S-reads.fasta
-m longest
6B. fasta-rename.py #renames all seqs with new names on fasta header
7. Align rep_sest sequences with SINA
8. Eliminate all OTUs with <20 seqs EXCEPT for Curtobacterium OTUs (also did <50 seqs)
9. JModel Test - Computes likelihood scores with PHYML
Base Frequencies +F
Rate Variation +I +G nCat=4
ML Optimized
Base Tree Search = NNI
Best Models:
Models BIC Calculation
TlM1 + G 27589
TrN + G 27593
GTR + G 27608
10. Run TrN+G model on MEGA
-Maximum Likelihood
-Nucleotide Substitution = TrN
-Bootstrap Method = 100
-Gamma Distributed = 5
-Complete Deletion
-NNI
11. Run GTR+G on RAxML - see RAxML manual for help
$ raxmlHPC -s input_file.phy -n output_name -m GTRGAMMA -# 100 -x 100 -p 2389 -f a -o outgroup_name
Need:
gg_13_5.fasta
gg_13_5_taxonomy.txt
2. Search for taxonomy of interest - start with Microbacteriaceae
#creates a text file with IDs matching search
$ egrep "f__Microbacteriaceae" gg_13_5_taxonomy.txt | awk '{print $1}' > ./gg-microbacteriaceae.txt
3. micro-only.py
#searches fasta file and creates a new fasta file with only IDs from gg-microbacteriaceae.txt
#found 5707 sequences
4. Combine my 16S reads
$ cat my-16S-reads.fasta gg-micro.fasta > output.fasta
5. QIIME - pick_otus.py - generates 327 OTUs
-m uclust
-s 0.97
-A #optimal search
***Swarm loses OTUs when running due to its algorithm
6. QIIME - pick_rep_set.py
-f gg-all-microbacteriaceae-with-16S.fasta
-r my-16S-reads.fasta
-m longest
6B. fasta-rename.py #renames all seqs with new names on fasta header
7. Align rep_sest sequences with SINA
8. Eliminate all OTUs with <20 seqs EXCEPT for Curtobacterium OTUs (also did <50 seqs)
9. JModel Test - Computes likelihood scores with PHYML
Base Frequencies +F
Rate Variation +I +G nCat=4
ML Optimized
Base Tree Search = NNI
Best Models:
Models BIC Calculation
TlM1 + G 27589
TrN + G 27593
GTR + G 27608
10. Run TrN+G model on MEGA
-Maximum Likelihood
-Nucleotide Substitution = TrN
-Bootstrap Method = 100
-Gamma Distributed = 5
-Complete Deletion
-NNI
11. Run GTR+G on RAxML - see RAxML manual for help
$ raxmlHPC -s input_file.phy -n output_name -m GTRGAMMA -# 100 -x 100 -p 2389 -f a -o outgroup_name
 MEGA - TrN + G with OTUs > 50 seqs
Tuesday, March 3, 2015
GreenGenes - Phylogenetics Background
Been working on this for a few weeks, but I'll summarize:
Brief overview of Phylogenetics:
Multiple Sequence Alignment
generates a score between pairs of sequences
MUSCLE - multiple alignment software includes distance estimations using Kmer
Clustalw - takes a set of input sequences and carry out progressive alignment
--> aligned in pairs in order to generate a distance matrix
--> uses a Neighbor-Joining method to produced unrooted tree which serves as the guide for multiple alignment
INPUT DATA METHOD
2 - 100 protein seqs MUSCLE
100 - 500 seqs globally aligned
> 500 seqs
small number of large seqs Clustalw
Genetic Distance and Nucleotide Substitution Models
Genetic Distance - evolutionary distance
Rate Heterogeneity among sites - rate of nucleotide substitution can vary substantially for different positions
--> Use Gamma Distribution - expectation 1.0 with variance 1/alpha
Phylogenetic Inference based on Distance Methods
Try to fit a tree to a matrix of genetic distances
Minimum Evolution (ME) - distance method for constructing additive trees to minimize length of tree
Neighbor-Joining - minimizes steps by finding a pair of neighboring OTUs
Phylogenetic Inference using Maximum Likelihood (ML) Methods
Highest probability of observed data under a set of parameters
Determines tree topology, branch lengths, and parameters of evolutionary model that maximizes the probability of observing the sequences in a particular arrangement
--> GOAL - to find tree among all possible tree structures that maximizes the global likelihood
However, impossible to compute all possible trees -> need to add heuristics
1. Stepwise Addition
2. Star Decomposition
3. Neighbor-Joining
PHYML - fast distance based method to quickly compute a full initial tree
RAxML - builds tree on maximum parsimony and optimizes with a variant of sub-tree
Uses Lazy Subtree Arrangement (LSR) - assigns maximal distance between pruning and insertion point for Subtree prune and regraft (SPR) operations to restrict size of neighborhood
Optimizes only the branch that originates at the pruning point
Repeats using the current best tree
Takes the 20 best trees found during LSR to reoptimize ML by adjusting branch lengths
Branch Support - all methods produce a single tree and ML values
Bootstrapping:
1. Pseudo-samples are created by randomly drawing with replacement l columns from the original l column alignment
2. From each pseudo-sample, a tree is reconstructed and a consensus tree is made
Consensus Tree - incorporates branches that occur in the majority of trees
Bootstrap Values used as an indicator for reliability of branches
Brief overview of Phylogenetics:
Multiple Sequence Alignment
generates a score between pairs of sequences
MUSCLE - multiple alignment software includes distance estimations using Kmer
Clustalw - takes a set of input sequences and carry out progressive alignment
--> aligned in pairs in order to generate a distance matrix
--> uses a Neighbor-Joining method to produced unrooted tree which serves as the guide for multiple alignment
INPUT DATA METHOD
2 - 100 protein seqs MUSCLE
100 - 500 seqs globally aligned
> 500 seqs
small number of large seqs Clustalw
Genetic Distance and Nucleotide Substitution Models
Genetic Distance - evolutionary distance
Rate Heterogeneity among sites - rate of nucleotide substitution can vary substantially for different positions
--> Use Gamma Distribution - expectation 1.0 with variance 1/alpha
Phylogenetic Inference based on Distance Methods
Try to fit a tree to a matrix of genetic distances
Minimum Evolution (ME) - distance method for constructing additive trees to minimize length of tree
Neighbor-Joining - minimizes steps by finding a pair of neighboring OTUs
Phylogenetic Inference using Maximum Likelihood (ML) Methods
Highest probability of observed data under a set of parameters
Determines tree topology, branch lengths, and parameters of evolutionary model that maximizes the probability of observing the sequences in a particular arrangement
--> GOAL - to find tree among all possible tree structures that maximizes the global likelihood
However, impossible to compute all possible trees -> need to add heuristics
1. Stepwise Addition
2. Star Decomposition
3. Neighbor-Joining
PHYML - fast distance based method to quickly compute a full initial tree
RAxML - builds tree on maximum parsimony and optimizes with a variant of sub-tree
Uses Lazy Subtree Arrangement (LSR) - assigns maximal distance between pruning and insertion point for Subtree prune and regraft (SPR) operations to restrict size of neighborhood
Optimizes only the branch that originates at the pruning point
Repeats using the current best tree
Takes the 20 best trees found during LSR to reoptimize ML by adjusting branch lengths
Branch Support - all methods produce a single tree and ML values
Bootstrapping:
1. Pseudo-samples are created by randomly drawing with replacement l columns from the original l column alignment
2. From each pseudo-sample, a tree is reconstructed and a consensus tree is made
Consensus Tree - incorporates branches that occur in the majority of trees
Bootstrap Values used as an indicator for reliability of branches
Subscribe to:
Comments (Atom)
 


