Monday, March 9, 2015

GreenGenes - Pipeline and Phylogenies

1. Download the entire GreenGenes database

Need:
gg_13_5.fasta
gg_13_5_taxonomy.txt

2. Search for taxonomy of interest - start with Microbacteriaceae
#creates a text file with IDs matching search
$ egrep "f__Microbacteriaceae" gg_13_5_taxonomy.txt | awk '{print $1}' > ./gg-microbacteriaceae.txt

3. micro-only.py 
#searches fasta file and creates a new fasta file with only IDs from gg-microbacteriaceae.txt
#found 5707 sequences

4. Combine my 16S reads
$ cat my-16S-reads.fasta gg-micro.fasta > output.fasta

5. QIIME - pick_otus.py - generates 327 OTUs
-m uclust
-s 0.97
-A #optimal search

***Swarm loses OTUs when running due to its algorithm

6. QIIME - pick_rep_set.py
-f gg-all-microbacteriaceae-with-16S.fasta
-r my-16S-reads.fasta
-m longest

6B. fasta-rename.py #renames all seqs with new names on fasta header

7. Align rep_sest sequences with SINA

8. Eliminate all OTUs with <20 seqs EXCEPT for Curtobacterium OTUs (also did <50 seqs)

9. JModel Test - Computes likelihood scores with PHYML
Base Frequencies +F
Rate Variation +I +G nCat=4
ML Optimized
Base Tree Search = NNI

Best Models:
        Models          BIC Calculation
      TlM1 + G                27589
      TrN + G                 27593
      GTR + G                 27608

10. Run TrN+G model on MEGA
-Maximum Likelihood
-Nucleotide Substitution = TrN
-Bootstrap Method = 100
-Gamma Distributed = 5
-Complete Deletion 
-NNI

11. Run GTR+G on RAxML - see RAxML manual for help
$ raxmlHPC -s input_file.phy -n output_name -m GTRGAMMA -# 100 -x 100 -p 2389 -f a -o outgroup_name


RAxML - GTG+G with OTUs > 50 seqs

 MEGA - TrN + G with OTUs > 50 seqs

No comments:

Post a Comment