Tuesday, March 24, 2015

EMP - Matrix and OTU table

Need to analyze EMP metadata:

Raw data - EMP_10k_merged_mapping_final.txt and full_emp_table_w_tax.biom

I was able to pull out Curto OTUs last week (see post from 3/12/15) from the full_emp.biom and convert to .txt file to be able to manipulate further.

Giant EMP_10k file has 14095 samples. The Curto OTU table only has 2882 samples.

Took the list of samples that appear in Curto OTU table and made a list. Wrote a rough code sample-ids-curto.py to parse EMP_10k file and extract only samples from Curto OTU table.
Creates a new file - curto-samples.csv

$ wc -l curto-samples.csv 
2490 

This means there were roughly 390 samples missing. So either code has a bug OR samples NOT in the EMP_10k file. Turns out, they are not in the EMP_10k file (no idea why?).
$vimdiff file1 file2 
Samples not included in further analysis found in samples-not-in-csv.txt

Need to redefine Curto OTU table - eliminate samples that are not found in the EMP_10k file.
Modified previous code slightly to parse Curto OTU Table and pull out correct samples.
*First had to transpose OTU table to get in correct format - code parses the first string in each row

                             OLD FORMAT                                          
               Sample1   Sample2   Samplen
OTU1
OTU2
OTUn

Creates a new file - full-emp-curto-only-with-found-samples.csv

Compare number of samples to check: both files have 2490 samples

Sort both curto-samples.csv and full-emp-curto-only-with-found-samples.csv 
$ sort curto-samples.csv curto-samples-sorted.csv
#and for other file

Combine two files and check to make sure sample IDs match-up *they should since they were sorted
Creates a new file - combined-samples-otu-table.csv

                                    NEW FORMAT                                          
                     OTU1   OTU2   OTUn …   METADATA
Sample1
Sample2
Samplen

Eliminate all columns in Metadata that contain "na" or "None" for every sample
--> 205 columns were eliminated

--> GRAND TOTAL = combined-samples-otu-table-annotated.xlsx
53 OTUs with 2489 samples with 271 columns of Metadata!

Tuesday, March 17, 2015

EMP, GreenGenes - Make Local DB and BLAST

Create a reference database from my GreenGenes + 16S strains

I used the rep_seqs that were generated when I created my phyla tree as my database.
Made a new file - curto-db.fasta
*these are aligned rep_set seqs

Two ways to create your own local database:

1.   Use the BLAST command line

The sequences need to be in a specific format:

Ex.
>gnl|831711|Microbacteriaceae_Candidatus_Rhodoluna
DNA here

makeblastdb
$ makeblastdb -in curto-db.fasta -dbtype nucl -out curto.db

Find out more details HERE

2. Use Geneious

Tools -> Sequence Search

Window pops up and click "Add/Remove Databases" - select "Add Sequence Database"
Follow instructions (ie. select 'nucleotide' and 'custom BLAST')

Perform Sequence Search again, but this time Select "Database" and scroll to your new custom database!

___________________________________________________________________________

Next, BLAST the EMP seqs against my local database.
*The EMP seqs were generated from QIIME assign_taxonomy.py and took those who identified with Curtobacterium with greater 0.67 quality score
*The seqs are also extremely short - less than 200 bp

Export the data to a .txt file

Really strange results - EMP seqs hit rep_seqs at equal frequency
Need to look at seqs in Geneious and check alignments!

Thursday, March 12, 2015

EMP - OTU Table

FINALLY FIGURED OUT HOW TO GET OTU TABLE!


  1. Remember the EMP Open .biom file was too large (too much memory - crashed Python)
  2. Converted format to HDF5 file for easier manipulation
  3. Found this convenient python class
  4. Which then enables (if biom is installed...) and only if hdf5 file is in correct format


$ biom subset-table -i full_emp_table_hdf5.h5 -a observation -s curto-only-ids.txt -o full_emp_table_curto.biom

$ biom convert -i full_emp_table_curto.biom -o full_emp_table_curto.txt --to_tsv --header-key taxonomy


Monday, March 9, 2015

GreenGenes - Pipeline and Phylogenies

1. Download the entire GreenGenes database

Need:
gg_13_5.fasta
gg_13_5_taxonomy.txt

2. Search for taxonomy of interest - start with Microbacteriaceae
#creates a text file with IDs matching search
$ egrep "f__Microbacteriaceae" gg_13_5_taxonomy.txt | awk '{print $1}' > ./gg-microbacteriaceae.txt

3. micro-only.py 
#searches fasta file and creates a new fasta file with only IDs from gg-microbacteriaceae.txt
#found 5707 sequences

4. Combine my 16S reads
$ cat my-16S-reads.fasta gg-micro.fasta > output.fasta

5. QIIME - pick_otus.py - generates 327 OTUs
-m uclust
-s 0.97
-A #optimal search

***Swarm loses OTUs when running due to its algorithm

6. QIIME - pick_rep_set.py
-f gg-all-microbacteriaceae-with-16S.fasta
-r my-16S-reads.fasta
-m longest

6B. fasta-rename.py #renames all seqs with new names on fasta header

7. Align rep_sest sequences with SINA

8. Eliminate all OTUs with <20 seqs EXCEPT for Curtobacterium OTUs (also did <50 seqs)

9. JModel Test - Computes likelihood scores with PHYML
Base Frequencies +F
Rate Variation +I +G nCat=4
ML Optimized
Base Tree Search = NNI

Best Models:
        Models          BIC Calculation
      TlM1 + G                27589
      TrN + G                 27593
      GTR + G                 27608

10. Run TrN+G model on MEGA
-Maximum Likelihood
-Nucleotide Substitution = TrN
-Bootstrap Method = 100
-Gamma Distributed = 5
-Complete Deletion 
-NNI

11. Run GTR+G on RAxML - see RAxML manual for help
$ raxmlHPC -s input_file.phy -n output_name -m GTRGAMMA -# 100 -x 100 -p 2389 -f a -o outgroup_name


RAxML - GTG+G with OTUs > 50 seqs

 MEGA - TrN + G with OTUs > 50 seqs

Tuesday, March 3, 2015

GreenGenes - Phylogenetics Background

Been working on this for a few weeks, but I'll summarize:

Brief overview of Phylogenetics:

Multiple Sequence Alignment
generates a score between pairs of sequences

MUSCLE - multiple alignment software includes distance estimations using Kmer

Clustalw - takes a set of input sequences and carry out progressive alignment
   --> aligned in pairs in order to generate a distance matrix
   --> uses a Neighbor-Joining method to produced unrooted tree which serves as the guide for multiple alignment

                       INPUT DATA                         METHOD
                   2 - 100 protein seqs                      MUSCLE
          100 - 500 seqs globally aligned            
                        > 500 seqs
              small number of large seqs                Clustalw

Genetic Distance and Nucleotide Substitution Models
Genetic Distance - evolutionary distance

Rate Heterogeneity among sites - rate of nucleotide substitution can vary substantially for different positions
   --> Use Gamma Distribution - expectation 1.0 with variance 1/alpha

Phylogenetic Inference based on Distance Methods
Try to fit a tree to a matrix of genetic distances

Minimum Evolution (ME) - distance method for constructing additive trees to minimize length of tree
Neighbor-Joining - minimizes steps by finding a pair of neighboring OTUs

Phylogenetic Inference using Maximum Likelihood (ML) Methods
Highest probability of observed data under a set of parameters
Determines tree topology, branch lengths, and parameters of evolutionary model that maximizes the probability of observing the sequences in a particular arrangement

--> GOAL - to find tree among all possible tree structures that maximizes the global likelihood
However, impossible to compute all possible trees -> need to add heuristics
1. Stepwise Addition
2. Star Decomposition
3. Neighbor-Joining

PHYML - fast distance based method to quickly compute a full initial tree
RAxML - builds tree on maximum parsimony and optimizes with a variant of sub-tree
   Uses Lazy Subtree Arrangement (LSR) - assigns maximal distance between pruning and insertion point for Subtree prune and regraft (SPR) operations to restrict size of neighborhood
   Optimizes only the branch that originates at the pruning point
   Repeats using the current best tree
   Takes the 20 best trees found during LSR to reoptimize ML by adjusting branch lengths

Branch Support - all methods produce a single tree and ML values
Bootstrapping:
1. Pseudo-samples are created by randomly drawing with replacement l columns from the original l column alignment
2. From each pseudo-sample, a tree is reconstructed and a consensus tree is made
   Consensus Tree - incorporates branches that occur in the majority of trees
   Bootstrap Values used as an indicator for reliability of branches