Wednesday, October 21, 2015

Field Season

Finally was able to get the field experiment setup and deployed in time for the California 'wet' season. Hopefully this elevational study has some interesting things to tell me! Just have to wait and see...



























Saturday, July 18, 2015

Microorganisms!

I will just preface this by saying, I am not a mycologist.
With that said, I am a microbiologist, characterizing the mechanisms that drive bacterial diversity. Now, studying microbial ecology is a bit different than traditional ecology. You do not have the wonderful experience to capture your study organism in the field, or see a glorified mating ritual. What you do get, is the excitement of seeing a successful next-gen PCR run come back clean, ready to be sequenced. Or the fascination of processing terabytes of sequence data for down-stream analysis. Further, the fieldwork, at least for me, is almost nonexistent. Now, don't get me wrong - I live for this field. I love processing data and find that how microorganisms dictate almost all broad-scale ecological processes to be utterly fascinating. To this day, I am blown away how large of an impact these litter bugs can have.
So, when I first ventured into the field down here in Costa Rica, I was expecting to see insects, the coolest frogs ever, and, most excitedly, large mammals. To my surprise, I did happen to see a lot of insects, however, they were not the most welcomed ones - mosquitos! However, in all seriousness, these giant primary tropical forests almost appear desolate. You have this amazing biodiversity of plant species, but, to the untrained eye, there is really nothing else to see. You rarely see a howler monkey grace you with its presence in the tree canopy, nor do you ever see the elaborate and beautiful snakes that are indicative of the tropics. But, as you look more closely, an entire world begins to unfold.
At the smaller scale, there is an entire world to be seen. You have colonies of leaf-cutter ants marching through the forest, harvesting leaves for its fungal garden. There are dung beetles carving out its delicious meal, ever ready to present its glorified treasure to a mate. And as you continue to examine this world, you come across one of the most interesting branches of life. Fungi. This clade is responsible for the vast portion of decomposition in terrestrial systems. Without fungi operating at the level that they do, life would simply not be the same in the tropics. The soil here is very nutrient poor, and most of the usable nutrients are locked in living biomass. Because of this, the turnover rate by fungi (and of course bacteria!) to recycle these nutrients is essential. I told you it's cool that microorganisms dictate everything!
Of course, this information is not new, nor was it to me before I set foot in Costa Rica. However, what I didn't expect to see was the morphological diversity that was on full display in all its glory. I would have never thought that a decomposing log in the middle of the forest would captivate and demand my attention. But these logs in particular, are the playground for these eukaryotes. There are clonal colonies of these fungi creating a vast and integrative network of mycelium, culminating in the production of these beautiful mushrooms. Beyond the decomposing logs, you see mushrooms sprouting up aboveground, evidence of the potential for these organisms to grow beyond belief. What few people seem to realize is that most fungi form these mycelium mats, creating giant organisms of massive size with the ability to become the largest organism in the world!
Fungi extend beyond the limitation of decomposition; and can have potential detrimental and pathogenic effects to all forms of life. Most interestingly, is the story of the zombie fungus (Ophiocordyceps unilateralis), as described by Alfred Wallace in the 1800s. I had the privilege to witness the effects in all its glory on a trip to La Selva. This fungi infects social insects (in my case, the mighty bullet ant (Paraponera clavata)) by using enzymes that have been deposited within the fungal spores to breakdown the armor that is the exoskeleton. Next, the fungal spread within the insect causes a truely unique and horrifying effect. Inevitably, the insect becomes a puppet, fully manipulated by the fungal pathogen as it reprograms the ant's entire social behavior. The obediant and systematic social insect that has developed over eons of evolutionary time is disrupted within just a few days. The ant leaves its nest or foraging trail, abadoning its family, to find a suitable habitat for its newfound master. The ant then climbs onto a stem and secures its place on the underside of a leaf, using its giant mandibles to fixate its location. It is here that the fungal pathogen shuts down the ant altogether, muscles atrophy and the infamous fungal 'death grip' is in full effect. The mighty ant, who is capable of lifting thousands of times its own body weight, is left helpless and paralyzed on what will eventually be its final resting place. The hyphae continue to spread throughout the ant, eventually killing its host who, at this point, has served its full purpose. Eventually, fruiting bodies grow out of the head of the ant, releasing spores from this advantageous position high up on the leaf of this plant. These spores disperse and are ready to fall onto the next unsuspecting ant brigade, starting this fascinating process once again.
Zombie fungus (Ophiocordyceps unilateralis) fruiting body erupting out of a bullet ant (special thanks to Bernal Carranza and Alex Wild)
Zombie fungus (Ophiocordyceps unilateralis) fruiting body erupting out of a bullet ant
(special thanks to Bernal Matarrita and Alex Wild)
In conclusion, microorganisms are awesome! The more you learn, the more convinced you will become, I guarantee it. Since I cannot take photos of bacteria (which equally have a number of amazing stories), I settled for fungi. I could not stop from taking photos of the vast diversity of fruiting bodies - some smaller than a pencil point, while others were as large as a person. Below you will find some of my favorites, most of which I have zero idea what they are (all input will be much appreciated!). To conclude, I am not a mycologist, but that doesn't stop me from enjoying and appreciating the unique stories and beauty each of these little guys has to offer! PURA VIDA
DSC_2167 DSC_2101 DSC_2034 DSC_2030 DSC_2015 DSC_1759 DSC_1749 DSC_1746 DSC_1624 DSC_1031 DSC_1019 DSC_1017 DSC_0973 DSC_0748

Wednesday, July 8, 2015

La Selva

This summer, I ventured down to Costa Rica to participate in the OTS Tropical Biology course. It has been one of the best experiences of my life. But, in particular, I wanted to share a project that a group of us put together. Now, this wasn't a typical project for this course, which usually consists of a week to plan, execute, present, and write up a research project. This project was geared towards science outreach. With the help of some amazing visiting scientists, filmmakers, and producers, we had the privilege to create a short film in 3 days! Our group decided to concentrate on what makes La Selva Biological Research Station so important. Enjoy!

untitled (La Selva)

Special thanks to Nathan Dappan at Day's Edge Production, Sarah Joseph at National Geographic, and Alex Wild 

Here I am with Michel Alejandro (Univ. of Puerto Rico) sitting atop the canopy tower at La Selva Biological Station

Tuesday, March 24, 2015

EMP - Matrix and OTU table

Need to analyze EMP metadata:

Raw data - EMP_10k_merged_mapping_final.txt and full_emp_table_w_tax.biom

I was able to pull out Curto OTUs last week (see post from 3/12/15) from the full_emp.biom and convert to .txt file to be able to manipulate further.

Giant EMP_10k file has 14095 samples. The Curto OTU table only has 2882 samples.

Took the list of samples that appear in Curto OTU table and made a list. Wrote a rough code sample-ids-curto.py to parse EMP_10k file and extract only samples from Curto OTU table.
Creates a new file - curto-samples.csv

$ wc -l curto-samples.csv 
2490 

This means there were roughly 390 samples missing. So either code has a bug OR samples NOT in the EMP_10k file. Turns out, they are not in the EMP_10k file (no idea why?).
$vimdiff file1 file2 
Samples not included in further analysis found in samples-not-in-csv.txt

Need to redefine Curto OTU table - eliminate samples that are not found in the EMP_10k file.
Modified previous code slightly to parse Curto OTU Table and pull out correct samples.
*First had to transpose OTU table to get in correct format - code parses the first string in each row

                             OLD FORMAT                                          
               Sample1   Sample2   Samplen
OTU1
OTU2
OTUn

Creates a new file - full-emp-curto-only-with-found-samples.csv

Compare number of samples to check: both files have 2490 samples

Sort both curto-samples.csv and full-emp-curto-only-with-found-samples.csv 
$ sort curto-samples.csv curto-samples-sorted.csv
#and for other file

Combine two files and check to make sure sample IDs match-up *they should since they were sorted
Creates a new file - combined-samples-otu-table.csv

                                    NEW FORMAT                                          
                     OTU1   OTU2   OTUn …   METADATA
Sample1
Sample2
Samplen

Eliminate all columns in Metadata that contain "na" or "None" for every sample
--> 205 columns were eliminated

--> GRAND TOTAL = combined-samples-otu-table-annotated.xlsx
53 OTUs with 2489 samples with 271 columns of Metadata!

Tuesday, March 17, 2015

EMP, GreenGenes - Make Local DB and BLAST

Create a reference database from my GreenGenes + 16S strains

I used the rep_seqs that were generated when I created my phyla tree as my database.
Made a new file - curto-db.fasta
*these are aligned rep_set seqs

Two ways to create your own local database:

1.   Use the BLAST command line

The sequences need to be in a specific format:

Ex.
>gnl|831711|Microbacteriaceae_Candidatus_Rhodoluna
DNA here

makeblastdb
$ makeblastdb -in curto-db.fasta -dbtype nucl -out curto.db

Find out more details HERE

2. Use Geneious

Tools -> Sequence Search

Window pops up and click "Add/Remove Databases" - select "Add Sequence Database"
Follow instructions (ie. select 'nucleotide' and 'custom BLAST')

Perform Sequence Search again, but this time Select "Database" and scroll to your new custom database!

___________________________________________________________________________

Next, BLAST the EMP seqs against my local database.
*The EMP seqs were generated from QIIME assign_taxonomy.py and took those who identified with Curtobacterium with greater 0.67 quality score
*The seqs are also extremely short - less than 200 bp

Export the data to a .txt file

Really strange results - EMP seqs hit rep_seqs at equal frequency
Need to look at seqs in Geneious and check alignments!

Thursday, March 12, 2015

EMP - OTU Table

FINALLY FIGURED OUT HOW TO GET OTU TABLE!


  1. Remember the EMP Open .biom file was too large (too much memory - crashed Python)
  2. Converted format to HDF5 file for easier manipulation
  3. Found this convenient python class
  4. Which then enables (if biom is installed...) and only if hdf5 file is in correct format


$ biom subset-table -i full_emp_table_hdf5.h5 -a observation -s curto-only-ids.txt -o full_emp_table_curto.biom

$ biom convert -i full_emp_table_curto.biom -o full_emp_table_curto.txt --to_tsv --header-key taxonomy


Monday, March 9, 2015

GreenGenes - Pipeline and Phylogenies

1. Download the entire GreenGenes database

Need:
gg_13_5.fasta
gg_13_5_taxonomy.txt

2. Search for taxonomy of interest - start with Microbacteriaceae
#creates a text file with IDs matching search
$ egrep "f__Microbacteriaceae" gg_13_5_taxonomy.txt | awk '{print $1}' > ./gg-microbacteriaceae.txt

3. micro-only.py 
#searches fasta file and creates a new fasta file with only IDs from gg-microbacteriaceae.txt
#found 5707 sequences

4. Combine my 16S reads
$ cat my-16S-reads.fasta gg-micro.fasta > output.fasta

5. QIIME - pick_otus.py - generates 327 OTUs
-m uclust
-s 0.97
-A #optimal search

***Swarm loses OTUs when running due to its algorithm

6. QIIME - pick_rep_set.py
-f gg-all-microbacteriaceae-with-16S.fasta
-r my-16S-reads.fasta
-m longest

6B. fasta-rename.py #renames all seqs with new names on fasta header

7. Align rep_sest sequences with SINA

8. Eliminate all OTUs with <20 seqs EXCEPT for Curtobacterium OTUs (also did <50 seqs)

9. JModel Test - Computes likelihood scores with PHYML
Base Frequencies +F
Rate Variation +I +G nCat=4
ML Optimized
Base Tree Search = NNI

Best Models:
        Models          BIC Calculation
      TlM1 + G                27589
      TrN + G                 27593
      GTR + G                 27608

10. Run TrN+G model on MEGA
-Maximum Likelihood
-Nucleotide Substitution = TrN
-Bootstrap Method = 100
-Gamma Distributed = 5
-Complete Deletion 
-NNI

11. Run GTR+G on RAxML - see RAxML manual for help
$ raxmlHPC -s input_file.phy -n output_name -m GTRGAMMA -# 100 -x 100 -p 2389 -f a -o outgroup_name


RAxML - GTG+G with OTUs > 50 seqs

 MEGA - TrN + G with OTUs > 50 seqs

Tuesday, March 3, 2015

GreenGenes - Phylogenetics Background

Been working on this for a few weeks, but I'll summarize:

Brief overview of Phylogenetics:

Multiple Sequence Alignment
generates a score between pairs of sequences

MUSCLE - multiple alignment software includes distance estimations using Kmer

Clustalw - takes a set of input sequences and carry out progressive alignment
   --> aligned in pairs in order to generate a distance matrix
   --> uses a Neighbor-Joining method to produced unrooted tree which serves as the guide for multiple alignment

                       INPUT DATA                         METHOD
                   2 - 100 protein seqs                      MUSCLE
          100 - 500 seqs globally aligned            
                        > 500 seqs
              small number of large seqs                Clustalw

Genetic Distance and Nucleotide Substitution Models
Genetic Distance - evolutionary distance

Rate Heterogeneity among sites - rate of nucleotide substitution can vary substantially for different positions
   --> Use Gamma Distribution - expectation 1.0 with variance 1/alpha

Phylogenetic Inference based on Distance Methods
Try to fit a tree to a matrix of genetic distances

Minimum Evolution (ME) - distance method for constructing additive trees to minimize length of tree
Neighbor-Joining - minimizes steps by finding a pair of neighboring OTUs

Phylogenetic Inference using Maximum Likelihood (ML) Methods
Highest probability of observed data under a set of parameters
Determines tree topology, branch lengths, and parameters of evolutionary model that maximizes the probability of observing the sequences in a particular arrangement

--> GOAL - to find tree among all possible tree structures that maximizes the global likelihood
However, impossible to compute all possible trees -> need to add heuristics
1. Stepwise Addition
2. Star Decomposition
3. Neighbor-Joining

PHYML - fast distance based method to quickly compute a full initial tree
RAxML - builds tree on maximum parsimony and optimizes with a variant of sub-tree
   Uses Lazy Subtree Arrangement (LSR) - assigns maximal distance between pruning and insertion point for Subtree prune and regraft (SPR) operations to restrict size of neighborhood
   Optimizes only the branch that originates at the pruning point
   Repeats using the current best tree
   Takes the 20 best trees found during LSR to reoptimize ML by adjusting branch lengths

Branch Support - all methods produce a single tree and ML values
Bootstrapping:
1. Pseudo-samples are created by randomly drawing with replacement l columns from the original l column alignment
2. From each pseudo-sample, a tree is reconstructed and a consensus tree is made
   Consensus Tree - incorporates branches that occur in the majority of trees
   Bootstrap Values used as an indicator for reliability of branches

Thursday, February 26, 2015

BACE - DNA combination and Ship

Some samples still have poor yields, so combine samples and reconcentrate

Followed protocol for Amicon Pro Purification System:

 MCBA15 004 - combine 4.1 and 4.2 from 2/25
 MCBA15 007 - combine from both extraction days
 MCBA15 015 - combine from both extraction days
 MCBA15 017 - combine from both extraction days
 MCBA15 019 - combine 19.1 and 19.2 from 2/25
 MMLR15 020 - combine from both extraction days

Quantified with Qubit on BioTek at 485/530 nM

Sample ID     Concentration (ng/uL)   Volume (uL)
MCBA15 004           20.8                 95
MCBA15 007            7.7                 90
MCBA15 015           34.9                100
MCBA15 017*          81.0                100 
MCBA15 019           11.0                 90
MMLR15 020            4.9                 95
 *split in two  

Shipped out sample on 3/3/15 (due to weather):

Sample ID        Total DNA (ng)
MCBA15 004          1976.0
MCBA15 007           693.0
MMLR15 010*          672.0
MMLR15 011*          616.0
MCBA15 015          3490.0
MCBA15 017          3645.0
MCBA15 019           990.0

Samples delivered and received 3/4/15 - email from Michael

_____________________________________
Samples not sent out due to poor yields:

MMLR15 018 - slow growing
MCBA15 021 - Frigoribacterium; did not redo
MMLR15 022 - Frigoribacterium; did not redo

Tuesday, February 24, 2015

BACE - DNA Extraction Pt II

Try and extract DNA from samples that I could not get enough DNA from on 2/17
   ***for samples with really poor yields from last time, extracted two sets

Need more Lysozyme - 10 mg/mL in 60 uL x 20 samples

   TEN Buffer: 

      40 mM Tris-HCl ph=7.5
      1 mM EDTA ph=8.0
      150 mM NaCl

      Stock Solutions:      
      400 mM Tris-HCl = 6.30 g in 100 mL dH2O
      100 mM EDTA = 2.92 g in 100 mL dH2O
      300 mM NaCl = 1.75 g in 100 mL dH2O

      --> 1500 uL TEN Buffer = 750 uL NaCl + 15 uL EDTA + 150 uL Tris-HCl + 585 uL ddH2O

   Add 1 mL TEN Buffer + 10 mg Lysozyme = 10 mg/mL

Followed Promega Wizard DNA Purification Kit Protocol for gram-positive bacteria
   EXCEPT:
      Added 2 mL of liquid grown culture
      Added 10 mg/mL of 60 uL + 60 uL ddH2O = 120 uL
      Added 60 uL of Rehydration Solution

Quantified with Qubit kit on BioTek at 485/530 nM

Sample ID     Concentration (ng/uL)
MCBA15 004.1      7.9
MCBA15 004.2     20.8
MCBA15 007        4.9
MMLR15 010        0.0 - probably lost pellet 
MCBA15 015       10.7
MCBA15 017.1      8.7
MCBA15 017.2      0.3
MMLR15 018.1      1.0 - grows slow, not much input
MMLR15 018.2      3.1 - grows slow, not much input
MCBA15 019.1      7.5
MCBA15 019.2     13.5
MMLR15 020       16.6

Tuesday, February 17, 2015

BACE - DNA Extractions and Shipment

Shipped out samples to MIT - Martin Polz and Michael Cutler

Curtobacterium samples (n=11) sent on 2/17/15:

Sample ID      Total DNA (ng)
MCBA15 001         981.0
MMLR15 002        1254.2
MCBA15 003         893.8
MCBA15 005         953.9
MMLR15 006         943.6
MCBA15 008        1657.8
MCBA15 009         922.6
MCBA15 012         962.3
MCBA15 013        1240.4
MMLR15 014        1153.8
MCBA15 016         756.7

Samples Received on 2/19/15

Thursday, February 12, 2015

BACE - DNA Extraction

Followed Promega DNA Extraction Kit Protocol

Results were better than Spin Column method, but still not great for some samples.


Wednesday, January 28, 2015

EMP - Align OTUs

Took the rep set of sequences and pulled out Curtobacterium OTUs only (Curtobacterium were assigned by GreenGenes database)

Aligned curto only sequences with SINA

Sequences are really short (~150 bp) - see how they incorporate into sequenced data from BACE litter (align with all sequences that were a hit for Microbacteriaceae)

BioCluster

Gained access to the BioCluster

Login: ssh username@hpc.oit.uci.edu

Help: cat /data/help/cheat-sheet.txt

Guidebook on how to use the BioCluster created by Kevin Thornton:

Example to run jobs on the cluster:

Batch jobs are jobs that contain all of necessary information and instructions to run inside a script. You create a script with your favorite editor (like emacs) and then submit the script to the scheduler to run.

Some jobs can run for days, weeks, or longer so batch is the way to go for such work. Once you submit a job to the scheduler, you can log off and come back at a later time and check on the results.

Serial batch jobs are usually the simplest to use. Serial jobs run with only one core and are also the slowest since they only consume 1-core per job.
Consider the following serial job script available from the HPC demo account.
  • cat ~demo/serial.sh
#!/bin/bash
#$ -N TEST
#$ -q free64
#$ -m beas

date  > out
Grid Engine DirectiveWhat It Does
#!/bin/bash
Running shell to use ( the bash shell )
#$ -N TEST
Our Job Name is TEST. If output is produced to standard out, you will see a file name TEST.o<jobid> and TEST.e<jobid> for errors (if any occurred)
#$ -q free64
Request the free64 queue
#$ -m beas
Send you email of job status (b)egin, (e)rror, (a)bort, (s)suspend
The first line #!/bin/bash is the shell to use. Grid Engine (GE) directives start with #$. GE directives are needed in order to tell the scheduler what queue to use, how many cores to use, whether to send email or not, etc.
The last line in our serial.sh script is the program to run. In this example it is a simple date program writing the output to out file.
date > out
Now that we have a basic understanding let’s run our first serial batch job on the HPC Cluster. First create a test directory, change to the test directory, copy the demo serial.sh script to our new directory and submit the job.
From your HPC account, do the following:
$ mkdir serial-test
$ cd serial-test
$ cp ~demo/serial.sh .
$ qsub serial.sh
$ qstat -u $USER
After you submit the job (qsub), GE will respond with a job ID:
Your job 1961 ("TEST") has been submitted
and qstat will display something similar to this:
job-ID  prior   name   user     state submit/start  queue       slots

  1961 0.00000  TEST  jfarran   qw    08/16/2012                 1
The state of our job is qw queue wait (meaning the job is sitting in the queue waiting for a compute node). The core count (slots) shows as 1 (this is the default which is one core).
When we run qstat -u $USER again a few seconds later, we see:
job-ID  prior   name   user    state submit/start  queue               slots

  1961 0.50659  TEST  jfarran   r   08/16/2012    free64@compute-7-11   1
The scheduler found compute-7-11 on free64 queue available with 1 core (slots) and started our job #1961 on it. The job state changed from queue wait qw to running r.
NoteOnce you submit your job (qsub), things happen rather quickly so you may need to type qstat repeatedly and fast to see your job. Or open a new window and run: watch -d "qstat -u $USER"
Once the job completes you will get an email notification and the qstat output will be empty.
Now do an ls and you will see the following files:
out  serial.sh
The serial.sh is the batch job we submitted and file out is the output from the date program. To see the output type:
$ cat out

Monday, January 26, 2015

BACE, Curto - Phylogeny

Used ARB-Silva SINA to align sequences and ARB to generate phylogeny

***May take out sequences (see below post; specifically AB_3.17L, AB_3.19L, AB_3.27L) that do not have full 16S gene - do not fit great into phylogeny


EDIT: new phylogeny with scale generated by ARB tree generator 


Wednesday, January 21, 2015

BACE, Curto - Sequence Data

Received sequence data from Beckman Genomics Institute

Trimmed sequences:
   Eliminated below 5% probability and trimmed first 20 bp (length of primers)

Primers used:
Forward Primer 
AGAGTTTGATCCTGGCTCAG
Reverse Primer
AAGGAGGTGATCCAGCCGCA
Assembled de novo:
   Samples with No contig overlap - could not assemble (n = 12) - used either F or R strand for id
      AB 3.02L - gel shows blurry line at 1500 bp
      AB 3.04L - multiple bands 
      AB 3.05L - bright band at 1500 bp
      AB 3.12L - bright, smeared band at 1500 bp
      AB 3.17L - multiple bands ~1500 bp
      AB 3.19L - very faint band at 1500 bp
      AB 3.27L - no visible band
      AB 3.37L - bright, smeared band at 1500 bp
      *AB 3.04B - multiple bands
      *AB 3.09B - multiple bands
      AB 3.13B - bright band at 1500 bp
      *Curto 145 (redo) - no visible band
      *low quality read percentage - did not include 

Blast samples - blast against nr/nt database