Wednesday, January 28, 2015

EMP - Align OTUs

Took the rep set of sequences and pulled out Curtobacterium OTUs only (Curtobacterium were assigned by GreenGenes database)

Aligned curto only sequences with SINA

Sequences are really short (~150 bp) - see how they incorporate into sequenced data from BACE litter (align with all sequences that were a hit for Microbacteriaceae)

BioCluster

Gained access to the BioCluster

Login: ssh username@hpc.oit.uci.edu

Help: cat /data/help/cheat-sheet.txt

Guidebook on how to use the BioCluster created by Kevin Thornton:

Example to run jobs on the cluster:

Batch jobs are jobs that contain all of necessary information and instructions to run inside a script. You create a script with your favorite editor (like emacs) and then submit the script to the scheduler to run.

Some jobs can run for days, weeks, or longer so batch is the way to go for such work. Once you submit a job to the scheduler, you can log off and come back at a later time and check on the results.

Serial batch jobs are usually the simplest to use. Serial jobs run with only one core and are also the slowest since they only consume 1-core per job.
Consider the following serial job script available from the HPC demo account.
  • cat ~demo/serial.sh
#!/bin/bash
#$ -N TEST
#$ -q free64
#$ -m beas

date  > out
Grid Engine DirectiveWhat It Does
#!/bin/bash
Running shell to use ( the bash shell )
#$ -N TEST
Our Job Name is TEST. If output is produced to standard out, you will see a file name TEST.o<jobid> and TEST.e<jobid> for errors (if any occurred)
#$ -q free64
Request the free64 queue
#$ -m beas
Send you email of job status (b)egin, (e)rror, (a)bort, (s)suspend
The first line #!/bin/bash is the shell to use. Grid Engine (GE) directives start with #$. GE directives are needed in order to tell the scheduler what queue to use, how many cores to use, whether to send email or not, etc.
The last line in our serial.sh script is the program to run. In this example it is a simple date program writing the output to out file.
date > out
Now that we have a basic understanding let’s run our first serial batch job on the HPC Cluster. First create a test directory, change to the test directory, copy the demo serial.sh script to our new directory and submit the job.
From your HPC account, do the following:
$ mkdir serial-test
$ cd serial-test
$ cp ~demo/serial.sh .
$ qsub serial.sh
$ qstat -u $USER
After you submit the job (qsub), GE will respond with a job ID:
Your job 1961 ("TEST") has been submitted
and qstat will display something similar to this:
job-ID  prior   name   user     state submit/start  queue       slots

  1961 0.00000  TEST  jfarran   qw    08/16/2012                 1
The state of our job is qw queue wait (meaning the job is sitting in the queue waiting for a compute node). The core count (slots) shows as 1 (this is the default which is one core).
When we run qstat -u $USER again a few seconds later, we see:
job-ID  prior   name   user    state submit/start  queue               slots

  1961 0.50659  TEST  jfarran   r   08/16/2012    free64@compute-7-11   1
The scheduler found compute-7-11 on free64 queue available with 1 core (slots) and started our job #1961 on it. The job state changed from queue wait qw to running r.
NoteOnce you submit your job (qsub), things happen rather quickly so you may need to type qstat repeatedly and fast to see your job. Or open a new window and run: watch -d "qstat -u $USER"
Once the job completes you will get an email notification and the qstat output will be empty.
Now do an ls and you will see the following files:
out  serial.sh
The serial.sh is the batch job we submitted and file out is the output from the date program. To see the output type:
$ cat out

Monday, January 26, 2015

BACE, Curto - Phylogeny

Used ARB-Silva SINA to align sequences and ARB to generate phylogeny

***May take out sequences (see below post; specifically AB_3.17L, AB_3.19L, AB_3.27L) that do not have full 16S gene - do not fit great into phylogeny


EDIT: new phylogeny with scale generated by ARB tree generator 


Wednesday, January 21, 2015

BACE, Curto - Sequence Data

Received sequence data from Beckman Genomics Institute

Trimmed sequences:
   Eliminated below 5% probability and trimmed first 20 bp (length of primers)

Primers used:
Forward Primer 
AGAGTTTGATCCTGGCTCAG
Reverse Primer
AAGGAGGTGATCCAGCCGCA
Assembled de novo:
   Samples with No contig overlap - could not assemble (n = 12) - used either F or R strand for id
      AB 3.02L - gel shows blurry line at 1500 bp
      AB 3.04L - multiple bands 
      AB 3.05L - bright band at 1500 bp
      AB 3.12L - bright, smeared band at 1500 bp
      AB 3.17L - multiple bands ~1500 bp
      AB 3.19L - very faint band at 1500 bp
      AB 3.27L - no visible band
      AB 3.37L - bright, smeared band at 1500 bp
      *AB 3.04B - multiple bands
      *AB 3.09B - multiple bands
      AB 3.13B - bright band at 1500 bp
      *Curto 145 (redo) - no visible band
      *low quality read percentage - did not include 

Blast samples - blast against nr/nt database