Research: EMP

Showing posts with label EMP. Show all posts

Tuesday, March 24, 2015

EMP - Matrix and OTU table

Need to analyze EMP metadata:

Raw data - EMP_10k_merged_mapping_final.txt and full_emp_table_w_tax.biom

I was able to pull out Curto OTUs last week (see post from 3/12/15) from the full_emp.biom and convert to .txt file to be able to manipulate further.

Giant EMP_10k file has 14095 samples. The Curto OTU table only has 2882 samples.

Took the list of samples that appear in Curto OTU table and made a list. Wrote a rough code sample-ids-curto.py to parse EMP_10k file and extract only samples from Curto OTU table.
Creates a new file - curto-samples.csv

$ wc -l curto-samples.csv
2490

This means there were roughly 390 samples missing. So either code has a bug OR samples NOT in the EMP_10k file. Turns out, they are not in the EMP_10k file (no idea why?).
$vimdiff file1 file2
Samples not included in further analysis found in samples-not-in-csv.txt

Need to redefine Curto OTU table - eliminate samples that are not found in the EMP_10k file.
Modified previous code slightly to parse Curto OTU Table and pull out correct samples.
*First had to transpose OTU table to get in correct format - code parses the first string in each row

OLD FORMAT
Sample1 Sample2 Samplen
OTU1
OTU2
OTUn

Creates a new file - full-emp-curto-only-with-found-samples.csv

Compare number of samples to check: both files have 2490 samples

Sort both curto-samples.csv and full-emp-curto-only-with-found-samples.csv
$ sort curto-samples.csv curto-samples-sorted.csv
#and for other file

Combine two files and check to make sure sample IDs match-up *they should since they were sorted
Creates a new file - combined-samples-otu-table.csv

NEW FORMAT
OTU1 OTU2 OTUn … METADATA
Sample1
Sample2
Samplen

Eliminate all columns in Metadata that contain "na" or "None" for every sample
--> 205 columns were eliminated

--> GRAND TOTAL = combined-samples-otu-table-annotated.xlsx
53 OTUs with 2489 samples with 271 columns of Metadata!

Tuesday, March 17, 2015

EMP, GreenGenes - Make Local DB and BLAST

Create a reference database from my GreenGenes + 16S strains

I used the rep_seqs that were generated when I created my phyla tree as my database.
Made a new file - curto-db.fasta
*these are aligned rep_set seqs

Two ways to create your own local database:

1. Use the BLAST command line

The sequences need to be in a specific format:

Ex.
>gnl|831711|Microbacteriaceae_Candidatus_Rhodoluna
DNA here

makeblastdb
$ makeblastdb -in curto-db.fasta -dbtype nucl -out curto.db

Find out more details HERE

2. Use Geneious

Tools -> Sequence Search

Window pops up and click "Add/Remove Databases" - select "Add Sequence Database"
Follow instructions (ie. select 'nucleotide' and 'custom BLAST')

Perform Sequence Search again, but this time Select "Database" and scroll to your new custom database!

___________________________________________________________________________

Next, BLAST the EMP seqs against my local database.
*The EMP seqs were generated from QIIME assign_taxonomy.py and took those who identified with Curtobacterium with greater 0.67 quality score
*The seqs are also extremely short - less than 200 bp

Export the data to a .txt file

Really strange results - EMP seqs hit rep_seqs at equal frequency

Need to look at seqs in Geneious and check alignments!

Thursday, March 12, 2015

EMP - OTU Table

FINALLY FIGURED OUT HOW TO GET OTU TABLE!

Remember the EMP Open .biom file was too large (too much memory - crashed Python)
Converted format to HDF5 file for easier manipulation
Found this convenient python class
Which then enables (if biom is installed...) and only if hdf5 file is in correct format

$ biom subset-table -i full_emp_table_hdf5.h5 -a observation -s curto-only-ids.txt -o full_emp_table_curto.biom

$ biom convert -i full_emp_table_curto.biom -o full_emp_table_curto.txt --to_tsv --header-key taxonomy

Wednesday, January 28, 2015

EMP - Align OTUs

Took the rep set of sequences and pulled out Curtobacterium OTUs only (Curtobacterium were assigned by GreenGenes database)

Aligned curto only sequences with SINA

Sequences are really short (~150 bp) - see how they incorporate into sequenced data from BACE litter (align with all sequences that were a hit for Microbacteriaceae)

Thursday, December 11, 2014

EMP Biom Files Pt. VII

Went back to the hdf5 file and attempted to solve problem in MatLab (Rich helped a lot!)

Solved the array issue:

%% Load the data
a=h5info('name of hdf file');% return structured array of the hdf
% hieracrchy for reference
observdata=h5read('name of hdf file','/observation/matrix/data');
observIndices=h5read('name of hdf file','/observation/matrix/indices');
observIndptr=h5read('name of hdf file','/observation/matrix/indptr');
ids=h5read('name of hdf file','/observation/ids');

%% Get the OTU indices
% You need a cell array of strings of your desired OTUs stored as variable
% qList

for i=1:length(qList)
qInd=find(strcmp(qList,ids));
end

%% find the data

outmat=zeros(length(qList),length(sampleIndptr));

for i=1:length(qInd)
p=observdata(observIndptr(qInd(i)):observIndptr(qInd(i)+1));
pI=observIndices(observIndptr(qInd(i)):observIndptr(qInd(i)+1));
for j=1:length(pI)
outmat(i,pI(j)+1)=p(j); %plus one to correct for matlab python coordinate changes
end
end

Merged files and got the following with all metadata!!!!!

EMP Biom Files Pt. VI - Green Planet

Got access to the Green Planet server and am able to login.

Was informed by Chad Cantwell that QIIME may or may not be installed in the new environment (server was updated recently and it may not be running)

Contacted Steve Hatosy (from Adam's lab) and was super helpful.
When you login to the server (through ssh), type: . /sopt/qiime/set_paths_1.4
This should set the paths to the QIIME script

Code to run:
split_otu_table_by_taxonomy.py
-i full_emp_table_w_tax.biom #open source biom file
-L 5 #split taxonomy at Family level
-o ./L5/ #file directory

>> SystemError: Negative size passed to PyString_FromStringAndSize

This is (I think) due to large size of the input file (biom file 2.63 GB). From my correspondence with people in the Knight Lab, the file should need ~30GB of memory to load.

Protocol for Cluster:

ssh -Y abchase@gplogin3.ps.uci.edu
enter login information
pwd #gets cluster directory
exit cluster
from local terminal. you can upload files to the cluster:

scp /Users/MartinyLab/Desktop/alexs-stuff/EMP/EMPopen/full_emp_table_w_tax.biom abchase@gplogin1.ps.uci.edu:/home/abchase

Run QIIME code with cluster file directory

Tuesday, December 2, 2014

EMP Biom Files Pt. V

Still working on open source files. not really sure how to access array data from hdf5 files

However, wrote short code to organize the 2 OTUs from the closed_ref_emp_table to combine with merged mapping file:

import os
import csv

mydir = os.path.expanduser("~/Desktop/alexs-stuff/EMP/")

in_file = mydir + "EMP_10k_merged_mapping_final.txt" #master mapping file

#need txt file with sample ids that had curto hits
wanted_file = mydir + "EMPclosed/sample-ids-curto.txt"

out_file = mydir + "EMPclosed/curto-samples.csv"

wanted = set()

with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)

count = 0

with open(in_file, "rb") as tsvin, open(out_file, "wb") as csvout:
tsvin = csv.reader(tsvin, delimiter = '\t')
csvout = csv.writer(csvout)

for row in tsvin:

if row[0] in wanted:
count = count + 1
csvout.writerows([row])

print "Converted %i records" % count

Output looks like this in excel after some editing:
Total samples: 136 with 111 in the merged mapping file

Friday, November 21, 2014

EMP Biom Files Pt. IV

Got in touch with Daniel MacDonald from the Knight Lab:

Sent him the full_emp biom file and he said it is fine but takes about ~30GB to parse (really prohibitive). Converted the open reference biom file into hdf5 format:

ftp://thebeast.colorado.edu/pub/full_emp_table_w_tax.hdf5

Wrote the following code. Only outputs one column (OTUs), but did confirm that curt OTUs are present in the file

import os

import h5py

mydir = os.path.expanduser("~/Desktop/alexs-stuff/")

in_file = mydir + "EMP/EMPopen/full_emp_table_hdf5.h5"

wanted_file = mydir + "EMP/greengenes-curto-only.txt"

out_file = mydir + "EMP/emp-curto-only.txt"

wanted = set()

with open(wanted_file) as f:

for line in f:

line = line.strip()

if line != "":

wanted.add(line)

hdf5_file = h5py.File(in_file, "r")

count = 0

with open(out_file, "w") as h:

for keys in hdf5_file["observation"]["ids"]:

if keys in wanted:

count = count + 1

h.write(keys + "\n")

print "Converted %i records" % count

hdf5_file.close()

Monday, November 17, 2014

EMP Biom Files Pt. III

Got in touch with Sean Gibbons and he was able to forward some code:
https://github.com/klocey/rare-bio/blob/master/tools/ConvertBiom/ConvertBiom.py
***script scans through giant biom file in smaller pieces, rather than loading entire file into memory

Output is a sparse abundance matrix
each row is: OTU, site, number of reads

Example (first 10 lines from open reference biom file):
0 0 7.0
0 1 10.0
0 2 13.0
0 3 7.0
0 4 2.0
0 5 3.0
0 6 3.0
0 7 3.0
0 158 320.0
0 159 32.0

Not really sure how to interpret data

Friday, November 14, 2014

EMP Biom Files Pt. II

***ALL DONE ON THE MAC IN THE LAB***

Need to make biom file into classic format to pull out Curto files
Should look something like:

Sample

Taxonomy

OTU 1

OTU 2

…

OTU n

Biom Convert

biom convert
-i full_emp_table_w_tax_closedref.biom
-o full_emp_closedref_taxonomy.txt
--biom-to-classic-table
--header-key taxonomy

Generated .txt file but too large to export into Excel - ERROR - not enough memory

Need to breakdown master .biom file:

split_otu_table_by_taxonomy.py
-i full_emp_table_w_tax_closedref.biom
-L 3 #level3 taxonomic split (class level)
-o ./L3/

Need to split further - maybe at L5:

split_otu_table_by_taxonomy.py
-i full_emp_table_w_tax_closedref.biom
-L 5 #level5 taxonomic split (family level)
-o ./L5/

Problem: only 2 Curto OTUs present in outputted .biom file. Asked Sean Gibbons what he thought:
"Yep, any of the OTUs with 'New' in the name did not hit the reference database, so you won't find them in the closed ref table. Number-only labels are Greengenes IDs, and those should all be in the closed ref table."

So - try again with open reference database? crashed last time when I ran previous code on the open reference biom file (see Pt I)
RESULT: SystemError: Negative size passed to PyString_FromStringAndSize

^Probably due to too large of input file
full_emp_table_w_tax.biom is 2.63 GB - probably crashes as a security measure

Tuesday, November 11, 2014

EMP Biom Files

Not too sure what to do with .biom files from S.Gibbons

Two files:

full_emp_table_w_tax_closedref.biom
full_emp_table_w_tax.biom

Need to figure out how to pull Curtobacterium metadata from above master files

***Neither attempt has been able to utilize the open reference .biom file - ERROR***

Attempt 1:

QIIME - filter_samples_from_otu_table.py

Under "List-based Filtering":

-i full_emp_table_w_tax_closedref.biom
--sample_id_fp curto-only.txt

RESULT: nothing
PROBLEM: curto-only.txt contains OTUs, not individual samples

Attempt 2:

QIIME - filter_otus_from_otus_table.py

Use feature to extract Curtobacterium OTUs

-i full_emp_table_w_tax_closedref.biom
-e curto.only.txt #this excludes samples from new .biom
--negate_ids_to_exclude

RESULT: generates new .biom file
PROBLEM: I think it only contains a handful of OTUs, or not working?

Summary table from .biom

Num samples: 15481
Num observations: 2
Total count: 197
Table density (fraction of non-zero values): 0.005

Compared to Summary table from master .biom

Num samples: 15481
Num observations: 69444
Total count: 654448644
Table density (fraction of non-zero values): 0.016

Generated by "biom summarize-table" function

EMP Update

I was able to figure out which OTUs from the rep_set file were Curtobacterium:

Searched taxonomic assignment file from S.Gibbons for "Microbacteriaceae" n=2713

searchfile = open("rep_set_tax_assignments.txt", "r")
for line in searchfile:
if "f__Microbacteriaceae" in line: print line
searchfile.close()

Created a smaller fasta file by pulling out Microbacteriaceae sequences from giant 'rep_set.fna' file from S.Gibbons

from Bio import SeqIO
fasta_file = "rep_set.fna" #input fasta file
wanted_file = "microbacteriaceae-only.txt" #input interesting sequence IDs, one per line
result_file = "microbacteriaceae-only.fasta" #output fasta file
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
count = 0
with open(result_file, "w") as f:
for seq in fasta_sequences:
if seq.id in wanted:
count = count + 1
SeqIO.write([seq], f, "fasta")

print "Coverted %i records" % count

QIIME - assign_taxonomy.py on new 'microbacteriaceae-only.fasta'

Aligned with GreenGenes core set (same reference as GenBank protocol)

Performed above procedure to generate 'curtobacterium-only.fasta' n=53

Monday, November 10, 2014

Earth Microbiome (EMP)

EMP has been down for months. Trying to get access to their databases.
Heard from EMP: rebuilding database. any day now it will be up and running

Got access to the EMP files from Jack Gilbert and Sean Gibbons

open reference OTU table (.biom)
closed reference OTU table (.biom)
rep sequences (.fna)
phylogenetic tree (.tre)
metadata file (.txt)
taxonomic assignments (.txt) - from emp_10k_rdp