Research: November 2014

Friday, November 21, 2014

EMP Biom Files Pt. IV

Got in touch with Daniel MacDonald from the Knight Lab:

Sent him the full_emp biom file and he said it is fine but takes about ~30GB to parse (really prohibitive). Converted the open reference biom file into hdf5 format:

ftp://thebeast.colorado.edu/pub/full_emp_table_w_tax.hdf5

Wrote the following code. Only outputs one column (OTUs), but did confirm that curt OTUs are present in the file

import os

import h5py

mydir = os.path.expanduser("~/Desktop/alexs-stuff/")

in_file = mydir + "EMP/EMPopen/full_emp_table_hdf5.h5"

wanted_file = mydir + "EMP/greengenes-curto-only.txt"

out_file = mydir + "EMP/emp-curto-only.txt"

wanted = set()

with open(wanted_file) as f:

for line in f:

line = line.strip()

if line != "":

wanted.add(line)

hdf5_file = h5py.File(in_file, "r")

count = 0

with open(out_file, "w") as h:

for keys in hdf5_file["observation"]["ids"]:

if keys in wanted:

count = count + 1

h.write(keys + "\n")

print "Converted %i records" % count

hdf5_file.close()

Monday, November 17, 2014

Got in touch with Sean Gibbons and he was able to forward some code:
https://github.com/klocey/rare-bio/blob/master/tools/ConvertBiom/ConvertBiom.py
***script scans through giant biom file in smaller pieces, rather than loading entire file into memory

Output is a sparse abundance matrix
each row is: OTU, site, number of reads

Example (first 10 lines from open reference biom file):
0 0 7.0
0 1 10.0
0 2 13.0
0 3 7.0
0 4 2.0
0 5 3.0
0 6 3.0
0 7 3.0
0 158 320.0
0 159 32.0

Not really sure how to interpret data

Friday, November 14, 2014

EMP Biom Files Pt. II

***ALL DONE ON THE MAC IN THE LAB***

Need to make biom file into classic format to pull out Curto files
Should look something like:

Sample

Taxonomy

OTU 1

OTU 2

…

OTU n

Biom Convert

biom convert
-i full_emp_table_w_tax_closedref.biom
-o full_emp_closedref_taxonomy.txt
--biom-to-classic-table
--header-key taxonomy

Generated .txt file but too large to export into Excel - ERROR - not enough memory

Need to breakdown master .biom file:

split_otu_table_by_taxonomy.py
-i full_emp_table_w_tax_closedref.biom
-L 3 #level3 taxonomic split (class level)
-o ./L3/

Need to split further - maybe at L5:

split_otu_table_by_taxonomy.py
-i full_emp_table_w_tax_closedref.biom
-L 5 #level5 taxonomic split (family level)
-o ./L5/

Problem: only 2 Curto OTUs present in outputted .biom file. Asked Sean Gibbons what he thought:
"Yep, any of the OTUs with 'New' in the name did not hit the reference database, so you won't find them in the closed ref table. Number-only labels are Greengenes IDs, and those should all be in the closed ref table."

So - try again with open reference database? crashed last time when I ran previous code on the open reference biom file (see Pt I)
RESULT: SystemError: Negative size passed to PyString_FromStringAndSize

^Probably due to too large of input file
full_emp_table_w_tax.biom is 2.63 GB - probably crashes as a security measure

Tuesday, November 11, 2014

EMP Biom Files

Not too sure what to do with .biom files from S.Gibbons

Two files:

full_emp_table_w_tax_closedref.biom
full_emp_table_w_tax.biom

Need to figure out how to pull Curtobacterium metadata from above master files

***Neither attempt has been able to utilize the open reference .biom file - ERROR***

Attempt 1:

QIIME - filter_samples_from_otu_table.py

Under "List-based Filtering":

-i full_emp_table_w_tax_closedref.biom
--sample_id_fp curto-only.txt

RESULT: nothing
PROBLEM: curto-only.txt contains OTUs, not individual samples

Attempt 2:

QIIME - filter_otus_from_otus_table.py

Use feature to extract Curtobacterium OTUs

-i full_emp_table_w_tax_closedref.biom
-e curto.only.txt #this excludes samples from new .biom
--negate_ids_to_exclude

RESULT: generates new .biom file
PROBLEM: I think it only contains a handful of OTUs, or not working?

Summary table from .biom

Num samples: 15481
Num observations: 2
Total count: 197
Table density (fraction of non-zero values): 0.005

Compared to Summary table from master .biom

Num samples: 15481
Num observations: 69444
Total count: 654448644
Table density (fraction of non-zero values): 0.016

Generated by "biom summarize-table" function

EMP Update

I was able to figure out which OTUs from the rep_set file were Curtobacterium:

Searched taxonomic assignment file from S.Gibbons for "Microbacteriaceae" n=2713

searchfile = open("rep_set_tax_assignments.txt", "r")
for line in searchfile:
if "f__Microbacteriaceae" in line: print line
searchfile.close()

Created a smaller fasta file by pulling out Microbacteriaceae sequences from giant 'rep_set.fna' file from S.Gibbons

from Bio import SeqIO
fasta_file = "rep_set.fna" #input fasta file
wanted_file = "microbacteriaceae-only.txt" #input interesting sequence IDs, one per line
result_file = "microbacteriaceae-only.fasta" #output fasta file
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
count = 0
with open(result_file, "w") as f:
for seq in fasta_sequences:
if seq.id in wanted:
count = count + 1
SeqIO.write([seq], f, "fasta")

print "Coverted %i records" % count

QIIME - assign_taxonomy.py on new 'microbacteriaceae-only.fasta'

Aligned with GreenGenes core set (same reference as GenBank protocol)

Performed above procedure to generate 'curtobacterium-only.fasta' n=53

Monday, November 10, 2014

BACE - Leaf Litter

Jen contacted Jeff Dukes at Purdue to get leaf litter from their grassland site in Boston.

Hopefully, this will be shipped next week (11/17).
***Claudia sent them a box on 11/13/14***

Need to find the leaf litter media from lab intranet (Kristen could help).

Found the following:

- Grind litter for 30 seconds to break into smaller chunks. Continue until you have ~200ml of ground litter.

- Add litter and 1 liter of DI water to large flask. Cover top with foil.

- Place on stir plate for 24 hours.

- Allow litter to settle for 24-48 hours.

- Decant liquid (siphon or scoop) into clean flask (you should have ~800 ml of media).

- Filter media through 100 μm, 8 μm, 3 μm, and 0.8 μm membranes (I use larger sizes first to decrease the use of expensive 0.8 μm filters).

- Transfer media into autoclavable jug. Add 15g agar and fill with DI water until final volume reached 1 litter.

- Autoclave (Liquid cycle)

Earth Microbiome (EMP)

EMP has been down for months. Trying to get access to their databases.
Heard from EMP: rebuilding database. any day now it will be up and running

Got access to the EMP files from Jack Gilbert and Sean Gibbons

open reference OTU table (.biom)
closed reference OTU table (.biom)
rep sequences (.fna)
phylogenetic tree (.tre)
metadata file (.txt)
taxonomic assignments (.txt) - from emp_10k_rdp