Friday, November 21, 2014

EMP Biom Files Pt. IV

Got in touch with Daniel MacDonald from the Knight Lab:

Sent him the full_emp biom file and he said it is fine but takes about ~30GB to parse (really prohibitive). Converted the open reference biom file into hdf5 format:

Wrote the following code. Only outputs one column (OTUs), but did confirm that curt OTUs are present in the file
import os
import h5py
mydir = os.path.expanduser("~/Desktop/alexs-stuff/")
in_file = mydir + "EMP/EMPopen/full_emp_table_hdf5.h5"
wanted_file = mydir + "EMP/greengenes-curto-only.txt"
out_file = mydir + "EMP/emp-curto-only.txt"
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
hdf5_file = h5py.File(in_file, "r")
count = 0
with open(out_file, "w") as h:
for keys in hdf5_file["observation"]["ids"]:
if keys in wanted:
count = count + 1
h.write(keys + "\n")
print "Converted %i records" % count

Monday, November 17, 2014

EMP Biom Files Pt. III

Got in touch with Sean Gibbons and he was able to forward some code:
***script scans through giant biom file in smaller pieces, rather than loading entire file into memory

Output is a sparse abundance matrix
each row is: OTU, site, number of reads

Example (first 10 lines from open reference biom file):
0 0 7.0
0 1 10.0
0 2 13.0
0 3 7.0
0 4 2.0
0 5 3.0
0 6 3.0
0 7 3.0
0 158 320.0
0 159 32.0

Not really sure how to interpret data

Friday, November 14, 2014

EMP Biom Files Pt. II


Need to make biom file into classic format to pull out Curto files
Should look something like:
OTU 1 

biom convert
-i full_emp_table_w_tax_closedref.biom
-o full_emp_closedref_taxonomy.txt
--header-key taxonomy 

Generated .txt file but too large to export into Excel - ERROR - not enough memory

Need to breakdown master .biom file:
-i full_emp_table_w_tax_closedref.biom
-L 3 #level3 taxonomic split (class level)
-o ./L3/

Need to split further - maybe at L5:
-i full_emp_table_w_tax_closedref.biom
-L 5 #level5 taxonomic split (family level)
-o ./L5/

Problem: only 2 Curto OTUs present in outputted .biom file. Asked Sean Gibbons what he thought:
"Yep, any of the OTUs with 'New' in the name did not hit the reference database, so you won't find them in the closed ref table. Number-only labels are Greengenes IDs, and those should all be in the closed ref table."

So - try again with open reference database? crashed last time when I ran previous code on the open reference biom file (see Pt I)
RESULT: SystemError: Negative size passed to PyString_FromStringAndSize

^Probably due to too large of input file
full_emp_table_w_tax.biom is 2.63 GB - probably crashes as a security measure

Tuesday, November 11, 2014

EMP Biom Files

Not too sure what to do with .biom files from S.Gibbons
Two files:

Need to figure out how to pull Curtobacterium metadata from above master files

***Neither attempt has been able to utilize the open reference .biom file - ERROR***

Attempt 1:
  • QIIME -
    • Under "List-based Filtering":
      • -i full_emp_table_w_tax_closedref.biom
      • --sample_id_fp curto-only.txt
  • RESULT: nothing
  • PROBLEM: curto-only.txt contains OTUs, not individual samples
Attempt 2:
  • QIIME -
    • Use feature to extract Curtobacterium OTUs
      • -i full_emp_table_w_tax_closedref.biom
      • -e curto.only.txt #this excludes samples from new .biom
      • --negate_ids_to_exclude 
  • RESULT: generates new .biom file
  • PROBLEM: I think it only contains a handful of OTUs, or not working?
    • Summary table from .biom
      • Num samples: 15481
      • Num observations: 2
      • Total count: 197
      • Table density (fraction of non-zero values): 0.005
    • Compared to Summary table from master .biom
      • Num samples: 15481
      • Num observations: 69444
      • Total count: 654448644
      • Table density (fraction of non-zero values): 0.016
    • Generated by "biom summarize-table" function 

EMP Update

I was able to figure out which OTUs from the rep_set file were Curtobacterium:

  • Searched taxonomic assignment file from S.Gibbons for "Microbacteriaceae" n=2713

searchfile = open("rep_set_tax_assignments.txt", "r")
for line in searchfile:
    if "f__Microbacteriaceae" in line: print line

  • Created a smaller fasta file by pulling out Microbacteriaceae sequences from giant 'rep_set.fna' file from S.Gibbons

from Bio import SeqIO
fasta_file = "rep_set.fna" #input fasta file
wanted_file = "microbacteriaceae-only.txt" #input interesting sequence IDs, one per line
result_file = "microbacteriaceae-only.fasta" #output fasta file
wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
count = 0
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if in wanted:
         count = count + 1
         SeqIO.write([seq], f, "fasta")

print "Coverted %i records" % count

  • QIIME - on new 'microbacteriaceae-only.fasta' 
    • Aligned with GreenGenes core set (same reference as GenBank protocol)
  • Performed above procedure to generate 'curtobacterium-only.fasta' n=53

Monday, November 10, 2014

BACE - Leaf Litter

Jen contacted Jeff Dukes at Purdue to get leaf litter from their grassland site in Boston.

Hopefully, this will be shipped next week (11/17).
***Claudia sent them a box on 11/13/14***

Need to find the leaf litter media from lab intranet (Kristen could help).

Found the following:
-       Grind litter for 30 seconds to break into smaller chunks. Continue until you have ~200ml of ground litter.
-       Add litter and 1 liter of DI water to large flask. Cover top with foil.
-       Place on stir plate for 24 hours.
-       Allow litter to settle for 24-48 hours.
-       Decant liquid (siphon or scoop) into clean flask (you should have ~800 ml of media).
-       Filter media through 100 μm, 8 μm, 3 μm, and 0.8 μm membranes (I use larger sizes first to decrease the use of expensive 0.8 μm filters).
-       Transfer media into autoclavable jug. Add 15g agar and fill with DI water until final volume reached 1 litter.

-       Autoclave (Liquid cycle)

Earth Microbiome (EMP)

EMP has been down for months. Trying to get access to their databases.
Heard from EMP: rebuilding database. any day now it will be up and running

Got access to the EMP files from Jack Gilbert and Sean Gibbons
  • open reference OTU table (.biom)
  • closed reference OTU table (.biom)
  • rep sequences (.fna)
  • phylogenetic tree (.tre)
  • metadata file (.txt)
  • taxonomic assignments (.txt) - from emp_10k_rdp