Friday, November 21, 2014

EMP Biom Files Pt. IV

Got in touch with Daniel MacDonald from the Knight Lab:

Sent him the full_emp biom file and he said it is fine but takes about ~30GB to parse (really prohibitive). Converted the open reference biom file into hdf5 format:
ftp://thebeast.colorado.edu/pub/full_emp_table_w_tax.hdf5

Wrote the following code. Only outputs one column (OTUs), but did confirm that curt OTUs are present in the file
import os
import h5py
mydir = os.path.expanduser("~/Desktop/alexs-stuff/")
in_file = mydir + "EMP/EMPopen/full_emp_table_hdf5.h5"
wanted_file = mydir + "EMP/greengenes-curto-only.txt"
out_file = mydir + "EMP/emp-curto-only.txt"
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
hdf5_file = h5py.File(in_file, "r")
count = 0
with open(out_file, "w") as h:
for keys in hdf5_file["observation"]["ids"]:
if keys in wanted:
count = count + 1
h.write(keys + "\n")
print "Converted %i records" % count
hdf5_file.close()

Monday, November 17, 2014

EMP Biom Files Pt. III

Got in touch with Sean Gibbons and he was able to forward some code:
https://github.com/klocey/rare-bio/blob/master/tools/ConvertBiom/ConvertBiom.py
***script scans through giant biom file in smaller pieces, rather than loading entire file into memory

Output is a sparse abundance matrix
each row is: OTU, site, number of reads

Example (first 10 lines from open reference biom file):
0 0 7.0
0 1 10.0
0 2 13.0
0 3 7.0
0 4 2.0
0 5 3.0
0 6 3.0
0 7 3.0
0 158 320.0
0 159 32.0

Not really sure how to interpret data

Friday, November 14, 2014

EMP Biom Files Pt. II

***ALL DONE ON THE MAC IN THE LAB***

Need to make biom file into classic format to pull out Curto files
Should look something like:
Sample
Taxonomy
OTU 1 
OTU 2
OTU n

biom convert
-i full_emp_table_w_tax_closedref.biom
-o full_emp_closedref_taxonomy.txt
--biom-to-classic-table
--header-key taxonomy 

Generated .txt file but too large to export into Excel - ERROR - not enough memory

Need to breakdown master .biom file:

split_otu_table_by_taxonomy.py
-i full_emp_table_w_tax_closedref.biom
-L 3 #level3 taxonomic split (class level)
-o ./L3/

Need to split further - maybe at L5:

split_otu_table_by_taxonomy.py
-i full_emp_table_w_tax_closedref.biom
-L 5 #level5 taxonomic split (family level)
-o ./L5/

Problem: only 2 Curto OTUs present in outputted .biom file. Asked Sean Gibbons what he thought:
"Yep, any of the OTUs with 'New' in the name did not hit the reference database, so you won't find them in the closed ref table. Number-only labels are Greengenes IDs, and those should all be in the closed ref table."

So - try again with open reference database? crashed last time when I ran previous code on the open reference biom file (see Pt I)
RESULT: SystemError: Negative size passed to PyString_FromStringAndSize

^Probably due to too large of input file
full_emp_table_w_tax.biom is 2.63 GB - probably crashes as a security measure

Tuesday, November 11, 2014

EMP Biom Files

Not too sure what to do with .biom files from S.Gibbons
Two files:
full_emp_table_w_tax_closedref.biom
full_emp_table_w_tax.biom


Need to figure out how to pull Curtobacterium metadata from above master files

***Neither attempt has been able to utilize the open reference .biom file - ERROR***

Attempt 1:
  • QIIME - filter_samples_from_otu_table.py
    • Under "List-based Filtering":
      • -i full_emp_table_w_tax_closedref.biom
      • --sample_id_fp curto-only.txt
  • RESULT: nothing
  • PROBLEM: curto-only.txt contains OTUs, not individual samples
Attempt 2:
  • QIIME - filter_otus_from_otus_table.py
    • Use feature to extract Curtobacterium OTUs
      • -i full_emp_table_w_tax_closedref.biom
      • -e curto.only.txt #this excludes samples from new .biom
      • --negate_ids_to_exclude 
  • RESULT: generates new .biom file
  • PROBLEM: I think it only contains a handful of OTUs, or not working?
    • Summary table from .biom
      • Num samples: 15481
      • Num observations: 2
      • Total count: 197
      • Table density (fraction of non-zero values): 0.005
    • Compared to Summary table from master .biom
      • Num samples: 15481
      • Num observations: 69444
      • Total count: 654448644
      • Table density (fraction of non-zero values): 0.016
    • Generated by "biom summarize-table" function 

EMP Update

I was able to figure out which OTUs from the rep_set file were Curtobacterium:

  • Searched taxonomic assignment file from S.Gibbons for "Microbacteriaceae" n=2713

searchfile = open("rep_set_tax_assignments.txt", "r")
for line in searchfile:
    if "f__Microbacteriaceae" in line: print line
searchfile.close()

  • Created a smaller fasta file by pulling out Microbacteriaceae sequences from giant 'rep_set.fna' file from S.Gibbons

from Bio import SeqIO
fasta_file = "rep_set.fna" #input fasta file
wanted_file = "microbacteriaceae-only.txt" #input interesting sequence IDs, one per line
result_file = "microbacteriaceae-only.fasta" #output fasta file
wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
count = 0
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted:
         count = count + 1
         SeqIO.write([seq], f, "fasta")

print "Coverted %i records" % count

  • QIIME - assign_taxonomy.py on new 'microbacteriaceae-only.fasta' 
    • Aligned with GreenGenes core set (same reference as GenBank protocol)
  • Performed above procedure to generate 'curtobacterium-only.fasta' n=53

Monday, November 10, 2014

BACE - Leaf Litter

Jen contacted Jeff Dukes at Purdue to get leaf litter from their grassland site in Boston.

Hopefully, this will be shipped next week (11/17).
***Claudia sent them a box on 11/13/14***

Need to find the leaf litter media from lab intranet (Kristen could help).

Found the following:
-       Grind litter for 30 seconds to break into smaller chunks. Continue until you have ~200ml of ground litter.
-       Add litter and 1 liter of DI water to large flask. Cover top with foil.
-       Place on stir plate for 24 hours.
-       Allow litter to settle for 24-48 hours.
-       Decant liquid (siphon or scoop) into clean flask (you should have ~800 ml of media).
-       Filter media through 100 μm, 8 μm, 3 μm, and 0.8 μm membranes (I use larger sizes first to decrease the use of expensive 0.8 μm filters).
-       Transfer media into autoclavable jug. Add 15g agar and fill with DI water until final volume reached 1 litter.

-       Autoclave (Liquid cycle)

Earth Microbiome (EMP)

EMP has been down for months. Trying to get access to their databases.
Heard from EMP: rebuilding database. any day now it will be up and running

Got access to the EMP files from Jack Gilbert and Sean Gibbons
  • open reference OTU table (.biom)
  • closed reference OTU table (.biom)
  • rep sequences (.fna)
  • phylogenetic tree (.tre)
  • metadata file (.txt)
  • taxonomic assignments (.txt) - from emp_10k_rdp