Sent him the full_emp biom file and he said it is fine but takes about ~30GB to parse (really prohibitive). Converted the open reference biom file into hdf5 format:
ftp://thebeast.colorado.edu/pub/full_emp_table_w_tax.hdf5
Wrote the following code. Only outputs one column (OTUs), but did confirm that curt OTUs are present in the file
Wrote the following code. Only outputs one column (OTUs), but did confirm that curt OTUs are present in the file
import osimport h5pymydir = os.path.expanduser("~/Desktop/alexs-stuff/")in_file = mydir + "EMP/EMPopen/full_emp_table_hdf5.h5"wanted_file = mydir + "EMP/greengenes-curto-only.txt"out_file = mydir + "EMP/emp-curto-only.txt"wanted = set()with open(wanted_file) as f:for line in f:line = line.strip()if line != "":wanted.add(line)hdf5_file = h5py.File(in_file, "r")count = 0with open(out_file, "w") as h:for keys in hdf5_file["observation"]["ids"]:if keys in wanted:count = count + 1h.write(keys + "\n")print "Converted %i records" % counthdf5_file.close()