Tuesday, November 11, 2014

EMP Update

I was able to figure out which OTUs from the rep_set file were Curtobacterium:

  • Searched taxonomic assignment file from S.Gibbons for "Microbacteriaceae" n=2713

searchfile = open("rep_set_tax_assignments.txt", "r")
for line in searchfile:
    if "f__Microbacteriaceae" in line: print line
searchfile.close()

  • Created a smaller fasta file by pulling out Microbacteriaceae sequences from giant 'rep_set.fna' file from S.Gibbons

from Bio import SeqIO
fasta_file = "rep_set.fna" #input fasta file
wanted_file = "microbacteriaceae-only.txt" #input interesting sequence IDs, one per line
result_file = "microbacteriaceae-only.fasta" #output fasta file
wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
count = 0
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted:
         count = count + 1
         SeqIO.write([seq], f, "fasta")

print "Coverted %i records" % count

  • QIIME - assign_taxonomy.py on new 'microbacteriaceae-only.fasta' 
    • Aligned with GreenGenes core set (same reference as GenBank protocol)
  • Performed above procedure to generate 'curtobacterium-only.fasta' n=53

No comments:

Post a Comment