Thursday, October 9, 2014

Genbank - Curto Only Fasta File

1. Take master fasta file (BLAST-combined.fasta)
2. Need to extract only curto and frigo taxonomic alignments
3. Create a .txt file with Accession Numbers
     a. Sort through BLAST-combined-curto_tax_assignments.txt for curto and frigo
          i. Above file was generated from QIIME assign_taxonomy.py
     b. Add accession numbers of only curto and frigo and create curto-accession-numbers.txt
4. Run curto-only2.py to cross-reference .txt file to master .fasta file
     a. Basically, code sorts through BLAST-combined.fasta and pulls out the information if the accession number is in the curto-accession-numbers.txt file
     b. PROBLEM: program adds ALL matching accession numbers (n = 4355, should be n = 982)
          i. curto-and-frigo-only-with-dups.fasta
     c. The extra sequence data is from duplicate accession numbers - need to filter out
5. Run duplicate-removal.py to filter out duplicate accession numbers, not duplicate sequences
6. FINALLY, have a fasta file with only curto and frigo sequences (n = 982 - verified)
     a. curto-and-frigo-only.fasta

No comments:

Post a Comment