Thursday, October 9, 2014

Genbank - Protocol for Metadata extraction from GenBank

Protocol for Curto Sequences

1. BLAST GreenGenes Rep Sequences and take top 5000 hits per sequence blasted
2. Query search for "microbacteriaceae curtobacterium 16S ribosomal RNA gene"
     a. Returned 1255 results
     b. Concatenate results onto GG rep sets
3. Created GenBank file with all results (n = 41246)
     a. combined-curto.gb
4. Run extract-data-from-genbank.py and export results to .csv file
     a. Took accession number, Genbank ID, title, isolation source, host, and rep sequence
     b. Tallied number of unique records (n = 11484)
5. Convert .gb to .fasta file using gb-to-fasta.py
6. Use QIIME to assign-taxonomy.py using PYNAST
7. Add taxonomic information to .csv file
8. Delete duplicate accession numbers and align taxonomic information with genbank info
     a. Created master sheet with (n = 11419) sequences that aligned with GreenGenes database
     b. Had 9237 isolation sources
     c. Excel - Duplicate Removal
9. Took accession numbers that aligned with curto (n = 959) and isolation sources (n = 736)
     ***NOTE*** A lot of sequences only aligned to the Family level
     Sample below of information extracted - master file: BLAST-gg-aligned.xlsx

10. Run sequence-cleaner.py and export list of accession numbers with get-accession.py for reference of which sequences were duplicates
     a. List of duplicate sequences: duplicate-BLAST-sequences.xlsx

No comments:

Post a Comment