Tuesday, March 24, 2015

EMP - Matrix and OTU table

Need to analyze EMP metadata:

Raw data - EMP_10k_merged_mapping_final.txt and full_emp_table_w_tax.biom

I was able to pull out Curto OTUs last week (see post from 3/12/15) from the full_emp.biom and convert to .txt file to be able to manipulate further.

Giant EMP_10k file has 14095 samples. The Curto OTU table only has 2882 samples.

Took the list of samples that appear in Curto OTU table and made a list. Wrote a rough code sample-ids-curto.py to parse EMP_10k file and extract only samples from Curto OTU table.
Creates a new file - curto-samples.csv

$ wc -l curto-samples.csv 
2490 

This means there were roughly 390 samples missing. So either code has a bug OR samples NOT in the EMP_10k file. Turns out, they are not in the EMP_10k file (no idea why?).
$vimdiff file1 file2 
Samples not included in further analysis found in samples-not-in-csv.txt

Need to redefine Curto OTU table - eliminate samples that are not found in the EMP_10k file.
Modified previous code slightly to parse Curto OTU Table and pull out correct samples.
*First had to transpose OTU table to get in correct format - code parses the first string in each row

                             OLD FORMAT                                          
               Sample1   Sample2   Samplen
OTU1
OTU2
OTUn

Creates a new file - full-emp-curto-only-with-found-samples.csv

Compare number of samples to check: both files have 2490 samples

Sort both curto-samples.csv and full-emp-curto-only-with-found-samples.csv 
$ sort curto-samples.csv curto-samples-sorted.csv
#and for other file

Combine two files and check to make sure sample IDs match-up *they should since they were sorted
Creates a new file - combined-samples-otu-table.csv

                                    NEW FORMAT                                          
                     OTU1   OTU2   OTUn …   METADATA
Sample1
Sample2
Samplen

Eliminate all columns in Metadata that contain "na" or "None" for every sample
--> 205 columns were eliminated

--> GRAND TOTAL = combined-samples-otu-table-annotated.xlsx
53 OTUs with 2489 samples with 271 columns of Metadata!

No comments:

Post a Comment