ITS accuracy at GenBank Conrad Schoch Barbara Robbertse
Improving accuracy Barcode tag in GenBank Barcode submission tool Standards RefSeq Targeted Loci Well validated sequences already in GenBank Bacteria all type sequences Limited fungal sequences
Formal selection of the fungal DNA barcode Schoch et al. 2012.
ITS sequence standards 1. Standardized sequence title should be "Fungal ITS barcode". 2. Annotation 3. Length 4. Quality of sequence 5. Unique or not? 6. Meta data
Difference Between GenBank and RefSeq Targeted Loci GenBank Not curated Author submits Only author can revise Multiple records for same loci common Records can contradict each other No limit to species included Data exchanged among INSDC members Akin to primary literature Proteins identified and linked Access via NCBI Nucleotide databases RefSeq Curated NCBI creates from existing data NCBI revises as new data emerge Single records for each molecule of major organisms Limited to model organisms Exclusive NCBI database Akin to review articles Proteins and transcripts identified and linked Access via Nucleotide & Protein databases http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.genbank_asm
150 Yeast sequences 1. Same ITS accession associated with different species/strains in the list. 2. Taxon name associated with accession in provide list not found in NCBI Strain name indicated on GenBank accession not found at culture collection database (wrong strain name in GenBank?) 3. Incomplete accession identifier in the list. 4. A few accessions in the list does not exist in Genbank.
Checklist for ITS Accessions added to the target loci RefSeq project: ----------------------------------------------------------------------------------------- 1) Source from a type specimen. 2) Primary GenBank name and Current name at CBS is the same. 3) Strand in the correct orientation. 4) Type info added from CBS to /note. 5) Added feature /culture_collection 6) Added feature /identified_by (source CBS) 7) Moved information in note to /isolation_source 7) All 26S labled 28S 8) Reannotated (used 5.8S Rfam borders; used 3 18S boundaries (CATTA motif) and 5 28S border (GACCT motif) as guide in an alignment). 9) Added PMID if available. 10) Checked hits with moleblast. 11) Example defline (note it has no strain info): DEFINITION Trichosporon veenhuisii 18S ribosomal RNA gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence. 12) Features used: /rrna for 18S and 28S and /misc_rna for ITS1 and ITS2 13) Standardized names used in product qualifier: 18S ribosomal RNA, 5.8S ribosomal RNA, 28S ribosomal RNA, internal transcribed spacer 1, internal transcribed spacer 2
Annotation of 150 ITS records in GenBank #records Features Annotation in note or product 127 /misc_rna contains 18S ribosomal RNA, internal transcribed spacer 1, 5.8S ribosomal RNA, internal transcribed spacer 2, and 28S ribosomal RNA (or 26S ribosomal RNA) 4 /misc_rna or /misc_feature 8 /misc_rna and /rrna 2 /misc_feature and /gene and /rrna 6 /rrna and /gene and /misc_feature 3 /rrna and /misc_rna contains internal transcribed spacer 1, 5.8S ribosomal RNA and internal transcribed spacer 2 internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 18S ribosomal RNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 26S ribosomal RNA 18S ribosomal RNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 28S ribosomal RNA (or 26S ribosomal RNA)
Annotation of 150 ITS records in RefSeq #records Features Annotation in product 138 /rrna and /misc_rna 18S ribosomal RNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 28S ribosomal RNA 2 /rrna and /misc_rna 18S ribosomal RNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 10 /misc_rna and /rrna internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2
RefSeq Accession number Project number RefSeq references Expanded qualifiers
Algorithm: 1. For each query run BLAST search against nr and collect top five hits. 2. Cluster query sequences into groups corresponding to different loci 3. For each locus: Compute multiple alignment for queries and their top five BLAST hits Compute phylogenetic tree based on the multiple alignment MOLE-BLAST
Adding microbial type strain data to the taxonomy database Upload all types together with names in NCBI Taxonomy Cross reference this as a property in Entrez Enable search restricted to ex-type sequences Start with Euzeby list
What next? Expand other markers for the known universe Secondary barcode-type markers list and communicate resources Highlight problematic ITS taxa Provide barcodes for all genomes Ensure genome samples are correctly identified Integrate sequences with fungal names
BaG (Barcode all genera) of Fungi, proposed goals Sequence for more than 3000 genera in GenBank Compare GenBank and MycoBank taxonomies Highlight types in GenBank taxonomy Target lists for all fungal genera focused on type species 16 000 Genera (5000 with full meta-data in MycoBank)
One name one fungus = opportunity
Acknowledgments ITS Meta data Centraalbureau voor Schimmelcultures (CBS) MOLE-BLAST Grzegorz Boratyn Tom Madden Taxonomy type updates Scott Federhen