ITS accuracy at GenBank. Conrad Schoch Barbara Robbertse

Similar documents
OVERVIEW OF CURRENT IDENTIFICATION SYSTEMS AND DATABASES

Hands-On Ten The BRCA1 Gene and Protein

Annotation of Drosophila mojavensis fosmid 8 Priya Srikanth Bio 434W

Bioinformatics Laboratory Exercise

Student Handout Bioinformatics

Appendix 81. From OPENFLU to OPENFMD. Open Session of the EuFMD: 2012, Jerez de la Frontera, Spain 1. Conclusions and recommendations

Annotation of Chimp Chunk 2-10 Jerome M Molleston 5/4/2009

AutoOrthoGen. Multiple Genome Alignment and Comparison

Evidence of a Pathway of Reduction in Bacteria: Reduced Quantities of Restriction Sites Impact trna Activity in a Trial Set

NCBI will no longer make taxonomy identifiers for individual influenza strains on January 15, 2018

The Blueprint of Life: DNA to Protein. What is genetics? DNA Structure 4/27/2011. Chapter 7

The Blueprint of Life: DNA to Protein

SMPD 287 Spring 2015 Bioinformatics in Medical Product Development. Final Examination

DNA codes for RNA, which guides protein synthesis.

Rotavirus Genotyping and Enhanced Annotation in the Virus Pathogen Resource (ViPR) Yun Zhang J. Craig Venter Institute ASV 2016 June 19, 2016

PAirwise Sequence Comparison (PASC) and Its Application in the Classification of Filoviruses

PROTOCOL FOR INFLUENZA A VIRUS GLOBAL SWINE H1 CLADE CLASSIFICATION

a. From the grey navigation bar, mouse over Analyze & Visualize and click Annotate Nucleotide Sequences.

Phylogenomics. Antonis Rokas Department of Biological Sciences Vanderbilt University.

Name: Due on Wensday, December 7th Bioinformatics Take Home Exam #9 Pick one most correct answer, unless stated otherwise!

Following virus recombination and evolution

Module 3. Genomic data and annotations in public databases Exercises Custom sequence annotation

a) SSR with core motif > 2 and repeats number >3. b) MNR with repeats number>5.

Molecular phylogeny of Australian isolates of Sporothrix schenckii sensu lato. David New Microbiology Registrar, PathWest

VirusDetect pipeline - virus detection with small RNA sequencing

Long non-coding RNAs

PROTEIN SYNTHESIS. It is known today that GENES direct the production of the proteins that determine the phonotypical characteristics of organisms.

Cross species analysis of genomics data. Computational Prediction of mirnas and their targets

High-throughput transcriptome sequencing

SEQUENCE FEATURE VARIANT TYPES

Bioinformation by Biomedical Informatics Publishing Group

GENOME-WIDE COMPUTATIONAL ANALYSIS OF SMALL NUCLEAR RNA GENES OF ORYZA SATIVA (INDICA AND JAPONICA)

Molecular Identification of Lipase Producing Bacteria based on 16S rdna Sequencing

RNA and Protein Synthesis Guided Notes

Identification of mirnas in Eucalyptus globulus Plant by Computational Methods

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

CHAPTER 6 METABOLIC PATHWAY RECONSTRUCTION

FINAL ANNOTATION REPORT: Drosophila virilis Fosmid 11 (48P14) Robert Carrasquillo Bio 4342

Association for Molecular Pathology Promoting Clinical Practice, Basic Research, and Education in Molecular Pathology

HBV. Next Generation Sequencing, data analysis and reporting. Presenter Leen-Jan van Doorn

First Report of Penicillium adametzioides from Decayed Grapes (Vitis vinifera) in Pakistan

Data mining with Ensembl Biomart. Stéphanie Le Gras

High Throughput Sequence (HTS) data analysis. Lei Zhou

The BLAST search on NCBI ( and GISAID

The RNA Virus Database

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Prokaryotes and eukaryotes alter gene expression in response to their changing environment

Quality control of Saccharomyces yeasts: differentiation of species level and strain grouping using COX 2 gene analysis and MALDI-TOF-MS analysis

David M. Underhill, Ph.D.

Deciphering the Role of micrornas in BRD4-NUT Fusion Gene Induced NUT Midline Carcinoma

Bioinformatics. Sequence Analysis: Part III. Pattern Searching and Gene Finding. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Sections 12.3, 13.1, 13.2

For all of the following, you will have to use this website to determine the answers:

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

High AU content: a signature of upregulated mirna in cardiac diseases

Basic Local Alignment Search Tool

Supplementary Figure 1. SC35M polymerase activity in the presence of Bat or SC35M NP encoded from the phw2000 rescue plasmid.

Nature Structural & Molecular Biology: doi: /nsmb.2419

Exemplar for Internal Assessment Resource Mathematics and Statistics Level 1 Resource title: Carbon Credits

Working with gene features and genomes

Transcriptome Analysis

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and Worldwide.

Influenza Virus HA Subtype Numbering Conversion Tool and the Identification of Candidate Cross-Reactive Immune Epitopes

Sebastian Jaenicke. trnascan-se. Improved detection of trna genes in genomic sequences

Alternative RNA processing: Two examples of complex eukaryotic transcription units and the effect of mutations on expression of the encoded proteins.

IDENTIFICATION OF IN SILICO MIRNAS IN FOUR PLANT SPECIES FROM FABACEAE FAMILY

Table of content. -Supplementary methods. -Figure S1. -Figure S2. -Figure S3. -Table legend

RNA Secondary Structures: A Case Study on Viruses Bioinformatics Senior Project John Acampado Under the guidance of Dr. Jason Wang

MODULE 3: TRANSCRIPTION PART II

Studying Alternative Splicing

Comparing Amino Acid Sequences Abstract

The Open Bioinformatics Journal, 2014, 8, 1-5 1

Pre-mRNA Secondary Structure Prediction Aids Splice Site Recognition

Molecular Cell Biology - Problem Drill 10: Gene Expression in Eukaryotes

Principles of phylogenetic analysis

Supplemental Figure 1. Small RNA size distribution from different soybean tissues.

Hepatitis A Outbreaks In Australia Molecular Epidemiology

ELF3and. FOXA2Loci, Probable Putative Trans-QTL for Abnormal Sperm Percentage in Cattle: An In- silicoanalysis

DEVELOPING BIOINFORMATICS TOOLS FOR THE STUDY OF ALTERNATIVE SPLICING IN EUKARYOTIC GENES LIM YUN PING

Bjoern Peters La Jolla Institute for Allergy and Immunology Buenos Aires, Oct 31, 2012

Protein Synthesis and Mutation Review

AP Biology Reading Guide. Concept 19.1 A virus consists of a nucleic acid surrounded by a protein coat

Bioinformatic analyses: methodology for allergen similarity search. Zoltán Divéki, Ana Gomes EFSA GMO Unit

Section B. Comparative Genomics Analysis of Influenza H5N2 Viruses. Objective

Genetics. Instructor: Dr. Jihad Abdallah Transcription of DNA

Supplemental Information For: The genetics of splicing in neuroblastoma

Overview: Chapter 19 Viruses: A Borrowed Life

Exploring HIV Evolution: An Opportunity for Research Sam Donovan and Anton E. Weisstein

Gene Expression. From a gene to a protein

Non-messenger RNAs. Karin Lagesen

Part III: Basic Immunology

Fondation Merieux J Craig Venter Institute Bioinformatics Workshop. December 5 8, 2017

In Memoriam July 2006 Elisa Santry, 16, Boston, MA Dave Bushow, 29, River Vale, NJ /27/2006 1of 33

Micro-RNA web tools. Introduction. UBio Training Courses. mirnas, target prediction, biology. Gonzalo

Characterizing the Respiratory Microbiome of Commercial Broilers on the Delmarva Peninsula

RESEARCH PROJECT. Comparison of searching tools and outcomes of different providers of the Medline database (OVID and PubMed).

Protein Synthesis

Next Generation Surveillance Systems integrating whole genome sequencing data into real-time detection and control (of TADs / EADs)

CS 312: Algorithms Analysis. Gene Sequence Alignment. Overview: Objectives: Code:

aM (modules 1 and 10 are required)

Transcription:

ITS accuracy at GenBank Conrad Schoch Barbara Robbertse

Improving accuracy Barcode tag in GenBank Barcode submission tool Standards RefSeq Targeted Loci Well validated sequences already in GenBank Bacteria all type sequences Limited fungal sequences

Formal selection of the fungal DNA barcode Schoch et al. 2012.

ITS sequence standards 1. Standardized sequence title should be "Fungal ITS barcode". 2. Annotation 3. Length 4. Quality of sequence 5. Unique or not? 6. Meta data

Difference Between GenBank and RefSeq Targeted Loci GenBank Not curated Author submits Only author can revise Multiple records for same loci common Records can contradict each other No limit to species included Data exchanged among INSDC members Akin to primary literature Proteins identified and linked Access via NCBI Nucleotide databases RefSeq Curated NCBI creates from existing data NCBI revises as new data emerge Single records for each molecule of major organisms Limited to model organisms Exclusive NCBI database Akin to review articles Proteins and transcripts identified and linked Access via Nucleotide & Protein databases http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.genbank_asm

150 Yeast sequences 1. Same ITS accession associated with different species/strains in the list. 2. Taxon name associated with accession in provide list not found in NCBI Strain name indicated on GenBank accession not found at culture collection database (wrong strain name in GenBank?) 3. Incomplete accession identifier in the list. 4. A few accessions in the list does not exist in Genbank.

Checklist for ITS Accessions added to the target loci RefSeq project: ----------------------------------------------------------------------------------------- 1) Source from a type specimen. 2) Primary GenBank name and Current name at CBS is the same. 3) Strand in the correct orientation. 4) Type info added from CBS to /note. 5) Added feature /culture_collection 6) Added feature /identified_by (source CBS) 7) Moved information in note to /isolation_source 7) All 26S labled 28S 8) Reannotated (used 5.8S Rfam borders; used 3 18S boundaries (CATTA motif) and 5 28S border (GACCT motif) as guide in an alignment). 9) Added PMID if available. 10) Checked hits with moleblast. 11) Example defline (note it has no strain info): DEFINITION Trichosporon veenhuisii 18S ribosomal RNA gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence. 12) Features used: /rrna for 18S and 28S and /misc_rna for ITS1 and ITS2 13) Standardized names used in product qualifier: 18S ribosomal RNA, 5.8S ribosomal RNA, 28S ribosomal RNA, internal transcribed spacer 1, internal transcribed spacer 2

Annotation of 150 ITS records in GenBank #records Features Annotation in note or product 127 /misc_rna contains 18S ribosomal RNA, internal transcribed spacer 1, 5.8S ribosomal RNA, internal transcribed spacer 2, and 28S ribosomal RNA (or 26S ribosomal RNA) 4 /misc_rna or /misc_feature 8 /misc_rna and /rrna 2 /misc_feature and /gene and /rrna 6 /rrna and /gene and /misc_feature 3 /rrna and /misc_rna contains internal transcribed spacer 1, 5.8S ribosomal RNA and internal transcribed spacer 2 internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 18S ribosomal RNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 26S ribosomal RNA 18S ribosomal RNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 28S ribosomal RNA (or 26S ribosomal RNA)

Annotation of 150 ITS records in RefSeq #records Features Annotation in product 138 /rrna and /misc_rna 18S ribosomal RNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 28S ribosomal RNA 2 /rrna and /misc_rna 18S ribosomal RNA internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2 10 /misc_rna and /rrna internal transcribed spacer 1 5.8S ribosomal RNA internal transcribed spacer 2

RefSeq Accession number Project number RefSeq references Expanded qualifiers

Algorithm: 1. For each query run BLAST search against nr and collect top five hits. 2. Cluster query sequences into groups corresponding to different loci 3. For each locus: Compute multiple alignment for queries and their top five BLAST hits Compute phylogenetic tree based on the multiple alignment MOLE-BLAST

Adding microbial type strain data to the taxonomy database Upload all types together with names in NCBI Taxonomy Cross reference this as a property in Entrez Enable search restricted to ex-type sequences Start with Euzeby list

What next? Expand other markers for the known universe Secondary barcode-type markers list and communicate resources Highlight problematic ITS taxa Provide barcodes for all genomes Ensure genome samples are correctly identified Integrate sequences with fungal names

BaG (Barcode all genera) of Fungi, proposed goals Sequence for more than 3000 genera in GenBank Compare GenBank and MycoBank taxonomies Highlight types in GenBank taxonomy Target lists for all fungal genera focused on type species 16 000 Genera (5000 with full meta-data in MycoBank)

One name one fungus = opportunity

Acknowledgments ITS Meta data Centraalbureau voor Schimmelcultures (CBS) MOLE-BLAST Grzegorz Boratyn Tom Madden Taxonomy type updates Scott Federhen