Comparative genomics of E. coli and Shigella: Identification and characterization of pathogenic variants based on whole genome sequence analysis David A. Rasko PhD. University of Maryland School of Medicine Institute for Genome Sciences Department of Microbiology and Immunology
E. coli Diarrheagenic E. coli (DEC) Category B Pathogen (food and water borne pathogen) Causes ~300,000 deaths caused each year DEC strains can be categorized according to several distinct pathogenic variants (pathovars). Diverse serotypes (>100 serotypes) Phylogeny and evolution is diverse and complex as calculated by MLST and MLEE typing
Not all E. coli are created equal Commensal strain - colonization of huans without disease Kaper et al 2004
Why Genomics? Don t we know everything about E. coli yet? Organism diversity emergence of new pathogens? Outbreak identification? O104:H4 Virulence factor identification, regulatory pathway identification and SNP typing? Current typing methods are not adequate.
E. coli genomics genome structure Phage Phage O-antigen cluster Genome synteny is conserved, but can be up to 1.5 Mb of novel DNA in each strain
MLST as a typing schema 3400 nucleotides (PubMLST) Without manual intervention the phylotypes are not well resolved Pathotypes are not restricted to any one phylotype
Whole genome sequencing as a typing method 2.7 million bases shared between 44 E.coli/Shigella Gene independent > 290,000 variable columns in the alignment Each of the phylotypes is clearly distinguished
MLST vs Whole genome phylogeny Whole genome phylogeny is much more robust and provides greater discriminating power than MLST Currently almost no difference in cost
Context is important!!!
Whole genome phylogeny Shigella in relation to E. coli S. dysenteriae S. flexneri Phylogeny built on the conserved 2.2 million bases of these 96 E. coli/shigella genomes The four Shigella species form 2 distinct clusters when compared to E. coli
Whole genome phylogeny Shigella only comparisons Phylogeny based on the conserved 2.2 million bases of these 63 Shigella genomes 5 clades are supported: Clade 1 S. boydii dominant, with a few S. flexneri Clade 2 S. sonnei Clade 3 S. dysenteriae Type 2 and S. boydii Clade 4 S. dysenteriae Type 1 Clade 5 S. flexneri
O104 Clade Whole genome phylogeny context By placing the outbreak strains in context of representative isolates of E. coli we could quickly identify the outbreak isolate was an EAEC with Shiga-toxin Mark Pallen will expand later Rasko et al. NEJM 2011
Case study: Attaching and Effacing E. coli
Attaching and Effacing E. coli (AEEC) Diverse group of E. coli that contain: Locus of enterocyte effacement (LEE) Type III secretion system Responsible for pedestal formation Variable Shiga-toxin presence Large number of phage Relatively few genomes examined outside of O157 Focus on the identification of novel virulence factors in this group More accurate diagnostic of these isolates
Comparative Genomics Whole Genome Phylogeny 113 AEEC genomes have been sequenced at UMB/IGS in collaboration with SSI and other investigators Analysis include 23 reference genomes Prototype isolates of pathotypes may not always be the best genomic representatives Total 136 genome comparison
Inconsistency of typing with whole genomes or MLST Whole genome MLST
Phylogeny is neat and making these nice circles is cool, but what impact does this have on pathogenesis, virulence or therapeutic development? Carolyn Morris, Grad Student Rasko lab
AEEC Virulence - TTSS Majority of AEEC contain the LEE region (defining feature of the group) - Diversity in certain parts of the cluster, but not linked to virulence or disease severity - Secreted effectors included in this gene cluster other effectors located in the genome
Using whole genome alignments to identify novel genome features Comparative methods were adequate for the pairwise comparison of relatively few genomes How do you identify interesting regions in hundreds of genomes? Development and application of Genomic Epidemiology (or at least how we define it)
Application of Genomic Epidemiology Gene independent identification of regions associated with each group - comparison of defined pathotype isolates identified regions that were exclusive to each group -Known exclusive virulence factors identified
Application of Genomic Epidemiology Other features also identified that were not previously known to play a role in virulence - Functional analysis underway
Can we use this phylogenetic signal as a method to identify and develop group specific biomarkers?
Attaching and Effacing E. coli (AEEC) Novel algorithm established for the identification of AEEC pathogens
The current problems with the development of EPEC diagnostics Defining the pathogen Serotype or virulence is not sufficient? Case Control studies with well defined parameters for disease presentation, isolate source, patient and isolate metadata Ideally include host parameters as well Immune status Microbiota (can we use microbiota as a treatment?) Lack of understanding of population structure How many distinct EPEC isolates are within one individual? What is the rate of variation within a host? Environment?
Future Directions Discussion Points Rapid diagnostic genomics requires appropriate comparison to close relatives Proper comparison requires knowledge of pathogen Vibrio species comparisons require a SNP-based analysis E. coli/shigella comparisons are on the gene/region presence/absence Programs are required to obtain large-scale collections of isolates with extensive meta-data Commensal species for those organisms that have this type of interaction with with host Data needs to rapidly go public for use by entire scientific community
Acknowledgments IGS/UMB Julia Redman Jason Sahl Tracy Hazen Carolyn Morris Sam Angioli James Kaper Claire Fraser-Liggett Jacques Ravel Genome Resource Center and Informatics Resource Center groups CDC Cheryl Bopp Michele Parsons Ciara O Reilly Michael Humphrys Eric Mintz PacBio Dale Webster Ali Bashir Ellen Paxinos Andrew Kasarkis Eric Schadt UVa James Nataro Nadia Boisen SSI Flemming Scheutz Jakob Frimodt- Moller Carsten Struve Andreas Petersen Karen Krogfelt PHAC Matt Gilmour Funding National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under contract number HHSN272200900007C and grant numbers 1R01AI089894, 1RC4AI092828 and 1U19AI090873 as well as Startup funds from the University of Maryland School of Medicine and EntVac
Questions?