CS 312: Algorithms Analysis. Gene Sequence Alignment. Overview: Objectives: Code:

Similar documents
Novel Coronavirus 2012

Coronaviruses. Virion. Genome. Genes and proteins. Viruses and hosts. Diseases. Distinctive characteristics

Coronaviruses cause acute, mild upper respiratory infection (common cold).

Part I. Content: History of Viruses. General properties of viruses. Viral structure. Viral classifications. Virus-like agents.

Trends of Pandemics in the 21 st Century

Update August Questions & Answers on Middle East Respiratory Syndrome Coronavirus (MERS CoV)

INFLUENZA-2 Avian Influenza

Chapter 6- An Introduction to Viruses*

History electron microscopes

Gastroenteritis and viral infections

Antigenic Epitope Prediction towards SARS Corona Virus using Bioinformatics Analysis *1

Malik Sallam. Ola AL-juneidi. Ammar Ramadan. 0 P a g e

CONSTRUCTION OF PHYLOGENETIC TREE USING NEIGHBOR JOINING ALGORITHMS TO IDENTIFY THE HOST AND THE SPREADING OF SARS EPIDEMIC

Sequencing-Based Identification of a Novel Coronavirus in Ferrets with Epizootic Catarrhal Enteritis and Development of Molecular Diagnostic Tests

بسم هللا الرحمن الرحيم

FACT SHEET FOR ADDITIONAL INFORMATION CONTACT

Julianne Edwards. Retroviruses. Spring 2010

Scientific Method in Vaccine History

Viruse associated gastrointestinal infection

Cover Page. The handle holds various files of this Leiden University dissertation

Avian Influenza Virus H7N9. Dr. Di Liu Network Information Center Institute of Microbiology Chinese Academy of Sciences

Evolution of influenza

Rama Nada. - Malik

HS-LS4-4 Construct an explanation based on evidence for how natural selection leads to adaptation of populations.

DOLPHIN RESEARCH CENTER Viruses and Dolphins

Discovery and Development of SARS-CoV 3CL Protease Inhibitors

Tis the Season Respiratory that is

INFLUENZA SURVEILLANCE

Cough, Cough, Sneeze, Wheeze: Update on Respiratory Disease

Overview: Chapter 19 Viruses: A Borrowed Life

Viral reproductive cycle

Acute respiratory illness This is a disease that typically affects the airways in the nose and throat (the upper respiratory tract).

Unit 2: Lesson 2 Case Studies: Influenza and HIV LESSON QUESTIONS

Influenza viruses. Virion. Genome. Genes and proteins. Viruses and hosts. Diseases. Distinctive characteristics

Virion Genome Genes and proteins Viruses and hosts Diseases Distinctive characteristics

Epitope discovery. Using predicted MHC binding CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

Influenza: The past, the present, the (future) pandemic

VIRAL GASTRO-ENTERITIS

LESSON 4.4 WORKBOOK. How viruses make us sick: Viral Replication

Structure & Function of Viruses

Seminar 4. Bugs on the Move. Carleton Learning in Retirement What s Bugging You? Daniel Burnside

LECTURE topics: 1. Immunology. 2. Emerging Pathogens

Bioinformatics Laboratory Exercise

Influenza: The Threat of a Pandemic

INFLUENZA A VIRUS. Structure of the influenza A virus particle.

Human Infection with Novel Influenza A Virus Case Report Form

11/15/2011. Outline. Structural Features and Characteristics. The Good the Bad and the Ugly. Viral Genomes. Structural Features and Characteristics

Chapter 08 Lecture Outline

LESSON 4.6 WORKBOOK. Designing an antiviral drug The challenge of HIV

LEC 2, Medical biology, Theory, prepared by Dr. AYAT ALI

MERS. G Blackburn DO, MACOI Clinical Professor of Medicine MSUCOM

Human Immunodeficiency Virus

New viruses causing respiratory tract infections. Eric C.J. Claas

Five Features of Fighting the Flu

Western Veterinary Conference 2013

Medical Virology. Herpesviruses, Orthomyxoviruses, and Retro virus. - Herpesviruses Structure & Composition: Herpesviruses

INFLUENZA VIRUS. INFLUENZA VIRUS CDC WEBSITE

Early February Surveillance of severe atypical pneumonia in Hospital Authority in Hong Kong. Initiate contacts in Guangdong

Phylogenetic Methods

Influenza Surveillance Report

Unit 4 Student Guided Notes

Influenza-Associated Pediatric Mortality rev Jan 2018

Lecture 5 (Ch6) - Viruses. Virus Characteristics. Viral Host Range

Hands-On Ten The BRCA1 Gene and Protein

MERS H7N9. coronavirus. influenza virus. MERS: Since September 2012: 160 cases, including 68 deaths (42%)

Grade Level: Grades 9-12 Estimated Time Allotment Part 1: One 50- minute class period Part 2: One 50- minute class period

SARS and the Clinical Laboratory

The Ebola Virus. By Emilio Saavedra

Influenza. By Allison Canestaro-Garcia. Disease Etiology:

2009 (Pandemic) H1N1 Influenza Virus

SECTION 25-1 REVIEW STRUCTURE. 1. The diameter of viruses ranges from about a. 1 to 2 nm. b. 20 to 250 nm. c. 1 to 2 µm. d. 20 to 250 µm.

Influenza RN.ORG, S.A., RN.ORG, LLC

Swine Acute Diarrhea Syndrome Coronavirus (SADS-CoV) in China

INFLUENZA (Outbreaks; hospitalized or fatal pediatric cases)

SUMMARY OF THE DISCUSSION AND RECOMMENDATIONS OF THE SARS LABORATORY WORKSHOP, 22 OCTOBER 2003

Human Influenza. Dr. Sina Soleimani. Human Viral Vaccine Quality Control 89/2/29. November 2, 2011 HVVQC ١

In the Name of God. Talat Mokhtari-Azad Director of National Influenza Center

Zika Virus. It may be devastating, But we might just get one step ahead of it. Learning in Retirement Winter 2017 Daniel Burnside. photo: Newsweek.

LEARNING FROM OUTBREAKS: SARS

Avian Influenza: Armageddon or Hype? Bryan E. Bledsoe, DO, FACEP The George Washington University Medical Center

Severe Acute Respiratory Syndrome (SARS) Updated Guidelines for the Global Surveillance of SARS. August 22 nd 2005

Useful Contacts. Essential information concerning travel, schools and colleges, and the workplace will be published on

IMPORTANT INFORMATION ABOUT SWINE FLU. This leaflet contains important information to help you and your family KEEP IT SAFE. Welsh bilingual version

Middle East respiratory syndrome coronavirus (MERS-CoV) and Avian Influenza A (H7N9) update

Viral Hemorrhagic Disease

Severe Acute Respiratory Syndrome (SARS) Coronavirus

Microbiology Laboratory Directors, Infection Preventionists, Primary Care Providers, Emergency Department Directors, Infectious Disease Physicians

Respiratory Syncytial Virus. Respiratory Syncytial Virus. Parainfluenza virus. Acute Respiratory Infections II. Most Important Respiratory Pathogens

Biology 350: Microbial Diversity

Viral Infection. Pulmonary Infections with Respiratory Viruses. Wallace T. Miller, Jr., MD. Objectives: Viral Structure: Significance:

LESSON 4.5 WORKBOOK. How do viruses adapt Antigenic shift and drift and the flu pandemic

Q: If antibody to the NA and HA are protective, why do we continually get epidemics & pandemics of flu?

Chapter. Severe Acute Respiratory Syndrome (SARS) Outbreak in a University Hospital in Hong Kong. Epidemiology-University Hospital Experience

Respiratory Viruses. Respiratory Syncytial Virus

Phylogenetic Tree Practical Problems

MERS CoV Outbreak Riyadh, July-Aug 2015

Ali Alabbadi. Bann. Bann. Dr. Belal

Appendix E1. Epidemiology

2.1 VIRUSES. 2.1 Learning Goals

Transcription:

CS 312: Algorithms Analysis Gene Sequence Alignment Overview: In this project, you will implement dynamic programming algorithms for computing the minimal cost of aligning gene sequences and for extracting optimal alignments. Objectives: To design and implement a Dynamic Programming algorithm that has applications to gene sequence alignment. To think carefully about the use of memory in an implementation. To solve a non trivial computational genomics problem. Code: You are provided with some scaffolding code to help you get started on the project. We provide a Visual Studio project with a GUI containing a 10x10 adjacency matrix, with a row and a column for each of 10 organisms. The organism at row #n is the same sequence for column #n. Note that this matrix is *not* the dynamic programming table; it is simply a place to store and display the final result from the alignment of the gene sequences for each pair of organisms. Thus, cell (i, j) in this table will contain the minimum cost alignment score of the genome for organism i with the genome for organism j. When you press the Process button on the toolbar, the matrix fills with numbers, one number for each pair, as shown in the figure below. You will generate these numbers by aligning the first 5000 characters (bases) in each sequence pair. Currently, the numbers are computed by a quick placeholder computation that is not particularly useful (these are the numbers shown in the figure below). Your job will be to replace that number with the output from a sequence alignment function that employs dynamic programming. You will fill the matrix with the pair wise scores.

Each gene sequence consists of a string of letters and is stored in the given database file. The scaffolding code loads the data from the database. For example, the record for the sars9 genome contains the following sequence (prefix shown here): atattaggtttttacctacccaggaaaagccaaccaacctcgatctcttgtagatctgttctctaaacgaactttaaaatctgtgtagctgtc gctcggctgcatgcctagtgcacctacgcagtataaacaataataaattttactgtcgttgacaagaaacgagtaactcgtccctct We encourage you to familiarize yourself with the few lines of existing code required for using the database file, in order to become familiar with this functionality. Note that you can also open the.mdb database file using Microsoft Access, if you have installed a copy (possibly through the MSDN Academic Alliance). To Do: 1. Make sure you understand the sequence alignment and solution extraction algorithms as presented in class. The edit distance algorithm in section 6.3 of the Dasgupta et al. textbook is essentially the same algorithm. 2. Plan out your data structures, and write pseudo code for the alignment and extraction algorithms that you need to implement. There are two different alignment algorithms. a. The first simply scores the alignment between two sequences but cannot be used to extract the actual alignment. Call this the scoring algorithm. b. The second aligns the sequences but does so such that the actual alignment can be extracted. Call this the extraction algorithm. Be sure that your scoring algorithm runs in at most O(n 2 ) time and O(n) space. Your extraction algorithm can use O(n 2 ) in time and space. 3. Implement the sequence alignment and extraction algorithms using dynamic programming and a table (in particular, use the Needleman/Wunsch algorithm). When scoring, you only need to

align the first 5000 characters (bases) of each sequence (taxa). 4. As an aid to debugging, the first two sequences in the database are the strings: polynomial and exponential. While these strings aren t biologically valid DNA sequences, they are the strings used in an example in the textbook. Their scores will occupy the first two rows and columns of the 10x10 results table. 5. Your scoring and extraction algorithms should use the following operator scores to compute the distance between two sequences: Substitutions, which are single character mis matches, are penalized 1 unit Insertions/Deletions ( indels ), which create gaps, are penalized 5 units Matches are rewarded 3 units Find the best alignment by minimizing the total distance. 6. Record the pair wise scores computed by the scoring algorithm in the given 10x10 adjacency matrix such that position (i, j) is the optimal distance from taxa i to taxa j for the first 5000 characters. The matrix should be symmetric. Correctness of your algorithm will be checked using this matrix. 7. Identify any two distinct sequences with a low distance in the table. Call this pair the matching pair. Extract the alignment of their first 100 characters using your extraction algorithm and display the alignment in a manner that easily reveals the matches, substitutions, and insertions/deletions. Performance Requirement: You must fill the 10x10 table in less than 61 seconds using O(n 2 ) time and O(n) space. Also, you must complete the alignment of the matching pair using O(n 2 ) time and space. Report: Turn in a paragraph in which you describe how you met the big O bounds in the performance requirement.

Improvement Ideas: Here are a couple of suggestions for improvements. They are listed without prejudice against things that are or are not on this list. Experiment with alternative operator penalty weights/costs and discuss impact Try aligning longer (sub )sequences. Conduct an empirical analysis. Discuss impact on the adjacency matrix. Re implement your alignment algorithm in top down fashion using recursion and a memory function. How does this algorithm compare to the implementation using a table? Try a shortest path algorithm like Dijkstra s to solve this problem. Get ahead of the game, learn about the A* shortest path algorithm, and implement it with a tight admissible heuristic (we will get to this topic later in the course, but you might enjoy trying it in the context of the alignment problem)

Appendix: Optional Biological Background Reading The following explains the biological setting of this project. In light of the SARS outbreak a couple of years ago, we have chosen to use the SARS virus as the source of our DNA dataset. Next is some background on SARS and Coronaviruses in general from the Department of Microbiology and Immunology, University of Leicester, followed by concrete instructions and explanations for the assignment. Coronaviruses: Coronaviruses were first isolated from chickens in 1937. After the discovery of Rhinoviruses in the 1950 s, approximately 50% of colds still could not be ascribed to known agents. In 1965, Tyrell and Bynoe used cultures of human ciliated embryonal trachea to propagate the first human coronavirus (HCoV) in vitro. There are now approximately 15 species in this family, which infect not only man but cattle, pigs, rodents, cats, dogs, and birds (some are serious veterinary pathogens, especially in chickens). Coronavirus particles are irregularly shaped, ~60 220nm in diameter, with an outer envelope bearing distinctive, club shaped peplomers (~20nm long x 10nm at wide distal end). This crown like appearance (Latin, corona) gives the family its name. The center of the particle appears amorphous in negatively shaped stained EM preps, the nucleocapsid being in a loosely wound rather disordered state.

Most human coronaviruses do not grow in cultured cells, therefore relatively little is known about them, but two strains grow in some cell lines and have been used as a model. Replication is slow compared to other envelope viruses, e.g. influenza. Coronavirus infection is very common and occurs worldwide. The incidence of infection is strongly seasonal, with the greatest incidence in children in winter. Adult infections are less common. The number of coronavirus serotypes and the extent of antigenic variation is unknown. Re infections occur throughout life, implying multiple serotypes (at least four are known) and/or antigenic variation, hence the prospects for immunization appear bleak. SARS: SARS is a type of viral pneumonia, with symptoms including fever, a dry cough, dyspnea (shortness of breath), headache, and hypoxaemia (low blood oxygen concentration). Typical laboratory findings include lymphopaenia (reduced lymphocyte numbers) and mildly elevated aminotransferase levels (indicating liver damage). Death may result from progressive respiratory failure due to alveolar damage. The typical clinical course of SARS involves an improvement in symptoms during the first week of infection, followed by a worsening during the second week. Studies indicate that this worsening may be related to a patient s immune responses rather than uncontrolled viral replication.

The outbreak is believed to have originated in February 2003 in the Guangdong province of China, where 300 people became ill, and at least five died. After initial reports that a paramyxovirus was responsible, the true cause appears to be a novel coronavirus with some unusual properties. For one thing, SARS virus can be grown in Vero cells (a fibroblast cell line isolated in 1962 from a primate) a novel property for HCoV s, most of which cannot be cultivated. In these cells, virus infection results in a cytopathic effect, and budding of coronavirus like particles from the endoplasmic reticulum within infected cells. Amplification of short regions of the polymerase gene, (the most strongly conserved part of the coronavirus genome) by reverse transcriptase polymerase chain reaction (RT PCR) and nucleotide sequencing revealed that the SARS virus is a novel coronavirus which has not previously been present in human populations. This conclusion is confirmed by serological (antigenic) investigations. We now know the complete ~29,700 nucleotide sequence of many isolates of the SARS virus. The sequence appears to be typical of coronaviruses, with no obviously unusual features, although there are some differences in the make up of the non structural proteins which are unusual. There is currently no general agreement that antiviral drugs have been shown to be consistently successful in treating SARS or any coronavirus infection, nor any vaccine against SARS. However, new drugs targeted specifically against this virus are under development. Coronaviruses with 99% sequence similarity to the surface spike protein of human SARS isolates have been isolated in Guangdong, China, from apparently healthy masked palm civets (Paguma larvata), a catlike mammal closely related to the mongoose. The unlucky palm civet is regarded as a delicacy in Guangdong and it is believed that humans became infected as they raised and slaughtered the animals rather than by consumption of infected meat. Might SARS coronavirus recombine with other human coronaviruses to produce an even more deadly virus? Fortunately, the coronaviruses of which we are aware indicate that recombination has not occurred between viruses of different groups, only within a group, so recombination does not seem likely given the distance between the SARS virus and HCoV. SARS, The Disease and the Virus (from The National Center for Biotechnology Information): In late 2002, an outbreak of severe, atypical pneumonia was reported in Guangdong Province of China. The disease had an extremely high mortality rate (currently up to 15 19%), and quickly expanded to over 25 countries. The World Health Organization coined it "severe acute respiratory syndrome", or SARS. In April 2003, a previously unknown coronavirus was isolated from patients and subsequently proven to be the causative agent according to Koch's postulates in experiments on monkeys. The virus has been named SARS coronavirus (SARS CoV). The first complete sequence of SARS coronavirus was obtained in BCCA Genome Sciences Centre, Canada, about two weeks after the virus was detected in SARS patients. It was immediately submitted to GenBank prior to publication as a raw nucleotide sequence. GenBank released the sequence to the public the same day under accession number AY274119; the NCBI Viral Genomes Group annotated the sequence also the same day and released it in the form of the Entrez Genomes RefSeq record NC_004718 at 2 am next day. As of the beginning of May of 2003, all the SARS CoV RNA transcripts have been detected and sequenced in at least two laboratories; further experiments are underway.

The availability of the sequence data and functional dissection of the SARS CoV genome is the first step towards developing diagnostic tests, antiviral agents, and vaccines. Revised: 4/29/2008