Bioinformatics Laboratory Exercise

Similar documents
Hands-On Ten The BRCA1 Gene and Protein

For all of the following, you will have to use this website to determine the answers:

Annotation of Chimp Chunk 2-10 Jerome M Molleston 5/4/2009

Student Handout Bioinformatics

Data mining with Ensembl Biomart. Stéphanie Le Gras

Bio 111 Study Guide Chapter 17 From Gene to Protein

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

SMPD 287 Spring 2015 Bioinformatics in Medical Product Development. Final Examination

MODULE 3: TRANSCRIPTION PART II

Number of Differences from Species 1

Chapter 12-4 DNA Mutations Notes

RNA and Protein Synthesis Guided Notes

a. From the grey navigation bar, mouse over Analyze & Visualize and click Annotate Nucleotide Sequences.

Insulin mrna to Protein Kit

TRANSLATION: 3 Stages to translation, can you guess what they are?

Project Manual Bio3055. Cholesterol Homeostasis: HMG-CoA Reductase

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing

The Meaning of Genetic Variation

Exploring HIV Evolution: An Opportunity for Research Sam Donovan and Anton E. Weisstein

6.3 DNA Mutations. SBI4U Ms. Ho-Lau

Molecular Database Generation for Type 2 Diabetes using Computational Science-Bioinformatics Tools

High-throughput transcriptome sequencing

PROTEIN SYNTHESIS. It is known today that GENES direct the production of the proteins that determine the phonotypical characteristics of organisms.

SpliceDB: database of canonical and non-canonical mammalian splice sites

Point total. Page # Exam Total (out of 90) The number next to each intermediate represents the total # of C-C and C-H bonds in that molecule.

Integration Solutions

Reporting TP53 gene analysis results in CLL

Sections 12.3, 13.1, 13.2

Post-Lab Activity STUDENT MANUAL POST-LAB ACTIVITY. Analysis and Interpretation of Results

Project Manual Bio3055. Apoptosis: Superoxide Dismutase I

Section D. Identification of serotype-specific amino acid positions in DENV NS1. Objective

FINAL ANNOTATION REPORT: Drosophila virilis Fosmid 11 (48P14) Robert Carrasquillo Bio 4342

Central Dogma. Central Dogma. Translation (mrna -> protein)

Sebastian Jaenicke. trnascan-se. Improved detection of trna genes in genomic sequences

Objective: You will be able to explain how the subcomponents of

DNA codes for RNA, which guides protein synthesis.

Pre-mRNA has introns The splicing complex recognizes semiconserved sequences

Biological systems interact, and these systems and their interactions possess complex properties. STOP at enduring understanding 4A

Gene finding. kuobin/

ITS accuracy at GenBank. Conrad Schoch Barbara Robbertse

TITLE: The Role Of Alternative Splicing In Breast Cancer Progression

Bioinformatic analyses: methodology for allergen similarity search. Zoltán Divéki, Ana Gomes EFSA GMO Unit

RESEARCH PROJECT. Comparison of searching tools and outcomes of different providers of the Medline database (OVID and PubMed).

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

R2 Training Courses. Release The R2 support team

Protein Synthesis and Mutation Review

Mutations. Any change in DNA sequence is called a mutation.

Beta Thalassemia Case Study Introduction to Bioinformatics

Multiple sequence alignment

SFARI Gene 2.0 User Guide

Annotation of Drosophila mojavensis fosmid 8 Priya Srikanth Bio 434W

Alternative RNA processing: Two examples of complex eukaryotic transcription units and the effect of mutations on expression of the encoded proteins.

Phenylketonuria (PKU) Structure of Phenylalanine Hydroxylase. Biol 405 Molecular Medicine

DNA is the genetic material that provides instructions for what our bodies look like and how they function. DNA is packaged into structures called

Section Chapter 14. Go to Section:

Computational Biology I LSM5191

Cours Bioinformatique : TP2

Part III: Basic Immunology

Finding subtle mutations with the Shannon human mrna splicing pipeline

Supplementary Figure 1. CFTR protein structure and domain architecture.

Analysis with SureCall 2.1

MUTATIONS, MUTAGENESIS, AND CARCINOGENESIS. (Start your clickers)

Biochemistry 2000 Sample Question Transcription, Translation and Lipids. (1) Give brief definitions or unique descriptions of the following terms:

Proteins. Length of protein varies from thousands of amino acids to only a few insulin only 51 amino acids

Chapter 4: Information and Knowledge in the Protein Insulin

Integrated Analysis of Copy Number and Gene Expression

OMIM The Online Mendelian Inheritance in Man Knowledgebase: A Wardrobe Full of Genes. Ada Hamosh, MD, MPH

Add_A_Class_with_Class_Number_Revised Thursday, March 18, 2010

You may use your notes to answer the following questions:

Supplementary Document

Genetic information flows from mrna to protein through the process of translation

Term Definition Example Amino Acids

Protein Synthesis

Mouse Clec9a ORF sequence

OncoPPi Portal A Cancer Protein Interaction Network to Inform Therapeutic Strategies

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

COSMIC - Catalogue of Somatic Mutations in Cancer

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016

Care Pathways User Guide

Structural Variation and Medical Genomics

Sequence Analysis of Human Immunodeficiency Virus Type 1

Section B. Comparative Genomics Analysis of Influenza H5N2 Viruses. Objective

Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and Worldwide.

MicroRNA in Cancer Karen Dybkær 2013

SEQUENCE FEATURE VARIANT TYPES

Pre-mRNA Secondary Structure Prediction Aids Splice Site Recognition

Bioinformatics. Sequence Analysis: Part III. Pattern Searching and Gene Finding. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Guide to Use of SimulConsult s Phenome Software

Completing the CIBMTR Confirmation of HLA Typing Form (Form 2005)

A Brief Summary of Important Online Bioinformatics Databases and Genomic Application Algorithms

Agile Product Lifecycle Management for Process

Rotavirus Genotyping and Enhanced Annotation in the Virus Pathogen Resource (ViPR) Yun Zhang J. Craig Venter Institute ASV 2016 June 19, 2016

Fully Automated IFA Processor LIS User Manual

Molecular Evolution and the Neutral Theory

Circuit Pilates Classes Pilates Online - Login

Abstract. Patricia G. Melloy*

The Molecular Evolution of Gene Birth and Death. Author: Ann Brokaw AP Biology Teacher Rocky River High School Rocky River, Ohio

MRC-Holland MLPA. Description version 12; 13 January 2017

Chemistry 107 Exam 4 Study Guide

Transcription:

Bioinformatics Laboratory Exercise Biology is in the midst of the genomics revolution, the application of robotic technology to generate huge amounts of molecular biology data. Genomics has led to an explosion in biological data. For example the amount of DNA sequence data have grown exponentially over the last several decades. At its current rates the amount of DNA sequences known will double every 9 to 12 months. In spring 2005 there were 39 billion base pairs of sequence known. This semester Spring 2005, there are 100 billion base pairs of sequence data known. Similar levels of data growth have occurred for protein sequences, gene expression data and protein interactions. A new sub discipline in biology, bioinformatics, has developed to deal with this vast amount of data. Bioinformatics involves the careful storage, organization and indexing of sequence information, and the development of computer software to analyze the data. The major depository of sequence data is a database called National Center of Biotechnology Information (NCIB) maintained by the National Institute of Health. NCIB maintains a web site that includes a number of different interfaces for searching and analyzing the databases. The site can be accessed at the URL http://www.ncbi.nlm.nih.gov/. We will use this data base to explore the types of mutations that accumulate in genes as species diverge. As we discussed in class, base substitutions within an open reading frame can be categorized based on how they affect sequence of the encoded protein. For example mis-sense mutations are based substitutions that result in a change of one amino acid in the polypeptide chain; silent mutations are based substitutions that do not affect the polypeptide chain; and non-sense mutations are base substitutions that generate a premature stop codon shortening the polypeptide chain. In this exercise, you will identify genes that code ribosomal proteins in humans. You will use a program at NCBI to identify the codons within the gene and the amino acids that they encode. You will then identify the equivalent gene (an orthologue) in a closely related vertebrate species. From an alignment of these two genes you will identify base substitution mutations that have accumulated since the divergence of these species. You will then analyze whether these mutations are mis-sense, silent or non-sense mutations. During the course of this exercise you will use the two major tools to identify genes in the NCIB databases. The first method involves searching for key words in the data base entries. In addition to the DNA sequence, every entry in the DNA databases includes other information about the gene. This additional information may include the name of the species, information about the protein encoded by the gene, the names of the investigators or literature citations for the data. This information can be used to identify genes using an interface called entrez. Entrez is a search engine similar to those that are used for literature searches (e.g. pubmed). It searches the database entries for words in the entry. For example, we will search the database using Homo sapiens ribosomal protein to identify genes encoding the human ribosomal proteins. This search term will identify these human genes. However it will also identify numerous other entries that are

not ribosomal proteins but have these four words somewhere in their database entry. Therefore once the entrez search has identified possible entries, they will have to be carefully examined to determine which ones correspond with human ribosomal proteins. A second way to search the DNA databases is by sequence comparison. NCBI includes a search program called Blast that will compare any DNA sequence to all 39 billion bp of sequence in the data base, identify similar sequences, rank the sequences in order of similarity and provide sequence alignments for similar regions. We will use blast to identify vertebrate genes similar to the human ribosomal proteins. The third piece of bioinformatics software that we will use will be a program called ORF finder. This program will identify the open reading frame in a fragment of DNA and identify which amino acid each codon encodes. Procedure I. Identifying genes for human ribosomal proteins. 1. Using Internet Explorer go to the NCBI website <http://www.ncbi.nlm.nih.gov/>. 2. On the NCBI home page click on the All Databases link on the blue bar. 3. Select the Nucleotide database 4. In the search window type homo sapiens ribosomal protein 60S and click GO

5. This will retrieve all the nucleotide database entries that include the words. Note that almost forty thousand database entries are retrieved. Most of these entries are not human genes and many may be partial sequences. To limit the analyses to genes that are better understood select the RefSeq tab by clicking on the tab.. 6. RefSeq are the database entries that have been most carefully reviewed by NCBI. There are still more than 100 entries. Some of these will be genes for ribosomal proteins. Other may be human pseudogene or genes encoding factors that interact with the ribosome. 7. To identify the genes for human ribosomal proteins only choose entries that begin Homo sapiens. Avoid mitochondrial proteins, pseudogenes or whole chromosomes. On this example the third entry is a ribosomal protein. As an example click on the blue accession number to obtain the database entry. (For the actual exercise, everyone in the class will be assigned an separate ribosomal protein to analyze.)

8. Scholl down the entry to observe the type of information in this entry. Notice at the bottom of the entry is the DNA sequence. This sequence may include promoter elements, exons, introns etc. To obtain just the open reading frame (start codon to stop codon without any introns) click on the link CDS about halfway down the entry. 9. If you scroll down to the bottom of this screen you will see a sequence of DNA that corresponds to the open reading frame. It starts with a DNA version of a start codon and ends with one of the three stop codons.

10. Unfortunately, the computer programs cannot read this sequence because of the numbers. Therefore before you run the other programs you need to convert this to another format called fasta. To convert it to another format click the drop down button next to GenBank Full and select FASTA. Then click the Display button and new screen will come up. 11. Below you will see an example of a Fasta format. Copy the FASTA entry and paste it into a word document. (Be sure to select the pasted info and convert it to Courier Font 8pt) See example at the end of this lab handout.

II. ORF Finder 1. Return to the home page by clicking the NCBI symbol in the upper left hand side of the window. Scroll down under HotSpots to find the ORF Finder Link. Click this link. 2. Scroll down the ORF Finder page and you will find a dialogue box. Paste the FASTA format of your gene in this box and click OrfFind button. 3. The ORF Finder program will generate a series of green bars. Click on the top (and longest bar) to obtain the annotated ORF.

3. Scroll to the bottom of this page to find an annotated open reading frame. Copy the open reading frame and paste it into the word document. (Be sure to select the pasted info and convert it to Courier Font 8pt. Also convert the entire font to Black.) See end of lab handout for examples. III. Blast Search 1. Return to the home page by clicking the NCBI symbol in the upper left hand side of the window. Click on the BLAST link on the blue bar. 2. Under the nucleotide column find the Nucleotide-nucleotide Blast and click.

3. In the new screen paste the fasta format for the human gene in the search dialogue box. Next to Choose a database click the drop down box and select refseq_rna. Click BLAST! to launch the search. 3. A blast search response comes up. Click on format! to see the results of the blast search. (Note it may require a few minutes to return the results.) 4. Scroll down the results until you see a list of similar genes. The best matches are at the top of this list. You will use the first non-human gene on this list. In the case of this example it is a dog. To see an alignment of the two sequences click on the Score corresponding to the best match

5. This will bring you to a Blast alignment of the two sequences. Copy and paste this alignment in your word document. (Be sure to select the pasted info and convert it to Courier Font 8pt.) 6. The top sequence is the human gene; the second sequence is the dog gene. Matches between the two sequences are indicated by a between the two sequences. A mis-match between the two sequences suggests a mutation has accumulated in one of these genes since the divergence of humans and dogs. IV. Mutation Analysis Starting at the 5 of the gene identify 10 single base substitutions. Ignore any double base substitutions. Note the location and substitution for these mutations on the annotated Open Reading Frame. Using the genetic code determine if these mutations are silent, mis-sense or non-sense mutations. Report your analysis in a data collection sheet. See example on the last page. V. Short Lab Report 1. Submit the Fasta format for your human ribosomal protein open reading frame. 2. Submit the annotated open reading frame generated by ORF Finder 3. Submit the Blast alignment of the human gene to the most similar non-human gene. 4. Submit a data collection sheet formatted as in the example. 5. What percentage of the mutations were silent, mis-sense or non-sense mutations. 6. It is a basic tenet of evolutionary biology that mutations are random. If this is true, we would predict that mis-sense mutations would be more common than silent mutations. (Changes in either of the first two nucleotides of a codon generally result in a mis-sense mutation. Only mutations in the third position result in a silent mutation.) Explain why in this analysis, silent mutations are more common than missense mutations.

Example Analysis NM_033625 Human ribosomal protein RPL34 >gi 16117788:67-420 Homo sapiens ribosomal protein L34 (RPL34), transcript variant 2, mrna ATGGTCCAGCGTTTGACATACCGACGTAGGCTTTCCTACAATACAGCCTCTAACAAAACTAGGCTGTCCC GAACCCCTGGTAATAGAATTGTTTACCTTTATACCAAGAAGGTTGGGAAAGCACCAAAATCTGCATGTGG TGTGTGCCCAGGCAGACTTCGAGGGGTTCGTGCTGTAAGACCTAAAGTTCTTATGAGATTGTCCAAAACA AAGAAACATGTCAGCAGGGCCTATGGTGGTTCCATGTGTGCTAAATGTGTTCGTGACAGGATCAAGCGTG CTTTCCTTATCGAGGAGCAGAAAATCGTTGTGAAAGTGTTGAAGGCACAAGCACAGAGTCAGAAAGCTAA ATAA Annotated Open Reading Frame 1 atggtccagcgtttgacataccgacgtaggctttcctacaataca M V Q R L T Y R R R L S Y N T 46 gcctctaacaaaactaggctgtcccgaacccctggtaatagaatt A S N K T R L S R T P G N R I 91 gtttacctttataccaagaaggttgggaaagcaccaaaatctgca V Y L Y T K K V G K A P K S A 136 tgtggtgtgtgcccaggcagacttcgaggggttcgtgctgtaaga C G V C P G R L R G V R A V R 181 cctaaagttcttatgagattgtccaaaacaaagaaacatgtcagc P K V L M R L S K T K K H V S 226 agggcctatggtggttccatgtgtgctaaatgtgttcgtgacagg R A Y G G S M C A K C V R D R 271 atcaagcgtgctttccttatcgaggagcagaaaatcgttgtgaaa I K R A F L I E E Q K I V V K 316 gtgttgaaggcacaagcacagagtcagaaagctaaataa 354 V L K A Q A Q S Q K A K * gi 57109193 ref XM_535688.1 L34 (LOC478509), mrna Length = 577 PREDICTED: Canis familiaris similar to ribosomal protein Score = 543 bits (274), Expect = e-153 Identities = 334/354 (94%) Strand = Plus / Plus Query: 1 atggtccagcgtttgacataccgacgtaggctttcctacaatacagcctctaacaaaact 60 Sbjct: 172 atggttcagcgtttgacataccgtcgtaggctgtcctacaatacagcctctaacaaaact 231 Query: 61 aggctgtcccgaacccctggtaatagaattgtttacctttataccaagaaggttgggaaa 120 Sbjct: 232 aggctgtcccgaactcctggcaatagaatcgtttacctttataccaagaaggttgggaaa 291 Query: 121 gcaccaaaatctgcatgtggtgtgtgcccaggcagacttcgaggggttcgtgctgtaaga 180 Sbjct: 292 gcgccaaagtctgcatgtggcgtgtgtcctggccgacttcgaggtgttcgtgcggtgaga 351

Query: 181 cctaaagttcttatgagattgtccaaaacaaagaaacatgtcagcagggcctatggtggt 240 Sbjct: 352 cctaaagtccttatgagattgtctaaaacgaaaaaacatgtcagcagggcctatggtggt 411 Query: 241 tccatgtgtgctaaatgtgttcgtgacaggatcaagcgtgctttccttatcgaggagcag 300 Sbjct: 412 tccatgtgtgctaaatgtgttcgtgacaggatcaagcgtgctttccttattgaggagcag 471 Query: 301 aaaatcgttgtgaaagtgttgaaggcacaagcacagagtcagaaagctaaataa 354 Sbjct: 472 aaaatcgttgtgaaagtgttgaaggcacaagcacagagtcagaaagctaaataa 525 Data Collection Sheet Mutation Human Dog Type of mutation 1 GTC GTG Silent Val Val Etc. 2 CGA CGT Silent Arg Arg 3 CTT CTG Silent Leu Leu 4 ACC ACT Silent Thr Thr