Bioinformatics Laboratory Exercise

Bioinformatics Laboratory Exercise Biology is in the midst of the genomics revolution, the application of robotic technology to generate huge amounts of molecular biology data. Genomics has led to an explosion in biological data. For example the amount of DNA sequence data have grown exponentially over the last several decades. At its current rates the amount of DNA sequences known will double every 9 to 12 months. In spring 2005 there were 39 billion base pairs of sequence known. This semester Spring 2005, there are 100 billion base pairs of sequence data known. Similar levels of data growth have occurred for protein sequences, gene expression data and protein interactions. A new sub discipline in biology, bioinformatics, has developed to deal with this vast amount of data. Bioinformatics involves the careful storage, organization and indexing of sequence information, and the development of computer software to analyze the data. The major depository of sequence data is a database called National Center of Biotechnology Information (NCIB) maintained by the National Institute of Health. NCIB maintains a web site that includes a number of different interfaces for searching and analyzing the databases. The site can be accessed at the URL http://www.ncbi.nlm.nih.gov/. We will use this data base to explore the types of mutations that accumulate in genes as species diverge. As we discussed in class, base substitutions within an open reading frame can be categorized based on how they affect sequence of the encoded protein. For example mis-sense mutations are based substitutions that result in a change of one amino acid in the polypeptide chain; silent mutations are based substitutions that do not affect the polypeptide chain; and non-sense mutations are base substitutions that generate a premature stop codon shortening the polypeptide chain. In this exercise, you will identify genes that code ribosomal proteins in humans. You will use a program at NCBI to identify the codons within the gene and the amino acids that they encode. You will then identify the equivalent gene (an orthologue) in a closely related vertebrate species. From an alignment of these two genes you will identify base substitution mutations that have accumulated since the divergence of these species. You will then analyze whether these mutations are mis-sense, silent or non-sense mutations. During the course of this exercise you will use the two major tools to identify genes in the NCIB databases. The first method involves searching for key words in the data base entries. In addition to the DNA sequence, every entry in the DNA databases includes other information about the gene. This additional information may include the name of the species, information about the protein encoded by the gene, the names of the investigators or literature citations for the data. This information can be used to identify genes using an interface called entrez. Entrez is a search engine similar to those that are used for literature searches (e.g. pubmed). It searches the database entries for words in the entry. For example, we will search the database using Homo sapiens ribosomal protein to identify genes encoding the human ribosomal proteins. This search term will identify these human genes. However it will also identify numerous other entries that are

not ribosomal proteins but have these four words somewhere in their database entry. Therefore once the entrez search has identified possible entries, they will have to be carefully examined to determine which ones correspond with human ribosomal proteins. A second way to search the DNA databases is by sequence comparison. NCBI includes a search program called Blast that will compare any DNA sequence to all 39 billion bp of sequence in the data base, identify similar sequences, rank the sequences in order of similarity and provide sequence alignments for similar regions. We will use blast to identify vertebrate genes similar to the human ribosomal proteins. The third piece of bioinformatics software that we will use will be a program called ORF finder. This program will identify the open reading frame in a fragment of DNA and identify which amino acid each codon encodes. Procedure I. Identifying genes for human ribosomal proteins. 1. Using Internet Explorer go to the NCBI website <http://www.ncbi.nlm.nih.gov/>. 2. On the NCBI home page click on the All Databases link on the blue bar. 3. Select the Nucleotide database 4. In the search window type homo sapiens ribosomal protein 60S and click GO

5. This will retrieve all the nucleotide database entries that include the words. Note that almost forty thousand database entries are retrieved. Most of these entries are not human genes and many may be partial sequences. To limit the analyses to genes that are better understood select the RefSeq tab by clicking on the tab.. 6. RefSeq are the database entries that have been most carefully reviewed by NCBI. There are still more than 100 entries. Some of these will be genes for ribosomal proteins. Other may be human pseudogene or genes encoding factors that interact with the ribosome. 7. To identify the genes for human ribosomal proteins only choose entries that begin Homo sapiens. Avoid mitochondrial proteins, pseudogenes or whole chromosomes. On this example the third entry is a ribosomal protein. As an example click on the blue accession number to obtain the database entry. (For the actual exercise, everyone in the class will be assigned an separate ribosomal protein to analyze.)

8. Scholl down the entry to observe the type of information in this entry. Notice at the bottom of the entry is the DNA sequence. This sequence may include promoter elements, exons, introns etc. To obtain just the open reading frame (start codon to stop codon without any introns) click on the link CDS about halfway down the entry. 9. If you scroll down to the bottom of this screen you will see a sequence of DNA that corresponds to the open reading frame. It starts with a DNA version of a start codon and ends with one of the three stop codons.

10. Unfortunately, the computer programs cannot read this sequence because of the numbers. Therefore before you run the other programs you need to convert this to another format called fasta. To convert it to another format click the drop down button next to GenBank Full and select FASTA. Then click the Display button and new screen will come up. 11. Below you will see an example of a Fasta format. Copy the FASTA entry and paste it into a word document. (Be sure to select the pasted info and convert it to Courier Font 8pt) See example at the end of this lab handout.

II. ORF Finder 1. Return to the home page by clicking the NCBI symbol in the upper left hand side of the window. Scroll down under HotSpots to find the ORF Finder Link. Click this link. 2. Scroll down the ORF Finder page and you will find a dialogue box. Paste the FASTA format of your gene in this box and click OrfFind button. 3. The ORF Finder program will generate a series of green bars. Click on the top (and longest bar) to obtain the annotated ORF.

3. Scroll to the bottom of this page to find an annotated open reading frame. Copy the open reading frame and paste it into the word document. (Be sure to select the pasted info and convert it to Courier Font 8pt. Also convert the entire font to Black.) See end of lab handout for examples. III. Blast Search 1. Return to the home page by clicking the NCBI symbol in the upper left hand side of the window. Click on the BLAST link on the blue bar. 2. Under the nucleotide column find the Nucleotide-nucleotide Blast and click.

3. In the new screen paste the fasta format for the human gene in the search dialogue box. Next to Choose a database click the drop down box and select refseq_rna. Click BLAST! to launch the search. 3. A blast search response comes up. Click on format! to see the results of the blast search. (Note it may require a few minutes to return the results.) 4. Scroll down the results until you see a list of similar genes. The best matches are at the top of this list. You will use the first non-human gene on this list. In the case of this example it is a dog. To see an alignment of the two sequences click on the Score corresponding to the best match

5. This will bring you to a Blast alignment of the two sequences. Copy and paste this alignment in your word document. (Be sure to select the pasted info and convert it to Courier Font 8pt.) 6. The top sequence is the human gene; the second sequence is the dog gene. Matches between the two sequences are indicated by a between the two sequences. A mis-match between the two sequences suggests a mutation has accumulated in one of these genes since the divergence of humans and dogs. IV. Mutation Analysis Starting at the 5 of the gene identify 10 single base substitutions. Ignore any double base substitutions. Note the location and substitution for these mutations on the annotated Open Reading Frame. Using the genetic code determine if these mutations are silent, mis-sense or non-sense mutations. Report your analysis in a data collection sheet. See example on the last page. V. Short Lab Report 1. Submit the Fasta format for your human ribosomal protein open reading frame. 2. Submit the annotated open reading frame generated by ORF Finder 3. Submit the Blast alignment of the human gene to the most similar non-human gene. 4. Submit a data collection sheet formatted as in the example. 5. What percentage of the mutations were silent, mis-sense or non-sense mutations. 6. It is a basic tenet of evolutionary biology that mutations are random. If this is true, we would predict that mis-sense mutations would be more common than silent mutations. (Changes in either of the first two nucleotides of a codon generally result in a mis-sense mutation. Only mutations in the third position result in a silent mutation.) Explain why in this analysis, silent mutations are more common than missense mutations.

Example Analysis NM_033625 Human ribosomal protein RPL34 >gi 16117788:67-420 Homo sapiens ribosomal protein L34 (RPL34), transcript variant 2, mrna ATGGTCCAGCGTTTGACATACCGACGTAGGCTTTCCTACAATACAGCCTCTAACAAAACTAGGCTGTCCC GAACCCCTGGTAATAGAATTGTTTACCTTTATACCAAGAAGGTTGGGAAAGCACCAAAATCTGCATGTGG TGTGTGCCCAGGCAGACTTCGAGGGGTTCGTGCTGTAAGACCTAAAGTTCTTATGAGATTGTCCAAAACA AAGAAACATGTCAGCAGGGCCTATGGTGGTTCCATGTGTGCTAAATGTGTTCGTGACAGGATCAAGCGTG CTTTCCTTATCGAGGAGCAGAAAATCGTTGTGAAAGTGTTGAAGGCACAAGCACAGAGTCAGAAAGCTAA ATAA Annotated Open Reading Frame 1 atggtccagcgtttgacataccgacgtaggctttcctacaataca M V Q R L T Y R R R L S Y N T 46 gcctctaacaaaactaggctgtcccgaacccctggtaatagaatt A S N K T R L S R T P G N R I 91 gtttacctttataccaagaaggttgggaaagcaccaaaatctgca V Y L Y T K K V G K A P K S A 136 tgtggtgtgtgcccaggcagacttcgaggggttcgtgctgtaaga C G V C P G R L R G V R A V R 181 cctaaagttcttatgagattgtccaaaacaaagaaacatgtcagc P K V L M R L S K T K K H V S 226 agggcctatggtggttccatgtgtgctaaatgtgttcgtgacagg R A Y G G S M C A K C V R D R 271 atcaagcgtgctttccttatcgaggagcagaaaatcgttgtgaaa I K R A F L I E E Q K I V V K 316 gtgttgaaggcacaagcacagagtcagaaagctaaataa 354 V L K A Q A Q S Q K A K * gi 57109193 ref XM_535688.1 L34 (LOC478509), mrna Length = 577 PREDICTED: Canis familiaris similar to ribosomal protein Score = 543 bits (274), Expect = e-153 Identities = 334/354 (94%) Strand = Plus / Plus Query: 1 atggtccagcgtttgacataccgacgtaggctttcctacaatacagcctctaacaaaact 60 Sbjct: 172 atggttcagcgtttgacataccgtcgtaggctgtcctacaatacagcctctaacaaaact 231 Query: 61 aggctgtcccgaacccctggtaatagaattgtttacctttataccaagaaggttgggaaa 120 Sbjct: 232 aggctgtcccgaactcctggcaatagaatcgtttacctttataccaagaaggttgggaaa 291 Query: 121 gcaccaaaatctgcatgtggtgtgtgcccaggcagacttcgaggggttcgtgctgtaaga 180 Sbjct: 292 gcgccaaagtctgcatgtggcgtgtgtcctggccgacttcgaggtgttcgtgcggtgaga 351

Query: 181 cctaaagttcttatgagattgtccaaaacaaagaaacatgtcagcagggcctatggtggt 240 Sbjct: 352 cctaaagtccttatgagattgtctaaaacgaaaaaacatgtcagcagggcctatggtggt 411 Query: 241 tccatgtgtgctaaatgtgttcgtgacaggatcaagcgtgctttccttatcgaggagcag 300 Sbjct: 412 tccatgtgtgctaaatgtgttcgtgacaggatcaagcgtgctttccttattgaggagcag 471 Query: 301 aaaatcgttgtgaaagtgttgaaggcacaagcacagagtcagaaagctaaataa 354 Sbjct: 472 aaaatcgttgtgaaagtgttgaaggcacaagcacagagtcagaaagctaaataa 525 Data Collection Sheet Mutation Human Dog Type of mutation 1 GTC GTG Silent Val Val Etc. 2 CGA CGT Silent Arg Arg 3 CTT CTG Silent Leu Leu 4 ACC ACT Silent Thr Thr