Nucleotide sequence conservation in paramyxoviruses; the concept of codon constellation

Similar documents
Bacterial Gene Finding CMSC 423

Protein sequence alignment using binary string


The Cell T H E C E L L C Y C L E C A N C E R

General aspects of bacteriology, bacterial structure and growth. Che-Hsin Lee, Ph.D. Department of Biological Sciences National Sun Yat-senUniversity

Biology 12 January 2004 Provincial Examination

CASE TEACHING NOTES. Decoding the Flu INTRODUCTION / BACKGROUND CLASSROOM MANAGEMENT

Genomic and evolutionary aspects of chloroplast trna in monocot plants

The Synthetic Machinery of the Cell

The transfer RNA genes in Oryza sativa L. ssp. indica

Biology 12 November 1999 Provincial Examination

Youhua Chen 1,2. 1. Introduction. 2. Materials and Methods

PROVINCIAL EXAMINATION MINISTRY OF EDUCATION BIOLOGY 12 GENERAL INSTRUCTIONS

Protein Synthesis and Mutation Review

PROVINCIAL EXAMINATION MINISTRY OF EDUCATION BIOLOGY 12 GENERAL INSTRUCTIONS

Bacterial Gene Finding CMSC 423

Biology 12 JANUARY Course Code = BI. Student Instructions

BIOLOGY 621 Identification of the Snorks

Lezione 10. Sommario. Bioinformatica. Lezione 10: Sintesi proteica Synthesis of proteins Central dogma: DNA makes RNA makes proteins Genetic code

Supplementary Figures

BIOLOGY 12 NOVEMBER 2000 STUDENT INSTRUCTIONS

L I F E S C I E N C E S

Integration Solutions

TRANSLATION: 3 Stages to translation, can you guess what they are?

Complete Student Notes for BIOL2202

Patterns of hemagglutinin evolution and the epidemiology of influenza

Biology 12. Examination Booklet 2008/09 Released Exam June 2009 Form A DO NOT OPEN ANY EXAMINATION MATERIALS UNTIL INSTRUCTED TO DO SO.

Biology 12 AUGUST Course Code = BI. Student Instructions

Biology 12. Examination Booklet August 2007 Form A DO NOT OPEN ANY EXAMINATION MATERIALS UNTIL INSTRUCTED TO DO SO.

L I F E S C I E N C E S

Islamic University Faculty of Medicine

PROVINCIAL EXAMINATION MINISTRY OF EDUCATION BIOLOGY 12 GENERAL INSTRUCTIONS

Point total. Page # Exam Total (out of 90) The number next to each intermediate represents the total # of C-C and C-H bonds in that molecule.

AN INTRODUCTION TO GENETICS. The First Ankle in a Series about the Significance of Genetic Science for Catholic Health Care

Coronaviruses. Virion. Genome. Genes and proteins. Viruses and hosts. Diseases. Distinctive characteristics

The Basics: A general review of molecular biology:

HOST-PATHOGEN CO-EVOLUTION THROUGH HIV-1 WHOLE GENOME ANALYSIS

LESSON 4.4 WORKBOOK. How viruses make us sick: Viral Replication

Nucleotide Sequence of the Australian Bluetongue Virus Serotype 1 RNA Segment 10

Computational Biology I LSM5191

Translation. Host Cell Shutoff 1) Initiation of eukaryotic translation involves many initiation factors

Section 6. Junaid Malek, M.D.

Supplementary Fig. S1: E. helvum cytochrome b haplotype Bayesian phylogeny with E. dupreanum used as an outgroup. Three main clades are identified:

Supplementary Table 3. 3 UTR primer sequences. Primer sequences used to amplify and clone the 3 UTR of each indicated gene are listed.

Lecture 2: Virology. I. Background

CS612 - Algorithms in Bioinformatics

Sections 12.3, 13.1, 13.2

RNA Processing in Eukaryotes *

Objective: You will be able to explain how the subcomponents of

Innate Immunity & Inflammation

SUPPLEMENTARY INFORMATION. An orthogonal ribosome-trnas pair by the engineering of

reads observed in trnas from the analysis of RNAs carrying a 5 -OH ends isolated from cells induced to express

Complete nucleotide sequences of Nipah virus isolates from Malaysia

Macromolecules of Life -3 Amino Acids & Proteins

LAB#23: Biochemical Evidence of Evolution Name: Period Date :

Characterizing intra-host influenza virus populations to predict emergence

Biology. Lectures winter term st year of Pharmacy study

Human Genome: Mapping, Sequencing Techniques, Diseases

Rajesh Kannangai Phone: ; Fax: ; *Corresponding author

Last time we talked about the few steps in viral replication cycle and the un-coating stage:

Biological systems interact, and these systems and their interactions possess complex properties. STOP at enduring understanding 4A

Supplementary Document

Polyomaviridae. Spring

c Tuj1(-) apoptotic live 1 DIV 2 DIV 1 DIV 2 DIV Tuj1(+) Tuj1/GFP/DAPI Tuj1 DAPI GFP

Human Biology HBIO4. (JUN14HBIO401) WMP/Jun14/HBIO4/E7w. General Certificate of Education Advanced Level Examination June 2014

Intrinsic cellular defenses against virus infection

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

MATERIALS AND METHODS The sources of the viral RNAs, oligonucleotides, enzymes and nucleotides have been reported [2].

Bi 8 Lecture 17. interference. Ellen Rothenberg 1 March 2016

Advanced Subsidiary Unit 1: Lifestyle, Transport, Genes and Health

Table S1. Oligonucleotides used for the in-house RT-PCR assays targeting the M, H7 or N9. Assay (s) Target Name Sequence (5 3 ) Comments

Mutants and HBV vaccination. Dr. Ulus Salih Akarca Ege University, Izmir, Turkey

number Done by Corrected by Doctor Ashraf

Molecular Cell Biology - Problem Drill 10: Gene Expression in Eukaryotes

Research Article Complex Codon Usage Pattern and Compositional Features of Retroviruses

1. to understand how proteins find their destination in prokaryotic and eukaryotic cells 2. to know how proteins are bio-recycled

Properties of amino acids in proteins

Application of Reverse Genetics to Influenza Vaccine Development

Effects of Genetic Testing on Insurance: Pedigree Analysis and Ascertainment Adjustment

AMERICAN NATIONAL SCHOOL General Certificate of Education Advanced Subsidiary Level and Advanced Level

1. Describe the relationship of dietary protein and the health of major body systems.

Biomolecules: amino acids

Proteins are sometimes only produced in one cell type or cell compartment (brain has 15,000 expressed proteins, gut has 2,000).

TKheory Section: [Total 16 Marks]

L I F E S C I E N C E S

PhysicsAndMathsTutor.com. Question Number. Answer Additional Guidance Mark. 1(a) 1. mutation changes the sequence of bases / eq ;

Reverse transcription and integration

Supplementary Figure 1 MicroRNA expression in human synovial fibroblasts from different locations. MicroRNA, which were identified by RNAseq as most

Figure S1. Analysis of genomic and cdna sequences of the targeted regions in WT-KI and

Eukaryotic Gene Regulation

Supplementary Fig. 1. Delivery of mirnas via Red Fluorescent Protein.

Finding protein sites where resistance has evolved

Supplemental Data. Shin et al. Plant Cell. (2012) /tpc YFP N

This exam consists of two parts. Part I is multiple choice. Each of these 25 questions is worth 2 points.

Chapter 12-4 DNA Mutations Notes

Biol115 The Thread of Life"

The pathogenesis of nervous distemper


Unit 13.2: Viruses. Vocabulary capsid latency vaccine virion

Secondary Structure Prediction of Polymerase 1 Protein of

Transcription:

Journal of General Virology (2015), 96, 939 955 DOI 10.1099/vir.0.070789-0 Review Nucleotide sequence conservation in paramyxoviruses; the concept of codon constellation Bert K. Rima Correspondence Bert K. Rima b.rima@qub.ac.uk Centre for Infection and Immunity, School of Medicine, Dentistry and Biomedical Sciences, Queen s University Belfast, Belfast BT9 7BL, Northern Ireland, UK The stability and conservation of the sequences of RNA viruses in the field and the high error rates measured in vitro are paradoxical. The field stability indicates that there are very strong selective constraints on sequence diversity. The nature of these constraints is discussed. Apart from constraints on variation in cis-acting RNA and the amino acid sequences of viral proteins, there are other ones relating to the presence of specific dinucleotides such CpG and UpA as well as the importance of RNA secondary structures and RNA degradation rates. Recent other constraints identified in other RNA viruses, such as effects of secondary RNA structure on protein folding or modification of cellular trna complements, are also discussed. Using the family Paramyxoviridae, I show that the codon usage pattern (CUP) is (i) specific for each virus species and (ii) that it is markedly different from the host it does not vary even in vaccine viruses that have been derived by passage in a number of inappropriate host cells. The CUP might thus be an additional constraint on variation, and I propose the concept of codon constellation to indicate the informational content of the sequences of RNA molecules relating not only to stability and structure but also to the efficiency of translation of a viral mrna resulting from the CUP and the numbers and position of rare codons. The paradox between stability in the field and error rates measured in vitro It is important to consider the constraints on sequence diversity in RNA viruses because there is a paradox between the measured error rates of the polymerases of these viruses and the remarkable stability that some of the RNA viruses display in the field. The explanation of this paradox remains daunting (Vignuzzi & Andino, 2012). The high error rate in the replication of the viral RNA genomes has frequently been put forward as an evolutionary advantage because the existence of these viruses as a quasi-species might enhance their ability to adapt to new hosts (Domingo & Holland, 1994; Domingo & Wain- Hobson, 2009). In this paper this apparent paradox is discussed at the hand of paramyxoviruses such as measles (MV), mumps (MuV), parainfluenza virus type 5 (PIV5) and type 3 (PIV3). It has been noted that these paramyxoviruses have very stable nucleotide sequences in the field. They do not display the sort of levels of variation that might be expected on the basis of high polymerase error rates and the level of degeneracy of the genetic code. In other words, synonymous mutations are not as frequent as may be expected for these viruses and the rate of their Six supplementary tables are available with the online Supplementary Material. appearance d s (the synonymous substitution rate) and d N (the non-synonymous substitution rate) are low. The paradox is most easily demonstrated for MV, where both measurements of the error rate and the level of sequence divergence in the field have been assessed (see Box 1). Using the most conservative figures for the potential number of errors that could be generated during one year of endemic circulation of MV, this is 50 000 fold higher than what is actually observed in the field. Based on error rate estimates measured in the laboratory, the MV nucleotide sequence could be completely randomized six times during one year of replication in the field. That this does not happen indicates that there are very strong constraints on sequence diversity. By analysing the usage of synonymous codons in the sequences of paramyxoviruses, I show that the pattern of usage is constant in a given genotype of these viruses. From this and other recently described constraints on RNA virus sequence variation, I propose the concept of codon constellation which states that there is informational content in the usage of synonymous codons, codon pairing and the numbers and positions of rare codons in a viral RNA sequence, that determine not only RNA stability and higher order structure but also affects the efficiency of translation and the correct folding of proteins. 070789 G 2015 The Author Printed in Great Britain 939

B. K. Rima Potential replacement rate per annum: Genome size mutation rate number of replications= 15894 5 10 6 * 26 10 16384**= 338000 per annum i.e ~ 20 changes per site or ~ 6 times complete randomizations of the viral sequence Observed rate: 4 10 4 per nucleotide per annum*** 15894 = 6 nt changes per annum *The error rate has been measured by passage of virus in vitro in cell culture to be between 9 10 5 (Schrag et al.,1999) and 5 10 6 (Zhang et al., 2013). The first figure was revised downwards to 4.4 10 5 on theoretical grounds by Sanjuan et al. (2010) **The number of replications required to sustain measles endemically per annum is estimated conservatively by assuming that 26 transmission events take place per annum (2 weeks between infection and transmission) times the number of replications occurring in each patient. This latter figure is essentially unknown but can be estimated as being very high, by assuming a doubling of the input virus (assumed 10 p.f.u.), every 24 h (conservative) for 14 days it is at least 10 2 14 = 10 16364. The real value is probably orders of magnitude higher. ***Rima et al. (1997) direct assessment 4 10 4 ; Pomeroy et al. (2008) from bio-informatics = 6.5 10 4 Box 1. The paradox between mutation rates and field stability. Constraints on the variation of cis-acting sequences Some of the constraints on RNA virus sequence variation are well understood. Obviously, all RNA viruses need to maintain all the sequence motifs that are important in the replication of their genome such as the promoters for binding of the RNA dependent RNA polymerase (RdRp) at the genome termini. It is also obvious that especially in positive strand RNA viruses, RNA secondary structures that provide replication and packaging signals need to remain conserved. However, in the negative stranded paramyxoviruses, RNA secondary structures are not likely to be important in the genome and antigenome, since they replicate by direct copying of the RNA template bound to the nucleocapsid protein. In this ribonucleoprotein complex (RNP), the viral RNA is contained in a protected fold of the helical assembly of the nucleocapsid proteins (Desfosses et al., 2011). The genome in these RNPs is modelled to be single stranded and stretched out along the RNP. The Y forms of nucleocapsids as replicative intermediates, which have been visualized for MV (Thorne & Dermott, 1977), show that newly synthesized RNA is immediately encapsidated during replication. Of course, considerations about higher order RNA structure are applicable to the mrna transcripts of the paramyxoviruses. In the paramyxoviruses and other members of the order Mononegavirales, the transcription promoter at the 39end of the negative strand needs also to be conserved, as well as signals for the processing of the RNA transcripts by polyadenylation, co-transcriptional editing signals and signals that control the gradient of gene expression (Lamb & Parks, 2013). These latter requirements are not absolute. Changing the gene order of vesicular stomatitis virus (Ball et al., 1999) shows that it is possible to maintain a transmissible genetic entity, albeit in vitro. However, growth is very much compromised in these viruses and no natural variants of gene order have been observed. Members of Mononegavirales with a split genome, in which the coding sequence for the RdRp protein is placed on a separate replicating RNA molecule, which makes the virus functionally bipartite (Takeda et al., 2006), can also be propagated in cells in vitro. Another potential constraint on sequence variation could be periodicity in the RNA genome, due to its interaction with nucleocapsid protein. In the Paramyxovirinae, it has been shown that the length of the genomes is usually a multiple of six. The so-called rule of six (Calain & Roux, 1993) derives from the genome being packaged in blocks of six nucleotides associated with one copy of the nucleocapsid protein. This leads to each nucleotide being in a specific phase (one to six) with respect to the RNP, which has been shown to be important in gene transcription and editing (Hausmann et al., 1996; Iseni et al., 2002; Kolakofsky et al., 1998). Comparative sequence analysis for the morbilliviruses, however, has not indicated any preference of the four nucleotides A, G, C or U for phase positions one to six (Rima et al., 2005). One potential further constraint on sequence variation is derived from the frequency of stop codons in +1and21reading frames being marginally higher than expected (Rima & McFerran, 1997). It would be energetically favourable that an out-offrame ribosome would encounter a stop codon as soon as 940 Journal of General Virology 96

The concept of codon constellation possible to stop wasting energy in the generation of a nonfunctional protein. This has, so far, not been explored experimentally. The constraints discussed above are easily understood and their importance has been functionally demonstrated in many experiments. Mutations in the cis-acting sequences of the viruses including promoters, etc., have been shown to be attenuating, deleterious or fully lethal. It is also clear that the sequences of the genomes of RNA viruses or their mrnas are constrained by the fact that the functionality of the proteins must be maintained, as variations in the viral proteins affects their functionality and the overall fitness of the virus. These constraints have been reviewed exhaustively and will not further be considered here. Earlier described constraints on dinucleotide frequencies in RNA viruses; CpG suppression Despite the above considerations, the degeneracy of the genetic code would still allow more nucleotide sequence variation than is observed between field isolates. This indicates that there are further constraints, but these are probably less strict and allow levels of violation and tolerance that makes them difficult to verify in vitro. Assessing the effect of breaking these rules may in many cases require extensive studies of pathogenesis of viruses in animal models. Nevertheless, one such weaker constraint, i.e. the suppression of certain dinucleotide frequencies, has been identified, initially by bioinformatics and now has been verified experimentally. In all RNA virus genomes, it has been noted that the dinucleotide CpG is present much less frequently than expected from the G+C content of the genome (Rima & McFerran, 1997; Simmonds et al., 2013). For DNA genomes, this has been explained since the cytosine base of the CpG is often methylated, and the methylation of CpG dinucleotides in promoter regions provides a form of epigenetic transcriptional control (Bird, 1980; Chandler & Jones, 1988). Such a methylated C can be deaminated to a T residue giving rise to a mutation which may affect the epigenetic control, or if it occurs elsewhere in coding sequences simply lead to a transition. Mammalian hosts discriminate against high CpG-containing genome DNA from bacteria and other parasites. The parasite recognition functions by stimulation of the innate immune system through pattern recognition receptors and especially the Toll-like receptors (TLRs). Recognition of the deoxydinucleotide CpG by TLR9 has made it into a pathogen associated molecular pattern (PAMP) (Werling & Jungi, 2003). The ability of mammalian host cells to mount such a response requires that viruses avoid this motif. CpG ribonucleotide pairs were also shown to be underrepresented in the genomes of RNA viruses, first in poliovirus RNA (Rothberg & Wimmer, 1981) and later in a wider set of RNA viruses including representatives of all the major families (Karlin et al., 1994). These authors summarized that especially for RNA viruses which had no DNA intermediate in their replication cycle, the standard explanations for CpG suppression, i.e. parasite recognition and control of transcription by methylation of C in CpG in promoter regions, did not apply. Krieg (2000) and colleagues identified an immune-stimulatory DNA motif, RRCGYY, that stimulated innate immunity through interaction with TLR9. In RNA viruses, it was also shown that particularly the most immune-stimulatory oligonucleotide motif RRCGYY (R is purine and Y is pyrimidine) was almost three times less frequently observed in the dataset of 60 RNA viruses than the YYCGRR motif, similar to its suppression in the host genes (Rima & McFerran, 1997). However, further analysis showed that the preference for YYCGRR was entirely based on the preference of YC and GR dinucleotides in the dataset of RNA virus sequences available at the time (Rima & McFerran, 1997). The mechanisms of CpG RNA-induced innate immune responses remain unclear. The interaction of RNA viruses with the innate immune system through TLRs 3, 7 and 8 and others has been well reviewed (Jensen & Thomsen, 2012; Randall & Goodbourn, 2008) as this is a vibrant research area in virology. RNA oligonucleotides rich in GU were shown to be able to act as PAMPs signalling through TLR7 and TLR8 (Forsbach et al., 2008). Sugiyama et al. (2005) demonstrated that CpG ribodinucleotides were able to stimulate cells from the immune system directly, but they also showed that TLR8 was either not involved or required some hitherto unidentified cofactor, as HEK cells transfected with TLR8 did not respond to the CpG RNA motif. At present, the number of TLRs is 13 and growing, and the ligands for some of these have not yet been identified. Krieg (1996) also commented on the skewed overrepresentation of CpG dinucleotides in the 59 and 39 LTRs of the HIV genome and their low frequency in the remainder of the genome. This was potentially explained by the fact that reverse transcribed DNA generated in the infected lymphocyte during HIV-1 replication could act as a PAMP. However, an alternative explanation for this distribution of CpG dinucleotides in HIV-1 is that it mirrored what was observed in RNA viruses. Suppression of UpA dinucleotides in the sequences of RNA viruses The frequency of UpA dinucleotides is also suppressed significantly and sometimes to a greater extent than CpG (Rima & McFerran, 1997). In contrast to the CpG suppression, no potential explanation could be suggested for the reduced frequency of UpA dinucleotides. However, since 1997, it has been shown that the introduction of UpA dinucleotides reduces the stability of eukaryotic mrna molecules (Duan & Antezana, 2003), and this has been suggested to be linked mechanistically to the action of RNase L, which is part of the interferon-induced antiviral activity and specifically cleaves mrnas after UpA or UpU dinucleotides in the single stranded part of stem loop structures (Brennan-Laun et al., 2014). One such cleavage http://vir.sgmjournals.org 941

B. K. Rima product of hepatitis C virus has been shown to be a potent inducer of interferon through RIG-I (Malathi et al., 2010). A recent careful analysis paper by the Simmonds group (Atkinson et al., 2014) has now provided direct experimental evidence for the effects on replication of enhancing the frequencies of CpG and UpA dinucleotides in the picornavirus EV7. The authors emphasize the lack of correlation between replication and the frequencies of these dinucleotides, and that the mechanism by which this occurs and the type of evolutionary pressure that this provides are still far from clear. The suppression of CpG and UpA dinucleotides was shown in both positive and negative stranded RNA viruses, but the strand specificity is irrelevant as UpA and CpG are palindromic. Other constraints on diversity in the RNA viruses; the need to avoid introduction of viral sequences complementary to cellular RNAs An RNA virus infecting a cell enters the RNA world of that cell and this imposes a substantial but not yet quantified constraint on the sequence of the virus, as it must avoid sequences that are complementary to those of all cellular RNAs. Double-stranded RNA is a PAMP recognized by TLR3, and in the cytoplasm by the helicases of the RIG-I, mda5 and LGP2 family (Randall & Goodbourn, 2008). Furthermore, cellular enzymes such as DICER that recognize double-stranded RNA are able to generate antisense sequences that silence the expression of mrnas through selective degradation. Thus, an infecting RNA virus needs to avoid complementarity to all cellular RNAs such as micrornas, mrnas, snrnps, snorna, etc., to avoid eliciting RNA interference (Li et al., 2013; Maillard et al., 2013). MicroRNA seed sequences are only 7 nt in length (Lewis et al., 2003), e.g. one may be present in a random sequence of 160 00 nt and hence present by chance (if not evolved) in the larger RNA viruses. The importance of this constraint has not been evaluated so far, but it has been demonstrated that introduction of microrna targets in influenza A virus can reduce viral growth in vitro and in vivo (Langlois et al., 2013). The presence and location of rare codons and secondary structures affect the translation of mrnas Fig. 1 shows a compilation of the minor constraints that may be important in retaining the optimal translation of a viral mrna. Besides the effects of the introduction of specific dinucleotides, it also shows the effect of the introduction of rare codons in an mrna and specifically at the 59end of the ORF. Recently published data on the role of rare codons at the 59end of ORFS in bacteria has indicated that their presence may affect the overall level of translation by a hitherto unexpectedly high factor (Goodman et al., 2013) as modification of the rare codon content affected translation levels by 1 to 2 logs. The presence of rare codons at the 59end of an ORF has been suggested to lead to ribosome ramping, where a number of ribosomes are piled up at the start of the ORF before they take off translating the rest of the mrna (Tuller et al., 2010). Rare codons, assuming that their corresponding trnas are also present in low concentrations in the cell, are due to the laws of mass action more likely to be mistranslated and the error rate is raised to 10 23 or 10 24. The normal error of mischarging is estimated to be 10 25 (Ogle & Ramakrishnan, 2005). Rare codons may also lead to a local accumulation of stalled ribosomes on an mrna template (Cannarozzi et al., 2010). A further important element in the efficiency of translation and the quality of the resulting protein has been highlighted in studies with HIV-1 that indicate that higher order RNA structures in the genome (mrna) of HIV-1 are important in linking the folding of a protein with the speed of translation (Watts et al., 2009). They found that at the borders of important domains of the gag-pol polyprotein, as well as in the envelope protein, there are secondary and tertiary structures in the RNA that may slow down translation and thereby allow proper folding of the nascent protein before folding could potentially be influenced by strings of amino acids downstream of the domain border. This again indicates that there is informational content in the mrna, which is not directly linked to coding specific amino acid sequences, but that positions and numbers of rare codons as well as higher order RNA structures influence the rates and outcomes of translation of a message. The complement of trnas in the cell Another easily envisaged constraint on diversity is the need to align the codon usage and sequence of an RNA virus or its mrnas with the trna complement of the infected cell to maximize the efficiency of translation. However, while easily comprehended, this constraint too is not easily demonstrated experimentally. Firstly, it is surprising (or shocking), how little we know about the host cells types that are the main cells infected in many important human and animal virus infections. Furthermore, the main infected cell types do not need to be the ones most pathologically important or important for transmission. The cell types that are most important in initial infection, replication and transmission are often another unknown. Even if we knew which cell types were important, we do not know the trna complement of individual cell types but we do know that they vary between cell types (Dittmar et al., 2006). Over 450 trna genes have been annotated in the human genome and their relative expression in various cell types is not known. For example, in human hosts it has been demonstrated that mrnas which are present in cells at high copy numbers and which encode highly expressed proteins have a codon usage that differs from those of other genes (Dittmar et al., 2006). Viral sequences would be expected to be optimized for high level expression in 942 Journal of General Virology 96

The concept of codon constellation Avoid introduction of NNU-ANN or NNC-GNN pairs Schlafen 11 protein counteracts modifications of the cellular trna complement generated by HIV-1 infection to aid translation of its own mrnas with codons that prefer A or U in the third position as opposed to C in human mrnas Ribosome ramping UA UA CG RNA secondary structures slow down translation to allow independent folding of protein domains 5 AA AAAAA Start C G Codons which complement rare trnas cause a slow down of translation and ribosomes pile up. Mass action laws increase the number of mistranslations error rate to 10 3 to 10 4 (error of trna charging 10 5 ) CG GC UA Stop Viruses must avoid complementarity to microrna seed sequences Viruses need to avoid complementarity to all types of cellular RNAs Introduction of UpA dinucleotides increases rates of mrna degradation Introduction of CpG dinucleotides affects RNA secondary structure and signalling to the innate immune system Fig. 1. Compilation of factors that affect the efficiency of translation of an mrna in the cell. The effects of codon pairing, ribosome ramping, the cellular trna complement and the position of rare codons, RNase sensitivity and structure as well as signalling to the innate immune system and the need to avoid RNA sequences complementary to cellular RNAs are indicated. infected cells and one could envisage that their codon usage would be similar to those of highly expressed cellular proteins, but in the absence of knowledge of the specific cell types that are important and their trna complements, it is impossible to make any prediction of the severity of this constraint. Hence, it is difficult to evaluate the recent suggestion that viruses that grow in epithelial cell types evolve more rapidly (Hicks & Duffy, 2014). However, that the cellular trna complement is important has recently become clear, because it has been shown that HIV-1 modifies the trna content of infected cells and that this modification is counteracted by the cellular protein Schlafen-11 (Li et al., 2012) through binding to specific trnas. As a result, proteins with a codon usage that is similar to that of HIV-1, such as an unmodified version of the ORF encoding GFP, were translated more efficiently in infected cells than the enhanced version of GFP, which has a codon usage optimized for mammalian cells. These experiments show unambiguously that viruses are capable of affecting the trna complement in a cell to increase the translation efficiency of their own mrnas. Codon order can affect translation efficiency A further set of recent experiments has indicated that virus growth is impaired and the level of expression of viral proteins is reduced if one alters the order of codons in a viral mrna. This has been demonstrated directly, using poliovirus as well as influenza virus. In the latter case, changing the codon order but keeping the sequence of the protein as well as the overall codon usage unchanged, but introducing less preferred codon pairs, reduced expression of the HA protein in engineered viruses, and this has been proposed as a rational method for attenuation of viruses (Yang et al., 2013). The same has been achieved earlier with poliovirus by the same group (Wimmer & Paul, 2011). Changing the codon order can, of course, influence higher order RNA structures that could be important as described http://vir.sgmjournals.org 943

B. K. Rima Table 1. Comparison of the CUPs of some viruses of cow and man versus the CUPs of their hosts Codon Human hpiv3 average Polio Bovine bpiv3 average FMDV BEV TBEV Ala - GCU 1.05 1.07 1.24 1.00 1.04 0.86 0.80 1.08 Ala - GCC 1.61 0.53 0.87 1.71 0.50 1.54 1.38 1.16 Ala - GCA 0.90 2.26 1.47 0.80 2.20 0.99 1.38 1.12 Ala - GCG 0.43 0.14 0.42 0.48 0.25 0.62 0.44 0.65 Arg - CGU 0.50 0.15 0.44 0.49 0.24 1.03 0.74 0.39 Arg - CGC 1.15 0.12 0.44 1.17 0.08 1.83 1.14 0.80 Arg - CGA 0.67 0.62 0.19 0.68 0.36 0.46 0.17 0.66 Arg - CGG 1.25 0.15 0.44 1.32 0.21 0.51 0.34 0.66 Arg - AGA 1.21 3.92 3.00 1.14 3.70 1.14 1.77 1.85 Arg - AGG 1.21 1.03 1.50 1.20 1.41 1.03 1.83 1.63 Asn - AAU 0.92 1.38 0.80 0.81 1.35 0.19 0.87 0.65 Asn - AAC 1.08 0.62 1.20 1.19 0.65 1.81 1.13 1.35 Asp - GAU 0.92 1.45 0.92 0.84 1.36 0.55 1.14 0.82 Asp - GAC 1.08 0.55 1.08 1.16 0.64 1.45 0.86 1.18 Cys - UGU 0.89 1.53 1.10 0.85 1.42 0.86 0.62 1.11 Cys - UGC 1.11 0.47 0.90 1.15 0.58 1.14 1.38 0.89 Gln - CAA 0.51 1.34 0.94 0.46 1.28 0.88 1.02 0.72 Gln - CAG 1.49 0.66 1.06 1.54 0.72 1.12 0.98 1.28 Glu - GAA 0.83 1.49 1.14 0.78 1.36 0.78 0.85 0.78 Glu - GAG 1.17 0.51 0.86 1.22 0.64 1.22 1.15 1.22 Gly - GGU 0.65 0.88 1.04 0.64 0.84 0.72 1.11 0.75 Gly - GGC 1.37 0.33 0.73 1.43 0.32 1.45 1.11 0.76 Gly - GGA 0.98 2.06 1.32 0.95 1.93 1.14 0.92 1.45 Gly - GGG 0.99 0.73 0.90 0.99 0.91 0.70 0.87 1.03 His - CAU 0.82 1.49 0.71 0.75 1.34 0.33 0.94 0.93 His - CAC 1.18 0.51 1.29 1.25 0.66 1.67 1.06 1.07 Ile - AUU 1.07 0.95 1.17 0.98 0.91 1.01 1.44 0.76 Ile - AUC 1.45 0.66 1.08 1.57 0.63 1.93 1.06 1.43 Ile - AUA 0.48 1.39 0.74 0.45 1.46 0.06 0.50 0.81 Leu - UUA 0.44 2.14 0.77 0.38 2.22 0.00 0.44 0.23 Leu - UUG 0.76 0.87 1.27 0.71 0.83 0.82 0.99 1.33 Leu - CUU 0.77 0.81 0.84 0.70 0.89 0.85 1.77 0.84 Leu - CUC 1.17 0.53 0.87 1.26 0.49 2.09 1.06 1.06 Leu - CUA 0.42 1.11 1.11 0.36 1.09 0.27 0.82 0.32 Leu - CUG 2.44 0.55 1.14 2.59 0.49 1.97 0.92 2.21 Lys - AAA 0.84 1.42 1.02 0.78 1.27 0.87 0.65 0.91 Lys - AAG 1.16 0.58 0.98 1.22 0.73 1.13 1.35 1.09 Phe - UUU 0.91 1.18 0.93 0.85 1.21 0.60 0.85 0.83 Phe - UUC 1.09 0.82 1.07 1.15 0.79 1.40 1.15 1.17 Pro - CCU 1.13 1.44 1.01 1.08 1.41 1.10 1.01 1.10 Pro - CCC 1.31 0.42 0.67 1.39 0.49 1.45 0.97 0.89 Pro - CCA 1.09 1.85 1.88 1.00 1.81 0.79 1.29 1.28 Pro - CCG 0.46 0.29 0.44 0.53 0.30 0.66 0.73 0.73 Ser - UCU 1.10 1.24 0.79 1.04 1.04 0.76 1.15 0.73 Ser - UCC 1.31 0.38 1.27 1.37 0.50 1.78 1.37 0.70 Ser - UCA 0.88 2.44 1.91 0.79 2.30 0.93 1.15 1.32 Ser - UCG 0.34 0.23 0.40 0.39 0.29 0.89 0.31 0.55 Ser - AGU 0.90 1.06 0.83 0.87 1.15 0.40 0.75 1.32 Ser - AGC 1.46 0.65 0.79 1.53 0.72 1.24 1.28 1.38 Thr - ACU 0.97 1.09 1.25 0.89 1.12 0.84 1.29 0.65 Thr - ACC 1.45 0.46 1.25 1.55 0.55 1.61 1.56 1.18 Thr - ACA 1.12 2.22 1.18 1.01 2.16 0.89 0.75 1.27 Thr - ACG 0.47 0.22 0.33 0.56 0.18 0.65 0.40 0.89 Tyr - UAU 0.87 1.40 0.82 0.79 1.44 0.32 0.84 0.78 Tyr - UAC 1.13 0.60 1.18 1.21 0.56 1.68 1.16 1.22 Val - GUU 0.71 1.31 0.64 0.64 1.36 1.07 1.11 0.71 944 Journal of General Virology 96

The concept of codon constellation Table 1. cont. Codon Human hpiv3 average Polio Bovine bpiv3 average FMDV BEV TBEV Val - GUC 0.95 0.66 0.84 1.01 0.70 1.36 1.06 1.13 Val - GUA 0.46 1.29 0.81 0.40 1.35 0.16 0.61 0.17 Val - GUG 1.88 0.74 1.71 1.95 0.59 1.42 1.22 1.99 hrscu is 1.50 or higher. hrscu is between 1.25 and 1.49. hrscu is between 0.75 and 0.50. hrscu is 0.50 or lower. above, and indeed their analysis identified a new structural element important in the poliovirus replication. Codon usage and codon-pair context have also been identified as important gene primary structure features that influence the decoding fidelity of an mrna (Moura et al., 2007). Large scale comparative codon-pair context analysis has demonstrated the existence of general and species specific codon-pair context rules, which govern evolution of mrnas in the three domains of life. Fundamental differences between prokaryotic and eukaryotic mrna decoding rules exist, which are partially independent of codon usage. Moura et al. (2007) identified that, whilst evolution of eubacterial and archeal mrna primary structure is mainly dependent on constraints imposed by the translational machinery, in eukaryotes DNA methylation and tri-nucleotide repeats impose strong biases on codon-pair context. The methylation is probably irrelevant for RNA viruses, but the strong bias against repeating the same codon in a pair may be important. Cannarozzi et al. (2010) studied the location of synonymous codons in yeast, in relation to their ability to interact with specific trnas. They demonstrate the phenomenon of codon correlation in yeast. Their analysis is based on assessing which trnas can decode a synonymous codon in a (non-consecutive) pair of codons for a given amino acid. If the two codons can be decoded by the same trna, they call this pair autocorrelated. They show that there is a preference in ORFs for reusing synonymous codons that use the same trna, if these occur within a certain distance from each other. Their results established that sequences supporting trna reusage are expressed more efficiently (up to 30 %) than sequences that require the use of a different trna. Bioinformatic analysis showed that this phenomenon also occurs in the human genome and this correlation is strongest in highly expressed genes (Cannarozzi et al., 2010), and therefore may be relevant for viruses. The suppression of dinucleotides CpG and UpA is also important in codon pairing in these viruses. The chance of decreasing or keeping the same frequency of CpG in any altered random order that maintained the encoding of the same amino acid sequence for a human protein DRD2 was found to be,0.0001 (Duan & Antezana, 2003). Studies demonstrated in positive strand RNA viruses the existence of genome scale ordered RNA secondary structures (GORS) and that their presence was associated with a persistent phenotype of the virus (Davis et al., 2008; Simmonds et al., 2004). In a recent study on norovirus, the attempted removal of one of these structures highlighted the need to be careful not to change codon usage, but it is difficult not thereby to change the occurrence of codon pairs, potential stability and local secondary structures of the RNA (McFadden et al., 2013). These results showed that removing such a structure led to attenuation in vivo in mouse models. Virus that persisted in the mice was shown to have synonymous mutations that led to greater levels of potential secondary structure. More recently, studies from the same groups showed that artificial constructs with GORS remarkably reduced signalling of IFN induction mediated through PKR activation (Witteveldt et al., 2014). Their analysis focused on positive stranded RNA viruses, but similar arguments could well apply to the mrnas of some of the negative strand viruses, as some of these, e.g. the mrna that encodes the RdRp, is of similar length to the picornavirus RNA genomes. Codon usage patterns (CUPs) in RNA viruses These observations on the potential importance of codon order in RNA viruses prompted the following initial analysis on codon usage by RNA viruses to see to what extent it mirrored that of their hosts. In order to analyse this, the relative synonymous codon usage (RSCU) has been calculated for each codon in the complete set of (fused) ORFs in a given virus, as described previously (Sharp & Li, 1986). Thus, for example for arginine, which can be encoded in the standard codon assignment by six codons (CGN and AGA and AGG), the total number of arginine residues has been counted in the ORF, then divided by six to calculate how many codons of each type would be expected if their usage were random and unbiased, and the number of codons of a specific type has then been divided by this number to derive the RSCU. The listing of all the RSCUs in the tables for the ORF(s) in a given virus is then referred to as the CUP. Table 1 shows that for a small selection of RNA viruses, codon usage is very different from that of their host (see Table S1, available in the online Supplementary Material, http://vir.sgmjournals.org 945

B. K. Rima Table 2. CUPs of paramyxoviruses Table 2(a). RSCU values Codon Human 15-MV 10-MuV 9-PIV5 Akimota 1 Tupaia Menangle 7-hPIV3 6-bPIV3 4-NDV Ala - GCU 1.05 1.18 1.04 1.18 1.14 1.33 1.14 1.07 1.04 0.92 Ala - GCC 1.61 0.96 0.75 0.61 0.78 0.68 0.72 0.53 0.50 0.89 Ala - GCA 0.90 1.55 1.80 2.00 1.94 1.82 1.96 2.26 2.20 1.72 Ala - GCG 0.43 0.31 0.40 0.21 0.14 0.17 0.18 0.14 0.25 0.48 Arg - CGU 0.50 0.35 0.48 0.48 0.38 0.63 0.48 0.15 0.24 0.46 Arg - CGC 1.15 0.28 0.42 0.57 0.46 0.28 0.45 0.12 0.08 0.48 Arg - CGA 0.67 0.59 0.98 0.96 0.51 0.71 0.53 0.62 0.36 0.60 Arg - CGG 1.25 0.57 0.46 0.52 0.36 0.35 0.53 0.15 0.21 0.71 Arg - AGA 1.21 2.22 2.19 1.85 2.97 2.60 2.36 3.92 3.70 1.90 Arg - AGG 1.21 1.99 1.47 1.63 1.32 1.44 1.65 1.03 1.41 1.85 Asn - AAU 0.92 1.04 1.34 1.50 1.33 1.28 1.31 1.38 1.35 1.13 Asn - AAC 1.08 0.96 0.66 0.50 0.67 0.72 0.69 0.62 0.65 0.87 Asp - GAU 0.92 1.10 1.30 1.41 1.44 1.33 1.35 1.45 1.36 1.01 Asp - GAC 1.08 0.90 0.70 0.59 0.56 0.67 0.65 0.55 0.64 0.99 Cys - UGU 0.89 0.87 1.02 1.19 1.15 1.18 1.03 1.53 1.42 1.07 Cys - UGC 1.11 1.13 0.98 0.81 0.85 0.82 0.97 0.47 0.58 0.93 Gln - CAA 0.51 1.14 1.28 1.13 1.21 1.27 1.17 1.34 1.28 0.90 Gln - CAG 1.49 0.86 0.72 0.87 0.79 0.73 0.83 0.66 0.72 1.10 Glu - GAA 0.83 0.91 1.00 1.04 0.96 1.42 1.04 1.49 1.36 0.87 Glu - GAG 1.17 1.09 1.00 0.96 1.04 0.58 0.96 0.51 0.64 1.13 Gly - GGU 0.65 0.88 1.15 1.08 1.18 1.17 0.92 0.88 0.84 0.79 Gly - GGC 1.37 0.68 0.80 0.63 0.45 0.51 0.51 0.33 0.32 0.81 Gly - GGA 0.98 1.16 1.30 1.49 1.47 1.46 1.52 2.06 1.93 1.03 Gly - GGG 0.99 1.28 0.74 0.81 0.90 0.87 1.04 0.73 0.91 1.38 His - CAU 0.82 1.07 1.19 1.31 1.36 1.27 1.28 1.49 1.34 1.22 His - CAC 1.18 0.93 0.81 0.69 0.64 0.73 0.72 0.51 0.66 0.78 Ile - AUU 1.07 0.90 1.13 1.07 1.38 1.40 1.10 0.95 0.91 0.89 Ile - AUC 1.45 1.21 0.92 1.18 0.83 0.69 0.88 0.66 0.63 1.14 Ile - AUA 0.48 0.89 0.95 0.75 0.79 0.91 1.02 1.39 1.46 0.97 Leu - UUA 0.44 0.73 1.25 1.19 1.40 1.33 1.10 2.14 2.22 0.91 Leu - UUG 0.76 0.99 1.16 0.99 1.21 1.15 1.00 0.87 0.83 0.91 Leu - CUU 0.77 0.98 1.01 0.97 1.15 1.21 1.04 0.81 0.89 1.00 Leu - CUC 1.17 1.04 0.79 0.82 0.56 0.63 0.89 0.53 0.49 1.06 Leu - CUA 0.42 0.94 1.11 1.12 0.94 1.07 1.04 1.11 1.09 1.00 Leu - CUG 2.44 1.32 0.68 0.81 0.75 0.62 0.94 0.55 0.49 1.12 Lys - AAA 0.84 0.88 1.01 0.99 1.15 1.31 1.01 1.42 1.27 0.84 Lys - AAG 1.16 1.12 0.99 1.01 0.85 0.69 0.99 0.58 0.73 1.16 Phe - UUU 0.91 0.94 1.01 1.04 1.29 1.18 1.07 1.18 1.21 0.84 Phe - UUC 1.09 1.06 0.99 0.96 0.71 0.82 0.93 0.82 0.79 1.16 Pro - CCU 1.13 1.20 1.37 1.00 1.46 1.59 1.15 1.44 1.41 1.15 Pro - CCC 1.31 1.16 0.94 0.64 0.71 0.32 0.85 0.42 0.49 0.74 Pro - CCA 1.09 1.04 1.32 1.75 1.61 1.71 1.34 1.85 1.81 1.33 Pro - CCG 0.46 0.60 0.37 0.62 0.22 0.38 0.65 0.29 0.30 0.78 Ser - UCU 1.10 0.99 1.22 1.14 1.44 1.53 1.38 1.24 1.04 1.43 Ser - UCC 1.31 0.90 0.92 0.77 0.51 0.42 0.75 0.38 0.50 0.83 Ser - UCA 0.88 1.63 1.68 1.71 1.89 1.47 1.41 2.44 2.30 1.39 Ser - UCG 0.34 0.39 0.26 0.35 0.29 0.27 0.31 0.23 0.29 0.43 Ser - AGU 0.90 0.92 1.12 1.31 1.16 1.26 1.28 1.06 1.15 0.80 Ser - AGC 1.46 1.18 0.80 0.73 0.71 1.06 0.88 0.65 0.72 1.11 Thr - ACU 0.97 1.16 1.35 1.44 1.53 1.42 1.39 1.09 1.12 1.13 Thr - ACC 1.45 1.08 0.91 0.82 0.54 0.75 0.93 0.46 0.55 0.95 Thr - ACA 1.12 1.54 1.57 1.55 1.84 1.64 1.49 2.22 2.16 1.44 Thr - ACG 0.47 0.21 0.17 0.19 0.10 0.19 0.20 0.22 0.18 0.48 Tyr - UAU 0.87 0.88 1.19 1.21 1.28 1.22 1.25 1.40 1.44 0.99 946 Journal of General Virology 96

The concept of codon constellation Table 2. cont. Table 2(a). RSCU values Codon Human 15-MV 10-MuV 9-PIV5 Akimota 1 Tupaia Menangle 7-hPIV3 6-bPIV3 4-NDV Tyr - UAC 1.13 1.12 0.81 0.79 0.72 0.78 0.75 0.60 0.56 1.01 Val - GUU 0.71 1.09 1.30 0.96 1.27 1.54 0.90 1.31 1.36 0.58 Val - GUC 0.95 1.14 0.86 0.89 0.75 0.59 0.96 0.66 0.70 1.12 Val - GUA 0.46 0.78 1.07 1.04 1.15 1.06 1.12 1.29 1.35 1.04 Val - GUG 1.88 0.99 0.76 1.11 0.83 0.81 1.02 0.74 0.59 1.26 Table 2(b). RSCU values Codon Human RSV A RSV B PVM hmpv A hmpv B TRTV Hendra Nipah Ferla Beilong Ala - GCU 1.05 1.47 1.51 1.65 1.18 1.14 1.12 1.52 1.53 1.09 1.75 Ala - GCC 1.61 0.58 0.54 0.92 0.48 0.50 0.76 0.57 0.63 0.74 0.75 Ala - GCA 0.90 1.88 1.83 1.34 2.15 2.20 1.99 1.65 1.55 1.62 1.19 Ala - GCG 0.43 0.06 0.13 0.08 0.19 0.15 0.13 0.26 0.30 0.55 0.30 Arg - CGU 0.50 0.43 0.46 0.32 0.20 0.24 0.16 0.35 0.46 0.33 0.43 Arg - CGC 1.15 0.15 0.15 0.08 0.20 0.00 0.03 0.24 0.07 0.08 0.52 Arg - CGA 0.67 0.31 0.38 0.29 0.33 0.17 0.19 0.59 0.58 0.38 0.52 Arg - CGG 1.25 0.23 0.15 0.21 0.20 0.17 0.21 0.12 0.17 0.30 0.47 Arg - AGA 1.21 4.14 3.86 2.95 3.93 4.06 2.91 3.19 3.34 2.84 2.21 Arg - AGG 1.21 0.74 0.99 2.15 1.15 1.36 2.51 1.51 1.38 2.08 1.85 Asn - AAU 0.92 1.27 1.28 0.95 1.18 1.19 1.10 1.35 1.31 1.16 1.16 Asn - AAC 1.08 0.73 0.72 1.05 0.82 0.81 0.90 0.65 0.69 0.84 0.84 Asp - GAU 0.92 1.53 1.42 1.28 1.20 1.18 0.99 1.34 1.24 1.25 1.28 Asp - GAC 1.08 0.47 0.58 0.72 0.80 0.82 1.01 0.66 0.76 0.75 0.72 Cys - UGU 0.89 1.20 1.15 1.27 1.21 1.27 1.19 1.22 1.37 1.41 1.12 Cys - UGC 1.11 0.80 0.85 0.73 0.79 0.73 0.81 0.78 0.63 0.59 0.88 Gln - CAA 0.51 1.66 1.48 1.09 1.41 1.23 1.12 1.17 1.24 1.11 0.78 Gln - CAG 1.49 0.34 0.52 0.91 0.59 0.77 0.88 0.83 0.76 0.89 1.22 Glu - GAA 0.83 1.58 1.52 0.91 1.48 1.35 1.12 1.12 1.13 0.94 0.85 Glu - GAG 1.17 0.42 0.48 1.09 0.52 0.65 0.88 0.88 0.87 1.06 1.15 Gly - GGU 0.65 1.39 1.39 1.44 1.06 1.12 0.70 1.04 0.94 0.94 1.09 Gly - GGC 1.37 0.62 0.53 0.89 0.66 0.37 0.65 0.51 0.50 0.62 0.62 Gly - GGA 0.98 1.46 1.64 0.98 1.41 1.53 1.32 1.35 1.50 1.18 1.08 Gly - GGG 0.99 0.53 0.44 0.69 0.87 0.98 1.32 1.11 1.06 1.26 1.22 His - CAU 0.82 1.42 1.45 1.27 1.41 1.18 1.15 1.43 1.18 1.22 1.18 His - CAC 1.18 0.58 0.55 0.73 0.59 0.82 0.85 0.57 0.82 0.78 0.82 Ile - AUU 1.07 0.77 0.83 0.93 0.85 0.96 0.86 1.02 1.04 0.79 1.23 Ile - AUC 1.45 0.58 0.56 0.67 0.66 0.50 0.82 0.99 0.93 1.08 0.83 Ile - AUA 0.48 1.65 1.61 1.40 1.49 1.54 1.32 0.99 1.03 1.13 0.95 Leu - UUA 0.44 2.27 2.26 1.48 2.36 2.18 1.14 1.13 1.07 1.16 0.87 Leu - UUG 0.76 1.01 0.84 1.23 0.71 0.76 1.36 1.30 1.17 1.19 0.94 Leu - CUU 0.77 0.80 0.95 0.94 0.67 0.74 0.88 1.04 1.10 0.81 1.39 Leu - CUC 1.17 0.52 0.36 0.68 0.47 0.39 0.45 0.83 0.68 0.68 0.81 Leu - CUA 0.42 1.00 1.17 0.84 1.17 1.34 0.99 0.88 1.06 1.10 0.94 Leu - CUG 2.44 0.41 0.42 0.84 0.63 0.59 1.19 0.83 0.91 1.07 1.05 Lys - AAA 0.84 1.50 1.49 1.15 1.45 1.51 1.23 1.05 1.18 0.99 0.93 Lys - AAG 1.16 0.50 0.51 0.85 0.55 0.49 0.77 0.95 0.82 1.01 1.07 Phe - UUU 0.91 1.06 1.20 1.07 1.16 1.14 1.15 1.07 1.13 0.79 1.14 Phe - UUC 1.09 0.94 0.80 0.93 0.84 0.86 0.85 0.93 0.87 1.21 0.86 Pro - CCU 1.13 1.60 1.58 1.79 1.48 1.27 1.09 1.42 1.64 1.21 1.94 Pro - CCC 1.31 0.83 0.63 0.87 0.54 0.68 0.73 0.68 0.71 0.80 0.70 Pro - CCA 1.09 1.53 1.69 1.08 1.80 1.84 1.82 1.42 1.19 1.23 0.88 Pro - CCG 0.46 0.05 0.10 0.26 0.19 0.22 0.36 0.47 0.47 0.76 0.48 Ser - UCU 1.10 1.13 1.08 1.19 1.06 1.03 0.88 1.31 1.38 1.20 1.31 Ser - UCC 1.31 0.65 0.56 0.76 0.49 0.45 0.52 0.65 0.58 0.82 0.72 http://vir.sgmjournals.org 947

B. K. Rima Table 2. cont. Table 2(b). RSCU values Codon Human RSV A RSV B PVM hmpv A hmpv B TRTV Hendra Nipah Ferla Beilong Ser - UCA 0.88 1.45 1.65 1.50 1.69 1.76 1.75 1.82 1.81 1.67 1.50 Ser - UCG 0.34 0.09 0.09 0.15 0.07 0.10 0.03 0.25 0.29 0.41 0.41 Ser - AGU 0.90 1.70 1.62 1.35 1.47 1.66 1.14 1.23 1.16 1.03 1.12 Ser - AGC 1.46 0.99 1.00 1.05 1.23 1.00 1.67 0.75 0.79 0.88 0.93 Thr - ACU 0.97 1.07 1.10 1.36 1.07 1.00 0.66 1.33 1.43 1.12 1.43 Thr - ACC 1.45 0.75 0.87 0.84 0.51 0.83 0.79 0.68 0.68 0.89 0.79 Thr - ACA 1.12 2.11 1.93 1.65 2.26 2.09 2.42 1.72 1.71 1.54 1.57 Thr - ACG 0.47 0.06 0.10 0.14 0.16 0.08 0.13 0.27 0.19 0.45 0.21 Tyr - UAU 0.87 1.41 1.25 1.17 1.27 1.36 1.13 1.14 1.16 0.95 1.15 Tyr - UAC 1.13 0.59 0.75 0.83 0.73 0.64 0.88 0.86 0.84 1.05 0.85 Val - GUU 0.71 1.25 1.09 1.16 1.19 1.24 1.15 1.15 1.44 1.06 1.29 Val - GUC 0.95 0.54 0.65 0.71 0.59 0.50 0.67 1.02 0.75 1.08 0.79 Val - GUA 0.46 1.29 1.49 0.87 1.37 1.49 1.00 0.78 1.05 0.95 0.69 Val - GUG 1.88 0.92 0.77 1.27 0.85 0.78 1.18 1.05 0.75 0.91 1.23 hrscu is 1.50 or higher. hrscu is between 1.25 and 1.49. hrscu is between 0.75 and 0.50. hrscu is 0.50 or lower. for the accession numbers of virus sequences analysed). Neither the CUPs of bovine enterovirus (BEV) or bovine parainfluenza virus type 3 (bpiv3) resemble those of the bovine host. Similarly, the CUPs of poliovirus and human parainfluenza virus type 3 (hpiv3) do not resemble those of the human host. In neither comparison is there any tendency for the human virus to be more like the human host, or the bovine viruses to resemble bovine CUPs. Tick borne encephalitis virus (TBEV) has also been included in Table 1, as it has to be able to use human as well as tick host cells, and its CUP is different again. This virus also has a remarkable high level of UpA suppression, with an odds ratio of 0.39 (Rima & McFerran, 1997). The CUP of TBEV reflected this bias, as the UUA, CUA and GUA codons have very low RSCUs (Table 1). The CUP of foot-and-mouth disease virus (FMDV) with an UpA odds ratio of 0.44 also shows extremely low RSCUs for AUA, CUA and GUA. The UUA codon is never used in this virus and only two of ninety-eight Ile residues are encoded by AUA. A comprehensive analysis of CUPs in the paramyxoviruses This restricted set of data in Table 1 indicated that there are specific viral CUPs, and in order to analyse this in more detail, the CUPs of a large number of members of the family Paramyxoviridae have been analysed in order to study variations in the CUPs at the family, subfamily, genus and species (clade/genotype) levels. The family has been chosen because they are strict cytoplasmic RNA viruses, and the evolution of its members has been studied extensively because it contains clinically important human viruses and vaccines. For the calculation of the RSCUs, small overlapping ORFs have been ignored as these do not substantially affect the CUPs due to their small size. Table 2 shows that each of the members of the Paramyxoviridae has a different CUP. The table shows mean average values of CUPs for 15 MV, 10 mumps virus (MuV), 9 parainfluenza type 5 (PIV5), 7 human and 6 bovine para-infuenza virus type 3 (h/bpiv3) and 4 Newcastle disease virus (NDV) strains, based on individual CUPs of viruses from different genotypes. The CUPs of the paramyxoviruses have been compared here to the human CUP, because many of these viruses cause infections in human hosts, and stability in the field is often measured only in human epidemics. Whilst the human CUP has a preference for codons ending in C (NNC), the paramyxoviruses generally prefer NNA codons. This is even the case in the strongly suppressed CGN codons for arginine, which leads to the fast overrepresentation AGA and AGG arginine codons. In the group of CGN codons, CGA is less severely suppressed than CGU and CGG codons and the extremely rare CGC codon. The reduced frequency of the NNC codons is probably explained by the need to avoid CpG dinucleotides in NNC GNN codon pairs, and essentially the preference of the A in the third position similarly can be explained, because NNU would potentially give rise to UpA in NNU ANN codon pairs in the genome and/or antigenome of these viruses. Almost all RSCU values of less than 0.50 (marked in red) are for NCG or CGN codons, reflecting the strong CpG suppression. There is no consistent bias in the Paramyxovirinae of NNA+NNG over NNU+NNC codons. However, in all codon groups, the ratio of frequencies of NNU over NNC is about 1.5, and NNA is 948 Journal of General Virology 96

The concept of codon constellation Table 3. Comparison of CUPs in the genus Rubulavirus Codon Bat mumps 10-MuV LPMV Mapuera 9-PIV5 hpiv2 SV41 hpiv4a Menangle Tioman Akimota 1 Ala - GCU 1.27 1.04 1.18 1.26 1.18 1.25 1.28 1.36 1.14 1.23 1.14 Ala - GCC 0.51 0.75 0.91 0.73 0.61 0.63 0.72 0.44 0.72 0.75 0.78 Ala - GCA 2.09 1.80 1.50 1.83 2.00 1.94 1.73 2.01 1.96 1.70 1.94 Ala - GCG 0.13 0.40 0.40 0.18 0.21 0.18 0.27 0.19 0.18 0.32 0.14 Arg - CGU 0.53 0.48 0.62 0.49 0.48 0.58 0.65 0.35 0.48 0.55 0.38 Arg - CGC 0.37 0.42 1.11 0.59 0.57 0.39 0.35 0.11 0.45 0.45 0.46 Arg - CGA 0.79 0.98 1.20 0.74 0.96 0.66 0.76 0.56 0.53 0.58 0.51 Arg - CGG 0.58 0.46 0.74 0.79 0.52 0.39 0.74 0.37 0.53 0.60 0.36 Arg - AGA 2.30 2.19 1.34 2.05 1.85 2.74 1.88 3.47 2.36 2.26 2.97 Arg - AGG 1.43 1.47 0.99 1.33 1.63 1.24 1.62 1.15 1.65 1.56 1.32 Asn - AAU 1.45 1.34 1.25 1.35 1.50 1.40 1.19 1.55 1.31 1.18 1.33 Asn - AAC 0.55 0.66 0.75 0.65 0.50 0.60 0.81 0.45 0.69 0.82 0.67 Asp - GAU 1.47 1.30 1.16 1.14 1.41 1.51 1.32 1.53 1.35 1.32 1.44 Asp - GAC 0.53 0.70 0.84 0.86 0.59 0.49 0.68 0.47 0.65 0.68 0.56 Cys - UGU 1.20 1.02 1.23 0.97 1.19 1.24 1.21 1.23 1.03 1.14 1.15 Cys - UGC 0.80 0.98 0.77 1.03 0.81 0.76 0.79 0.77 0.97 0.86 0.85 Gln - CAA 1.22 1.28 1.04 1.06 1.13 1.31 1.20 1.41 1.17 1.11 1.21 Gln - CAG 0.78 0.72 0.96 0.94 0.87 0.69 0.80 0.59 0.83 0.89 0.79 Glu - GAA 0.98 1.00 0.95 0.88 1.04 1.13 1.09 1.42 1.04 0.92 0.96 Glu - GAG 1.02 1.00 1.05 1.12 0.96 0.87 0.91 0.58 0.96 1.08 1.04 Gly - GGU 1.26 1.15 1.12 0.78 1.08 1.21 1.07 1.43 0.92 0.90 1.18 Gly - GGC 0.55 0.80 0.55 0.75 0.63 0.37 0.62 0.47 0.51 0.70 0.45 Gly - GGA 1.31 1.30 1.35 1.05 1.49 1.58 1.29 1.33 1.52 1.25 1.47 Gly - GGG 0.87 0.74 0.99 1.42 0.81 0.83 1.02 0.77 1.04 1.15 0.90 His - CAU 1.36 1.19 1.17 1.03 1.31 1.36 1.33 1.56 1.28 1.23 1.36 His - CAC 0.64 0.81 0.83 0.97 0.69 0.64 0.67 0.44 0.72 0.78 0.64 Ile - AUU 1.22 1.13 1.14 1.12 1.07 1.23 1.18 1.18 1.10 1.06 1.38 Ile - AUC 0.78 0.92 1.21 1.10 1.18 0.71 0.93 0.57 0.88 1.02 0.83 Ile - AUA 1.00 0.95 0.65 0.78 0.75 1.05 0.89 1.25 1.02 0.92 0.79 Leu - UUA 1.78 1.25 0.92 1.02 1.19 1.54 1.38 2.42 1.10 1.00 1.40 Leu - UUG 1.23 1.16 1.11 0.95 0.99 0.56 0.85 0.71 1.00 1.01 1.21 Leu - CUU 0.99 1.01 0.97 1.05 0.97 1.70 1.02 0.94 1.04 1.11 1.15 Leu - CUC 0.56 0.79 1.18 1.04 0.82 0.72 0.86 0.52 0.89 0.96 0.56 Leu - CUA 0.88 1.11 0.93 0.95 1.12 1.14 1.15 0.94 1.04 1.10 0.94 Leu - CUG 0.56 0.68 0.89 0.99 0.81 0.35 0.73 0.48 0.94 0.81 0.75 Lys - AAA 1.04 1.01 0.98 0.93 0.99 1.27 1.04 1.32 1.01 1.03 1.15 Lys - AAG 0.96 0.99 1.02 1.07 1.01 0.73 0.96 0.68 0.99 0.97 0.85 Phe - UUU 1.25 1.01 0.95 1.00 1.04 1.25 1.10 1.24 1.07 0.89 1.29 Phe - UUC 0.75 0.99 1.05 1.00 0.96 0.75 0.90 0.76 0.93 1.11 0.71 Pro - CCU 1.35 1.37 1.22 1.39 1.00 1.44 1.13 1.47 1.15 1.52 1.46 Pro - CCC 0.48 0.94 0.88 0.92 0.64 0.64 0.67 0.58 0.85 0.61 0.71 Pro - CCA 1.74 1.32 1.30 1.09 1.75 1.69 1.96 1.64 1.34 1.32 1.61 Pro - CCG 0.43 0.37 0.60 0.60 0.62 0.23 0.24 0.31 0.65 0.55 0.22 Ser - UCU 1.41 1.22 1.11 1.39 1.14 1.46 1.35 1.36 1.38 1.53 1.44 Ser - UCC 0.44 0.92 0.82 0.72 0.77 0.91 0.89 0.48 0.75 0.73 0.51 Ser - UCA 2.02 1.68 1.41 1.59 1.71 1.78 1.43 2.11 1.41 1.41 1.89 Ser - UCG 0.26 0.26 0.40 0.34 0.35 0.12 0.28 0.20 0.31 0.25 0.29 Ser - AGU 1.16 1.12 1.14 0.99 1.31 1.14 1.29 1.07 1.28 1.18 1.16 Ser - AGC 0.72 0.80 1.12 0.97 0.73 0.60 0.77 0.79 0.88 0.90 0.71 Thr - ACU 1.47 1.35 1.15 1.55 1.44 1.74 1.57 1.53 1.39 1.45 1.53 Thr - ACC 0.59 0.91 1.10 0.84 0.82 0.49 0.88 0.51 0.93 0.77 0.54 Thr - ACA 1.77 1.57 1.52 1.27 1.55 1.68 1.34 1.80 1.49 1.56 1.84 Thr - ACG 0.17 0.17 0.22 0.34 0.19 0.10 0.22 0.16 0.20 0.22 0.10 Tyr - UAU 1.45 1.19 1.12 1.16 1.21 1.35 1.07 1.35 1.25 1.28 1.28 Tyr - UAC 0.55 0.81 0.88 0.84 0.79 0.65 0.93 0.65 0.75 0.72 0.72 Val - GUU 1.41 1.30 1.12 0.98 0.96 1.45 1.19 1.08 0.90 0.97 1.27 http://vir.sgmjournals.org 949

B. K. Rima Table 3. cont. Codon Bat mumps 10-MuV LPMV Mapuera 9-PIV5 hpiv2 SV41 hpiv4a Menangle Tioman Akimota 1 Val - GUC 0.72 0.86 0.99 0.83 0.89 0.72 0.83 0.73 0.96 0.89 0.75 Val - GUA 0.97 1.07 1.05 1.11 1.04 1.29 1.05 1.54 1.12 1.09 1.15 Val - GUG 0.90 0.76 0.84 1.08 1.11 0.54 0.93 0.65 1.02 1.06 0.83 hrscu is 1.50 or higher. hrscu is between 1.25 and 1.49. hrscu is between 0.75 and 0.51. hrscu is 0.50 or lower. preferred over NNG by a factor of 1.5. The latter calculation excludes the ratios of NCA over NCG codons, because of the extremely low frequency of NCG resulting from CpG dinucleotide suppression. The preference for NNA codons is particularly strong in the Pneumovirinae (Table S2) with exception of the pneumonia virus of mice (PVM) in the pneumovirus genus and turkey rhinotracheitis virus (TRTV) in the metapneumovirus genus. The human and bovine PIV3 dataset (again excluding the NCG sets) again show that NNA codons are used almost three times as frequently as NNG (Table 2). Analysis of the ratio of A+U over G+C in the third position in glycine, leucine and valine codons (thus ignoring the data on the codons groups that contain NCG) in the Paramyxoviridae shows that with the notable exception of the morbilliviruses, where A+U was equal to G+C, all others showed substantial biases for A+U over G+C in the third position ranging from 1.3 in PIV5 to.2 in the PIV3 dataset. From similar analyses of nucleotide preferences in the third positions of codons, Zhang et al. (2013) concluded for torque teno sus virus 1 that mutagenic pressure alone could not explain the biases in codon usage in that virus, and that natural translational selection probably plays a more important role because of the skewed usage of A and U compared to G and C in the third position in codons. Whilst the human CUP strongly prefers CUG codons for leucine, the paramyxoviruses prefer UUA, and in general there appears to be no bias against NUA or UAN codons in this virus family. This agrees with the observation that UpA suppression is present in these viruses at a lower level than in other viruses with odds ratios between 0.75 and 0.89 found (Rima & McFerran, 1997). In the paramyxoviruses, where there is a choice such as in the isoleucine codons, there is a strong preference for the use of the AUA codon and low frequencies of the use of AUC (Table 2), but for example, this tendency is less strong in the genus Hendravirus (Table S3), indicating differences between viruses in the family. Table 2 also shows similarities in the CUPs between pneumoviruses, but also details specific differences between respiratory syncytial viruses (RSVs) both from subgroup A and B, and human metapneumoviruses (hmpvs) in subgroups A and B. The avulavirus NDV is an outlier in several instances, and appears to have a less strong bias against CGN and NCG codons than the other viruses. Whether this reflects the non-mammalian NDV host has not been evaluated, but TRTV another avian paramyxovirus of the metapneumovirus genus does not show this. Table 2 shows that whilst all viruses have CUPs very different from those of the human CUP, they all have differences from each other that distinguish the various genera in the family. Variations in CUPs within a genus A comparison of the CUPs within the genus Rubulavirus in shown in Table 3. This is a genus with relatively low levels of amino acid sequence conservation, and the heterogeneity of this genus is also visible in the variation in CUPs. Notably, Mapuera virus has almost no preference for NNA over NNG codons, and the level of CpG suppression in La- Piedad-Michoacan-Mexico virus (LPMV) is quite low. This virus provides the only example in the dataset in which the RSCUs of some the CGN codons are.1.00. Thus in the genus Rubulavirus, there are very distinct patterns of CUP for each of the viruses. The situation is different in the genus Morbillivirus (Table 4). This is a much more homogeneous genus containing: MV, rinderpest (RPV), peste-des-petits ruminants (PPRV), cetacean morbillivirus (DMV) and related canine (CDV) and phocine distemper (PDV) viruses (related to each other but further removed from the others) (Barrett et al., 1991) and an outlier, the recently described feline morbillivirus (FeMoV) (Woo et al., 2012). The newly discovered FeMoV has a CUP more like those of PDV and CDV than of the others, demonstrated by a reversal of codon usage preferences in both the cysteine codons (UGU and UGC) and the lysine codons (AAA and AAG). The CUPs in this genus also indicate the relatively close relationship between MV and RPV, and more distant ones with PPRV, and even more distant with CDV and PDV, which mirrors the phylogenetic analyses performed earlier (Barrett, 1999). The smaller differences in the CUPs in the genus Morbillivirus as compared to Rubulavirus is indicative of the historical nature of virus classification, which by a lack of criteria based on current sequence analyses has been shown to be somewhat arbitrary. 950 Journal of General Virology 96