Title: Prediction of HIV-1 virus-host protein interactions using virus and host sequence motifs

Author's response to reviews Title: Prediction of HIV-1 virus-host protein interactions using virus and host sequence motifs Authors: Perry PE Evans (evansjp@mail.med.upenn.edu) Will WD Dampier (wnd22@drexel.edu) Lyle LU Ungar (ungar@cis.upenn.edu) Aydin AT Tozeren (aydin.tozeren@drexel.edu) Version: 2 Date: 20 April 2009 Author's response to reviews: see over

20 April 2009 Title: Prediction of HIV-1 virus-host protein interactions using virus and host sequence motifs Response to Reviewers: We have revised our manuscript according to the detailed suggestions of Drs. Roger Ptak and Fred Davis. We are grateful for their valuable inputs and their comprehensive review of our work. Thank you. Reviewer 1: Roger Ptak Major Compulsory Revisions: None requested. Minor Essential Revisions: 1. Provide rationale for Subtype B and C. Also abbreviations for HIV proteins need to be corrected as shown. We introduced the following sentence into the first paragraph of the Methods section: We focused on subtype B because it is most common in the industrialized world [20], and chose subtype C because it is most common globally [21]. In addition, we corrected the abbreviations as specified. 2. Missing word in Page 6, Methods We corrected the aforementioned sentence. 3. Page 6, Methods, Validation using HIV-1, Human Interaction Database At the end of this paragraph, HPP should be HHP We fixed it. 4. Methods There is no description of the statistical methods/tests/software packages used to generate these results. We have added a paragraph to the end of the Methods section to describe p-value calculations presented in the manuscript. 5. Page 8, Typos and editorial mistakes in Results section All corrected as specified. 6. Page 10, Figure 5 should be labeled as Figure 8 and the other Figures renumbered accordingly. In the last sentence, significance should be significant. 1

7. Page 12, Editorial errors in Discussion All corrected. 8. The resolution of Figure 3 and Legend: 9. Figure 4 and Legend: Is there a reason why similar data for the other HIV-1 proteins are not provided in the manuscript? If this data is available, it would be helpful to add it as a supplemental figure or table. We provided the p values in Figure 4 for 3 HIV-1 proteins for which large amount of data existed in the HIV-1, human Protein Interaction database.. Yes, 'a' refers to all KEGG pathways and 'e' refers to enriched pathways. This has now been indicated in the legend of Figure 4. Additional file 4 now has the corresponding data for all HIV-1 proteins. 10. Corrections involving Figure 8 and Legend: 11. Explanation for Additional File 2: 12. Additional File 3: The relevance of this file is not understood. It is not discussed in the text and could not be opened. This is the Endnote file, mistakenly included. Discretionary Revisions: 1.Add reference for broader description of the HIV-1, Human Protein Interaction Database.. 2. Page 11, Discussion and Figure 8 In addition to the Brass et al. and König et al. data sets that are discussed and included in the lower table of Figure 8, the authors should include the similar data set recently published by Zhou et al.: We added the Zhou et al. reference and discussion to the second paragraph of Discussion. We have updated Figure 7 and its legend to include the data from Zhou et al. Thank you very much for this very careful reading of our manuscript and providing us with detailed recipes to improve it. Reviewer 2: Fred Davis 2

Major Compulsory revisions: 1. The authors should cite and discuss the work of Tastan, et al (Pac Symp Biocomput. 2009). In particular, how do the authors interpret Tastan's finding that the ELM-domain feature was the weakest contributor to their predictive model (Tastan Fig 3)? http://psb.stanford.edu/psb-online/proceedings/psb09/tastan.pdf Please see the last paragraph of our revised Background section for an extensive discussion of Tastan et al and how it relates to our work 2. Simple false positive/negative assessment will ease comparison to other prediction methods. We added this information to the second paragraph of the Results section. P 6, line 26: The authors should quantify the background conservation per protein, and discuss how they chose the 70% threshold. 3b. P 5, line 2: How many viral ELM hits were removed by the conservation threshold? We added the following sentences to the first paragraph of Methods: A total of 99 ELMs were found on at least one virus protein sequence. The conservation threshold removed 43 of these, leaving 56 conserved ELMs for the HIV1 Proteome. 4. P 4, line 24: The authors should elaborate on how ELM hits were identified - We accessed the ELM resource repeatedly (at least 25,000 times) with automated web queries that we designed and crashed the ELM resource three times. We revised our automated queries to increase the time duration between each inquiry so that the load on the databases was reduced. We used Human as the species for both human and virus sequences, and specified no cellular location. The filter for disordered regions and globular domains features were always on in these queries. 5. P 4, line 20: How many sequences are in the HIV protein alignments used by the authors? These numbers are now presented in Additional File 1. 5b. P 4, line 20: Why did the authors choose sequences only from subtypes B and C? Subtype B because it is most common in the industrialized world and subtype C because is the most common globally. 3

6. Typos in P 6. 7. Given that the authors base their predictions on a model of direct interaction (at least in subset H1), what is the overlap between their predicted interactions and the ~1000 interactions in the HIV-1, Human database that are explicitly annotated as direct interactions ("interacts with")? We have added a comparison of H1 and direct interactions (as defined by Tastan et al) to Figure 4. Figure 4 looks at ENV, NEF, and TAT. All virus proteins are given in Additional File 2. A description of DHHE has been added to the first paragraph of Methods -Validation using HIV-1, Human Interaction Database. 8. P 7, line 6. Which viral ELMs have been verified as binding sites for human proteins? We have added 10 examples of verified motifs on 7 HIV-1 proteins to the last paragraph of Background section. 9. P 10, line 8: It is not clear what the fractions 0.25 and 0.5 refer to in the discussion of "ELM modules". The thresholds were used as maximal bounds for ELM presence on human proteins. We have corrected the paragraph to clarify. We have not seen estimates of the expected frequencies of functional ELMs. 10. Throughout the manuscript, the tests used to calculate p-values should be explicitly described. The end of Methods section has been updated with a discussion of p-values. 10b. In figure 8, why is the p-value of the HHE, HHP overlap in the first row of the top panel (8.16E-015) different from the P-value in the pairwise table at the bottom: 1E-14? This is a rounding off error introduced by the corresponding author while editing the document. We are now presenting the p-values exactly as obtained from using R's hypergeometric test and have updated the figures accordingly. Discretionary/Comments to authors: 1. How did the experimental overlap differ between the H1 and H2 sets? Does the H1 set of predicted direct interactions or the H2 set of out-competed proteins dominate the overlap? This comparison would be a useful note about the relative contribution of each approach. The experimental overlap with H1 and H2 sets are presented in the part of the Results Section focusing on figure 4. 4

2. The message of Figure 3D is unclear. We expanded the discussion of this figure. 3. P 6. end of second paragraph: "HPP" should be "HHP" Fixed. 4. P 12. "let those alone" should be "let alone those" Fixed. 5. Can the author's predictions be prioritized in anyway? For example, out of the thousands of interactions, where should an experimentalist start? Yes, human HHP in KEGG pathways gives better results than all of HHP. Experimentalists should start here. We have added the following sentence to the end of the second paragraph in Discussion. Given that HHP in KEGG pathways is about half as large as all HHP, and has a stronger overlap with HHE, as experimentalist should begin verification with this set. 6. Firth et al (PLoS Comp Biol 2008) addressed the lack of specificity in ELM motifs and presented a motif finding algorithm that improved this specificity. Their refined ELM motifs are available for download, and could help with the overcall problem for ELM detection. We are currently working on developing virus specific short peptide motifs based on the conserved ELMs found on HIV proteome using a variety of computational methods. We plan to publish this ongoing research at a later time. Thank you very much for the insightful comments and for helping us better place our work within the context of existing literature. Sincerely, Aydin Aydin Tozeren, Ph.D. 5