EGFRIndb: Epidermal Growth Factor Receptor Inhibitor database Inderjit S Yadav 1,3, Harinder Singh 2, Mohd. Imran Khan 1, Ashok Chaudhury 3, G.P.S. Raghava 2 and Subhash M. Agarwal 1 * 1 Bioinformatics Division, Institute of Cytology and Preventive Oncology, Noida-201301, India 2 Bioinformatics Centre, Institute of Microbial Technology, Sector 39A, Chandigarh, India 3 Department of Bio & Nano Technology, Guru Jambhehswar University Science & Technology, Hisar, India * Corresponding author Dr. Subhash Mohan Agarwal Bioinformatics Division Institute of Cytology and Preventive Oncology I-7, Sector-39, Noida-201301 India Email address: smagarwal@yahoo.com
Background Aberrant activity of epidermal growth factor receptor (EGFR) family proteins has been found to be associated with a number of human cancers including lung and breast. Consequently, the search for EGFR family inhibitors, a well established target of pharmacological and therapeutic value has been ongoing. Therefore, over the years several small molecules, which compete for ATP in the kinase domain had been synthesised and some of them have proved to be effective in attenuating EGFR mediated proliferation. Thus, there exists in literature a vast amount of experimental data on EGFR tyrosine kinase inhibitors. In this paper, we describe a comprehensive database EGFRIndb that contains details of the small molecular inhibitors of EGFR family. Description EGFRIndb is a literature curated database of small synthetic molecular inhibitors of EGFR. It consists of 4581 compounds showing in vitro inhibitory activities (IC 50, IC 80, GI 50, GI 90, EC 50, K i, K d and percentage inhibition) either against EGFR or its different isoforms i.e. Erbb2 (v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 2) and Erbb4 (v-erbb2 avian erythroblastic leukemia viral oncogene homolog 4) or various mutants. For each compound, database provides information on structure, experimentally determined inhibitory activity of compound against kinase as well as various cell lines, properties (physical, elemental and topological) and drug likeness. Additionally, it provides information on irreversible as well as dual inhibitors that have gained importance in recent years due to the emergence of clinical resistance to known drugs. As compound activity against similar kinases is a measure of its selectivity and specificity, the database also provides this information. It also provides simple search, advanced search, browse facility as well as a tool for structure based searching. Conclusion EGFRIndb gathers biological and chemical information on EGFR inhibitors from the literature. It is hoped that it will serve as a useful resource in drug discovery and provide data for docking, virtual screening and Quantitative structure activity relationship (QSAR) model development to the cancer researchers. Keywords Database; Epidermal growth factor receptor; Cancer; Tyrosine kinase inhibitors; Dual inhibitors; Irreversible Inhibitors
1. Introduction The epidermal growth factor receptor (EGFR) is a member of receptor tyrosine kinase family. It is a trans-membrane protein consisting of four closely related isoforms: EGFR/ErbB1/Her1, HER2/ErbB2, HER3/ErbB3 and HER4/ErbB4. All the members share the common architecture consisting of three main domains: extracellular ligand binding domain, a trans-membrane domain and an intracellular kinase domain [1]. Each member of the EGFR family is a potent mediator of normal cell growth and development [2, 3]. The three isoforms-egfr, ERbB2 and ERbB3 are all implicated in the development and progression of various cancers while the fourth-erbb4 is suggested to be involved in inhibition of cell growth rather than proliferation [3]. The association between EGFR and various cancers including non small cell lung cancer has been well established in the past two decades [4, 5]. Similarly, overexpression/amplification of ErbB2 has been reported in a number of human tumors including 25% to 30% of breast cancer cases [6], ovarian and gastric cancer [3]. Thus, both EGFR and ErbB2 receptors among the EGFR family members have been well-characterised and got established as a pharmacologically validated target for cancer therapy [7]. As a result, in the past several years a number of small molecule tyrosine kinase inhibitors (TKIs) were developed and are currently available as therapeutics including first clinically used gefitinib (Irressa) [8] and erlotinib (Tarceva) [9], afatinib (Tomtovok) [10] and dual inhibitor of EGFR and ErbB2 like lapatinib (Tykerb) [11]. The treatment of patients with EGFR or ErbB2 based inhibitors as targeted therapy thus has shown a significant reduction in the cancer progression [12]. Clinically it is well established that protein kinase inhibitors play a significant role in the treatment of cancer. According to a recent paper, out of 24 small-molecule kinase inhibitors approved for use as therapeutic agents 17 are for cancer [13]. It is also established that a significant proportion of activity in oncology industry is directed on only eight common kinase targets which includes EGFR and HER2 [13]. Over the period, biological activity of a large number of molecules has been published, which if pooled would be useful for analysis and decision making in drug discovery. In recent years, a number of literaturecurated databases related to genetic or mechanistic aspect of cancer have been developed [14-16]. Also, a specialised database of somatic mutations of the tyrosine kinase domain of EGFR (SM-EGFR-DB) that catalogues EGFR somatic mutations identified in human cancers from literature has been established [17]. Similarly, a number of excellent small compound repository like Pubchem (http://pubchem.ncbi.nlm.nih.gov/), a database consisting of structures and biological activities of small organic molecules, Chemical Entities of Biological Interest (ChEBI) which provides the oncological classification of compounds along with the structural information [18] and DrugBank which is a resource for known drugs and their targets has been developed [19]. As a result of emergence of EGFR family as a pharmacological target and success of protein kinase inhibitors as treatment agent of cancer, a number of researchers have continuously synthesised small molecules and investigated them for anti-egfr activity using a variety of in vitro cellular and enzymatic assay systems. This resulted in identification of a variety of bioactive compounds making a large amount of biological and structural information available in the public domain. In spite of such a huge amount of data available in biomedical literature on EGFR inhibitors, to our knowledge there is no dedicated resource
available that collectively provides information on these molecules under one platform. Keeping the importance of EGFR kinase inhibitor in anticancer therapeutics and to complement other databases we have developed EGFRIndb database (http://crdd.osdd.net/raghava/egfrindb). EGFRIndb is a database of 4581 small molecules that have shown activity against either of EGFR isoforms (EGFR, ErbB2, ErbB4) or mutants of EGFR-TK domain. It is expected that availability of EGFRIndb would facilitate better understanding of the molecular space and its associated complexity with reference to EGFR family inhibitors.
2. Construction and content 2.1 Data source, compilation and structure For collection of compounds which have shown inhibition against different EGFR isoforms or its mutants, we have extensively searched literature using PubMed and other resources like google scholar with different keywords or their combination to find the research articles. Some of the keywords that were used for systematic searching from PubMed are EGFR inhibitors, Erbb2 inhibitors, EGFR dual inhibitors, EGFR irreversible inhibitors, EGFR covalent inhibitors, EGFR mutant inhibitors. The main criterion for inclusion of the compounds in the database was it must have activity against EGFR or its isoforms Erbb2 and Erbb4 or its mutants. Once collected, we read through these articles for obtaining the different standard inhibition values like IC 50 (concentration required for 50% inhibition), IC 80 (concentration required for 80% inhibition), GI 50 (concentrations required to inhibit the growth by 50%), GI 90 (concentrations required to inhibit the growth by 90%), EC 50 (half maximal effective concentration), ED 50 (effective dose required to produce a therapeutic response in 50% of the population), K i (inhibition constant), K d (dissociation constant) and percentage inhibition. We also gathered information whether the inhibition constants have been investigated either using enzymatic assay or against the cell line under in vitro conditions. Finally, we noted the cell line, corresponding inhibition values along with the literature id (PMID). As in recent years, a number of activating or resistant mutations have been documented in literature, the information of the compounds inhibiting these mutant isoforms like L858R, T790M and double mutant L858R/T790M was also extracted. Further, to understand the inhibition mechanism of these compounds, they were classified into reversible and irreversible inhibitors. The 2D and 3D structures of the collected molecules were drawn using Marvinsketch 5.4.1, a chemical drawing tool, which is a part of Chemaxon package (http://www.chemaxon.com/products/marvin/marvinsketch/). A majority of research articles describe the structures as a common scaffold with different functional groups as variants, therefore utmost care has been taken while drawing the structures. Different structural and topological molecular descriptors were calculated for each molecule using the Marvinsketch5.4.1, which includes mass, IUPAC name, composition, InChI (IUPAC International Chemical Identifier), InChiKey, SMILES, SMART, pi, logp, vander waals volume, polar surface area (PSA), h-bond donor and acceptors counts, no. of chiral centers and double bond etc. InChI is an identifier for chemical substances which is used in printed and electronic data to enable easier linking of diverse data compilations (http://www.iupac.org/home/publications/e-resources/inchi.html). InChIKey is a 27-character compacted (hashed) version of InChI that is designed for internet and database searching/indexing [20]. Simplified Molecular-Input Line-Entry System (SMILES) is a specification in the form of a line notation for describing the structure of chemical molecules using short ASCII strings [21] while SMiles ARbitrary Target Specification (SMARTS) is a language for specifying substructural patterns in molecules (http://en.wikipedia.org/wiki/smiles_arbitrary_target_specification). Once all the information was gathered, the data was fed into different tables. Overall, EGFRIndb is divided into seven tables (i) general molecule description table which contains the basic information about the
molecule like compound Id (unique), PubChem ID, InChi, InChi key, SMART, SMILES and so forth (ii) enzyme activity table which harbours the information of inhibition activity of compound against different EGFR isoforms (EGFR/ErbB2/ErbB4) or its mutants (iii) cellular activity table which details the inhibition values of compound against different cell lines (iv) target table includes information of other protein targets inhibited by the compound along with its inhibitory value (v) property table containing elemental information of compound (vi) topology information table and (vii) filter table containing information about computed filters (Lipinski s rule of five, ghose filter, veber filter, mugge s filter and bioavailability) (Figure 1). The server is launched from a Red Hat based xeon server using Apache and the frontend has been developed using PHP programming language with MySQL (http://www.mysql.com/), an open source relational database management system at the backend.