USE OF CLUSTER ANALYSIS AS TRANSLATIONAL PHARMACOGENOMICS TOOL FOR BREAST CANCER GUIDED THERAPY. Ngozi Nwana

Size: px
Start display at page:

Download "USE OF CLUSTER ANALYSIS AS TRANSLATIONAL PHARMACOGENOMICS TOOL FOR BREAST CANCER GUIDED THERAPY. Ngozi Nwana"

Transcription

1 USE OF CLUSTER ANALYSIS AS TRANSLATIONAL PHARMACOGENOMICS TOOL FOR BREAST CANCER GUIDED THERAPY By Ngozi Nwana A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Biomedical Informatics Department of Health Informatics School of Health Related Professions Rutgers, the State University of New Jersey Fall 2014 Ngozi Nwana All Right Reserved 1

2 Final Dissertation Approval Form Use of Cluster Analysis as Translational Pharmacogenomics Tool For Breast Cancer Guided Therapy BY Ngozi Nwana Dissertation Committee: Shankar Srinivasan PhD, Committee Chair Dinesh P. Mital PhD, Committee Member Syed Haque PhD, Committee Member Approved by the Dissertation Committee: Date Date Date 2

3 Abstract Use of Cluster Analysis as Translational Pharmacogenomics Tool for Breast Cancer Guided Therapy BY Ngozi Nwana Breast cancer epidemiology and disease diversity continue to be a major health concern today. Breast cancer is a tumor that develops from the breast cells and which has become malignant. It is currently considered the second leading cause of cancer death in women, exceeded only by lung cancer as death rates have been declining since 1989 (with larger decreases in women younger than 50). The decline is believed to be as a result of early detection through screening, increased awareness and also improved treatment. However despite all the advances in early detection and current treatment options to date, the race for sustainable, effective, personalized, treatment and ultimately cure still continues to be a challenge. Varying levels of breast cancer disease biology and treatment information currently exist today. So far, this has enabled varying levels of treatment success. However, despite spectacular examples of prolonged disease remission with some of the treatments available today, the statistical survival benefit in metastatic breast cancer patients currently is still estimated in months and not years. A patient s potential to survive continues to vary greatly and depend on factors such as type, stage (spread) of cancer, degree of disease aggressiveness in addition to the minimally analyzed genetic make- up. This treatment gap has been attributed to the diversity of breast cancer disease (diverse subtypes), and at the granular level, limitations with disease understanding and pharmacogenomics. All these factors present treatment limitations and challenges thus resulting in increased incidence for the disease. Undoubtedly, the area mostly lacking and impactful in today s general breast cancer treatment is inadequate incorporation of 3

4 the genetic aspects of disease for sustainable personalized treatments for the patients in spite of the tremendous amount of genetic data that is available today. This gap generated my interest to research breast cancer gene expression data in order to extract relevant genetic information for disease understanding. Today, breast cancer data are hugely generated mainly using DNA microarrays. This provides the opportunity to extract from these gene expression data previously unrecognized biological structure and meaning. However, one of the major challenges with huge genetic data is that of understanding and then analyzing the resulting gene expression data in order to understand gene behavior and extract significant patterns of disease from relevant gene activity. Several approaches for mining genetic data include some existing unsupervised clustering techniques. Incidentally, some of these existing unsupervised clustering methods have often been classified either as non-robust and/or lack the ability to discover subtle, context-dependent biological patterns, and hence have not proven to be optimal methods for analyzing cdna microarray breast cancer gene expression data. In this study, NMF (non-negative matrix factorization) Consensus algorithm from MIT s GenePattern analysis module, a robust clustering methodology designed for class discovery and clustering validation is presented as the main clustering methodology for the analyses of breast cancer DNA microarray gene expression data obtained from the Broad Cancer Institute of Harvard and MIT. NMF is sensitive and adaptive to huge genetic data, and provides biological relevance and sensitivity to the resulting clusters. It was used in conjunction with preparatory unsupervised and sparse hierarchical clustering, both of which were used in the initial classification and baseline clustering before NMF clustering was performed. Comparative gene marker selection analysis was also run to evaluate other genes and pathways for other types of cancer other than breast cancer in order to determine if there are comparable genes/biomarkers that may share similar pathways as breast cancer genes. 4

5 This study puts forth some important recommendations that would constitute a science-based, long-term, sustainable breast cancer data analysis approach, critical for disease understanding and for implementing effective sustainable and personalized treatment solutions for all breast cancer patients. Some of the components of these recommendations are primarily lacking in traditional treatment and some of the recommendations are somewhat already incorporated into similar genetic- based breast cancer treatment today. However, they are yet to be adequately incorporated in standard breast cancer disease treatment for all patients and for all subtypes. Keywords: Genomic technologies, Pharmacogenomics, Translational informatics, Genomic instability, Apoptosis, Sustained angiogenesis (VEGF pathway inhibitors COX2 inhibitors), Metastasis (MMP inhibitors), Anti-oncogene expression, Self-sufficiency in growth signals (ERBB pathways inhibitors, ER pathway inhibitors), Tissue invasion, Genomic instability (Conventional cytotoxic treatments), Breast cancer susceptibility genes (BRCA1 and BRCA2 genes), Comparative gene marker selection, Hierarchical Clustering (HC), Sparse Hierarchical Clustering (SHC), Non-negative Matrix Factorization (NMF) 5

6 Acknowledgement My most sincere thanks and gratitude first goes to God who made all these possible for me. Though only my name appears on the cover of this dissertation, a great many people have worked with me and contributed to its completion. Therefore my appreciation and many thanks go to everyone who contributed to this work and made this dissertation possible. I am especially grateful to my Advisor, Dr. Shankar Srinivasan for his incredible guidance and insight throughout this research study. I was extremely fortunate to have an advisor who gave me the freedom to be an in-depth researcher and at the same time provided me the required guidance to recover when my steps faltered. Dr. Shankar s patience and support helped me tremendously to overcome many urgent situations and finish this dissertation. I also will like to thank the Faculty members (Dr. Mital, Dr. Shibata and Dr. Haque) who at various points read this dissertation and offered their input throughout the review period. My deepest gratitude goes to my immediate family for their immense support through it all. Most definitely, none of this would have been possible without the love, encouragement and patience of my family. My immediate family, to whom this dissertation is dedicated to, has been a constant source of love, support and strength to me always through all these years. Particularly, I would like to express my most heart-felt gratitude and thanks to my husband, Acho Nwana who encouraged and supported me above and beyond throughout this journey. I also would like to thank my three children (Chioma, Nwanacho and Muna Nwana), for their constant reminders to me in so many ways to finish the Ph. D dissertation for the degree. I wish to really thank my entire extended family for providing me their loving support, many prayers and encouragement throughout this entire period. Of special mention is my Mom (Abigail 6

7 Ndukwe) who also lived with me most of my Ph. D years and provided lots of support for my other motherly tasks especially during those periods when I had to focus heavily on the dissertation. I also would like to remember my late Dad, Hezekiah Ndukwe and father- in law Harry Nwana who both passed away before this Ph. D degree was completed. I will always remember their tremendous love, support and encouragement throughout the years leading up to my pursuing the Ph. D degree. I also will like to thank my brothers, sisters, mother-in law, uncles, several cousins and friends who were also particularly supportive in various ways, kept the dream alive and motivated me to thrive relentlessly to complete this degree. Lastly, I wish to thank several of my professional colleagues from work who encouraged me particularly during the very hectic periods to stay focused and finish the degree. To all of you, I am truly grateful. 7

8 Table of contents Abstract... 3 Acknowledgement... 6 Table of contents... 8 List of figures... 9 List of tables Introduction Statement of Problem (Purpose of study) Background and Significance of the Problem Necessary Path to closing the breast cancer treatment gap Understanding Breast Cancer Pre-disposing factors Understanding Breast Cancer subtypes Understanding breast cancer gene types and gene mutation categories Understanding breast cancer disease pathways Understanding the interactions between breast cancer genes, disease subtypes and disease biological pathways Understanding breast cancer pharmacogenomics Research Hypotheses Objectives of Research Results (anticipated) and Significance of Research LITERATURE REVIEW Review of related literature Discovery-driven Translational research in breast cancer Breast cancer translational research specific study aspects Breast cancer pharmacogenomics studies complementing Translational studies Clustering analyses techniques used in prior Breast cancer studies Microarray studies and their relevance in breast cancer studies METHODOLOGY (Materials and Methods) Materials Methods (GenePattern Clustering Analyses Software) Research Design and Methods

9 3.2.2 Research Timeline and Work Plan Research goals/strategies Study Design Expected/Projected Study outcome RESULTS AND DISCUSSIONS Results and analyses Results of Unsupervised/Sparse Hierarchical Clustering Results Analyses for Unsupervised and Sparse hierarchical Clustering NMF Clustering Analyses and Results Model Selection for NMF Clustering Validating the NMF Algorithm Results analysis for the NMF Clustering Consensus Clustering Measuring consensus clustering Consensus matrix re-ordering and visualization Comparative Marker selection results and analyses Comparative Marker Selection Results analysis Study Conclusions Potential Challenges with Research Further Work to Close the Research Gap Summary and Conclusions Glossary of Acronyms References List of figures Figure 1-1 Figure 1-2 Figure 1-3 Correspondence between Molecular Class and Clinicopathological Features of Breast Cancer [175] Breast cancer risk factors and mechanisms (Adapted from Drug Discovery Today: Disease Mechanisms) Breast cancer disease pathways from Human Pathway database [140] Figure 1-5 p53 and RB tumour-suppressor pathways [163] Figure 2-1 BRCA1 (Breast cancer 1) and BRCA2 (Breast cancer 2) genes [1])

10 Figure 5-1 Breast_A & Breast_B HC clusterview from cdt file Figure 5-2 Sparse Hierarchical Clustering Results Figure 5-3 NMF cluster results for Breast_A & Breast_B data: Convergence Plot K= Figure 5-4 Consensus matrix k=2 Breast_A & Breast_B Figure 5-5 Data Results from NMF clustering of Breast_A.gct and Breast _B.gct analyses Figure 5-6 Consensus matrix k=5 dataset for Breast_A & Breast_B Figure 5-7 The Cophenetic Coefficient plot for Breast_A & Breast_B.gct clustering Figure 5-8 Data Results for NMF Clustering analyses for Breast _A. gct & Breast _B.gct (all 5 k.plot) Figure 5-9 Lorenz Curve for Breast_A_data Figure 5-10 Change in Gini for Breast_A_data Figure 5-11 Consensus CDF (empirical cumulative distribution) Graph for Breast_A_data Figure 5-12 Delta area Plot for Breast_A_data Figure 5-13 Heat Mapfor Breast_A.sub78.srt.5.gct Figure 5-14 Breast_B data Lorenz curve plot Figure 5-15 Breast_B.data Change in Gini plot Figure 5-16 Breast_B.data Delta area plot Figure 5-17 Heatmap for Breast_B.sub39.srt.5.gct Figure 5-18 Comparative marker selection FDR (BH) vs. Q-Value Br.vs.Rest (Colon, lung and prostate).odf Figure 5-19 Comparative marker selection Lambda vs. TTO Br. Vs. Rest (Colon, lung and prostate).odf Figure 5-20 Comparative marker selection Score Histogram Br. Vs. Rest (Colon, lung and prostate).odf Figure 5-21 Comparative marker selection Q-Value Histogram Br. Vs. Rest (Colon, lung and prostate).odf Figure 5-22 Comparative Marker Selection HEAT Map Breast_A_CLS.gct Figure 5-23 Comparative Marker Selection HEAT Map Breast_B_CLS.gct Figure 5-24 Comparative marker selection analysis results for breast cancer and other cancer types [179]

11 List of tables Table 1-1 Breast cancer pathways and sources Table 1-2 Breast cancer pathways and resulting effects Table 1-3 Representative genes of major breast cancer pathways Table 1-4 Breast Cancer pre-disposition genes Table 2-1 Clinical outcome of tamoxifen-treated breast cancer patients using the AmpliChip CYP450 Test Table 3-1 Breast Cancer Study Data Listing Table 3-2 Unsupervised hierarchical clustering parameters Table 3-3 Sparse Hierarchical clustering parameters Table 3-4 NMF clustering input parameters Table 3-5 Comparative Marker Selection input parameters Table 5-1 Breast_A & Breast_B Weights- A[1][1] and B[1][1] Table 5-2 Consensus clustering parameters Table 5-3 Comparative marker selection.comp.marker.odf all_aml_train.preprocessed Table 5-4 Comparative Marker selection results for Multi_A.comp.marker.Br.vs.Rest.odf Table 5-5 Comparative Marker selection Multi_A.comp.marker.Br.vs.Rest Data list.odf

12 Chapter I STATEMENT OF PROBLEM, SIGNIFICANCE AND BACKGROUND 1 Introduction Breast cancer is a tumor that develops from the breast cells and which has become malignant. It is considered the second leading cause of cancer death in women, exceeded only by lung cancer as death rates have been declining since 1989 (with larger decreases in women younger than 50) believed to be as a result of earlier detection through screening and increased awareness and also improved treatment. It is the most common cancer among American women (with about 1% affecting men) except for skin cancer. It constitutes about 18% of all cancers in women with about one in every eight women (12%) developing invasive breast cancer in her lifetime. Breast cancer generally originates from breast tissue, most commonly from the inner lining of milk ducts (ductal carcinomas) or the lobules (lobular carcinomas) that supply the ducts with milk. The most common breast cancers start in the cells around the ducts while others can start in the cells that line the lobules. For both types, there are benign or malignant occurrences and diverse risk factors. Mammary ductal carcinoma is the most common type of breast cancer in women and comes in two forms: invasive ductal carcinoma (IDC), an infiltrating, malignant and abnormal proliferation of neoplastic cells (an abnormal mass of tissue resulting from neoplasia, the abnormal proliferation of cells in the breast tissue) and ductal carcinoma in situ (DCIS), a noninvasive, possibly malignant, neoplasm that is still confined to the lactiferous ducts, where breast cancer most often originates from. Breast cancer associated risk factors are diverse and constitute the basis for the complexity of the disease. At a general level, the severity of the different breast cancer disease forms/subtypes 12

13 has been attributed to the pre-disposing factors that are prevalent in each disease type, including the genetic component of disease and the patient s pharmacogenomics make-up. Similarly, at a very general level also, the relative survival rate of breast cancer disease has also been thought to vary within the treatment groups and disease categories and to greatly depend on the same factors; type of breast cancer, stage (spread) at diagnosis and treatment, degree of aggressiveness and genetic make- up. Relative survival statistics compare the survival of patients diagnosed with cancer with the survival of people in the general population who are the same age, race and sex and who have not been diagnosed with cancer. On the other hand, an overall survival rate is described as the percentage of people who are alive after a certain/specified period of time after diagnosis of a disease and may vary depending on each person s diagnosis and treatment. In the case of breast cancer, this rate is a percentage of people alive after a specified time after breast cancer diagnosis and varies by breast cancer stage, but many people will live much longer. People diagnosed with stage 0, I or II breast cancers tend to have higher overall survival rates than people diagnosed with stage III or IV breast cancers. The overall breast cancer survival rate has generally been placed around five years at 90% for women with stage 1 breast cancer. That implies that 90 percent of women diagnosed with stage I breast cancer survive at least five years after diagnosis. Sometimes survival is presented as a relative survival rate which compares survival rates for women with breast cancer to survival rates for women in the general population. There is also population survival rate which is used to assess the survival at the population level. It is also called SEER staging because it is used by the Surveillance, Epidemiology, and End Results (SEER) program, a part of the National Cancer Institute to compile national cancer statistics in the US from collected cancer data. The current state of variable treatment success for breast cancer is predominantly because disease treatment has also been designed to address cure at a very general level too. As is 13

14 expected, the progress made so far in breast cancer treatment has yet to be adequate enough for use in determining sustainable treatments or cure. Treatment is generally given universally on the basis of type or stage of breast cancer, and not necessarily geared towards addressing the underlying genetic basis for the disease and its diverse subtypes. Today, breast cancer data are generated mainly from large amounts of genomic information using DNA microarrays. This provides the opportunity to extract from these data previously unrecognized biological structure and meaning. In this study, I present NMF (non- negative matrix factorization) Consensus algorithm, a robust clustering methodology designed for class discovery and clustering validation as the main clustering methodology for the analyses and understanding of breast cancer DNA microarray data. It was used in conjunction with preparatory unsupervised and sparse hierarchical clustering, both of which were used in the initial classification and baseline clustering before NMF clustering was performed. Overall, Clustering analysis was specifically chosen for this research investigation and no other methods (for example factor analyses or neural network analyses) in consideration of the huge breast cancer genomic data that was to be analyzed and specifically because of their overall classification potential, specific adaptability to genetic data and sensitivity for identifying relevant classes and sub-classes of resulting data within a given group of primary data. Clustering techniques are generally adapted to huge amounts of data because they are able to run on algorithms that enable the comparison of massive amounts of data for similarities. This is especially beneficial for my study in consideration of the huge breast cancer gene expression cdna microarray data that needed to be analyzed. 14

15 The MIT s GenePattern Clustering module and not any other clustering techniques was especially chosen for this study as it possesses various clustering algorithms within the same framework with full capability for efficient clustering and identification of certain gene events that characterize specific disease forms and biology of breast cancer subtypes. In addition, the MIT s GenePattern s clustering modules consists of clustering suites that partition a gene expression dataset into clusters such that the gene expression dataset in each cluster share common expression traits based on distance measure used. It also provides biological relevance and sensitivity to the resulting clusters. NMF clustering was also chosen as the primary clustering analyses technique for this study because of its adaptability to genetic data. DNA Microarray gene expression analysis was also chosen as the data source for this study for the following benefits. Currently today, breast cancer data are generated mainly from large amounts of genomic information using DNA microarrays. With the development of microarray technology, it is now possible to efficiently examine how active, thousands of genes could be at any given time for a given patient in order to determine pathological presence in a disease state for those applicable genes. In the past, scientists were only able to conduct genetic analyses on a few genes at once. 1.1 Statement of Problem (Purpose of study) Breast cancer epidemiology and disease diversity continue to be a major health concern today and the race for sustainable effective treatment and ultimately cure continues to be a challenge despite all the advances in early detection and current treatment options. Breast cancer is diverse with various subtypes. The American Cancer Society s estimates for breast cancer in the United States for 2014 are: 15

16 About 232,670 new cases of invasive breast cancer will be diagnosed in women. About 62,570 new cases of carcinoma in situ (CIS) will be diagnosed (CIS is noninvasive and is the earliest form of breast cancer). About 40,000 women will die from breast cancer [180] The degree of incidence is also very variable from patient to patient. In a study published in 1995, it was projected that well-established risk factors accounted for 47% of cases while only 5% were attributable to hereditary syndromes [25]. In spite of this available preliminary knowledge about the risk factors and associated treatments, breast cancer treatment gaps still exist for the various subtypes, some more than others. This is because currently, not all the aspects of present breast cancer treatment and complete cure for the diverse disease forms and classifications are fully understood yet as is reflected in the rate of treatment success. The variable pharmacogenomics aspects especially continue to present the most important challenge and complexity to disease treatment. For example, with some of these current treatments, approximately 40% of patients with lymph node-positive disease will experience a relapse, and a majority of these patients will die from disseminated cancer. For patients with lymph node-negative disease, the 5-year recurrence rate is 25%, suggesting that the risk of recurrence and subsequent death is closely related to the state of the disease at the time of primary surgery and probably also related to the long term effectiveness of the treatment drug. There is still inadequate knowledge in this area and hence the inability for developing effective and sustainable treatment options. To date, varying levels of breast cancer disease biology and treatment information exist. So far, this has enabled varying levels of treatment success. However, despite spectacular examples of prolonged disease remission with some of the treatments available today, the statistical survival benefit in meta-static breast cancer patients currently is still estimated in months and not years. Patients potential to survive continues to vary greatly and traditionally center on factors such as 16

17 type, stage (spread) of cancer, degree of disease aggressiveness and minimally analyzed genetic make- up. Obviously, at the granular level, a gap still exists in breast cancer disease understanding and this has continued to present treatment limitations and challenges for finding a sustainable cure and personalizing care for patients. Undoubtedly, the area mostly lacking and impactful in today s general breast cancer treatment is inadequate incorporation of the genetic aspects of disease for sustainable personalized treatments for the patients. Incorporating the genetic aspects for treatment is very important for determining gene state in relation to disease state. This is because even though all of the cells in the human body contain identical genetic material, the same genes are not necessarily active in every cell in healthy and/or in disease states for all disease or normal patients, implying that different genetic factors may be specific to the different breast cancer subtypes. For example, microarray based comparative genomic hybridization has revealed some differences in copy numbers of particular genes in different subtypes of breast cancer, which include Luminal A, Luminal B, HER2- enriched and Basal-like breast cancer. The increased copy-number variation in basal-like tumors indicates more genetic complexity than in the other subtypes, suggesting a greater degree of genetic instability in these tumors [ ]. Basal-like cancers are relatively enriched for low-level copy-number gains involving several chromosomal regions, whereas high-level amplification at any locus is infrequent. In contrast, high-level amplifications are seen more frequently in HER2-positive and luminal B tumors. Similar aberrant genomic patterns occur in familial breast cancers that are not associated with BRCA1 or BRCA2 [165, ]. Both hereditary BRCA1-associated tumors and sporadic basal-like tumors do not have markers of X-chromosome inactivation (Xi); duplication of the active X chromosome and loss of Xi suggest that X-chromosome abnormalities contribute to the pathogenesis of basal-like cancers [170,171]. These distinct transcriptional and genomic aberrations that differentiate the four subtypes of breast cancer indicate that these variants may arise from different transformed stem or progenitor 17

18 cells, each with distinct biologic properties [ ]. Moreover, these subgroups track with prognosis and responses to therapy. The low-grade luminal-a tumors are indolent and sensitive to anti-estrogens. Luminal B tumors and tumors that are HER2-positive and ER-positive have incomplete sensitivity to endocrine therapy, and HER2-positive tumors, which have an aggressive natural history, are sensitive to trastuzumab, an anti-her2 antibody. Basal-like tumors also have a more aggressive natural history, though they can be especially sensitive to chemotherapy [175]. Unfortunately, this additional clinical value derived from the possibilities of molecular classification for these described cancer subtypes may be limited by its close correspondence with their statuses of ER, PR, and HER2, along with tumor grade for the different breast cancer types (Figure 2-1). Figure 1-1 Correspondence between Molecular Class and Clinicopathological Features of Breast Cancer [175]. 18

19 Despite the limitation that may result with molecular classifications of breast cancer, several important insights that could reveal new therapeutic targets have resulted from exploring these molecular differences that underlie the various phenotypes of breast cancer. Examples are the identification of a functional androgen-receptor pathway in a subgroup of ER-negative and PRnegative breast tumors and defects in DNA-repair pathways in BRCA1 and BRCA2 carriers and probably in many basal-like breast cancers [ ] all detected through the molecular classifications. Overall, molecular classification of breast cancer is changing the design of clinical trials and the basis patient selection process. However, the investigations into disease biology and the genetic aspects for the various breast cancer subtypes need to continue in order to readapt more favorable treatment approaches and close the current treatment success gap. Traditionally, in the past, scientists typically classified different types of cancers primarily based on the organs in which the tumors develop and subsequently designed treatments based mainly on disease location, with little or no consideration for the genetic aspects. This treatment approach geared mainly towards disease location only neglected to incorporate gene activity within the cells underlying the disease. It also failed to address the complexities of disease gene specific interactions and molecular classifications that may be responsible for incidence and recurrence. Ultimately, this treatment approach resulted oftentimes in potential misguided treatment and/or lack of personalized/effective treatment due to limited understanding of the individual patient gene interactions for the disease. Overall, this limitation created a gap in complete disease understanding needed for achieving sustainable treatment or cure for the various breast cancer disease subtypes. In addition to the knowledge gap resulting from inadequate knowledge of molecular classifications for breast cancer disease for treatment determination, there are also clear indications that some intrinsic features of breast cancer biology may unfortunately be compromising the efficiency of 19

20 some of the available and current therapeutic strategies even in consolidated age groups [25]. As a result, wonder-drugs have proven all but unattainable, leading researchers to seek incremental treatments, targeting specific mutations of particular types of cancers, sometimes tailored only to very narrow patient groups. It is believed that in-depth understanding of breast cancer biology with regards to the contributory hereditary/genetic implications (pharmacy-genomic aspects) for therapeutic inhibition may provide significant insight into developing sustainable individualized treatment solutions. Breast cancer data mining capabilities and originating data sources and platforms also have been suspected to have an impact on the slowed progress made in breast cancer data analyses to date. Some current data mining capabilities have been useful in data analyses and meaningful translation of genetic information for effective treatment. Unfortunately, collectively, they are yet to result in optimal and efficient disease mapping or effective treatment models/design for stronger predictions for treatment outcomes. It is therefore evident from the degree of breast cancer treatment success that these data mining capabilities have yet to deliver consistent, effective and sustainable solutions for the breast cancer treatment today. Several approaches including clustering analyses for mining breast cancer data derived from several sources/platforms are available and are currently being used at varying levels in today s breast cancer exploratory data analyses. In particular, cluster analyses have been used in some breast cancer data mining studies with variable levels of success. Some have not been found to be fully effective in providing optimal data analysis required for understanding breast cancer genomic data. More specifically, some existing unsupervised clustering methods have been employed for breast cancer data analyses. Unfortunately, some of those methods have been classified either as non-robust and/or lack the ability to discover subtle, context-dependent biological patterns, and hence did not prove to be optimal methods. 20

21 In addition to the limitations resulting from inadequate knowledge of molecular classifications, inadequate data mining capabilities, intrinsic features of breast cancer biology already described, breast cancer data sources have contributed to the overall limitation. Some available breast cancer data sources and platforms still present challenges in terms of their data structures or formats which may present challenges when integrated onto an adaptable universal platform for comprehensive analyses. There may be the additional limitation which may include inadequate understanding of the data structure or formats which may limit the potential to optimally extract translational pharmacogenomics information from the analyses results. For example, even though breast cancer microarray data has been readily available for analysis for some time now, the major challenge still remains that of understanding and analyzing the resulting complex genomic data, extracting the relevant patterns of disease and then translating that knowledge and framework for treatment solutions for the various subtypes. Collectively, these gaps and complexities of breast cancer data sources and mining capabilities point to the general difficulty in fully understanding and classifying breast cancer diverse forms in addition to finding effective treatment. Overall, this has resulted in the inability to deliver specific and targeted treatments or cure for all breast cancer disease types. These less than optimal outcomes continue to prompt for more refined data mining and focused research in breast cancer analyses for treatment designs with effective solutions. Hence, my study projects that designing treatment approaches based on disease biology, genetic diversity and pharmacogenomics in addition to using refined and focused data mining options to analyze and understand the data will improve treatment successes, reduce variability in treatment outcomes and hence my study. 21

22 1.2 Background and Significance of the Problem To date, more than 200,000 women in the United States develop breast cancer each year. Approximately 70 percent of these women have been classified to have estrogen receptorpositive cancer. For treatment, these patients with estrogen receptor-positive cancer are typically prescribed tamoxifen (an anti-estrogen drug), which has remained the drug of choice. Besides tamoxifen, other treatment forms exist for higher risk, aggressive-like breast cancers for example, Adriamycin (CA- cyclophosphamide + doxorubicin) for aggressive types; Monoclonal antibodies e.g. Trastuzumab/Herceptin for aggressive recurrent types. With mammography still currently considered the only acceptable screening tool for early detection, surgery is usually prescribed in disease conditions for patients with primary breast cancer, which is often followed by adjuvant therapeutics e.g., chemotherapy, radiotherapy and/or endocrine therapy. Radiation may also be added at the surgical bed to control cancer cells that were missed by surgery. This usually extends survival, although radiation exposure to the heart may cause damage and heart failure in the following years. Hormonal treatments are also quite common especially following surgery for those specific cancers that require hormones to grow. Those hormonally dependent cancers are treated with drugs that interfere with the hormones, usually tamoxifen, and with drugs that shut off the production of estrogen in the ovaries or elsewhere even though this may damage the ovaries and end fertility. In addition, low-risk, hormone-sensitive breast cancers may be treated with hormone therapy and radiation alone after surgery. Breast cancers without hormone receptors, or which have spread to the lymph nodes in the armpits, or which express certain genetic characteristics, are higher-risk and are treated more aggressively. Even with all these successes, the concept of therapeutic specificity relating to pharmacogenomics, in terms of why or why not successful treatments occur for certain breast 22

23 cancer types, is still not very clear or completely understood. This is the same reason these treatment successes are not consistently sustainable in all breast cancer cases and subtypes. Interestingly, so much work has been focused on understanding how cells become resistant to conventional chemotherapeutic agents but so little work has been done on why they are sensitive to begin with therapeutic specificities. For example, ERBB2 and ABL tyrosine kinases which are targeted in breast cancer treatment are expressed in many normal cell types as well as in some cancer cells [ ]. The question becomes why should targeting these kinases lead to the destruction of cancer cells only but not normal cells or why should the cancer cells be sensitive to the targeted treatment and not normal cells? Some studies have indicated that it is actually not the neoplastic cells, but rather the tumor vasculature, that is the actual target of many conventional chemotherapeutic agents [141]. This claim is further supported by research on apoptosis in the 1990s which had suggested that some tumor cells are less, rather than more, likely to undergo apoptosis after noxious stimuli. This is consistent with studies showing that many drugs used in the clinic to treat solid tumors do not function through the induction of apoptosis at the doses achieved in patients, but through another mechanism. This is yet another example of therapeutic specificity, which has yet to be thoroughly investigated. Fortunately, despite the lack of a rigorous pharmacogenomics selection process to achieve therapeutic specificity with these available therapies, considerable successes continue to result from some of these treatment solutions at varying degrees of efficacy and sustainability. Additional research investigations in this area continue to lead to significant progress and advancements in gaining the knowledge required to develop better, more effective and sustainable targeted therapies for all breast cancer forms and especially for those breast cancer aspects where the disease mechanisms and pathways are understood relatively well. 23

24 However in spite of all the progress made so far and the successes achieved to date with understanding various aspects of therapeutic specificity for disease treatment, unfortunately, the current state of breast cancer treatment overall remains highly un-predictive, despite the progress made in the development of breast cancer treatment drugs, either for neoplastic cells or tumor vasculature. To date, there still exists the lingering gap in understanding the basis for therapeutic selectivity (pharmacogenomics). The current treatment solutions still have yet to prove consistently reliable for treating all breast cancer subtypes and mutation categories. This is evident in the fact that despite spectacular examples of prolonged disease remission with some of these notable therapies, the statistical survival benefit in meta-static breast cancer patients currently is still estimated in months and not years. Patients potential to survive continue to vary greatly with these treatments and depend on several factors including the type, stage (spread) of cancer when diagnosed, degree of disease aggressiveness and genetic make- up. Sustainable treatment and possible cure for all patient categories continue to be a challenge, hugely attributed to inadequate disease understanding hugely resulting from the complexities of molecular classifications, prevailing breast cancer pre-disposing factors, diverse disease subtypes and the potential for variable and diverse disease pathways. The cumulative result of this gap becomes unpredictable treatment specificity or resistance, inadequate disease understanding required for developing targeted optimal therapies for the various breast cancer disease forms and/or unreliable treatment outcomes. There is an undeniable need to fine tune breast cancer research and deliver a higher rate of success with these treatments. In order to understand drug applicability and efficacy for breast cancer treatment, the basis for the responsiveness of cancers to treatment drugs should be well understood. The concept of therapeutic specificity still needs to be adequately investigated in 24

25 order to gain useful treatment information. This might become the key to how better agents against cancer can be explored and developed. Hence, my study has focused on analyzing breast cancer gene expression data from DNA microarrays via mrnas (messenger RNAs) in order to understand the specific gene interactions for breast cancer subtypes, determine their correlation to disease patient pharmacogenomics (severity, recurrence, and treatment success) and generate the applicable information required for incorporating patient pharmacogenomics into standard disease treatment. It is projected that this approach will enable the fulfillment of my research hypotheses. My study projected that understanding breast cancer gene specific interactions with the help of microarray technology, will not only enable the classification of different types of cancers based on the organs in which the tumors develop, but will also help to further classify these types of cancers based on the molecular patterns of gene activity in the tumor cells. My study also projects that this analysis approach will provide further insight into therapeutic specificity in understanding exactly how different therapies affect tumors. It will also enable the design of more effective targeted and sustainable treatments for each specific breast cancer type in order to derive the goals of this study as follows: This study presents NMF (non-negative matrix factorization) Consensus algorithm, as the main clustering methodology for the data analysis, used in conjunction with preparatory unsupervised and sparse hierarchical clustering, both of which were used in the initial classification and baseline clustering of the gene expression data. NMF utilizes a group of algorithms in multivariate analyses where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. The usefulness of Comparative gene-marker Analyses was also evaluated as part of the clustering study to determine if there are significant correlations of pathogenicity (incidence, recurrence, metastasis, aggressiveness and treatment 25

26 sustainability) and insights for disease that can be derived from other known cancer genes other than breast cancer. It is to be noted that other data analysis techniques besides the ones described exist and have been used for this type of analyses. However, many of them have been found to be far from optimal. 1.3 Necessary Path to closing the breast cancer treatment gap In order to fully bridge the breast cancer disease understanding and treatment gap, the following breast cancer aspects have to be thoroughly understood. They include: Breast cancer pre-disposing factors Breast cancer subtypes Breast cancer gene types and mutation categories Breast cancer disease pathways Interactions between breast cancer genes, disease subtypes and disease biological pathways Breast cancer pharmacogenomics Understanding Breast Cancer Pre-disposing factors The prevalence of diverse breast cancer pre-disposing factors and their correlation to disease subtypes and disease pathways was extremely significant in this current study and was investigated. Some of the known risk factors have been characterized as age [1], sex,[21] lack of childbearing or breastfeeding, genetic factors and higher hormone levels,[23] [24]. In more recent years, research attributed the impact of diet and other behaviors on breast cancer incidence to include a high-fat diet,,[27] alcohol intake,[28][29] obesity,[30], environmental factors such as tobacco use, radiation[31], endocrine disruptors and shiftwork [32]. Radiation from mammography was also linked to breast cancer. Even though radiation from mammography is a low dose, there had been some claim that some cumulative radiation effect can cause cancer [32]. 26

27 On a broader molecular level, these diverse breast cancer pre-disposing risk factors have been further classified into two main categories. One group includes factors that result from excessive exposure to estrogens. The other group includes factors like germ-line mutations for high or middle penetrance genes (BRCA1, BRCA2, ATM, P53, CHEK2 and NBS1) which contribute to the deficiency in the maintenance of genomic integrity. More specific classifications of these pre-disposing factors within the above-mentioned broad categories have included sub- factors such as impact of environmental carcinogens, excessive exposure to ionizing radiation and possibly viruses and constitutive chromosomal instability leading up to deficiency of maintaining genomic integrity [7]. Sub- categories of somatic genetic risk factor events such as chromosomal, methylation, microsatellite, nucleotide and mitochondrial instabilities, alteration of specific genetic pathways resulting in the activation of oncogenes (amplification, over-expression) and inactivation of suppressor genes (allelic loss, intragenic mutations, promoter hyper-methylation, decreased expression and integrity) integrity (Fig. 1) [104] [7] have been reported to contribute to genetic instability. Chromosomal alterations are also factors that have been confirmed to contribute to breast cancer genetic events. Specifically, during a malignant transformation process, breast epithelial cells accumulate a high number of somatic genetic events (mainly gross chromosomal alterations and methylation abnormalities). These DNA alterations activate oncogenes and inactivate tumor suppressor genes, which eventually result in the manifestation of The hallmarks of cancer (Fig. 1-2)[7]. 27

28 Environmental carcinogens (?) RISK FACTORS Excessive exposure to ionizing radiation Viruses (??/) Deficiency in maintenance of genomic integrity (mainly heritable factors?) Germ- line mutation (high or middle penetrance): BRCA1 BRCA2 ATM P53 CHEK2 NBS1 Low- penetrance polymorphisms in genes Excessive exposure to estrogens (interplay between heritable and life style factors) Early menarche Late menopause Nulliparity or low parity Late age at first delivery Lack of breast feeding Post menopausal obesity Tall Height Prolonged hormone replacement therapy Hormonal contraceptives Increased alcohol intake Low physical activity Environmental exposure to endocrine disruptors (?) Low- penetrance polymorphisms in genes responsible for steroid metabolism SOMATIC GENETIC EVENTS Genetic Instability Chromosomal instability Methylation instability Macrosatellite instability (? Nucleotide instability (?) Alteration of specific genetic pathways Activation of oncogenes (amplification, overexpression) Inactivation of suppressor genes (allelic loss, intragenic mutations, promoter hypermethylation, decreased expression)) Mitochondrial instability (?) THE HALLMARKS OF CANCER Self sufficiency in growth signals Insensitivity to antigrowth signals Evading apoptosis Limitless replicative potential Sustained angiogenesis Tissue invasion and metastasis Genome instability Contribution of surrounding stroma ERBB pathways inhibitors ER pathways inhibitors Demethylating agents aimed to restore antioncogene expression Bcl2 inhibitors Telomerase inhibitors VEGF pathway inhibitors COX2 inhibotors MMP inhibitors Conventional cytotoxic treatments? THERAPEUTIC APPROACHES Figure 1-2 Breast cancer risk factors and mechanisms (Adapted from Drug Discovery Today: Disease Mechanisms) The risk factors listed above collectively are very significant to breast cancer disease treatment. However, the challenge of confirming the variety of prevalent risk factors and their function in a breast cancer disease event is significant. Oftentimes, the factors thought to be the strongest predictors for metastasis (including lymph node status and histological grade) solely based on data analyses, have failed to accurately classify breast cancer tumors according to clinical behavior [12, 13, 14 ]. To date, none of the signatures of breast cancer gene expression reported, have definitively allowed for specific patient-tailored therapy strategies or to strongly predict short 28

29 interval to distant metastasis ( poor prognosis signature) in all disease cases [15, 16,17,18, 19, 20, 21]. Nonetheless, understanding these risk factors and dissecting their roles (disease complexity, type, persistence, recurrence) in breast cancer disease cases is very important for developing effective and sustainable treatment solutions. In order to understand breast cancer disease effectively, the various disease risk factors and their relevance to disease subtypes and associated pathways will need to be fully understood Understanding Breast Cancer subtypes Breast cancer is very diverse in nature. It has been known to exist in several subtypes. At a very high level, about four significant breast cancer molecular subtypes have been distinguished by gene-expression profiling and categorized as Luminal A, Luminal B, HER2- enriched and Basal-like breast cancer. These subtypes correspond reasonably well to clinical characterization on the basis of ER and HER2 status, as well as proliferation markers or histologic grade. Altogether, about 23 significant genes have been associated with the 4 molecular subtypes to various extents and expressed at various stages of the specific disease subtype to which it is related. These 23 significant genes include the following: PIK3CA, TP53, MAP3K1, MAP2K4, GATAS, MLL3, CDH1, PTEN, PIK3R1, AKT1, RUNX1, CBF8, TBX3, NCOR1, CTCF, FOXA1, SF3B1, CDKNIB, RB1, AFF2, NFI, PTPN22, PTPRD; Luminal A: All 23 genes are associated with this disease subtype and are mostly ER-positive and histologically low-grade Luminal B: All 23 genes are associated with the disease subtype. This subtype is also mostly ER-positive, but may express low levels of hormone receptors and is often high-grade. Her2 enriched: All 23 genes except TBX3, NCOR1, RB1 & NF1 are associated with subtype which show amplification and high expression of the ERBB2 gene and several other genes of the ERBB2 amplicon; 29

30 Basal like: All 23 genes except MAP3K1, MAP2K4, CDH1, PIK3R1, AKT1, RUNX1, CBF8, FOXA1, CDKN1B and PTPN22 are associated with subtype, which mostly correspond to ERnegative, progesterone-receptor (PR) negative. Microarray studies have shown that luminal types of tumors express high amounts of luminal cytokeratins and genetic markers of luminal epithelial cells of normal breast tissue [6]. In contrast, basal-like breast cancers do not express ER, PR, and ER-related genes and do not overexpress several genes that typify myoepithelial cells of normal breast tissue: luminal cytokeratins, smoothmuscle specific markers, and certain integrins. In some basal-like cancers, there is high expression of basal cytokeratins such as CK5 and a variety of growth factor receptors, including high levels of epidermal growth factor receptor, c-kit (a tyrosine kinase in breast epithelium), and growth factors such as hepatocyte growth factor and insulin growth factor. The above-listed genes are associated with the different disease subtypes at varying levels via applicable pathways. Overall, in order to understand breast cancer disease and develop sustainable solutions that can interfere with gene functions either by activation or inactivation for the various disease forms, the gene interactions for the disease subtypes and associated pathways will need to be fully understood in order to treat the disease at its root Understanding breast cancer gene types and gene mutation categories Research studies have shown that invasive cancer (tumorigenesis) develops when there are alterations in these three types of genes that are responsible for tumorigenesis. These genes include Oncogenes cause cancer by activation; Tumor-suppressor genes - cause cancer by inactivation and Stability genes- cause cancer by inactivation, oftentimes through mutation. In general, mutations reduce the activity of the gene product. 30

31 Oncogene activations can result from chromosomal translocations, gene amplifications or subtle intragenic mutations affecting crucial residues that regulate the activity of the gene product. Oncogenes are mutated in ways that render the gene constitutively active or active under conditions in which the wild-type gene is not. Tumor-suppressor genes are targeted in the opposite way by genetic alterations by inactivation. Such in-activations arise from missense mutations at residues that are essential for its activity, from mutations that result in a truncated protein, from deletions or insertions of various sizes, or from epigenetic silencing. A mutation in a tumor-suppressor gene is analogous to a dysfunctional brake in an automobile; the car doesn t stop even when the driver attempts to engage it [142]. Some recently described tumor-suppressor genes have been hypothesized to exert a selective advantage on a cell when only one allele is inactivated and the other remains functional (that is, haplo-insufficiency) [143]. However, mutations in both the maternal and paternal alleles of a tumor-suppressor gene are generally required to confer a selective advantage to the cell. This situation commonly arises through the deletion of one allele via a gross chromosomal event such as loss of an entire chromosome or chromosome arm-coupled with an intragenic mutation of the other allele [144]. Despite the functional differences between oncogenes and tumor-suppressor genes, their mutations operate similarly at the physiologic level. Both mutation types drive the neoplastic process by increasing tumor cell number through the stimulation of cell birth or the inhibition of cell death or cell-cycle arrest. The increase can be caused by activating genes that drive the cell cycle, by inhibiting normal apoptotic processes or by facilitating the provision of nutrients through enhanced angiogenesis. 31

32 A third class of cancer genes is called stability genes or caretakers. This class of genes when functioning properly includes the mismatch repair (MMR), nucleotide-excision repair (NER) and base-excision repair (BER) genes responsible for repairing subtle mistakes made during normal DNA replication or induced by exposure to mutagens. Other stability genes control processes involving large portions of chromosomes, such as those responsible for mitotic recombination and chromosomal segregation (for example, BRCA1, BLM and ATM). Generally, stability genes keep genetic alterations to a minimum when they are functioning properly, and thus when they are inactivated, mutations in other genes occur at a higher rate [145]. The stability genes promote tumorigenesis in a completely different way than oncogenes and tumor-suppressor genes when mutated. All genes are potentially affected by the resultant increased rate of mutation, but specifically, only mutations in oncogenes and tumor-suppressor genes affect net cell growth and can thereby confer a selective growth advantage to the mutant cell. As with tumor-suppressor genes, both alleles of stability genes generally must be inactivated for a physiologic effect to result. Understanding the genetic events and specific mutations that apply to the various breast cancer disease types would enable improvement in the not yet optimal drug development process and patient treatment. In addition, targeting most major oncogenes in many cancer types is a clinical and operational advantage and should be explored always Understanding breast cancer disease pathways Breast cancer disease pathways are as important to the disease as risk factors, breast cancer genes and disease subtypes. Without a precise definition of "biological pathway", most researchers regard a biological pathway as a series of inter-connected molecular and cellular events that links molecular entities such as proteins, genes, and metabolites to a disease. Breast cancer, in essence is a genetic disease, oftentimes occurring through well-defined or fairly known disease pathways. Disease pathways have become very important in understanding the 32

33 molecular and genetic details of breast cancer disease. In general, a biological pathway can be activated by extracellular stimuli and can be defined by molecular pathway events such as signal transduction events, enzymatic reaction events and genetic regulation events which can lead to persistent changes of the biochemical state of cells, which may in turn initiate a disease condition [162]. To date, many breast cancer disease pathways have been discovered and reasonably studied. These disease pathways have been projected and mapped out by several sources, often reflecting the best understanding of the disease for the source. The integrated breast cancer disease pathway is described below. Figure 1-3 Breast cancer disease pathways from Human Pathway database [140]. The integrated Human Pathway Database (HPD) pathway shown above incorporates the most important proteins for breast cancer and is derived from different sources such as Protein Lounge, BioCarta, KEGG, and NCI-Nature. More specifically, this integrated pathway is a breast 33

34 cancer-specific similarity network with 25 HPD pathways. These several pathways contain genes, proteins or metabolites with database annotations, some protein-protein interactions mapped onto the pathway and some unconnected lines. The HPD pathway utilized the Rp score from the Connectivity-Maps (C-Maps) webserver to determine the rank of the most important proteins for breast cancer and only the pathway pairs with a similarity score and overlap above the threshold {S i, j 0.2, AND P i P j >2} are shown in the figure above. The protein to protein relations for these important proteins were determined by annotating the pathways. These proteins were then used to determine the most important pathways for the disease [140]. The color and shapes in the diagram were drawn to indicate original HPD pathway data sources, based on the shape/color legend shown in the upper left corner. The sub-network pathways and host pathways are also indicated with directed cyan edges. Edges are labeled with the number (in red) of molecular entities shared by the connected pathways. The count of molecular entity overlap between each pair of related pathways is labeled as red-colored numbers on the edge [140]. Another breast cancer pathway involving just the FOXA1 gene is shown below: 34

35 Figure 1-4 Integrated pathway model involving FOXA1 related breast cancer. The figure above shows how information from different pathway database sources are readily integrated, queried, and analyzed together in HPD for FOXA1-related breast cancer signaling studies. 20 pathways connected with breast cancer with pathway IDs, total molecules, Hit Molecules, similarity score and Pathway sources are shown below: 35

36 20 Pathways Involving Query Gene(s) Pathway ID Pathways Involved Total Molecules Hit Molecules Similarity Score Pathway Source HPD_707 Recruitment of repair and signaling proteins to double-strand breaks Reactome HPD_548 DNA Damage Induced Sigma Signaling Protein Lounge HPD_628 atm signaling pathway BioCarta HPD_336 Chks in Checkpoint Regulation Protein Lounge HPD_644 HPD_1261 role of brca1 brca2 and atr in cancer BioCarta susceptibility FOXA1 transcription factor network NCI-Nature Curated HPD_596 BRCA1 Pathway Protein Lounge HPD_788 ATM Pathway Protein Lounge HPD_1182 p53 Signaling Protein Lounge HPD_1058 Aurora A signaling NCI-Nature Curated HPD_1567 DNA Repair Mechanism Protein Lounge HPD_212 brca1 dependent ub ligase activity BioCarta HPD_1070 BARD1 signaling events NCI-Nature Curated HPD_786 Molecular Mechanisms of Cancer Protein Lounge HPD_340 GADD45 Pathway Protein Lounge HPD_12 cell cycle: g2/m checkpoint BioCarta HPD_988 Ubiquitin mediated proteolysis KEGG HPD_752 Fanconi's Anaemia Pathway Protein Lounge HPD_276 ATM mediated phosphorylation of repair proteins Reactome HPD_953 JAK/STAT Pathway Protein Lounge Table 1-1 Breast cancer pathways and sources 36

37 Certain breast cancer pathways and the resulting disease states are known. Examples of those examples are shown in the Table 2-1 below: Known Therapies for Disease Pathway EERB pathways inhibitors/ ER pathway Resulting Disease states impacts self-sufficiency in growth signals inhibitors Demethylating agents aimed to restore antioncogene expression BcL 2 inhibitors Telomerase inhibition MMP inhibitors Conventional cytotoxic treatment when inactivated results to insensitivity to anti-growth signals when inactivated, evades apoptosis activates limitless replicative potential results in tissue invasion and metastasis applied when genomic instability occurs Table 1-2 Breast cancer pathways and resulting effects This study projects, that once existing breast cancer disease pathways are fully understood and new ones discovered, it will provide the platform required to solve major breast cancer research challenges for the future. First, it will enable the discovery of new genes that have a causal role in neoplasia, particularly those that initiate and conclude the process. Also, it will enable the delineation of the pathways through which these genes act and the basis for the varying actions in specific cell types. Thirdly, it will enable the development of translational medicine in terms of new ways to exploit the knowledge gained for the benefit of patients by either blocking or inactivating the significant disease pathways. 37

38 1.3.5 Understanding the interactions between breast cancer genes, disease subtypes and disease biological pathways Breast cancer disease pathways are yet to be fully understood investigated or even adequately integrated in disease analysis or patient treatment scheme. Features of successful breast cancer treatment involve understanding the dynamics of all the risk factors and the specific disease pathway for all the individuals being treated. Treatment failure on the other hand, is represented by ineffective, unreliable, unpredictable and unsustainable outcome that does not consistently and successfully inactivate or activate the necessary genes on the disease path. It is important that this aspect be understood because it will result into understanding the relevant pathways for the various disease forms which will lend to successful treatment. What is currently known is that mammalian cells have multiple safeguards to protect them against the potentially lethal effect of cancer gene mutations and only when several genes are defective, does an invasive cancer (tumorigenesis) develop. In the last decade also, many important genes that are responsible for the genesis of various cancers have been discovered and their mutations reasonably identified. Research has also established that no single gene defect causes cancer unlike diseases such as cystic fibrosis or muscular dystrophy, wherein mutations in one gene can cause disease and so it is established that most of the cancer pathways are triggered by more than one gene event [157]. What is also known is that cancer-gene mutations whether somatic or hereditary ultimately enhance net cell growth. Research performed over the past decade also shows evidence that there are fewer breast cancer pathways than genes, because of composite gene-pathway combinations. There are almost always a variety of genes which, when altered, lead to similar phenotypes [148] [149], implying that many genes can adapt one pathway and hence there are 38

39 more genes than pathways. For example, several cancer genes directly control transitions from a resting stage (G0 or G1) to a replicating phase (S) of the cell cycle (Fig. 1-4, Rb pathway). Figure 1-5 p53 and RB tumour-suppressor pathways [163] p53 and RB are at the heart of the two main tumor-suppressor pathways that control cellular responses to potentially oncogenic stimuli and control transitions from a resting stage (G0 or G1) to a replicating phase (S) of the cell cycle. Each pathway consists of several upstream regulators and downstream effectors. For simplicity, only four main components in each pathway are shown. Similarly, the pathways interact at several points, two of which are shown. In the p53 pathway, signals such as DNA damage induce the ARF (also known as p14 in humans and p19 in mice) product of the CDKN2A locus. ARF increases p53 levels by sequestering MDM2, which facilitates the degradation and inactivation of p53. p53 has both transactivation and transrepression activity, and so controls the transcription of numerous genes. Among the p53 target genes are WAF1, an inhibitor of cyclin-dependent protein kinases (CDKs) that, among other activities, causes cell-cycle arrest, and BAX, which promotes apoptotic cell death. In the RB pathway, stress signals such as oncogenes induce INK4A, the other product of the CDKN2A locus. INK4A inhibits CDKs that phosphorylate, and therefore inactivate, RB during the 39

40 G1 phase of the cell cycle. RB also controls the expression of numerous genes, although it does so primarily by recruiting transcription factors and chromatin remodelling proteins. One downstream consequence of RB activity is the inhibition of E2F activity, which is important for the transcription of several genes that are required for progression through the G1 and S phases of the cell cycle. RB also regulates p53 activity through a trimeric p53-mdm2-rb complex [163]. The products of these genes include proteins as diverse as cdk4 (a kinase), cyclin D1 (which interacts with and activates cdk4), Rb (a transcription factor) and p16 (which interacts with and inhibits cdk4) [ ]. The genes encoding Rb and p16 are tumor-suppressor genes inactivated by mutation, whereas those encoding cdk4 and cyclin D1 are oncogenes, also activated by mutation. Detailed studies of these individual tumor types provide compelling evidence for shared pathway for four genes (cdk4, cyclin D1, Rb, and p16) and function in a single pathway in human cancers [ ]. This concept of composite gene-pathway has been applied to cancer studies to elucidate the biochemical functions of altered cancer genes in different cell culture systems. Results are utilized to derive/map potential disease pathways. The studies of the TP53 tumor-suppressor gene is yet another example of a breast cancer pathway that focuses on pathways rather than individual genes.the p53 protein is a transcription factor that normally inhibits cell growth and stimulates cell death when induced by cellular stress [159] [160] [161].The most common way to disrupt the p53 pathway is through a point mutation that inactivates its capacity to bind specifically to its cognate recognition sequence. However, there are several other ways/pathways to achieve the same p53 protein disruption effect, including amplification of the MDM2 gene and infection of the gene with DNA tumor viruses whose products (such as the E6 protein of human papilloma virus) bind to p53 and functionally inactivate it (Fig. 2-4, p53 pathway). This led to major discoveries of the 1990s that established 40

41 that virtually all DNA tumor viruses that cause tumors in experimental animals or humans also encode proteins that inactivate both Rb and p53 genes [ ], thus demonstrating the same concept of composite gene-pathway. Representative genes of major breast cancer pathways are shown in the Table 2-3 below. Genea (synonym) Somatic mutation typeb Cancers with mutant genec Pathwayd FBXW7 (CDC4) Inactivating codon change Colon, uterine, ovarian, breast CIN AKT2 Amplification Ovarian, breast PI3K PI3KCA Activating codon change Colon, stomach, brain, breast PI3K CCND1 (cyclin D1) Amplification, translocation Leukemias, breast RB ERBB2 Amplification Breast, ovarian RTK* NTRK1,3 Translocation, activating codon change Thyroid, secretory breast, colon RTK SMAD2 Inactivating codon change Colon, breast SMAD MAP2K4 (MKK4) Inactivating codon change Pancreas, breast, colon Unknown * RTK, receptor tyrosine kinase pathway; APOP, apoptotic pathway; CIN, ; PI3K, phosphoinositide 3 kinase pathway, RB, ; SMAD, Table 1-3 Representative genes of major breast cancer pathways The genesa listed in the table above are associated with certain breast cancer pathways. The somatic mutation typeb listed represents the type of mutation associated with each cancer type. It could either be an activating codon change, intragenic mutation altering one or a small number of base pairs that activates the gene product, indicating that it is an oncogene. It also could represent an inactivating codon change, any mutation (point mutation, small or large deletion, etc.) that inactivates the gene product, indicating that the gene is a tumor suppressor. The somatic mutation typeb could also be amplifications and translocations which generally affect oncogenes, even though occasional translocations may disrupt a gene rather than activate it as has been suggested to occur with the RUNX1 gene [141]. Also listed in the table above are various breast cancer types with mutant genes c associated with the various cancer types and the applicable pathways for those genes that are mutated somatically but not inherited in mutant form. The single pathway that is listed for each gene represents a best guess (when one cannot be made). 41

42 Some breast cancer genes are associated with hereditary breast cancer and not to somatic mutations. These cancer pre- disposition genes that are implicated in hereditary breast cancer are given in the Table 2-4 below. The Table outlines the names of these pre-disposition genes, the syndrome, hereditary pattern, second hit, pathway and major heredity tumor types that could be impacted by the same genes. Gene Syndrome Hereditary Second hit Pathway Major heredity synonym(s) pattern tumor types BRCA1, BRCA2 Hereditary breast Dominant Inactivation of WT CIN Breast, ovary cancer allele TP53 (p53) Li- Fraumeni syndrome Dominant Inactivation of WT allele p53 Breast, sarcoma, adrenal, brain FBXW7 (CDC4) Inactivating codon Dominant Inactivating codon CIN Colon, uterine, change change ovarian, breast Table 1-4 Breast Cancer pre-disposition genes Therefore, understanding breast cancer genetics (respective gene type for the disease such as oncogenes, tumor-suppressor and stability genes and the associated mutation types) and the associated breast cancer pathways, could highlight the basis for treatment success and understanding treatment resistance factors which ultimately should lead to a change in the way future clinical trials are designed/conducted and the way breast cancer disease is treated. It will also provide some insight on how drugs should be developed, as well as the basis for the selectivity/specificity of conventional chemotherapeutic agents for patient treatment for the different disease types Understanding breast cancer pharmacogenomics Poor knowledge of disease pharmacogenomics in terms of genetic basis for disease, disease subtypes/diversity, disease pathways and associated treatment resistance has also been thought 42

43 to be a major contributing factor to the present challenge of inadequate disease understanding. The result therefore is the lack of effective treatment model and solutions that can be implemented for more predictive diagnosis, drug efficacy and successful and sustainable treatment outcomes. Inaccurate prognosis, ineffective patient treatment, variable treatment outcomes, adverse side effects, effect of overtreatment and in some extreme cases, death are some of the serious consequences of this limitation [6]. Cancer treatment should be based on pharmacogenomics. Treatment should be on an individualized basis for all patients and all disease types with a strategy to uncover and target the specific genes that are significant/ major oncogenes for the particular cancer type and to understand the degree of resistance that may be linked with those genes. Targeting those genes should represent a clinical and operational advantage in terms of the success of treatment decisions and a greater potential of reducing side effects. If the significant genes for the specific disease type are identified and confirmed, then the treatment decision should move to what role the genes play in the disease and therefore definitively targeting those genes at the disease root level. Therefore, pharmacogenomics treatment approach should be that prior to a clinical trial for a breast cancer drug or any cancer drug, a pathway analysis and drug sensitivity test should be performed. Patient pre-selection for drug/clinical trial should be based on those patients possessing the relevant genetic and pathway biomarkers to qualify to be enrolled in the trial. With each clinical trial, a genetic analysis of patients who responded and those who did not respond should be performed in order to understand the underlying resistance and continue to refine the treatment model. 43

44 A comprehensive understanding and analysis of the breast cancer predisposing factors as well as their pharmacogenomics underlying relationship within the specific disease type and pathway is considered extremely important. It is very relevant to delivering a high degree of successful patient care and possible disease eradication. Unfortunately, this is lacking in today s treatment solutions. However, its applicability will enable a greater depth in understanding the disease mechanisms and possible adaptabilities of the diverse disease forms for which treatment successes which are yet to be optimal. Goals and Objectives of study: The overall goal of this project was to identify the genetic factors that are necessary for breast cancer disease incidence, progression and recurrence across the various diverse forms. Specifically, the objectives are to utilize DNA microarray gene expression data from two classes of breast cancer patients with disease less than 5 years and disease greater than 5 years to determine: 1) what genetic factors are important for breast cancer incidence, progression and recurrence from DNA microarray gene expression analyses 2) whether genetic similarities or differences occur within the diverse breast cancer forms as a result of the associated prevailing risk factors 3) whether understanding the breast cancer genetic factors across the diverse patient groups can enable effective personalized treatment designs for the particular disease subtypes 4) whether there are differences in the disease severity based on the genetic component of the various disease subtypes 44

45 1.4 Research Hypotheses Research Hypotheses of the Project: Are there statistically significant associations between the associated genetic factors and breast cancer incidence, progression and recurrence Are there statistically significant differences or similarities in the genetic basis of disease in relation to the prevailing risk factors Are there statistically significant association between understanding the breast cancer genetic factors across the diverse patient groups and developing effective personalized treatment for the particular disease subtypes Are there statistically significant differences in breast cancer disease severity based on the genetic component of the various disease subtypes The study hypotheses which were evaluated through this study include the following: Hypothesis 1: This study hypothesized that there are significant associations between the associated genetic factors and breast cancer incidence, progression and recurrence. This hypothesis was tested here by investigating those indications that project that breast cancer incidence, progression and survivability can in a large part be related to a patient s genetic make-up. Hypothesis 2: This study hypothesized that there is a correlation between the genetic basis of breast cancer disease and the prevailing risk factors that are associated with the disease. This was also projected to be the basis of the different breast cancer subtypes. The hypothesis was tested in this study by evaluating the applicable predisposing factors that are associated with the various breast cancer disease forms and their relationship to the genetic basis/component of the particular disease subtype. 45

46 Hypothesis 3: This study hypothesized that in-depth understanding of breast cancer genetic factors across the diverse patient groups and subtypes is the foundation for developing effective personalized treatment for the particular disease subtypes. The central hypothesis of this study aspect infers that variable treatment outcome also occurs as a result of inadequate understanding of the underlying disease gene behavior in terms of cancer genetics and cancer mutation specifics for the disease types. A good understanding of this aspect will enable treatments to only target the genes that are specifically relevant to the disease being treated, so that tailored individualized, effective treatment can be provided for each patient for their specific disease. Furthermore, this hypothesis also claimed that breast cancer pharmacogenomics relevance (differences in disease type, drug response patterns and side effects) which may impact treatment outcome, safety & efficacy is related to the specific disease pathway. The hypothesis was tested here by determining if the pharmacogenomics components reviewed are important and related to the genetic basis for the particular disease type. Hypothesis 4: This study also hypothesized that there are significant correlations between the differences in breast cancer disease severity and the genetic component of the particular disease type and for the various disease subtypes. 46

47 Overall, this study projected that understanding the patient s relevant predisposing factors, disease subtype, patient s genetic component for the disease, patient s specific disease pathway, interactions between the various significant factors and patient s pharmacogenomics basis for disease will not only help with implementing effective treatment solutions, but will help with improving and reducing variability with treatment outcomes and providing the foundation for developing effective and sustainable treatments (new treatment model) for known and future disease types. It will also ensure significant reduction in adverse side effects from the treatment drug and hence my study. 1.5 Objectives of Research Having identified the complexity of the breast cancer pre-disposing factors, disease pathways, the associated treatment gap and the challenges and path for obtaining effective and successful treatment for the various disease subtypes, the overall objective of this study is therefore to ultimately provide a framework for developing effective and sustainable treatment solutions for the disease type based on disease characterization and cancer subtype confirmation. Therefore, this study will focus mainly on utilizing several clustering methodologies for the analysis of the study breast cancer genetic data. The goal was to understand, analyze the breast cancer microarray genomic data, identify and confirm the prevalent pre-disposing factor(s) for the disease type and pathway, extract the relevant disease patterns (severity/aggressiveness, recurrence, metastatic potential) from within the study data and ultimately gain sustainable understanding of breast cancer disease for successful treatment solutions and possibly cure. 47

48 Generally, whole genome expression profiles are widely used to discover molecular subtypes of diseases. For this study, two distinct breast cancer genomic data sets obtained from the Broad Institute at MIT were used for the analyses. These two distinct data sets represented two disease types/categories for breast cancer in terms of disease incidence less than and greater than 5 years respectively. It is expected that the molecular subtypes for the disease can be determined from these data types. However, an attending and recognized challenge with whole genome data expression analyses approach is that of accurately identifying the correspondence or commonality of subtypes found in multiple, independent data sets which may be generated on different platforms. It is understood that while model-based supervised learning is often used to make these connections, the models can be biased to the training data set and thus may miss inherent, relevant substructure in the test data. Therefore this study will take into consideration this identified handicap and investigate other methods other than model-based supervised learning method for studying/analyzing the genomic expression profiles within the study data sets. It is hoped that the methods employed in this study will ultimately analyze the genomic data objectively and extract the required components for disease identification and sub-type confirmation. It is important to note that preceding my study, various approaches including supervised and unsupervised clustering techniques have severally been employed for these types of genomic analyses. Some have been classified either as non-robust and/or lack the ability to discover subtle, context-dependent biological patterns, and hence they did not prove to be optimal methods. Therefore, adaptable clustering methodologies which are designed for class discovery and clustering validation and geared towards genetic data analyses and extraction of disease genetic component will be used in this study. 48

49 These clustering methodologies are expected to provide robust statistical gene relevant clustering which will effectively classify the underlying factors that are responsible for the various breast cancer subtypes and assist in identifying the disease profile for treatment. It is also hoped that the intended cluster analysis tools will reveal associations and structure in the data which, though may not have been previously evident in prior studies, nevertheless could provide genetic relevance and insight into disease biology through this study. For this study, I have described the unsupervised hierarchical and sparse hierarchical methods as the preliminary clustering tools which were used in the initial classification to provide baseline clustering that were subsequently used to confirm and identify significant clusters. This was followed by the NMF clustering tool. The NMF (Nonnegative matrix factorization) is a gene-based clustering methodology/algorithm, used specifically for analyzing microarray genetic data and investigating the genetic components of the disease required to reveal distinct and common relatable gene pools (within resulting clusters) between the independent data sets. It was projected that in order for these analyses to be successful, a measure of correspondence had to be defined for distinct or qualifying relatable subtypes, while ensuring that the cluster analysis methods chosen did not impose the structure of one data set upon another, but rather would use a bi-directional approach to highlight the common substructures in both data sets in such a way that the significance of the clustering can be evaluated. It is projected that the methods for this study will reveal the correspondence between these cancer-related data sets and identify the common subtypes of breast cancer including but not limited to estrogen receptor status for all patients who share similar survival patterns so as to improve the accuracy of a clinical outcome predictor. It is hoped that NMF s classification potential, adapted to work for any stochastic clustering algorithm would assist in the identification, extraction of relevant gene clusters with 49

50 pharmacogenomics relevance and so enable the understanding of the gene function(s) and mechanisms that may be responsible for the different forms of breast cancer disease subtypes, persistence, severity and reoccurrence. It is also projected that this robust statistical clustering will effectively identify, extract and classify the underlying factors that are responsible for the various breast cancer subtypes. Specific research objectives include the following: Utilize several clustering/statistical techniques to analyze the two breast cancer gene data sets obtained from Broad Institute/MIT. These sets of breast cancer data represent short term and long term disease sets (Breast cancer disease data sets of less than 5 years and greater than 5 years). Specific goals for this study include the following: Specific Aim 1: Isolate and classify specific genes that persist in both breast cancer data sets that characterize the two distinct sets of patients with disease less than and greater than 5 years, representing short term and long term disease sets. The goal here is to identify the gene clusters that are potentially relevant for biological function within the two data sets respectively and derive/develop a comprehensive clinical information model that will enable further investigation into the pharmacogenomics complexities of the disease. Specific Aim 2: Next is to understand the biology of the genes that resulted from the clustering, investigate their importance or relevance/contribution to disease (specific mutation types and disease pathway) within the derived clusters and identify their relevant function or aspect(s) that are significant to disease variability, aggressiveness, recurrence, treatment sustainability and/or favorable outcomes. Based on the resulting underlying genetic coding for the disease, the knowledge gathered from these investigations will provide insight into the genetic basis of the 50

51 disease, gene to gene interaction/dynamics and pathway for disease, disease category, potential treatment profiles and classifications, and possibly guided and effective therapy sustainable options. Specific Aim 3: Identify the genes that persist and are common within the two breast cancer datasets, evaluate their significance for disease potential/type (short or long term disease), severity, metastasis, persistence, mechanism of action and pharmacogenomics implication for the disease. The long term goal here is to develop a targeted pharmacogenomics information guideline which can be applied towards understanding breast cancer disease factors that characterize long or short term disease in terms of specific gene action(s) and the interactions that are relevant for identifying and confirming potential breast cancer type at diagnosis or at any stage of the disease in order to derive the applicable tailored treatment format for the disease type and sustainable model/guideline for successful treatment. The derived guideline once determined for known cases will represent a validated model and a framework for understanding new unclassified disease entrants and predicting disease form and mutation type. This information model would provide a standardized, predictive and sustainable structure and parameters suitable for designing effective treatment for most if not all variants of breast cancer disease for future disease management and general breast cancer care. Specific Aim 4: Verify the second part of the central study hypothesis that states that incorporating the pharmacogenomics component into disease diagnosis and treatment will improve the recognition of the particular disease type, basis, incidence/subtype and mutation type, the identification of an effective guided/tailored therapy and improve favorable treatment outcome(s). 51

52 The main goal here is to evaluate the resulting pharmacogenomics model, its suitability for predicting gene assignments specific to disease type and its reliability for accurate prediction. Additionally, it will be determined if the model is reliable enough to support the pattern prediction of the various clinical forms of the disease and the applicable treatment type. Applying a pharmacogenomics model or framework at breast cancer diagnosis and initial treatment analysis can provide disease information that be translated to the bedside during disease treatment and clinical care. It will also serve as a framework whereby information for existing disease types can be mapped for understanding newer disease forms and developing appropriate individualized treatment model for those new forms. My study proposed the following specific actions necessary to achieve the established subobjectives. Fundamentally, it also involved conducting cluster analyses utilizing three different clustering analyses techniques/modules from the MIT s GenePattern software to study the two distinct breast cancer data. Implementation for the Specific Aims For Specific Aim 1 the information content with respect to the specific cluster group formation and the responsible gene data elements were identified and extracted from both breast cancer data sets of less than and greater than 5 years. These extracted functional data from the resulting gene elements were utilized for the development of a candidate list/matrix of relevant gene/mutation category that is significant to the specific disease type. 52

53 The specific clustering/ techniques to be used for these analyses included Hierarchical clustering, Sparse-hierarchical clustering, Non- negative matrix factorization (NMF), and Consensus clustering and Pharmacogenomics evaluations. Hierarchical and sparse hierarchical clustering was used in the initial preparatory clustering and preliminary classification of both study data sets. The goal here was to extract the fundamental clusters from both data sets in order to pre-ascertain the major cluster categories and establish strong or weak degrees of association between members of different clusters, before performing NMF clustering. Given that hierarchical clustering is highly sensitive to the measurements used to assess distance, it was expected that these initial clusters will provide significant insight into relatable clusters that can further be confirmed using the NMF Clustering algorithm. Following this initial clustering, the NMF consensus clustering algorithm was then run with the expectation that it will extract features from the data that may more accurately correspond to biological processes and provide further intuitive decomposition by parts necessary to derive the clusters of gene events based on its non-negative matrix factorization. It was expected that NMF will further reduce the dimension of the gene expression data from thousands of genes to a handful of meta-genes or meta-samples, each of which represents a group of genes or samples respectively, with each cluster clearly describing the class to which its members belong [9]. Finally, the consensus clustering algorithm was run to further classify the relevant genes that are significant. The consensus technique was used to determine an optimal number of clusters by running its selected algorithm several times and then assessing the stability of discovered clusters. The overall goal for the consecutive clustering analyses was to organize both the genes and samples and reduce the tens of thousands of genes within the genome down to a small number 53

54 of meta-genes in order to capture the alternative structures inherent in the data, extract the relevant biological correlations, or molecular logic, in both Breast_A. gct and Breast_B.gct gene expression data sets to finally elucidate meaningful biological information from the cancer-related microarray data, ultimately to recover the breast cancer subtypes using the different clustering algorithms. For specific Aim 2, the main goal is the following: For this Specific Aim 2, the information content relevant for specific cluster formation and the responsible gene data elements will be identified and extracted from the resulting candidate gene list from both clustered breast cancer data sets. These extracted gene data elements will be used to develop a candidate gene list matrix and disease profile matrix that are significant to the disease. The gene data elements of the significant clustered genes from the candidate list will be evaluated for their importance in terms of disease pattern and their function for recurrence, treatment sustainability and outcomes: For specific Aim 3, the main goal is the following: The main purpose of Specific Aim 3 was to perform gene function analysis and classification through gene function analysis in order to evaluate the genes within the formed matrix for disease potential in terms of mechanism and metastasis. These were achieved by conducting analyses to refine, confirm and validate the clustering results through model selection. The main purpose for this Specific Aim 3 was to determine the biological functions/mechanisms of the extracted genes within the resulting clusters and their genomic implications within the disease clusters. Possible inter-gene relationships that impact the genetic coding for the breast cancer disease were evaluated to further confirm the gene significance aspects within the clustered pools. 54

55 Following gene classification, the next steps were to refine, confirm and validate the resulting clusters by evaluating the consistency, robustness and performance of the specific algorithm. This was performed in order to ensure that the clustering algorithms used in the clustering analyses returned meaningful clusters, while providing biologically meaningful decomposition of the study genetic data. This process confirmed the validity of the clustering analyses and the obtained results. For specific Aim 4, the goal is the following: For this step, the goal is to develop a targeted pharmacogenomics information guideline and model which will be applied to address the treatment format for most breast cancer forms. This pharmacogenomics evaluation involved cross functional bioinformatics analyses that utilized gene function and coding of clustered genes to discern their correlation to disease incidence, progression, aggressiveness and recurrence. This was performed to verify if incorporating the pharmacogenomics component into disease diagnosis and treatment will improve the recognition of the particular disease type, content, incidence for a more effective guided/tailored therapy and treatment outcome. A favorable outcome of this hypothesis when was enabling a framework for understanding new disease forms by utilizing relevant parameters from the existing validated model to provide a framework for understanding new disease forms. This framework would represent a standardized, predictive and sustainable structure suitable for use at diagnosis and for determining effective treatment for most if not all variants of breast cancer disease for future disease management and general breast cancer care. To achieve this specific aim, an evaluation study integrating all the resulting clusters and genes from the different clustering analyses was conducted in order to extract pharmacogenomics insights for model development. The pharmacogenomics information model for gene assignments 55

56 and predictive treatment outcome both in short and long term breast cancer disease incidence will be evaluated to determine if the model is reliable enough to represent or support the different determined and/or undetermined clinical forms of the disease in terms of predictive treatment outcomes. 1.6 Results (anticipated) and Significance of Research The anticipated study results are expected to contribute to the new definition of a formal classification scheme of the different gene pools for the various breast cancer disease forms, such as benign, malignant, aggressive or recurrent types. It is to be noted that this investigation was not intended to return recommendations based on the treatment drugs for the patient categories. This study will also not take into consideration the effect of the treatment drugs that were used for treating the patients in relation to the gene expression data analyses. However, this study is intended to return recommendations based on genes that needed to be targeted for treatment, which in turn will be based on the particular genes that are significant for the disease type, modeled from the clustering results that are returned in this investigation. The emphasis of this study was to be able to diagnose the breast cancer disease type effectively once the disease is presented at the hospital and translate/utilize the knowledge gained from the patient s gene expression analyses (for example, ER positive or negative, basal or luminal type or HER2 positive, possible disease pathway and drugs that can effectively block that pathway and the genes from expressing) in further treatment of the patient with sustainable treatment solutions. A significant outcome hoped for this study was to first uncover/identify the particular genes and the biological function that characterize persistent, aggressive or recurrent disease forms and then understand the mechanisms of action for these genes that code for the various forms. 56

57 The study analyses will then suggest statistical models with which to describe those resulting breast cancer populations/gene pools; indicate rules for assigning new breast cancer cases to classes for identification and diagnostic purposes, determine exemplars to represent the subclasses and ultimately provide measures of definition for disease characterization, variability. This will bring about change of diagnostic consideration in what previously were only broad concepts of breast cancer disease. This study is also expected to significantly help in creating a model or framework for effective breast cancer treatment. It will also help to provide the pharmacogenomics insight at diagnosis which can be translated to the bedside during treatment/clinical care. Once a pharmacogenomics structure/ information model is determined for the various breast cancer forms, the model can then serve as framework for understanding newer disease forms and developing appropriate individualized treatment model for those new forms by mapping from existing disease types. Finally, it is hoped that the results from this study will help physicians and oncologists to better utilize effective well guided therapies for individualized medicine and eradicate unnecessary adverse effects for patients from treatments. 57

58 Chapter II 2 LITERATURE REVIEW Review of relevant Literature Breast cancer disease is extremely diverse. It has the capability of occurring in various diverse forms with varying degrees of aggressiveness, recurrence and gene involvement. These numerous facets of variability has continued to impact breast cancer research, development of diagnosis tools, drug development, disease treatment advancement and/or total cure. Disease variability is the same reason sustainable effective treatments so far are possible only for narrow patient groups while still presenting a significant challenge for the larger breast cancer patient population, where variable treatment outcomes are hugely evident. Predictive breast cancer treatment and outcome variability issue continues to persist and has been linked primarily to the diversity of factors/gene implications that may underlie the various forms of breast cancer disease. Breast cancer biology and gene implications In almost all hereditary breast cancer cases, the two main implicating genes have been known to be BRCA1 (Breast cancer 1) and BRCA2 (Breast cancer 2) genes, both early onset breast cancer genes. TP53 (tumor suppressor protein p53) is another gene relevant to hereditary breast cancer. 58

59 Breast cancer 1, early onset Breast cancer 2, early onset PDB rendering based on 1jm7 PDB rendering based on 1n0w. Figure 2-1 BRCA1 (Breast cancer 1) and BRCA2 (Breast cancer 2) genes [1]) Both BRCA1 and BRCA2 genes belong to a class of genes known as the tumor suppressor gene family. Functionally, both genes help to ensure the stability of the cell s genetic material (DNA) and prevent uncontrolled cell growth by maintaining genomic integrity that is necessary for preventing those dangerous genetic changes [2]. Typically, normal cells divide as many times as needed and stop. They attach to other cells, stay in place in tissues and commit cell suicide (apoptosis) when they are no longer needed. Until apoptosis, normal cells are protected from cell suicide by several protein clusters and pathways. Generally, when normal cells divide, their DNA is usually copied with many mistakes. Errorcorrecting proteins are then relied upon to fix those mistakes. The mutations known to cause breast cancer, such as BRCA1 and BRCA2, occur in those error-correcting mechanisms. Cells therefore become cancerous when those mutations destroy their ability to stop dividing and to attach to other cells or to stay where they belong. Presumably, these initial mutations of BRCA1 and BRCA2 allow other mutations, which then allow uncontrolled division, lack of attachment, and metastasis to distant organs [19][23]. Studies on these mutations are significantly important and relevant to breast cancer mechanism development. 59

60 Breast cancer mutations have been categorized into two main groups. They are either inherited from parents or acquired after birth. With the inherited gene mutations, certain inherited DNA changes increase the risk for developing cancer and are responsible for the cancers that run in some families. Those mutations can be changes in one or a small number of DNA base pairs (the building-blocks of DNA). In some of the BRCA1 mutation cases, large segments of DNA are rearranged. Those large segments, also called large rearrangements, can either be a deletion or a duplication of one or several exons in the gene. BRCA2 mutations are also usually insertions or deletions of a small number of DNA base pairs in the gene. These mutations build up and cause cells to divide in an uncontrolled way to form a tumor. As a result, the resulting protein product of the BRCA2 gene is then abnormal and does not function properly. In spite of all these seemingly available information on breast cancer, what is currently and definitely known is that a woman's risk of developing breast and/or ovarian and other cancers is greatly increased if she inherits a deleterious (harmful) BRCA1 or BRCA2 gene mutation(s). However, just how it happens is still not fully understood and this presents a very significant challenge in the breast cancer treatment success, which is certainly a motivation for this study and various others. It is therefore believed that a thorough evaluation and analyses of the huge genetic data that are now available can provide significant insight into the genes that are relevant for disease proliferation or cure and lend more understanding of the various breast cancer mechanisms of actions as they relate to the diverse types. 60

61 2.1 Review of related literature Cancer, being a complex and diverse disease has become a prime target for the application of novel technologies often referred to as the 'omic' platforms. It is expected that these new platforms may help to identify subgroups of cancer patients having specific molecular features associated with clinical outcomes [66 69]. The hope is that the solutions generated through these new platforms will include complete genome knowledge which will ultimately lead to providing a more predictive, individualized approach to cancer care, as well as facilitate the selection of treatment modalities that are most likely to benefit the individual patient. With the completion of the human genome project, as well as the current availability of these novel and powerful technologies within genomics, proteomics and functional genomics, there is a general expectation that all these technological advancements and developments in improved disease data sources and robust data mining techniques will result in a major impact on clinical practice, with a potential for changing the way in which diseases will be diagnosed, treated and monitored in the near future. Several research studies on breast cancer disease understanding and treatment have been conducted and many more are currently on-going. With emerging technologies, these studies include categories such as discovery-driven, genetic and translational studies. Some of these studies have involved the study of single cell lines and sometimes multiple cell lines some of which have generated significant solutions that have contributed to disease understanding but overall they are yet to provide all the missing details for all the various breast cancer disease forms in terms of disease incidence, varying disease persistence, recurrence and most importantly the selection of targeted, effective optimal treatment options. 61

62 Some emerging studies involving MicroRNA expression profiling of human breast cancer have recently begun and are on-going. MicroRNAs (mirnas), a class of short non-coding RNAs found in many plants and animals, often act post-transcriptionally to inhibit gene expression. Overall, the function of human mirnas is largely unknown. However, despite their recent discovery, a few mirna studies have been conducted in breast cancer research. A number of mirnas has been found to have oncogenic potential. Strong links between mirnas and human cancer are emerging. The first mirna study performed was used to classify breast tumors as luminal A, luminal B, basal-like, HER2+ and normal-like and to identify new markers of tumor subtype [185]. The study involved an integrated analysis of mirna expression in 93 primary human breast tumors, mrna expression and genomic changes in human breast cancer using a bead-based flow cytometric mirna expression profiling method as a suitable platform to classify breast cancer into prognostic molecular subtypes. This may serve as a basis for functional studies of the role of mirnas in the etiology of breast cancer. Two additional mirna studies have shown that a number of mirnas are deregulated in human breast cancer [182,183]. A third study found that a number of mirnas were differentially expressed in breast tumor biopsies and that mirna expression correlated with HER2 and estrogen receptor (ER) status [184]. Another breast cancer study has combined micrornas analysis and microarray technology to evaluate breast cancer molecular subtypes. MicroRNAs (mirnas) regulate mrna stability and translation through the action of the RNAi-induced silencing complex. In this study, the researchers systematically identified endogenous mirna target genes by using AGO2 immunoprecipitation (AGO2-IP) and microarray analyses in two breast cancer cell lines, MCF7 62

63 and MDA-MB-231, representing luminal and basal-like breast cancer, respectively [181]. Another study performed expression profiling of both mirna and mrna from the same breast cancer subtypes samples. In the study the researchers analyzed dual expression profiling data of mirna and mrna derived from the expression profiling of 489 mirnas in 41 luminal-a breast tumors samples and 15 basal-like samples. They defined a correlation coefficient ratio (CCR) and used it to examine the correlative dysregulated relationships between mirnas and mrnas [186]. Recently also, some cluster analyses study methods have been employed to varying degrees and perspectives in breast cancer data mining studies, in addition to the other existing studies. They have been used to investigate and understand how genetic mutations and their adaptations can reveal through data behavior from gene analyses the major components that underlie breast cancer disease. The emergence of DNA microarrays technologies has also made possible significant and very promising advancement in genetic and genomic breast cancer study aspects. With the advent of DNA micro arrays, it is now possible to simultaneously analyze large amounts of gene expression data and monitor the expression of all the genes in the genome. This is currently the most common approach for recent genetic-based cancer studies. However, the mounting challenge with these genetic study outcomes however remains the ability to interpret the resulting data effectively so as to gain insight into the biological processes and disease mechanisms that are relevant for adequate disease diagnosis and treatment selection. Specifics of some of these studies are described below: 2.2 Discovery-driven Translational research in breast cancer Discovery-driven Translational research is an aspect of breast cancer study that was ushered in with the hope that it will provide additional solutions for some of the missing pertinent disease 63

64 information as well as treatment understanding, where others have failed or are lacking. Discovery-driven translational cancer research has only recently gathered momentum among the basic and clinical research community and as a result, there are currently few research programs that utilize tissue biopsies and bio-fluids as their main sources for generating multiple cancer data sets for translational research analyses, which could be translated into relevant pharmacogenomics information [66 69]. Breast cancer discovery driven translational research focuses mainly on the general mechanism of action for the disease through the study of single and sometimes multiple cell lines to generate translational pharmacogenomics data that will be relevant for correct diagnosis and treatment selection for all patients. It then makes the resulting pharmacogenomics information available to the physician, oncologist, specialist and other healthcare personnel at the point of care delivery (at diagnosis) and at the bedside, in a structured, translated, integrated and predictive format. By so doing, it forces disease treatment to become more tailored and to follow individualized guided patterns with specific drugs prescribed for the applicable breast cancer type. This is on the basis that failure to utilize this logical structure and content at diagnosis and subsequently at the patient s bed side during treatment conversely can result in irrecoverable treatment inadequacies and most unfortunately delay for cure. However, discovery-driven translational research in breast cancer studies, recently is moving steadily from the study of cell lines to the analysis of clinically relevant samples that, together with the ever increasing number of novel and powerful technologies available within genomics, proteomics and functional genomics, promise to have a major impact on the way breast cancer will be diagnosed, treated and monitored in the future [65]. At present, the translational breast cancer research platform is experiencing a transition from the use of single 'omic' platforms to their integrated use to obtain systems biology knowledge. It is hoped that this integrated 64

65 approach will lead to a better understanding of the underlying biology of living cells and organisms, resulting in turn in a more effective translation of basic discoveries into clinical applications. It is also believed that the solutions coming out from these developments may help lead to a more predictive, individualized approach to cancer care, as well as facilitate the selection of treatment modalities that are most likely to benefit the individual patient. 2.3 Breast cancer translational research specific study aspects So many valuable translational studies have been conducted in breast cancer research. In a translational research study at the Danish Centre for Translational Breast Cancer Research, which focused on long-term ongoing strategies in search of markers for early detection and targets for therapeutic intervention, signaling pathways affected in individual tumors were identified. This study integrated multiplatform 'omic' data sets collected from individual patients tissue samples. It utilized a systems-biology approach which coalesced, knowledge-based complementary procedures to understand specific breast cancer mechanisms. Another breast cancer research study investigated the cellular heterogeneity of cancer cells within the tumor microenvironment. The goal for this study was to gain further understanding of the disease environment and malignant breast cancer cell differentiation. Tumors usually contain malignant cells showing different degrees of differentiation as well as other cell types, which together compose the 'tumor microenvironment' [17 29]. Breast cancer cell differentiation study is a very important aspect of disease diagnosis and treatment selection. The micro-dissection techniques/technology which allows for the dissection of a defined set of purified cell populations [30,31] has generally been used to address this cellular heterogeneity problem to a great extent. However, it has not been able to solve the problem altogether. Immunohistochemistry, another technology, has been used in translational cancer studies to detect heterogeneity even in a small number of cells when several markers are assessed simultaneously. 65

66 Other translational-based research studies have evaluated the application of cdna microarraybased expression profiling to allow for the analysis of single cells through elegant amplification strategies. Some of the studies reporting expression profiles generated from laser-captured cells have demonstrated the feasibility of the approach [10 11]. Similar single cell analysis of cultured cells using gel-based proteomics [35 37], on the other hand, does not provide a viable alternative at the moment, as more sensitive protein detection methods are needed to identify a significant number of proteins [38]. Celis et al. [13] completed studies that investigated the impact of integrating proteomic and functional genomic technologies in translational cancer research. They concluded that array and proteomic technologies in particular can play a significant key role in the study and treatment of human cancers as they provide invaluable resources to define and characterize regulatory and functional networks of genes and proteins within cells. They also concluded that proteomics provides tools that can be used to investigate the precise molecular defect(s) in cancer tissues, tools that enable the development of specific reagents that can be used to better understand different stages of disease pathology and for tools for identifying new clinically relevant drug targets, as well as functional insight for drug development [35 37]. In spite of the advances in breast cancer translational research studies, unfortunately, some of the research procedures present limitations that include the following: It could be very demanding and time intensive. Typical translational research procedures involve the application of high throughput technologies for the analysis of tissue biopsies. The procedures are typically far more demanding than the analysis of bio-fluids required for pharmacogenomics analyses. This is due to the heterogeneous nature of the tissues required for translational research and related sample preparation issues, as well as problems associated with managing long-term prospective/retrospective programs [10,11,5]. 66

67 In addition, the implementation of discovery-driven translational cancer research requires the coordination of basic research activities, facilities and infrastructures. It also requires the creation of an integrated and multidisciplinary expert environment to build collaboration amongst stakeholders, surgeons, oncologists, pathologists, epidemiologists, patients and patient advocacy groups participating in the cancer study. This intricate collaboration could add another level of complexity which could potentially introduce delays in delivering results. There are other additional translational research challenges that are related to cancer sample collection, sample handling and storage, standardization of translational protocols, references, number of patients, availability of normal controls, access to bio-banks, tissue arrays, clinical information, follow-up clinical data, computational and statistical analysis, as well as ethical considerations which are critical and must be carefully considered and dealt with from the beginning programs [10,11,5]. These are all contributory evidence that confirm the challenges with basic translational studies and its resulting deficiencies in providing all the pertinent information required to bridge the gap for deriving optimal predictive and sustainable breast cancer treatment without the pharmacogenomics aspects. 2.4 Breast cancer pharmacogenomics studies complementing Translational studies So far, it has been demonstrated that translational research studies, as a standalone treatment component, lack focused genetic investigation and is yet to accomplish effective, predictive and sustainable solutions that can successfully reduce treatment outcome variability. This gap continues to present challenges to successful disease treatment. To bridge that gap, breast cancer treatment recently has been incorporating some pharmacogenomics and proteomics considerations at the point of diagnosis or throughout cancer 67

68 care. Genomics is the study of all the genes in a cell or organism, while proteomics is the study of all the proteins. Molecular diagnostics determines how these genes and proteins interact in a cell. They focus upon gene and protein activity patterns in different types of cancerous or precancerous cells, uncover these sets of changes and capture the expression patterns information. Also called "molecular signatures," these expression patterns improve the clinicians' ability to diagnose cancer. Soon all cancers may be diagnosed this way. Pharmacogenomics helps to analyze tumors' specific genetic make-up through gene expression studies to potentially guide cancer treatment decisions, ensure better patient outcomes and avoid unnecessary toxicity. Pharmacogenomics studies inherited differences in inter-individual drug disposition and effects, with the goal of providing a guide towards selecting the optimal drug therapy and dosage for each patient. It also studies genetic polymorphisms in drug metabolizing enzymes and other molecules responsible for much of the inter-individual differences in the efficacy and toxicity of many chemotherapy agents which will not be possible in standalone translational research. Overall, pharmacogenomics is critical and especially important in oncology, as severe systemic toxicity and unpredictable efficacy are hallmarks of cancer therapies. Examples of existing pharmacogenomics studies are described below: A pharmacogenomics study conducted by Pribylova et al, 2009, was part of a long term study on 180 women with ER+ breast cancer. It investigated the clinical outcome of tamoxifen-treated breast cancer patients with estrogen receptor-positive breast cancer using the AmpliChip CYP450 Test to identify CYP2D6 genotypes which have an impaired ability to metabolize tamoxifen into the active metabolite, endoxifen development [53]. Preliminary 72 results from the study showed evidence that EM and UM have significantly lower percentage of disease progression (22% respectively 0%) compared to IM (67%). No progression observed at PM needed further monitoring and investigation [53]. 68

69 Within the same study, they also tried to determine the pharmacogenomics prediction of breast cancer treatment outcome with adjuvant tamoxifen in postmenopausal women with estrogen receptor-positive breast cancer [71]. In this study aspect, they tried to determine the share of ultrarapid (UM), extensive (EM), intermediate (IM) and poor (PM) metabolizers in the group of postmenopausal women with estrogen receptor-positive breast cancer, undergoing treatment with tamoxifen cancer drug. They concluded that the tamoxifen clinical/treatment outcome may be influenced by the activity of cytochrome P450 enzymes that catalyze the formation of antiestrogenic metabolites endoxifen (active metabolite) and 4-hydroxytamoxifen, which may impact the drug uptake and treatment effectiveness. Their results tentatively supported the hypothesis, that the genetic variants associated with activity of CYP2D6 could help clinicians in determining the therapeutic strategy for selecting treatment options. AmpliChip CYP450 Phenotype No. Disease Progression % EM % UM 1 0 0% IM % PM 5 0 0% TOTAL 72 Table 2-1 Clinical outcome of tamoxifen-treated breast cancer patients using the AmpliChip CYP450 Test Another pharmacogenomics study involved a Phase II trial of the oral PARP inhibitor olaparib in BRCA-deficient advanced breast cancer [73]. Olaparib (AZD2281; KU ) is a novel, oral active PARP inhibitor that induces synthetic lethality in homozygous BRCA-deficient cells. Eligibility criteria for patient recruitment included confirmed BRCA1/BRCA2 mutation and recurrent, measurable chemotherapy-refractory breast cancer. The primary aim of this study was 69

70 to test the efficacy of olaparib in confirmed BRCA1/BRCA2 carriers with advanced refractory breast cancer and the secondary aim was to assess safety and tolerability in this population. The phase I trial of this oral PARP inhibitor olaparib study identified 400 mg bd as the maximum tolerated dose (MTD) with an initial signal of efficacy in BRCA-deficient ovarian cancers [73]. In an international, multicenter, proof-of-concept, single-arm, phase II study, two sequential patient (pt) cohorts received continuous oral olaparib in 28-day cycles initially at the MTD, 400 mg bd (27 pts), and subsequently at 100 mg bd, a previously identified PARP inhibitory dose (27 pts). Overall, 54 pts exposed to a median of three prior lines of chemotherapy were enrolled. 27 pts were dosed at 400 mg bd (18 BRCA1 deficient and 9 BRCA2 deficient), and 24 of these had databased RECIST assessments. All adverse events were reported using CTCAE v3. Their results showed that the primary efficacy endpoint was best objective response rate (ORR; RECIST) post baseline. Progression-free survival (PFS) and clinical benefit rate were secondary endpoints. The ORR (currently based on unconfirmed responses) was 38% (9/24) (400 mg bd). Causally-related toxicity was mainly mild (grade 1-2) in severity; 9/27 pts (33%) had fatigue; 7/27 (26%) had nausea; 4/27 (15%) had vomiting; and 1/27 (4%) had anemia. Causally-related grade 3 or higher toxicities were seen in 5 pts (19%) with fatigue (3 pts), nausea (2 pts), and anemia (1 pt). 27 pts were treated in the subsequent 100 mg bd cohort where no data are currently available. The study concluded that Olaparib at 400 mg bd is well tolerated and highly active in advanced chemotherapy-refractory BRCA-deficient breast cancer. Toxicity in BRCA1/BRCA2 carriers was similar to that reported previously in non-carriers. This first study with olaparib in BRCA-deficient breast cancers provided positive proof of concept for high activity and tolerability of a genetically defined targeted therapy. 70

71 In most of all those pharmacogenomics research cases described above, the research goals in the individual cases were predominantly centered on the specific genomic/genetic aspect of the disease for disease understanding and predictive treatment solutions. In general, the genetic changes involved in cancer diseases result in altered proteins that disrupt the cell's communication network. Altered proteins along many different pathways cause signals to be garbled, intercepted, amplified, or misdirected in cancer situations. These changes associated with cancer hijack what was once normal communication and use it to achieve uncontrolled tumor growth and may not necessarily be the same in all breast cancer subtypes. Therefore the hope for pharmacogenomics studies is to understand the disease biology and the variability for the various breast cancer subtypes, so as to be able to provide tailored, effective and sustainable treatment with reduced or totally eliminated adverse reactions for the individual patient. It is expected that pharmacogenomics studies in combination with discovery-driven translational studies, microarray technology and effective data mining techniques will provide the much needed integrated components necessary to overcome the limitations of limitations of standalone translational research and understand breast cancer disease diversity. Additionally, this integrated approach will enable the evaluation of whole-gene expression information using microarray technology in order to understand genomic and phenotypic disease details and gain insight into gene polymorphisms that may influence the outcome of cancer therapy as well as disease progression/recurrence dynamics for the various subtypes. 2.5 Clustering analyses techniques used in prior Breast cancer studies In addition to the pharmacogenomics techniques, some cluster analyses techniques have also been used to mine breast cancer data. Clustering techniques are effective analysis because of their overall classification potential and sensitivity for identifying relevant classes and sub-classes 71

72 of resulting data within a given group of primary data. Clustering analyses techniques are also adapted to huge amounts of data because they are able to run on algorithms that enable the comparison of massive amounts of data for similarities, differences and assigning the clustered groups based on distances into classes at heightened sensitivity levels. Clustering analysis techniques, in general have been used and found effective and versatile both in biological and non-biological data analyses. Technically, as an exploratory data analysis tool utilized for solving classification problems, its object is to sort cases (in this case gene events) at various levels of sensitivity into groups, or clusters, so that the degree of association is definitively strong between members of the same cluster and weak between members of different clusters. Each cluster thus describes, in terms of the data collected, the new respective clustered classes to which its members belong with the description abstracted from the general class to the particular class or type. Primarily, the main challenge for cancer detection and diagnosis is to locate the renegade genes and proteins- the deranged, defective, and dominating molecules--that hijack communication in once-normal cells. This usually requires opening the cell and observing and analyzing the biomolecules inside for early detection and diagnosis. This can be accomplished through cluster analysis and several breast cancer studies have been performed using cluster analyses techniques for mining the cancer data. In the clustering study performed by Harris et al 2002, the protein expression maps (PEMs) of 26 breast cancer cell lines and three cell lines derived from normal breast or benign disease tissue were visualized by high resolution two-dimensional gel electrophoresis. Analysis of the resulting data was performed with ChiClust and ChiMap, two analytical bioinformatics tools designed to facilitate recognition of specific patterns shared by two or more (a series) PEMs. Both tools use PEMs that were matched by an image analysis program and locally written programs to create a 72

73 match table that is saved in an object relational database. These ChiClust and ChiMap methods are not dependent on any particular commercial image analysis program, and the whole software package gives an integrated procedure for the comparison and analysis of a series of PEMs. In the referenced study, the ChiClust tool was used to order the breast cell lines into groups according to biological characteristics including morphology in vitro and tumour forming ability in vivo. It then used clustering and sub-clustering methods to extract statistically significant protein expression patterns from the large series of PEMs. ChiMap was then used to highlight eight major protein feature-changes detected between breast cancer cell lines that either do or do not proliferate in nude mice [118].The ChiMap tool then calculated a differential value (either as percentage change or a fold change) and represented these graphically. All such differentials or just those identified using ChiClust were then submitted to ChiMap. Another cluster analysis study by Markey et al [119] focused on mammography since it is the primary tool for the detection of breast lesions and the subsequent decision to biopsy suspicious lesions. The purpose of the study was to identify and characterize clusters in a heterogeneous breast cancer computer-aided diagnosis database. A self-organizing map (SOM) was used to identify clusters in a large (2258 cases), heterogeneous computer-aided diagnosis database based on mammographic findings (BI-RADSTM) and patient age. Their study projected that the identification of subgroups within the database could help elucidate clinical trends and facilitate future model building. The resulting clusters were then characterized by their prototypes determined using a constraint satisfaction neural network (CSNN). The clusters showed logical separation of clinical subtypes such as architectural distortions, masses and calcifications. The broad categories of masses and calcifications were stratified into several clusters (seven for masses and three for calcifications). 73

74 The percent of the cases that were malignant was notably different among the clusters (ranging from 6 to 83%). A feed-forward back-propagation artificial neural network (BP-ANN) was used to identify likely benign lesions that may be candidates for follow up rather than biopsy. The performance of the BP-ANN varied considerably across the clusters identified by the SOM. In particular, a cluster (#6) of mass cases (6% malignant) which was identified accounted for 79% of the recommendations for follow up that would have been made by the BP-ANN. A classification rule based on the profile of cluster #6 performed comparably to the BP-ANN, provided approximately 25% specificity at 98% sensitivity. This performance was demonstrated to generalize to a large (2177) set of cases held-out for model validation [119]. This study demonstrated that through clustering and pharmaco-genomic analyses, distinct clusters which can show logical separation of clinical cancer subtypes such as architectural distortions, masses and calcifications can be derived. Another cluster analyses study by Pantaza et al [120] involved multidimensional data analysis, precisely by applying cluster analysis to the Wisconsin Breast Cancer dataset, a very well- known cancer dataset. With multidimensionality of data having long been an obstacle in data analysis because of limitations of coping with more than three spatial dimensions, this study projected that in order for the data to be visualized, a multidimensional data set must be projected into a lower dimensional space where it will invariably lose some of its features. Towards this goal, this study showed that data visualization and cluster analysis of a dataset can be achieved by using the Kohonen model of self-organizing artificial neural networks and also showed how the results of their analysis can be used to compare the performances of classification processes. Another study by Meliker et al [121] evaluated the advances in spatial technology that enable epidemiologists to create detailed maps and employ spatial cluster statistics to garner insights about patterns of disease [134, 135]. Many of the early disease clustering studies focused on 74

75 cancer and attempted to shed light on spatially varying risk factors [ ]. In recent years, however, etiologic cancer cluster studies have been criticized for failing to consider the extended latency between exposure and disease, often embodied in a neglect of residential, occupational, and other forms of mobility [ ]. In light of the challenges associated with the etiologic mapping of cancer, spatial epidemiologists have begun to map and analyze stages of cancer incidence to reveal factors about health care availability and mammography that influence stage of diagnosis. In demonstrating the utility of this approach, clusters have been detected using data aggregated into (a) zip codes in a study of breast cancer in situ in Wisconsin [129], (b) census tracts in studies of prostate cancer in New Jersey [130] and colorectal cancer in Massachusetts [131], and (c) census block groups in studies. Other distinct cluster analyses techniques have also been employed in breast cancer studies for mining resulting breast cancer data. These cluster techniques include hierarchical (supervised and unsupervised), Non-negative matrix factorization (NMF), consensus clustering, K-Means clustering, Self-Organizing Maps (SOMs) etc. These techniques aim at partitioning a set of data items into groups, based on a measure of distance (or dissimilarity). The groups are called clusters and their numbers may be pre-assigned (e.g. k-means clustering procedure, which classifies objects into a pre-specified number of clusters by moving objects into different clusters with the goal of minimizing intra-cluster variability while maximizing inter-cluster variability) or determined by the algorithms (e.g. joining algorithms which aggregate amalgamate - increasingly larger clusters of increasingly dissimilar patterns) [3]. Being that clustering is a relative notion; the intra-cluster and inter-cluster distances may vary from data set to data set. As a result of this, a variable clustering threshold has to be defined for every given clustering problem. In some cases, this is equivalent to visually inspecting the amalgamation trees and finding the number of clusters. 75

76 The different cluster techniques described above help to organize observed data into meaningful structures in order to develop useful data-driven taxonomies or classifications [3]. However, the different clustering techniques have their unique adaptabilities for data mining. For example, the visual depiction of a multidimensional data set provided by self-organizing maps proves useful in giving a first glimpse of the underlying features of the data and of the distribution of the data points in the multidimensional problem space. This helps to identify relations between multidimensional data as well as the number of clusters in the data sets. Unlike joining cluster analysis algorithms, the self-organizing maps do not set any limit on the number of patterns in the data sets. The techniques are further discussed below: Self-organizing maps (SOMs) (feature maps, Kohonen maps) [4] are biologically plausible models of artificial neural networks that provide a very convenient 2-dimensional visual representation of high dimensional input data. SOMs use an unsupervised learning algorithm based on distance calculation, causing each input pattern to be associated with a zone on the resulting map. The real spatial distances between input patterns are not respected in the sense that two patterns that are close to one another topologically are not necessarily similar from the distance point of view. These patterns are actually separated by a valley (Figure 2a) covering a wide input data space void of input patterns. In contrast, two data points that are close in terms of distance will also be close topologically but there will be a less thick boundary separating them. In this way, the 2- dimensional mapping space becomes a warped projection of the multidimensional input space, with the input patterns squeezed into one another and separated by variable width boundaries. 76

77 2.6 Microarray studies and their relevance in breast cancer studies Relevance of DNA Microarray gene expression analysis to proposed study DNA microarray gene expression studies have been utilized in some breast cancer studies. Currently today, breast cancer data are generated mainly from large amounts of genomic information using DNA microarrays. With the development of microarray technology, it is now possible to efficiently examine how active thousands of genes could be at any given time for a given patient in order to determine pathological presence in a disease state for those applicable genes. In the past, scientists were only able to conduct genetic analyses on a few genes at once. In a particular study, results of previous DNA microarray gene expression studies confirmed that breast cancer is not a single disease with variable morphologic features and biomarkers, but rather a group of molecularly distinct neoplastic disorders. There is evidence of variable gene activity levels with patients of various breast cancer types as has been demonstrated by other DNA microarray studies. Profiling results have also supported this variability hypothesis and has shown that estrogen-receptor (ER) negative and ER-positive breast cancers originate from distinct cell types and point to biologic processes that govern metastatic progression. Thus, such profiling has also uncovered molecular signatures that could influence clinical care. DNA microarray technology has made it possible to study which genes are active and which are inactive in different cell types, understand both how these cells function normally and how the cells are affected when various genes do not perform properly. This has further provided researchers the desired platform required to change the classification of different types of cancers, by including the genetic component instead of a classification that is solely based on the organs in which the tumors develop. 77

78 As a result, it is now possible to further classify these types of cancers based on the patterns of gene activity in the tumor cells and to design treatment strategies targeted directly to each specific type of cancer. By examining the differences in gene activity between untreated and treated tumor cells (for example those that are radiated or oxygen-starved), researchers can now understand exactly how different therapies can affect tumors and thus develop more effective treatments. It also has provided researchers the opportunity to extract from these gene expression data previously unrecognized biological structure and meaning, the results of which hold promise to accelerate the transition between empirical and molecular medicine. For example, microarraybased comparative genomic hybridization has revealed differences in copy numbers of particular genes in different subtypes of breast cancer. Description of the Microarray Technology: DNA microarrays are created by robotic machines that arrange minuscule amounts of hundreds or thousands of gene sequences on a single microscope slide. Researchers have a database of over 40,000 gene sequences that can be used for this purpose. When a gene is activated, cellular machinery begins to copy certain segments of that gene. The resulting product is known as messenger RNA (mrna), which is the body's template for creating proteins. The mrna produced by the cell is complementary, and therefore will bind to the original portion of the DNA strand from which it was copied. To determine which genes are turned on or off in a given cell, a researcher must first collect the messenger RNA molecules present in that cell. Each mrna molecule is then labelled by using a reverse transcriptase enzyme (RT) that generates a complementary cdna to the mrna. During that process, fluorescent nucleotides are attached to the cdna. The tumor and the normal samples are labeled with different fluorescent dyes. 78

79 Next, the labeled cdnas are placed onto a DNA microarray slide. The labeled cdnas that represent mrnas in the cell will then hybridize or bind to their synthetic complementary DNAs attached on the microarray slide, leaving its fluorescent tag. A special scanner must then be used to measure the fluorescent intensity for each spot/areas on the microarray slide. If a particular gene is very active, it produces many molecules of messenger RNA, thus, more labeled cdnas, which will hybridize to the DNA on the microarray slide and generate a very bright fluorescent area. Genes that are somewhat less active produce fewer mrnas, thus, less labeled cdnas, which results in dimmer fluorescent spots. If there is no fluorescence, none of the messenger molecules have hybridized to the DNA on the microarray slide, indicating that the gene is inactive. This microarray hybridization technique is frequently used to examine the activity of various genes at different times. When co-hybridizing Tumor samples (Red Dye) and Normal sample (Green dye) together, they will compete for the synthetic complementary DNAs on the microarray slide. As a result, if the spot is red, this means that that specific gene is more expressed in tumor than in normal (up-regulated in cancer). If a spot is green, that means that the gene is more expressed in the Normal tissue (Down regulated in cancer). If a spot is yellow, that means that the specific gene is equally expressed in normal and tumor states. 79

80 Chapter METHODOLOGY (Materials and Methods) 3.1 Materials Discovery-driven MIT translational breast cancer data (Breast-A [22] and Breast-B [23] breast cancer data sets), which were derived from studied whole genome expression profiles/cell lines but from multiple platforms [113] were used in this study. These two gene expression breast cancer data sets represent data from two distinct groups of patients; one group who developed distant metastases within 5 years (Breast_B), and the other group of patients who continued to be disease-free after a period of at least 5 years (Breast_A). These data list from the two data sets for these analyses include the following: Breast-A: data set Breast_A.gct Breast-A: class labels Breast_A.cls Breast-B: data set Breast_B.gct Breast-B: class labels Breast_B.cls Table 3-1 Breast Cancer Study Data Listing The genes from these two data sets (Breast-A and Breast-B) were selected primarily from gene expression cdna microarray data from 98 primary breast cancers, 34 of which were derived from 80

81 patients who developed distant metastases within 5 years (Breast_B.gct) and 44 derived from patients who continued to be disease-free after a period of at least 5 years (Breast_A.gct). 18 were from patients with BRCA1 germ line mutations, and 2 from BRCA2 carriers. All sporadic' patients were lymph node negative, and under 55 years of age at diagnosis [22]. In this study, the two data sets specifically represent comprehensive breast cancer array data of less than 5 years and another set of greater than 5 years and will be the specific input data sets for all the clustering analyses. It is to be noted that the breast cancer disease data sets with a reference of 5 years with the longest disease duration of 5 years and not for >5 years was chosen for this study for applicable reasons: Typically 5-year observed survival rate refers to the percentage of patients who live at least 5 years after being diagnosed with cancer even though many of these patients live much longer than 5 years after diagnosis. In order to get 5-year survival rates, doctors have to look at people who were treated at least 5 years ago. Improvements in treatment since then may result in a more favorable outlook for people now being diagnosed with breast cancer. Generally, disease progression of up to 5 years is considered significant and can be analyzed in a valid manner without impacting interferences such as improvements in treatment or resulting alterations that may happen with genes expressed for disease. This is especially important for patients that might have other significantly inherited diseases in addition to breast cancer (like diabetes, Alzheimer s disease, Parkinson s disease) in which cases there may be a lot of bias with survival rate and interference in gene expression resulting from the other diseases the patient might have that could potentially progress as the breast cancer progresses. Also, based on the survival rate statistics, literature claims that the statistical survival benefit in meta-static breast cancer patients is currently estimated in months and not years, majority of 81

82 breast cancer patients with metastatic cancer may not have up to 5 years. In addition, research also claims that even with some of these current treatments, approximately 40% of patients with lymph node-positive disease will experience a relapse, and a majority of these patients will die from disseminated cancer. For patients with lymph node-negative disease, the 5-year recurrence rate is about 25%, thus making the 5 year mark a good point to evaluate recurrence. Typically, the relative survival rate compares the observed survival with what would be expected for people without the cancer. This helps to correct for the deaths caused by something besides cancer and is a more accurate way to describe the effect of cancer on survival. Relative survival rates are at least as high as observed survival, and in most cases are higher. 3.2 Methods (GenePattern Clustering Analyses Software) The cluster analyses methods used in these analyses were run on MIT s freely available GenePattern clustering software package with various clustering modules. GenePattern is a genomic analysis platform which provides access to a wide range (150+ tools) of gene expression analysis computational methods used to analyze genomic data. It allows researchers to analyze the data and examine the results without writing additional programs. Most importantly, it ensures reproducibility of analysis methods and results by capturing the provenance of the data and analytic methods, the order in which methods were applied, and all parameter settings. Supported clustering methods within GenePattern utilized for this study include hierarchical, sparse hierarchical, non-negative matrix factorization (NMF), K-Means, consensus clustering, comparative gene marker selection and self-organizing maps of the GenePattern statistical software [114] ( The Comparative gene marker selection analysis module was the final module that was utilized in these study analyses. The expected results as well as the aspects being compared are described 82

83 as follows: Comparative gene marker selection was intended to provide a clustering analyses of gene markers from different cancer sources to identify similar biomarkers for some cancer types that may be applicable or seen to be relevant to another type of cancer disease (in this case, breast cancer). The goal is to identify and select from the heterogeneous cancer data groups, gene markers that are comparable (based on the Genepattern s algorithm- with relevance to function) to breast cancer gene markers from one disease to another. This has the potential to enable the identification of possible multiple indications for cancer treatment drugs. A drug that was previously designed for one particular cancer type can end up being applicable to another type of cancer if the significant markers for both diseases will respond and be inhibited by the same drug. Specifically, gene expression data analyses performed with GenePattern module follow these three critical steps: Step 1: Preparing the dataset which entails converting gene expression data from any source (e.g., Affymetrix or cdna microarrays) into a tab-delimited text file that contains a column for each sample, a row for each gene, and an expression value for each gene in each sample. GenePattern defines two file formats for gene expression data: GCT (gene cluster text file) and RES (ExpRESsion file format). The protocols run for these analyses used the GCT file format ( fileformats.html) Step 2: Creating a tab-delimited text file that specifies the class or phenotype of each sample in the expression dataset, if available. GenePattern uses the CLS file format for this purpose. Step 3: Preprocessing the expression data as needed, for example, to remove platform noise and genes that have little variation across samples. GenePattern provides the PreprocessDataset module for this purpose. 83

84 However, Steps 1, 2 and 3 were omitted in these analyses because pre-processed breast cancer data, Breast_A.gct and Breast_B.gct were used throughout the analyses. For the analyses, Breast-A [22] and Breast-B [23] breast cancer data sets were cycled into a phase of exploration (or learning), followed by a phase of confirmation (which were iterated), using the different clustering algorithms until the resulting gene patterns were well understood and properly categorized into biological meaningful clusters. The resulting clusters were in turn reverse-matched up for comparable validation with the already established gene subtypes (predetermined estrogen receptor (ER) gene subtypes) from which they were derived before the clusters were then merged and statistically segmented into distinct breast cancer data subsets/subtypes associated with estrogen receptor (ER) status Research Design and Methods Translational medicine informatics methods, integrated with predictive pharmacogenomics and statistical analysis tools were also utilized in this study, in addition to the clustering analyses. These tools were used to evaluate gene persistence and pharmacogenomics tendencies of breast cancer disease subtypes towards certain drug mechanisms. This study was designed to generate information that provided more understanding into breast cancer disease incidence, gene persistence, disease recurrence and pharmacogenomics details of the disease Research Timeline and Work Plan After a thorough review of all the involved Specific Aims, the expected timeframe for conducting and completing this proposed study was set to a period of 6 months. 84

85 As the study results show, the targeted Specific Aims 2.3.1, and Specific Aim 3 were completed. The gene classification analyses with regards to genetic function were also completed. The biological functions and mechanisms of the extracted genes and genomic implications of their presence within the disease clusters were also evaluated. The correlation between the resulting gene expression data clusters and the final genetic basis for the disease type was determined. Recommendations were returned once the evaluation of the inter-gene relationships that impact the genetic coding for the breast cancer disease and the gene significance within the clustered pools were confirmed. This research investigation returned recommendations on significant genes that need to be targeted for treatment, based on the particular genes that are relevant for the disease type as modeled from the resulting clustering analyses results. The emphasis of this study was geared towards being able to diagnose the disease type effectively once the disease is presented at the hospital and then translate/utilize the knowledge gained from the patient s gene expression analyses (for example, ER positive or negative, basal or luminal type or HER2 positive, possible disease pathway and drugs that can effectively block that pathway and the genes from expressing) in further treatment of the patient with sustainable treatment solutions. It is to be noted that this study was not intended to take into consideration the effect of the treatment drugs that were used for treating the patients in relation to the resulting gene expression data analyses or return recommendations on drug effectiveness for the individual patients whose gene expression data were analyzed. Specific Aim 4 was finalized to include the integrated evaluation that focused on extracting the pharmacogenomics insights into the specific gene behavior that is responsible for variable treatment outcomes, thus proving the central hypothesis which infers that variable treatment outcome occurs as a result of the underlying disease gene behavior for the particular disease subtype. From this will result the pharmacogenomics informational model which will be derived 85

86 from the evaluations of the correlations to disease severity/disease progression and treatment outcome variability due to gene persistence for the disease. Here also, the predictability and reliability of the resulting model was challenged with reference to the underlying gene coding for the disease form. The last phase of the study was completed and Specific Aim 5 was finalized. It involved evaluating the developed pharmacogenomics information model for its reliability to represent and support the different clinical forms of the disease in terms of predictive treatment outcome. In this aspect also, the predictability and reliability of the resulting model was challenged with reference to the underlying gene coding for the disease form and in terms of producing effective and sustainable guided/tailored therapy as a result of incorporating the pharmacogenomics disease component to evaluating disease treatment. The preparation of the manuscripts and reports has been completed Research goals/strategies The overall goal/strategy was to utilize different clustering techniques/algorithms from the gene pattern software (Hierarchical, Consensus clustering, K-Means clustering, Self-organizing maps and Non-negative self-organizing maps modules of the MIT Gene Pattern statistical software) for comparative analyses to demonstrate clinical utility and clinical validity for the resulting analyses gene patterns and clusters. Ultimately, gene behavior for the extracted relevant genes was matched up with already established genes within the ER positive and other confirmed subclasses in order to gain more insight into breast cancer disease diverse forms. Final conclusions from these clustering were derived by applying fundamental genetics, biochemistry and where applicable 86

87 pharmacogenomics to identify and confirm genetic regulation mechanisms and decipher regulatory interactions between genes and proteins which may not necessarily be linear. The evaluation into gene to disease situations in relation to why certain disease conditions or gene combination profiles may trigger a large effect or a significant biological response versus other comparable situations that may not necessarily yield similar responses was performed. These clustering analyses utilizing the various clustering methodologies were not run with any specific breast cancer data standards. It is not physically possible to run the analyses with any standards because it is not possible or feasible to identify any sets of standard breast cancer disease. The disease is diverse with diverse risk factors which are variable in terms of disease incidence for the numerous disease subtypes. There are various factors that play into the diversity of disease like age young or old, hereditary or acquired, malignant or benign, aggressive or non-aggressive, thus making it difficult to choose any one set of standard for the disease. Therefore the goal of the analyses is to identify common gene incidence (either based on function or expression levels) between gene expression datasets of < 5 years and > 5 years and use that to understand genes that are significant to disease initiation, progression and recurrence Study Design The study was designed to address the research goals as outlined. Research Design Study 1: Finding the Genes that persist within different breast disease types; an analysis that incorporates two verified breast cancer data sets, Breast-A and Breast-B. The specific goal for this study aspect was to identify the correspondent commonalities within the two Breast_A and Breast_B cancer data sets, identify the genes that persist within the two data 87

88 sets and evaluate the genes that are responsible for the incidence, persistence, duration of disease and recurrence of disease from the reference gene data pool. Breast_A and Breast_B cancer data were the input data sets for all the clustering analyses using the different modules. Consensus hierarchical, sparse hierarchical clustering was followed by NMF ( Non-matrix negative factorization ) clustering pharmacogenomics module which was then used in the final validation analyses. Preparatory unsupervised hierarchical and sparse hierarchical clustering analyses modules were first performed as the initial clustering analyses prior to NMF clustering in order to pre-ascertain the major breast cancer cluster categories, extract the fundamental similarities and measures between the genes, analyze temporal gene expression, provide molecular portraits of the breast tumors through the resulting gene clusters categories and identify strong or weak degrees of association between members of different clusters before performing NMF clustering. The two algorithms (HC and SHC) were used after merging the two Breast_A and Breast_B data sets to cluster the 98 tumors on the basis of their similarities measured over the approximately 5,000 significant genes. HC calculated pair-wise similarity on the basis of expression ratio measurements across all tumors and sparse hierarchical clustering was used to determine the dominant genes and to weight their significances. Given that hierarchical clustering is highly sensitive to the measurements used to assess distance, it is confirmed that these initial clusters provided significant insight into relatable clusters that can further be confirmed using the NMF Clustering algorithm. The goal was to utilize the consensus hierarchical and sparse hierarchical cluster analyses techniques as preliminary cluster methods, followed by NMF clustering to derive the consistent cluster(s) of those sought after genes. Consensus clustering generally runs a selected clustering 88

89 algorithm against perturbations of the original data set such that the result is a consensus matrix that assesses the stability of discovered clusters. Initial analyses conducted for this specific goal utilized the two independent breast cancer data sets which were merged into a single dataset. Hierarchical clustering was then performed to determine subtypes within the combined data. After merging the two data sets, the resulting subtypes within the merged data set were first determined by unsupervised clustering, given by predetermined phenotypes. It was expected that the results of the hierarchical clustering will reveal the dominant specific structure after normalization. It was also hoped that a close examination of the resulting clusters from the different clustering techniques will provide some pharmacogenomics insights into gene behavior for disease incidence, treatment variability and variable gene activity. Initial hierarchical and sparse clustering performed for this specific aim with the Consensus clustering module, required Kmax of 5 output clusters was first used for the learning phase. Next, the unsupervised hierarchical clustering algorithm, with the Euclidean distance measure was used. Resampling iterations was set at 20, for a randomly generated seed value with correlation set >0.5 for 10 genes. The above-mentioned unsupervised hierarchical clustering (HC) and Sparse hierarchical clustering algorithm used for this study recursively merged items/genes with other items/genes or with the results of previous merges, according to their pair-wise distances (with the closest item pairs being merged first) to obtain a tree structure/dendogram whose nodes correspond to the original items and the merging of other internal nodes of the tree. The hierarchical clustering module also utilized some preprocessing operations. The order of these preprocessing operations was as follows: 1. Log Base 2 transform: Log transform the data before clustering. 2. Row (gene) center 89

90 3. Row (gene) normalize 4. Column (sample) center 5. Column (sample) normalize Unsupervised hierarchical clustering input parameters Input filename Breast_A.gct.txt and Breast_B.gct.txt column.distance.measure row.distance.measure clustering.method m (pairwise complete linkage- hierarchical clustering method used) log.transform no log-transform for the data before clustering row.center no centering of each row (gene) in the data before clustering) row.normalize no normalizing of each row (gene) in the data before clustering column.center no centering of each column (sample) in the data before clustering column.normalize no normalizing of each sample (gene) in the data before clustering output.base.name <input.filename_basename> Table 3-2 Unsupervised hierarchical clustering parameters The input files were the Breast_A, Breast_B independent data sets. Preparatory hierarchical and sparse hierarchical clustering analyses modules were first performed as the initial clustering analyses in order to pre-ascertain the major breast cancer cluster categories, extract the fundamental similarities and measures between the genes, analyze temporal gene expression and provide molecular portraits of the breast tumors. The two algorithms (unsupervised HC and SHC) were used after merging the two Breast_A and Breast_B data sets to cluster the 98 tumors on the basis of their similarities measured over the approximately 5,000 significant genes. HC calculated pair-wise similarity on the basis of expression ratio measurements across all tumors by way of two-dimensional hierarchical clustering of the 98 tumor samples and sparse hierarchical clustering was used to determine the dominant genes and to weight their significances. The Sparse/Unsupervised Hierarchical Clustering Parameters used in this study included the input filename, row and column distance, clustering method, log transform, row and column centers, row and column normalization input and output base name. 90

91 The input file data file is provided in the link below: The Sparse hierarchical clustering parameters used included the following: Input filename Method type Breast_A.gct.txt and Breast_B.gct.txt average wbound -1 maxnumgenes 5000 cluster.features method.features standardize.arrays output.base.name Data Source false (this is set to true if a clustering of the genes with non-zero weights average (the type of linkage used to cluster the features (if desired) true <input.filename_basename> Note: For the Sparse hierarchical clustering, wbound equals tuning parameter if a non-negative number is given, then tuning parameter is otherwise, a tuning parameter is selected via permutation approach. Table 3-3 Sparse Hierarchical clustering parameters Research Design Study 2 for specific Aims 2, 3 and 4: Evaluate and understand the biology of the resulting clustered genes, investigate their importance and relevance to the disease and identify the relevant gene function or aspect(s) that are significant to disease variability, recurrence, treatment sustainability and favorable outcomes. The NMF Clustering module, being more adaptable to genetic clustering was utilized as the clustering approach for this study aspect. The analyses for this evaluation incorporated the two verified breast cancer data sets, Breast-A and Breast-B representing breast cancer data sets for patients with breast cancer less than or greater than 5 years. 91

92 NMF clustering algorithm was run, following preparatory Consensus hierarchical clustering and sparse hierarchical clustering to provide intuitive decomposition by parts of the formed clusters/gene events and further reduce the dimension of expression data from thousands of genes to a handful of meta-genes with each cluster clearly describing the class to which its members belong [9]. The NMF Clustering was also used to further determine which gene function is responsible for the formed clusters and to evaluate the key molecular determinant of the disease for the various disease subtypes using the following parameters. Input filename Breast_A.gct.txt and Breast_B.gct.txt K.initial 2 K.final 5 num.clusterings 20 max.num.iterations 2000 error.function divergence random.seed Random number selected Stop.convergence 40 stop.frequency 10 Table 3-4 NMF clustering input parameters For these analyses, Breast_A.gct and Breast _B.gct were the input data sets used with K.initial set at 2 and k.final set at 5. The num. clusterings was set at 20 while the max.num.iterations was set at The principle of divergence was used for error.function and the random.seed was assigned a random number. Stop.convergence was set at 40 and the stop.frequency was set at 10. The maximum and minimum numbers of desired clusters were set at 2 and 5 as in k.initial = 2 and k.final = 5 which are NMF algorithm driven. k.initial = 2 is the lowest statistically possible number of clusters that can be set and returned when there is variability and similarities in the composite data being clustered. Setting k.initial at 1 would imply that the clustering analysis will find and return the entire set of data and identify no differences in the test data sets clustered, which is not possible in this case based on the initial baseline clustering results returned by the 92

93 Unsupervised Hierarchical and Sparse hierarchical clustering performed where clusters possessing similarities and differences and also distinguished by the weights (intensity of gene expression) that were returned for the genes that started and persisted within the disease. Also within the NMF clustering analyses, the maximum expected number of clusters chosen to be returned from the analyses for this study was set at 20 and represented as NMF clustering parameter, num.clusterings = 20. num.clusterings = 20 was considered adequate for the number of iterations (max.num.iterations = 2000) that were carried out. Also prior to the NMF clustering, a measure of correspondence for the subtypes and then the significance was first defined [97]. However, without imposing the structure of one data set upon another, but rather by using a bi-directional approach to highlight the common substructures in both data sets, the NMF clustering technique clustered the data by first reducing the dimension of the gene expression data from thousands of genes to a handful of meta-genes /meta-samples, each of which represented a group of genes or samples respectively, decomposing the relevant genes, analyzing the resulting samples and summarizing the gene expression patterns of the meta-genes. Specifically, the breast cancer data sets were first merged and then statistically segmented into breast cancer data subsets/subtypes associated with estrogen receptor (ER). Following that, the merged breast cancer data were then cycled into a phase of exploration (or learning), followed by a phase of confirmation (which was iterated), until the resulting gene patterns are well understood and properly categorized into biological meaningful clusters. The resulting clusters were in turn reverse-matched up for comparable validation with the already established gene subtypes (pre-determined estrogen receptor (ER) gene subtypes) from which they were derived. 93

94 Finally, the consensus clustering algorithm for the NMF further clustered and classified the significant relevant genes, determined the optimal number of significant clusters by running its selected algorithm several times and then assessing the stability of discovered clusters. The NMF algorithm started the clustering by randomly initializing matrices W and H, which represent both Breast_A (W) and Breast_B (H) data sets respectively and then followed by iteratively updating them to minimize a divergence functional. The divergence functional is related to the Poisson likelihood of generating A from W and H combined, D = Σi,j Ai,jlog(Ai,j/(WH)i,j) Ai,j + (WH)i,j. At each step, W and H are updated by using the coupled divergence equations for clustering [9], which is the preferred form of implementation and has been demonstrated to have the features that are able to capture a subclass that the norm-based update equations do not. [9] Mathematically, this corresponds to factoring matrix A into two matrices with positive entries, A ~ WH. Matrix W has size N k, with each of the k columns defining a metagene; entry wij is the coefficient of gene i in metagene j. Matrix H has size k M, with each of the M columns representing the metagene expression pattern of the corresponding sample; entry hij represents the expression level of metagene i in sample j. and shows the simple case corresponding to k = 2 (Fig.6 & 7). A rank -2 reduction of the microarray of genes of N genes and M samples as seen below is the minimum rank that can be set for a meaningful clustering of data. Given this factorization A ~ WH, matrix H was used to group the M samples into k clusters. Each sample was then placed into a cluster corresponding to the most highly expressed metagene in the sample; that is, sample j is placed in cluster i if the hij is the largest entry in column j (Fig. 5-3). Often at the initial stages of clustering, there is a dual view of decomposition A ~ WH initially, 94

95 which defines meta-samples (rather than meta-genes) and clusters the genes (rather than the samples) according to the entries of W. In summary, the actual clustering steps used for this study (Specific Aim 3) are described below: The breast cancer data sets were first merged and then statistically segmented into breast cancer data subsets/subtypes associated with estrogen receptor (ER). Following that, the merged breast cancer data were then cycled into a phase of exploration (or learning), followed by a phase of confirmation (which was iterated), until the resulting gene patterns are well understood and properly categorized into biological meaningful clusters. The resulting clusters were in turn reverse-matched up for comparable validation with the already established gene subtypes (pre-determined estrogen receptor (ER) gene subtypes) from which they were derived. Overall, the NMF clustering extracted and confirmed the relevant breast cancer gene clusters from the two distinct groups of patients, Breast_A.gct and Breast_B.gct, revealed the correspondence between both breast cancer-related data sets and identified common subtypes of breast cancer associated with estrogen receptor status. Research Design 3: Evaluate gene significance to disease potential, mechanism, variability, recurrence and metastasis (a pharmacogenomics pilot analysis) The goal here was to utilize comparative marker selection method to reveal and evaluate the commonalities or differences between the resulting breast cancer clusters/ subtypes in relation to multiple gene expression data representing other types of cancers (breast, colon, kidney, lung, pancreatic, prostate and stomach cancers). The aim is to derive marker genes, along with possible associated pathways that are either common to multiple types of cancers or specific to 95

96 individual cancers, in this case breast cancer, but in comparison with normal and tumor samples. It was also used in this study aspect to evaluate and to determine if there are significant correlations of pathogenicity (incidence, recurrence, metastases, aggressiveness and treatment sustainability) and insights for disease that can be derived from other known cancer genes other than breast cancer. It was also used in this study notably to confirm common gene function (s) within the breast cancer subtypes implicated in the estrogen receptor status of breast cancer subtypes by identifying common gene markers, and further more to identify the subgroup of breast cancer patients that share similar survival patterns with other cancer types, thus improving the accuracy of a clinical outcome predictor. GenePattern s Comparative marker selection protocol which was found to be very efficient for these types of similarity connections was utilized for this analysis aspect. The Comparative Marker Selection application suite is freely available as a GenePattern module at protocol focuses on differential expression analysis, where the aim is to identify comparative gene markers (if any) from within several groups of heterogeneous cancer data that are differentially expressed between distinct classes or phenotypes. The protocol identifies similar gene types that may be applicable or seen to be relevant to different types of cancer and compares their up-regulation between the heterogeneous groups. By so doing, this comparative approach for gene marker selection analysis can enable the identification of possible multiple indications for cancer treatment drugs. A drug that was previously designed for one particular cancer type can end up being applicable to another type of cancer if the significant markers for both diseases can respond to or be inhibited by the same drug. However, other comparative marker selection methods other than the one used in this analysis may equally be useful for this type of analysis. 96

97 The GenePattern s Comparative Marker selection module used in this analysis utilizes several approaches to determine the features that are most closely correlated with a class template and for identifying the significance of that correlation. It consists of three modules that allow users to apply and compare different methods of computing significance for each marker gene, a viewer to assess the results, and a tool to create derivative datasets and marker lists based on userdefined significance criteria. Integrated within the GenePattern s comparative marker selection protocol also, are many statistical approaches for assigning significance values to genes. For each gene, the Comparative Marker Selection module uses a test statistic to calculate the difference in gene expression between the two classes and then estimates the significance (pvalue) of the test statistic score. These statistical approaches used by this Comparative Marker Selection module enables the testing to proceed without false positives because testing tens of thousands of genes simultaneously increases the possibility of mistakenly identifying a non-marker gene as a marker gene (a false positive). Therefore, the Comparative Marker Selection module corrects for multiple hypotheses testing by computing both the false discovery rate (FDR) and the family-wise error rate (FWER) which in general is stricter or more conservative than the FDR. The FDR represents the expected proportion of non-marker genes (false positives) within the set of genes declared to be differentially expressed while the FWER represents the probability of having any false positives [138]. In light of the fact that the FWER is stricter and subsequently may frequently fail to find marker genes within the large number of hypotheses being tested, due to the noisy nature of microarray data, most researchers generally prefer to identify marker genes based on the FDR rather than the more conservative FWER. 97

98 The input parameters for this ComparativeMarkerSelection are the following: Module input.filenames Counfounding variable.cls.file ComparativeMarkerSelection Multi_A.gct and Multi_A.cls; Breast_A.gct and Breast_B.gct Breast_A.cls and Breast_B.cls Breast_A.cls & Breast_B.cls Test.direction 2 Test.statistic T-test Min.std none Number.of.permutations log.transformed.data false complete false balanced false random.seed smooth.p.values = true true phenotype.test one versus all normalization.iterations 0 output.filename <input.file_basename>.comp.marker.odf Table 3-5 Comparative Marker Selection input parameters The data in the input file for the ComparativeMarkerSelection was not log transformed (this is usually the default) so as not to impact the accuracy of calculations such as the fold. The module uses its algorithms to compute significance values for compared features using several metrics including FDR (BH), Q Value, FWER, Feature Specific P-value and Bonferroni to output a file with the following columns. The module outputs a file containing the following columns: 1. Rank: The rank of the feature within the dataset based on the value of the test statistic. If a two-sided p-value is computed, the rank is with respect to the absolute value of the statistic. 2. Feature: The feature name. 3. Description: The description of the feature. 4. Score: The value of the test statistic. 5. Feature P: The feature-specific p-value based on permutation testing. 6. Feature P Low: The estimated lower bound for the feature p-value. 7. Feature P High: The estimated upper bound for the feature p-value. 8. FDR (BH): An estimate of the false discovery rate by the Benjamini and Hochberg procedure (1). The FDR is the expected proportion of erroneous rejections among all rejections. 9. Q Value: An estimate of the FDR using the procedure developed by Storey and Tibshirani (6). 98

99 10. Bonferroni: The value of the Bonferroni correction applied to the feature specific p-value. 11. maxt: The adjusted p-values for the maxt multiple testing procedure described in Westfall (7), which provides strong control of the FWER. 12. FWER (Family Wise Error Rate): The probability of at least one null hypothesis/feature having a score better than or equal to the observed one. This measure is not feature-specific. 13. Fold Change: The class zero mean divided by the class one mean. 14. Class Zero Mean: The class zero mean. 15. Class Zero Standard Deviation: The class zero standard deviation. 16. Class One Mean: The class one mean. 17. Class One Standard Deviation: The class one standard deviation. 18. k: If performing a two-sided test or a one-sided test for markers of class zero with the number of permuted scores greater than or equal to the observed score. If the testing is for markers of class one, then k will be the number of permuted scores less than or equal to the observed score. In addition to the goal described above, the analyses strategy designed for this section was to also utilize the comparative clustering techniques/algorithms from the gene pattern software to demonstrate clinical validity and utility for the resulting comparative marker genes and possibly determine the relationship(s) to disease progression patterns in relation to time. This disease-time relationship evaluation is important because traditionally, even with the ongoing unprecedented cancer data generation since the complete mapping of the human genome, most of the data that are gathered in most of the breast cancer studies primarily often describes the cancer disease situation under the assessed given conditions at a given time when the disease was diagnosed. This presents a far more complex challenge because scientifically a single snapshot of disease environment in time is not sufficient or adequate as a sole characterization basis for future disease definition in terms of understanding disease progression and/or continued drug efficacy at all stages of disease or disease incidence. Therefore, there is the need to investigate/explore the disease age factor and its relationship to disease complexities and their interdependences with time [113]. For this evaluation of disease incidence and behavior over time in relation to continued drug sustained efficacy, a combination clustering techniques like hierarchical clustering in addition to 99

100 this comparative marker selection technique are often used and found to be adequate in thoroughly evaluating the disease phenomenon, time dependence of biological responses and genomic evidence-based medicine in relation to disease progression. In this study, gene behavior for these extracted genes were matched up against already established genes within the ER positive subclasses derived from initial hierarchical (unsupervised and sparse) and NMF clustering performed in order to gain insight into the ER aspects of the breast cancer disease. The hope for these combined evaluations/analyses was to focus on identifying and confirming genetic regulation mechanisms and where applicable pharmacogenomics factors where applicable, in order to gain insight into the regulatory interactions between the resulting genes and proteins that may or may not necessarily be linear. The ultimate the goal is to utilize the results to gain understanding into gene to disease relationships in terms of why certain disease conditions or gene combination profiles may trigger a large effect on a biological response, while other comparable situations may not necessarily yield similar responses. The actual processing steps used for this study (Specific Aim 3) are described below: The Comparative gene marker selection analyses were adapted to identify comparative gene markers between the two breast cancer patients categories studied in relation to normal, tumor and other cancer types all contained within the multi_a.gct The phenotype class file identifies the tumor and normal samples for example: # Normal Tumor The confounding variable class file identifies the tissue type of each sample: # colon kidney prostate uterus human-lung breast

101 Given these two class files, when performing permutations, ComparativeMarkerSelection shuffles the tumor/normal labels only among samples with the same tissue type. By default, the module performed a two-sided test; that is, the test statistic score which is calculated assuming that the differentially expressed gene can be up-regulated in either phenotype class. In this case, the test parameter selected was the test direction parameter which can also optionally be used to specify a one-sided test, where the differentially expressed gene must be up-regulated for class 0 or for class 1. After the analysis, the Comparative Marker viewer displays the test statistic score, its p value, two FDR statistics and three FWER statistics for each gene. 101

102 Chapter IV 4 Expected/Projected Study outcome The outcome of this study was geared towards providing recommendations from the results of the gene expression analysis that will provide a framework for a sustainable clinical model which can be utilized for the identification and classification of breast cancer disease patterns as well as for the prediction of treatment outcome for the disease based on patient genetic makeup. Effective translational medicine techniques can be made available through these recommendations when adequately applied and which can be relied upon for appropriate integration into early drug development as well as discovery research disease data during disease treatment management. This will be beneficial for investigating new drugs. Dynamic, static and patient disease data modeling approaches for treatment relevance can be determined from these recommendations and can be implemented towards bridging the translational medicines gap. Using these generated translational research informatics tools including the dynamic investigative approaches with genotype guided statistical modeling; a deep understanding of molecular biology pathways involved in breast cancer can be obtained and used in profiling patient s treatment outcomes. A reliable predictive model that can identify an individual patient s treatment outcome can be generated by analyzing the patient s data and determining the patient s genetic profile that may lend information for either disease control or disease progression with a particular drug. This model can be extended to not only predict pathways but also understand why and how genetic profiles affect drug response in terms of accurately predicting from gene expression profile analysis, the patient pool that are disease enabling or resistant. 102

103 A thorough understanding of a patient s pharmacogenomics aspects (probable baseline riskevaluation, disease initiating events, preclinical progression, disease control or progression risks after treatment initiation in relation to typical therapies and the applicable treatment points (reactive/proactive/sporadic intervention) can be better understood and derived during the course of the disease. This can provide opportunities for optimal drug intervention in terms of selecting therapeutic options for breast cancer patients and even for patient selection/stratification during early drug trials. A translational update on determination of probability of success for the various breast cancer therapies/drug options across clinical studies can be made possible using a comprehensive framework for disease behavior to determine p(success), in other words, drug success. p(success), a key indicator for drug portfolio prioritization during drug development selections and valuation can be confidently determined through early translational evaluations. Translational medicine adds intermediate steps into drug developments at which p(success) is updated. To take advantage of this effort for better and faster decision making, it is important to thoroughly evaluate this p(success) in order to rely on it for updating trials. Decision support tools that can be derived from this study can be utilized during the course of the disease to evaluate all the various disease burden aspects/levels to support therapeutic decision for treatment drug prediction/selection for personalized care. These aspects include the following: - assessing risk for disease for baseline risk determination, - refining assessment for preclinical disease progression at the stage when disease event is initiating and diagnosis as well as during disease progression, - evaluating all therapeutic options to predict and diagnose disease effectively. 103

104 Overall, Hierarchical and Sparse Hierarchical clustering result was expected to present a highly sensitive, common and valuable approach that can be used to measure distance between the relevant genes within the resulting clusters from both breast cancer data sets and also to extract the significance for those genes in relation to the disease type. Note that Hierarchical and Sparse Hierarchical clustering may require a subjective definition of clusters to facilitate the grouping of gene elements based on how close they are to one another and the classification of the resulting clusters by assessing gene distance. Non-negative matrix factorization (NMF) clustering technique was expected to cluster the data by breaking it down into metagenes or meta samples, each of which will represent a group of genes or samples respectively and further refining, confirming and validating the candidate list of gene data elements. It was also expected to extract features from the resulting genes within the clusters that may more accurately correspond to biological processes. Consensus clustering algorithm, used in conjunction with the NMF Clustering method was expected to further classify and cluster the genes that are significant. This clustering technique is expected to help determine an optimal number of clusters by running its clustering algorithm several times and then assessing the stability of discovered clusters. Adapting/integrating a K- Means clustering algorithm helps to further obtain and classify unique clusters based on its specific algorithm by randomly selecting a center data point for k clusters and assigning each data point to the nearest cluster center. By iteratively recalculating a new center data point for each cluster based on the mean value of its members, all data points were reassigned to the closest cluster center until the distance between consecutive cluster-centers converge into k stable clusters. 104

105 Pharmacogenomics evaluations are expected to implement a robust integrated cross functional bioinformatics approach for investigating the biological function of the clustered genes and then discern their correlation to disease incidence, progression, aggressiveness, recurrence with the expectation to ultimately derive results that can fit an effective pharmacogenomics model. Comparative gene marker selection module provided the effective algorithms for determining the applicable gene markers from a heterogeneous cancer population that were relevant within more than one cancer group. These similarities were used to assess the possibility of multiple drug applications to more than one form of cancer. This approach can enable the application of one drug to more than one type of cancer if the relevant genes are up-regulated in the different cancer types. 105

106 Chapter V 5 RESULTS AND DISCUSSIONS 5.1 Results and analyses The results of hierarchical, sparse hierarchical, NMF, Consensus clustering and comparative marker selection are given below Results of Unsupervised/Sparse Hierarchical Clustering There were 4,968 significant genes across both groups of Breast_A and Breast_B data. Each row in the data files represents a tumor and each column a single gene. Overall, the analyses results from the initial unsupervised hierarchical clustering showed significantly functionally relatable clusters of genes. Pair-wise similarity metrics among genes were calculated on the basis of expression ratio measurements across all tumors. Similarly, for tumor clustering, pair-wise similarity measures among tumors were calculated based on expression ratio measurements across all significant genes. The results obtained showed significant dominant genes that were specifically present in both data sets respectively, as well as the genes that persisted across both data sets in very distinct gene clusters. As shown in the color bar in the dendograms below, red indicates up-regulation, green down-regulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumor clusters. Each gene is labeled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reversecomplementary of the named contig EST. 106

107 Breast_A HC clusterview from CDT file Breast_B HC clusterview from cdt file Figure 5-1 Breast_A & Breast_B HC clusterview from cdt file 107

108 On the other hand, sparse hierarchical clustering further determined the weights of the significant genes. A total of 24 genes were weighted for significance in Breast_A data while a total of 29 genes were weighted for significance in Breast B data. The sparse hierarchical clustering resulted in very distinct gene clusters (See Table 5-1 below), showed the significant genes in the analyses and also provided the weights for the resulting gene significance. The weights for the resulting genes from the sparse hierarchical clustering are described below and further analyses into these gene functions in relation to their significances have also been performed. Sparse hierarchical clustering through this study demonstrated its ability within its framework and algorithm to successfully analyze large and flat data sets (data sets containing a large number of features (genes) and a relatively small number of known observations) such as the Breast_A and Breast_B breast cancer data sets. Using its framework, it successfully clustered the observations within the Breast_A and Breast_B data sets using an adaptively chosen subset of the provided features to produce optimal and significant feature/gene selection for the relevant genes. It also allowed for a better coverage of the clustering features (features that help explain the true underlying clusters), while excluding a large fraction of noise features (non-clustering features) that may hide the underlying clusters as could happen with classical hierarchical clustering. 108

109 Breast_A Weights Breast_A[1][1]. Weights (Sparse Hierarchical Clustering Breast_B Weights Breast_B[1][1]. Weights (Sparse Hierarchical Clustering Gene Name Gene Description Weight Gene Name Gene Description Weight SCGB2A2 SCGB2A TFF1 TFF NPY1R NPY1R TFF3 TFF TFF1 TFF CYP2B7P1 CYP2B7P TFF3 TFF LTF LTF ESR1 ESR KRTHB1 KRTHB S100A8 S100A KRT15 KRT AREG AREG GFBP2 IGFBP MAGEA1 MAGEA CEACAM6 CEACAM S100A7 S100A NAT1 NAT DHRS2 DHRS S100A1 S100A CYP2B7P1 CYP2B7P CD14 CD TNNT1 TNNT KRT5 KRT CP CP RARRES1 RARRES PRAME PRAME MMP7 MMP S100A9 S100A CRAT CRAT CPB1 CPB EEF1A2 EEF1A MIA MIA CHI3L1 CHI3L UNG2 UNG KRT17 KRT LTF LTF CRABP1 CRABP ACOX2 ACOX IFI27 IFI MMP7 MMP ACTG2 ACTG GATA3 GATA SERPINB5 SERPINB S100P S100P TRIM29 TRIM LY6D LY6D C4A C4A GDF15 GDF GRIA2 GRIA LAD1 LAD PFKP PFKP MIA MIA

110 Table 5-1 Breast_A & Breast_B Weights- A[1][1] and B[1][1] Genes including SCGB2A2, NPY1R, TFF1, TFF3, ESR1, S100A8, AREG, MAGEA1, S100A7, DHRS2, MMP7 & CYP2B7P1 were significant in Breast_A data (data from patients with disease Genes including SCGB2A2, NPY1R, TFF1, TFF3, ESR1, S100A8, AREG, MAGEA1, S100A7, DHRS2, MMP7 & CYP2B7P1 were significant in Breast_A data (data from patients with disease less than 5 years), while genes such as TFF1, TFF3, CYP2B7P1, LTF, KRTHBI, KRT15, GFBP2, CEACAM6, NAT1, S100A1 & CD14 were significant in Breast_B data (data derived from patients with disease greater than 5 years). The genes that persisted across both data sets include TFF1, TFF3, CYP2B7P1, MMP7 and S100A7 and are functionally significant especially within the Breast_B data clusters (data derived from patients with disease greater than 5 years), where they are significantly up regulated in the poor prognosis signature noted in the Breast_B data weighted group. Figure 5-2 Sparse Hierarchical Clustering Results Results Analyses for Unsupervised and Sparse hierarchical Clustering Specifically, the resulting significant gene groups from these analyses in terms of function included cyclin, metalloproteinase and the VEGF receptor gene groups which are mainly involved 110

111 in cell cycle, invasion, metastasis, angiogenesis and signal transduction. These genes were significantly up-regulated in the poor prognosis signature as shown in Breast_B (patients who developed distant metastases within 5 years and greater data) and Breast_A (from patients with disease less than 5 years) derived cluster results. These up-regulated genes were generally associated with distinct meta-gene pools for BRCA1, BRCA2, ATM, P53, CHEK2 and NBS1 and were important considerations for the potential for disease metastases. Cyclin E2, MCM6, metalloproteinases MMP7, MP1, RAB6B, PK428, ESM1, and the VEGF receptor FLT1 have been known as persistent genes that are present in the metastasis group, thus providing some insight into tumor classification in terms of the underlying biological mechanism with a potential for rapid metastases [11]. These specific genes clustered particularly to the following disease category including NPYIR, SCGB2A2, TNNT1 for Breast_A data and TFF1, TFF3 and CYP2B7PI for Breast_B data. In addition, some genes persisted across the two breast cancer data sets and include MMP7, TFF1, TFF3 and ESM1. The functional annotation for these clustered genes in the unsupervised and sparse hierarchical clustering study was also similar to another supervised hierarchical clustering study in which potential 231 prognostic reporter genes were evaluated where more genes (RAD21, cyclin B2, PCTAIRE, CDC25B, CENPF, VEGF, PGK1, MAD2, CKS2, BUB1) belonging to typical breast cancer functional categories became apparent in the clustering results observed. 5.2 NMF Clustering Analyses and Results The NMF clustering returned meaningful biological information based on the meta-genes expression patterns that resulted from the decomposition of the relevant genes. A maximum of 5 clusters was set for these analyses, with the NMF algorithm grouping the M samples into k clusters. 111

112 The predominant genes that persisted in both data sets were Erp_LNn, Erp_LNp and Ern_LNp and Ern_LNn. These were up-regulated within the 5 clusters of Breast_A and Breast_B data sets. In the heat maps shown in Fig 5-8 above, the off-diagonal entries of the resultant consensus matrix served as the similarity measure among samples. The degree of dispersion of the visually inspected reordered consensus matrix provided substantial insight (see (Fig. 9a & 9b). The deep blue color corresponds to a numerical value of 0 meaning that the samples were never assigned to the same cluster (not up-regulated within the two data groups). Dark red color corresponds to 1 which means that the samples always appear in the same cluster and were up-regulated. Overall, the use of consensus clustering with the NMF approach made the selection of the number of classes an objective consideration of the quantitative cophenetic coefficient rather than a subjective evaluation. The NMF algorithm yielded a sparse parts-based representation of data which enabled the recognition of biological features in the data sets. The metagene profiles became more localized in sample space and their supports overlapped less as more NMF iterations were performed and thus provided metagene profiles that were positive, sparse, localized, and relatively independent. The resulting clusters showed a nested structure as k increased from 2 to 4, with the nesting capturing the subtypes. As the rank k increased and became strong, sample assignment to clusters did not vary much from random starting points, reflecting the probability that each pair of samples is clustered together and therefore enabling the NMF method to identify the known nested structure of the breast cancer classes, recover biologically significant phenotypes and uncover substructures, whose robustness could be evaluated by a Cophenetic correlation coefficient. The HeatMap of re-ordered consensus matrix, k = 2 is shown below for Breast_A and Breast_B data (Fig. 5-3). 112

113 Figure 5-3 NMF cluster results for Breast_A & Breast_B data: Convergence Plot K=2 NMF cluster results for Breast_A data: Convergence Plot K=2 NMF cluster results for Breast_B data: Convergence Plot K=2 Figure 5-4 Consensus matrix k=2 Breast_A & Breast_B The complete NMF cluster results for Breast_A and Breast_B data are given below as follows: 113

114 Figure 5-5 Data Results from NMF clustering of Breast_A.gct and Breast _B.gct analyses 5.3 Model Selection for NMF Clustering Model Selection for NMF: Model selection was determined by using an approach within the module that exploited the stochastic nature of the NMF consensus clustering [9] algorithm. This was used to evaluate the consistency and robustness of the algorithm performance and also to determine whether a given rank k decomposed the samples into meaningful clusters, while providing a biologically meaningful decomposition of the data. 100 runs were selected for these analyses. It is recommended to select the number of runs by continuing until appears to stabilize. For each run, the sample assignment was defined by a connectivity matrix C (average connectivity 114

115 matrix over many clustering runs, with entries of ranging from 0 to 1 to reflect the probability that samples i and j cluster together) with entry cij = 1 if samples i and j belong to the same cluster, and cij = 0 if they belong to different clusters. The dispersion between 0 and 1 measured the reproducibility of the class assignments with respect to random initial conditions. By using the off-diagonal entries of as a measure of similarity among samples, the average linkage HC was used to reorder the samples and thus the rows and columns of clustering [9]. Typically, as the number of runs increase, the meta-gene expression patterns across the samples become more localized with decreasing overlapping support, resulting in a sparse, localized, and compact representation clustering [11]. Studies have shown that NMF runs are usually sufficient to provide stability to the clustering [10],[11]. Given the initial random conditions for the runs, the NMF algorithm did not always converge to the same solution on each run, but since the clustering into k classes seemed to be strong, the sample assignment to clusters varied little from run to run as valid runs. Figure 5-6 Consensus matrix k=5 dataset for Breast_A & Breast_B 115

116 With a clustering that was stable, C tended not to vary among runs, and the entries of were close to 0 or Validating the NMF Algorithm The next important step in the clustering process involved validating the NMF algorithm based on the Cophenetic coefficient, which is the evaluation of the stability of the clustering associated with a given rank k. The Cophenetic correlation coefficient, ρk( ), is basically a measure that indicates the dispersion of the consensus matrix. ρk is computed as the Pearson correlation of two distance matrices: the first, I-, is the distance between samples induced by the consensus matrix, and the second is the distance between samples induced by the linkage used in the reordering of. For these analyses, the re-ordered consensus matrices and the corresponding cophenetic correlation were evaluated. The Cophenetic Coefficient.plot for Breast A. gct clustering The Cophenetic Coefficient.plot for Breast B. gct clustering Figure 5-7 The Cophenetic Coefficient plot for Breast_A & Breast_B.gct clustering 116

117 Although visual inspection of the reordered matrix provided substantial insight (see Fig.5-6), it was also important to have quantitative measure of the Cophenetic coefficient (see Fig.5-7). Coefficients ρ for the hierarchically clustered matrices of Breast_A and Breast_B datasets, K = 5, were computed by the NMF Algorithm as shown in the figure Fig.5-7 above. In a perfect consensus matrix (all entries = 0 or 1), the cophenetic correlation coefficient equals 1. The coefficient entries were scattered between 0 and 1, the cophenetic correlation coefficient is <1 and the values of k were selected where the magnitude of the cophenetic correlation coefficient began to fall (see Fig.5-7). The reordered consensus matrices and ρ were derived from the 50 connectivity matrices calculated at k = 2 to 5. The Cophenetic Correlation (average of for all 5 clusters) for Breast_A.gct data shows that ρ drops when k increases from 3 to 4, indicating that a three-cluster split of the data is more stable than the four-cluster split. Also, ρ moved back up when k increases from 4 to 5, contrasting the continuous drop observed when ρ was at 4. Therefore, it seems that further division of the cancer subtypes may be possible. On the other hand, the Cophenetic Correlation ρ (average of for all 5 clusters) for Breast_B.gct showed a consistent drop across the cluster profile as the clusters go from 2-5. Higher ranks k revealed further partitioning of the samples (Fig. 5-7), showing the consensus matrices generated for ranks k = 2, 3, 4, 5. Clear block diagonal patterns attest to the robustness of models with 2, 3, and 4 classes, whereas a rank-5 factorization showed increased dispersion (Fig.5-8). This qualitative observation is reflected quantitatively in the decreased value of the cophenetic correlation ρ4 (Fig.5-8). 117

118 Figure 5-8 Data Results for NMF Clustering analyses for Breast _A. gct & Breast _B.gct (all 5 k.plot) Results analysis for the NMF Clustering NMF clustering returned meaningful biological information based on the meta-genes expression patterns that resulted from the decomposition of the relevant genes. A maximum of 5 clusters was set for these analyses, with the NMF algorithm grouping the M samples into k clusters. The predominant genes that persisted in both data sets were Erp_LNn, Erp_LNp and Ern_LNp and Ern_LNn. These were up-regulated within the 5 clusters of Breast_A and Breast_B data sets. In the heat maps shown in Fig 5-8 above, the off-diagonal entries of the resultant consensus matrix served as the similarity measure among samples. The degree of dispersion of the visually inspected reordered consensus matrix provided substantial insight (see (Fig. 9a & 9b). The deep blue color corresponds to a numerical value of 0 meaning that the samples were never assigned to the same cluster. Dark red color corresponds to 1 which means that the samples always appear in the same cluster. 118

119 Overall, the use of consensus clustering algorithm with the NMF approach made the selection of the number of classes an objective consideration of the quantitative cophenetic coefficient rather than a subjective evaluation. 5.4 Consensus Clustering Consensus clustering is a clustering methodology often used in conjunction with resampling techniques for class discovery and clustering validation tailored to the task of analyzing gene expression data. Consensus clustering as an analysis approach provides for a method to represent the consensus across multiple runs of a clustering algorithm, to determine the number of clusters within the data and to assess the stability of the discovered clusters. More specifically, the method can be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. Finally, it provided for a visualization tool that was used to inspect cluster number, membership, and boundaries. Consensus clustering emerged as an important elaboration of the classical clustering problem. Consensus clustering, also called aggregation of clustering (or partitions), refers to the situation in which a number of different (input) clustering have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clustering [136][137]. Consensus clustering is thus the method that resolves the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. The main motivation for the consensus clustering methodology was the need to assess the stability of the discovered clusters, that is, the robustness of the putative clusters to sampling variability. The basic assumption of this method is intuitively simple: if the data represent a sample of items drawn from distinct sub-populations, and if we were to observe a different sample 119

120 drawn from the same sub-populations, the induced cluster composition and number should not be radically different. Therefore, the more the attained clusters are robust to sampling variability, the more we can be confident that these clusters represent real structure. To this end, perturbations of the original data were simulated by resampling techniques. The clustering algorithm was then applied to each of the perturbed data sets, and the agreement, or consensus, among the multiple runs was assessed to assess the effectiveness of the methodology in discovering biologically meaningful clusters Measuring consensus clustering For this clustering, first, a consensus matrix was defined to devise a method for representing and quantifying the agreement among the clustering runs over the perturbed datasets. A consensus matrix is an (N N) matrix that stores, for each pair of items, the proportion of clustering runs in which two items are clustered together. The consensus matrix is obtained by taking the average over the connectivity matrices of every perturbed dataset. These selections are included in the input data below. The Consensus Clustering module from the genepattern software was used in the analyses below as follows: Module: Consensus Clustering urn: lsid:broad.mit.edu:cancer.software.genepattern.module.analysis Input data was derived as shown below and the algorithm was run using the parameters indicated below:

121 The Consensus clustering parameters include the following: input.filenames Breast_A[1].gct and Breast_B[1].gct kmax 5 (must be >1) resampling.iterations 20 seed.value clustering.algorithm Hierarchical cluster.by rows/genes distance.measure EUCLIDEAN resample subsample merge.type average descent.iterations 2000 output.stubs <input.filename_basename> normalize.type -n1 normalization.iterations 0 create.heat.map yes heat.map.size 2 (no between 1 and 20) Table 5-2 Consensus clustering parameters Consensus matrix re-ordering and visualization The consensus matrix lends itself naturally to be used as a visualization tool to help assess the clusters composition and number. In particular, if we associate a color gradient to the 0 1 range of real numbers, so that white corresponds to 0, and dark red corresponds to 1, and if we assume the matrix is arranged so that items belonging to the same cluster are adjacent to each other (with the same item order used to index both the rows and the columns of the matrix), a matrix corresponding to perfect consensus will be displayed as a color-coded heat map characterized by red blocks along the diagonal, on a white background. 121

122 Figure 5-9 Lorenz Curve for Breast_A_data Figure 5-10 Change in Gini for Breast_A_data 122

123 Figure 5-11 Consensus CDF (empirical cumulative distribution) Graph for Breast_A_data 123

124 Figure 5-12 Delta area Plot for Breast_A_data Figure 5-13 Heat Mapfor Breast_A.sub78.srt.5.gct 124

125 Figure 5-14 Breast_B data Lorenz curve plot Figure 5-15 Breast_B.data Change in Gini plot 125

126 Figure 5-16 Breast_B.data Delta area plot Figure 5-17 Heatmap for Breast_B.sub39.srt.5.gct The consensus clustering results represented in the heat maps (heat map corresponding to the sorted consensus matrix), Lorenz curve, gini index and Consensus CDF-empirical cumulative distribution information (includes a series of plots of statistics that can be used to determine the best number of clusters) show that the resulting clusters from the consensus clustering were aligned with the cluster consistency derived from the other clustering algorithms used in the analyses. 5.5 Comparative Marker selection results and analyses An important step in analyzing expression profiles from microarray data is to identify genes that can discriminate between distinct classes of samples. 126

127 In this study however, microarray gene expression data (Multi_A.gct and Multi_A.cls) were downloaded for four cancer types, namely, breast, colon, lung and prostate cancer from the cancer data sets from the Broad Institute of MIT and Harvard. To ensure that the prediction results can be generalized to different datasets, two independent test sets (all_aml_train.gct and all_aml_train.cls ) were used to evaluate the robustness of the possible predicted gene markers obtained from the training set. Results for the training sets are shown below for the resulting comparative gene marker list that resulted (all_aml_train.preprocessed.comp.marker.odf) Table 5-3 Comparative marker selection.comp.marker.odf all_aml_train.preprocessed The training data containing the all_aml_train.preprocessed data groups shows that out of the 7129 features compared, 3829 were up-regulated in the ALL group while 3300 were upregulated in the AML group, showing equitable comparative analyses. In the actual test analysis for the identification of differentially expressed genes from the sample belonging to two phenotypes as well as from the datasets containing multiple phenotypes (Multi_A.gct and Multi_B.gct) derived from the Broad Institute of MIT and Harvard cancer datatsets, the analytic module of the Comparative marker selection was used to perform the required permutations. A test statistic (e.g., t-test) was chosen to assess the differential expression between the two classes of samples and the significance (nominal p-value) of marker genes was computed using a permutation test (have the advantage of not assuming a parametric 127

128 underlying distribution of expression values, and importantly they preserve gene-gene correlations which affect some measurements of significance) which is a commonly used method for assessing the significance of marker genes. For these datasets with unpaired cancer and control samples from the same patients within the Multi_A.gct dataset, the analytic test of the module similar to Mann-Whitney test was applied to identify genes that are differentially expressed in cancer versus control samples. For those datasets with paired information the test is represented as follows: Given the hypothesis that a particular gene is not differentially expressed in cancer versus the control group, the rejection of this hypothesis means that the gene is differentially expressed in cancer. Let and, be the gene's expression levels in control and cancer tissues of i-th patient, i = 1 m, and m be the number of patients. It is obvious that if the hypothesis is true, then the probability = = 0.5, assuming the gene's expression is a continuous random variable. Let K be the number of patients with, then the random variable K/m approximately follows a normal distribution (according to the Central Limit Theorem or de Moivre- Laplace Theorem) with its mean = 0.5 and a standard variation =, or follows a normal distribution N(0,1). Thus the p-value can be estimated as P(X> ), where is the number of patients satisfying. Overall, a gene is considered to be differentially expressed if the statistical significance, p-value, is less than 0.05 and its fold-change is at least 2. To construct a distribution of the test statistic, under the null hypothesis of no differential expression using the Genepattern s Comparative marker selection module, the phenotype labels were randomly re-assigned to samples and the test statistic was recomputed for the relabeled data set. This procedure was repeated for a given number of relabeling to yield the empirical null. 128

129 It should be noted that the total number of possible exhaustive permutations is a function of the number of samples in each class. In cases where the number of permutations is insufficient to estimate a significant p-value, the module also provided the option of computing asymptotic p- values based on the t-test. In addition to reporting the nominal p-value, the module also reports the estimated 95% confidence intervals for the nominal p-value to assess p-value accuracy, while including an option to perform all possible relabeling to obtain exact p-values. Selecting class markers is a particular instance of the general multiple hypothesis testing (MHT) problem. Since several thousand hypotheses (genes) were tested at once, the nominal p-values had to be corrected to account for the increased number of potential false positives. For example, in the case of the training samples where about 7129 genes were tested for differential expression, a nominal p-value threshold of 0.01 would only ensure that the expected number of false positives is less than approx. 71 (0.01*7,129). In the test case, as seen below where 5565 features were tested for differential expression, a nominal p-value threshold of 0.01 would only ensure that the expected number of false positives is less than approx. 56 (0.01*5,565). One approach to adjusting for multiple hypotheses testing within the module is to control the false discovery rate (FDR) which is the expected fraction of false positives among all genes reported as significant. The FDR cut-off level controls the fraction of false leads that the study can tolerate. The data files used in these analyses include the following:

130 The Module used in this analyses is the following: Comparative Marker Selection urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis The Comparative Marker Selection results were as follows: input.file: cls.file: Multi_A_modified.cls Multi_A.comp.marker.Br.vs.Rest.odf (1.3 MB) Multi_A.comp.marker.Pr.vs.Rest.odf (1.3 MB) Multi_A.comp.marker.Lu.vs.Rest.odf (1.3 MB) Multi_A.comp.marker.Co.vs.Rest.odf (1.3 MB) gp_execution_log.txt (1.0 KB) The complete set of the Comparative Marker selection analyses for Breast cancer comparative genes versus the Rest (colon, lung and prostate) are provided below for the following aspects: Upregulated features, FDR (BH) vs. Q Value, Lambda vs TTO (TT0=0.381) and Score Histogram. Data tables for the Gene list with the rankings are provided seperately: 130

131 Table 5-4 Comparative Marker selection results for Multi_A.comp.marker.Br.vs.Rest.odf Out of the 5565 features upregulated, a total of 3421 features/genes were up-regulated in the Breast cancer (from Multi_A.gct; data less than 5 years) vs 2144 features in the rest of the colon, prostate and lung cancers. Early breast cancer data including Breast_A. gct file was chosen in order to evaluate the genes that are prevalent in the early stages (disease incidence) that code the disease initiation/incidence of the various cancer types. It is believed that the potential is greater for recovering gene markers for diseases during the early disease periods. 131

132 Figure 5-18 Comparative marker selection FDR (BH) vs. Q-Value Br.vs.Rest (Colon, lung and prostate).odf 132

133 Figure 5-19 Comparative marker selection Lambda vs. TTO Br. Vs. Rest (Colon, lung and prostate).odf Figure 5-20 Comparative marker selection Score Histogram Br. Vs. Rest (Colon, lung and prostate).odf 133

134 Figure 5-21 Comparative marker selection Q-Value Histogram Br. Vs. Rest (Colon, lung and prostate).odf Table 5-5 Comparative Marker selection Multi_A.comp.marker.Br.vs.Rest Data list.odf 134

135 Figure 5-22 Comparative Marker Selection HEAT Map Breast_A_CLS.gct Figure 5-23 Comparative Marker Selection HEAT Map Breast_B_CLS.gct 135

Basement membrane in lobule.

Basement membrane in lobule. Bahram Memar, MD Basement membrane in lobule. Normal lobule-luteal phase Normal lobule-follicular phase Lactating breast Greater than 95% are adenocarcinomas in situ carcinomas and invasive carcinomas.

More information

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015 Goals/Expectations Computer Science, Biology, and Biomedical (CoSBBI) We want to excite you about the world of computer science, biology, and biomedical informatics. Experience what it is like to be a

More information

Diseases of the breast (2 of 2) Breast cancer

Diseases of the breast (2 of 2) Breast cancer Diseases of the breast (2 of 2) Breast cancer Epidemiology & etiology The most common type of cancer & the 2 nd most common cause of cancer death in women 1 of 8 women in USA Affects 7% of women Peak at

More information

Maram Abdaljaleel, MD Dermatopathologist and Neuropathologist University of Jordan, School of Medicine

Maram Abdaljaleel, MD Dermatopathologist and Neuropathologist University of Jordan, School of Medicine Maram Abdaljaleel, MD Dermatopathologist and Neuropathologist University of Jordan, School of Medicine The most common non-skin malignancy of women 2 nd most common cause of cancer deaths in women, following

More information

Breast Cancer. Most common cancer among women in the US. 2nd leading cause of death in women. Mortality rates though have declined

Breast Cancer. Most common cancer among women in the US. 2nd leading cause of death in women. Mortality rates though have declined Breast Cancer Most common cancer among women in the US 2nd leading cause of death in women Mortality rates though have declined 1 in 8 women will develop breast cancer Breast Cancer Breast cancer increases

More information

Triple Negative Breast Cancer

Triple Negative Breast Cancer Triple Negative Breast Cancer Prof. Dr. Pornchai O-charoenrat Division of Head-Neck & Breast Surgery Department of Surgery Faculty of Medicine Siriraj Hospital Breast Cancer Classification Traditional

More information

Breast Cancer. Saima Saeed MD

Breast Cancer. Saima Saeed MD Breast Cancer Saima Saeed MD Breast Cancer Most common cancer among women in the US 2nd leading cause of death in women 1 in 8 women will develop breast cancer Incidence/mortality rates have declined Breast

More information

LESSON 3.2 WORKBOOK. How do normal cells become cancer cells? Workbook Lesson 3.2

LESSON 3.2 WORKBOOK. How do normal cells become cancer cells? Workbook Lesson 3.2 For a complete list of defined terms, see the Glossary. Transformation the process by which a cell acquires characteristics of a tumor cell. LESSON 3.2 WORKBOOK How do normal cells become cancer cells?

More information

CELL BIOLOGY - CLUTCH CH CANCER.

CELL BIOLOGY - CLUTCH CH CANCER. !! www.clutchprep.com CONCEPT: OVERVIEW OF CANCER Cancer is a disease which is primarily caused from misregulated cell division, which form There are two types of tumors - Benign tumors remain confined

More information

Understanding and Optimizing Treatment of Triple Negative Breast Cancer

Understanding and Optimizing Treatment of Triple Negative Breast Cancer Understanding and Optimizing Treatment of Triple Negative Breast Cancer Edith Peterson Mitchell, MD, FACP Clinical Professor of Medicine and Medical Oncology Program Leader, Gastrointestinal Oncology Department

More information

BREAST PATHOLOGY. Fibrocystic Changes

BREAST PATHOLOGY. Fibrocystic Changes BREAST PATHOLOGY Lesions of the breast are very common, and they present as palpable, sometimes painful, nodules or masses. Most of these lesions are benign. Breast cancer is the 2 nd most common cause

More information

Transformation of Normal HMECs (Human Mammary Epithelial Cells) into Metastatic Breast Cancer Cells: Introduction - The Broad Picture:

Transformation of Normal HMECs (Human Mammary Epithelial Cells) into Metastatic Breast Cancer Cells: Introduction - The Broad Picture: Transformation of Normal HMECs (Human Mammary Epithelial Cells) into Metastatic Breast Cancer Cells: Introduction - The Broad Picture: Spandana Baruah December, 2016 Cancer is defined as: «A disease caused

More information

Information for You and Your Family

Information for You and Your Family Information for You and Your Family What is Prevention? Cancer prevention is action taken to lower the chance of getting cancer. In 2017, more than 1.6 million people will be diagnosed with cancer in the

More information

Breast Cancer. Dr. Andres Wiernik 2017

Breast Cancer. Dr. Andres Wiernik 2017 Breast Cancer Dr. Andres Wiernik 2017 Agenda: The Facts! (Epidemiology/Risk Factors) Biological Classification/Phenotypes of Breast Cancer Treatment approach Local Systemic Agenda: The Facts! (Epidemiology/Risk

More information

Introduction to Genetics

Introduction to Genetics Introduction to Genetics Table of contents Chromosome DNA Protein synthesis Mutation Genetic disorder Relationship between genes and cancer Genetic testing Technical concern 2 All living organisms consist

More information

Recent advances in breast cancers

Recent advances in breast cancers Recent advances in breast cancers Breast cancer is a hetrogenous disease due to distinct genetic alterations. Similar morphological subtypes show variation in clinical behaviour especially in response

More information

Biochemistry of Cancer and Tumor Markers

Biochemistry of Cancer and Tumor Markers Biochemistry of Cancer and Tumor Markers The term cancer applies to a group of diseases in which cells grow abnormally and form a malignant tumor. It is a long term multistage genetic process. The first

More information

Breast Cancer. Excess Estrogen Exposure. Alcohol use + Pytoestrogens? Abortion. Infertility treatment?

Breast Cancer. Excess Estrogen Exposure. Alcohol use + Pytoestrogens? Abortion. Infertility treatment? Breast Cancer Breast Cancer Excess Estrogen Exposure Nulliparity or late pregnancy + Early menarche + Late menopause + Cystic ovarian disease + External estrogens exposure + Breast Cancer Excess Estrogen

More information

number Done by Corrected by Doctor Maha Shomaf

number Done by Corrected by Doctor Maha Shomaf number 19 Done by Waseem Abo-Obeida Corrected by Abdullah Zreiqat Doctor Maha Shomaf Carcinogenesis: the molecular basis of cancer. Non-lethal genetic damage lies at the heart of carcinogenesis and leads

More information

Acute: Symptoms that start and worsen quickly but do not last over a long period of time.

Acute: Symptoms that start and worsen quickly but do not last over a long period of time. Cancer Glossary Acute: Symptoms that start and worsen quickly but do not last over a long period of time. Adjuvant therapy: Treatment given after the main treatment. It usually refers to chemotherapy,

More information

SFMC Breast Cancer Site Study: 2011

SFMC Breast Cancer Site Study: 2011 SFMC Breast Cancer Site Study: 2011 Introduction Breast cancer is the most frequently diagnosed cancer among American women, except for skin cancers. It is the second leading cause of cancer death in women,

More information

Predictive Assays in Radiation Therapy

Predictive Assays in Radiation Therapy Outline Predictive Assays in Radiation Therapy Radiation Biology Introduction Early predictive assays Recent trends in predictive assays Examples for specific tumors Summary Lecture 4-23-2014 Introduction

More information

- is a common disease - 1 person in 3 can expect to contract cancer at some stage in their life -1 person in 5 can expect to die from it

- is a common disease - 1 person in 3 can expect to contract cancer at some stage in their life -1 person in 5 can expect to die from it MBB157 Dr D Mangnall The Molecular Basis of Disease CANCER Lecture 1 One of the simpler (and better) definitions of cancer comes from the American Cancer Society, who define cancer as; 'Cancer is a group

More information

Development of Carcinoma Pathways

Development of Carcinoma Pathways The Construction of Genetic Pathway to Colorectal Cancer Moriah Wright, MD Clinical Fellow in Colorectal Surgery Creighton University School of Medicine Management of Colon and Diseases February 23, 2019

More information

Biochemistry of Carcinogenesis. Lecture # 35 Alexander N. Koval

Biochemistry of Carcinogenesis. Lecture # 35 Alexander N. Koval Biochemistry of Carcinogenesis Lecture # 35 Alexander N. Koval What is Cancer? The term "cancer" refers to a group of diseases in which cells grow and spread unrestrained throughout the body. It is difficult

More information

Section 9. Junaid Malek, M.D.

Section 9. Junaid Malek, M.D. Section 9 Junaid Malek, M.D. Mutation Objective: Understand how mutations can arise, and how beneficial ones can alter populations Mutation= a randomly produced, heritable change in the nucleotide sequence

More information

Chapter 5: Epidemiology of MBC Challenges with Population-Based Statistics

Chapter 5: Epidemiology of MBC Challenges with Population-Based Statistics Chapter 5: Epidemiology of MBC Challenges with Population-Based Statistics Musa Mayer 1 1 AdvancedBC.org, Abstract To advocate most effectively for a population of patients, they must be accurately described

More information

Overview of Cancer. Mylene Freires Advanced Nurse Practitioner, Haematology

Overview of Cancer. Mylene Freires Advanced Nurse Practitioner, Haematology Overview of Cancer Mylene Freires Advanced Nurse Practitioner, Haematology Aim of the Presentation Review basic concepts of cancer Gain some understanding of the socio-economic impact of cancer Order of

More information

BY Mrs. K.SHAILAJA., M. PHARM., LECTURER DEPT OF PHARMACY PRACTICE, SRM COLLEGE OF PHARMACY

BY Mrs. K.SHAILAJA., M. PHARM., LECTURER DEPT OF PHARMACY PRACTICE, SRM COLLEGE OF PHARMACY BY Mrs. K.SHAILAJA., M. PHARM., LECTURER DEPT OF PHARMACY PRACTICE, SRM COLLEGE OF PHARMACY Cancer is a group of more than 100 different diseases that are characterized by uncontrolled cellular growth,

More information

MicroRNA expression profiling and functional analysis in prostate cancer. Marco Folini s.c. Ricerca Traslazionale DOSL

MicroRNA expression profiling and functional analysis in prostate cancer. Marco Folini s.c. Ricerca Traslazionale DOSL MicroRNA expression profiling and functional analysis in prostate cancer Marco Folini s.c. Ricerca Traslazionale DOSL What are micrornas? For almost three decades, the alteration of protein-coding genes

More information

Diabetes Mellitus and Breast Cancer

Diabetes Mellitus and Breast Cancer Masur K, Thévenod F, Zänker KS (eds): Diabetes and Cancer. Epidemiological Evidence and Molecular Links. Front Diabetes. Basel, Karger, 2008, vol 19, pp 97 113 Diabetes Mellitus and Breast Cancer Ido Wolf

More information

Introduction. Cancer Biology. Tumor-suppressor genes. Proto-oncogenes. DNA stability genes. Mechanisms of carcinogenesis.

Introduction. Cancer Biology. Tumor-suppressor genes. Proto-oncogenes. DNA stability genes. Mechanisms of carcinogenesis. Cancer Biology Chapter 18 Eric J. Hall., Amato Giaccia, Radiobiology for the Radiologist Introduction Tissue homeostasis depends on the regulated cell division and self-elimination (programmed cell death)

More information

Regarding techniques of proteomics, there is:

Regarding techniques of proteomics, there is: Molecular الحلقة biology 14 واألخيرة To put you back in the context; the discussion was about Trancriptomics (the study of transcription). The following topic will be PROTEOMICS, which is the study of

More information

Your Guide to the Breast Cancer Pathology. Report. Key Questions. Here are important questions to be sure you understand, with your doctor s help:

Your Guide to the Breast Cancer Pathology. Report. Key Questions. Here are important questions to be sure you understand, with your doctor s help: Your Guide to the Breast Cancer Pathology Report Key Questions Here are important questions to be sure you understand, with your doctor s help: Your Guide to the Breast Cancer Pathology Report 1. Is this

More information

BREAST CANCER d an BREAST SELF EXAM

BREAST CANCER d an BREAST SELF EXAM BREAST CANCER and BREAST SELF EXAM American Cancer Society Statistics: 2009 Invasive breast cancer will be diagnosed in over 192,370 women Carcinoma in situ will be diagnosed in 62,280 women More than

More information

DOCTORAL THESIS SUMMARY

DOCTORAL THESIS SUMMARY UNIVERSITY OF MEDICINE AND PHARMACY CRAIOVA FACULTY OF MEDICINE DOCTORAL THESIS SUMMARY CLINICO-IMAGING STUDY OF INVASIVE DUCTAL BREAST CARCINOMAS CORRELATED TO HORMONAL RECEPTORS AND HER2/NEU ONCOPROTEIN

More information

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK CHAPTER 6 DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK Genetic research aimed at the identification of new breast cancer susceptibility genes is at an interesting crossroad. On the one hand, the existence

More information

Neoplasia 18 lecture 6. Dr Heyam Awad MD, FRCPath

Neoplasia 18 lecture 6. Dr Heyam Awad MD, FRCPath Neoplasia 18 lecture 6 Dr Heyam Awad MD, FRCPath ILOS 1. understand the role of TGF beta, contact inhibition and APC in tumorigenesis. 2. implement the above knowledge in understanding histopathology reports.

More information

Aberrant cell Growth. Younas Masih New Life College of Nursing Karachi. 3/4/2016 Younas Masih ( NLCON)

Aberrant cell Growth. Younas Masih New Life College of Nursing Karachi. 3/4/2016 Younas Masih ( NLCON) Aberrant cell Growth Younas Masih New Life College of Nursing Karachi 1 Objectives By the end of this session the learners will be able to, Define the characteristics of the normal cell Describe the characteristics

More information

Histological Type. Morphological and Molecular Typing of breast Cancer. Nottingham Tenovus Primary Breast Cancer Study. Survival (%) Ian Ellis

Histological Type. Morphological and Molecular Typing of breast Cancer. Nottingham Tenovus Primary Breast Cancer Study. Survival (%) Ian Ellis Morphological and Molecular Typing of breast Cancer Ian Ellis Molecular Medical Sciences, University of Nottingham Department of Histopathology, Nottingham University Hospitals NHS Trust Histological Type

More information

BIT 120. Copy of Cancer/HIV Lecture

BIT 120. Copy of Cancer/HIV Lecture BIT 120 Copy of Cancer/HIV Lecture Cancer DEFINITION Any abnormal growth of cells that has malignant potential i.e.. Leukemia Uncontrolled mitosis in WBC Genetic disease caused by an accumulation of mutations

More information

Cancer Genetics. What is Cancer? Cancer Classification. Medical Genetics. Uncontrolled growth of cells. Not all tumors are cancerous

Cancer Genetics. What is Cancer? Cancer Classification. Medical Genetics. Uncontrolled growth of cells. Not all tumors are cancerous Session8 Medical Genetics Cancer Genetics J avad Jamshidi F a s a U n i v e r s i t y o f M e d i c a l S c i e n c e s, N o v e m b e r 2 0 1 7 What is Cancer? Uncontrolled growth of cells Not all tumors

More information

Types of Breast Cancer

Types of Breast Cancer IOWA RADIOLOGY 1 Types of Breast Cancer 515-226-9810 Ankeny Clive Downtown Des Moines IOWA RADIOLOGY 1 Table of Contents Introduction... 1 Ductal Carcinoma... 2 Paget s Disease of the Nipple... 8 Lobular

More information

Characterisation of structural variation in breast. cancer genomes using paired-end sequencing on. the Illumina Genome Analyser

Characterisation of structural variation in breast. cancer genomes using paired-end sequencing on. the Illumina Genome Analyser Characterisation of structural variation in breast cancer genomes using paired-end sequencing on the Illumina Genome Analyser Phil Stephens Cancer Genome Project Why is it important to study cancer? Why

More information

Molecular Characterization of Breast Cancer: The Clinical Significance

Molecular Characterization of Breast Cancer: The Clinical Significance Molecular Characterization of : The Clinical Significance Shahla Masood, M.D. Professor and Chair Department of Pathology and Laboratory Medicine University of Florida College of Medicine-Jacksonville

More information

BREAST CANCER. Dawn Hershman, MD MS. Medicine and Epidemiology Co-Director, Breast Program HICCC Columbia University Medical Center.

BREAST CANCER. Dawn Hershman, MD MS. Medicine and Epidemiology Co-Director, Breast Program HICCC Columbia University Medical Center. BREAST CANCER Dawn Hershman, MD MS Florence Irving Assistant Professor of Medicine and Epidemiology Co-Director, Breast Program HICCC Columbia University Medical Center Background Breast cancer is the

More information

It is a malignancy originating from breast tissue

It is a malignancy originating from breast tissue 59 Breast cancer 1 It is a malignancy originating from breast tissue including both early stages which are potentially curable, and metastatic breast cancer (MBC) which is usually incurable. Most breast

More information

Evolving Practices in Breast Cancer Management

Evolving Practices in Breast Cancer Management Evolving Practices in Breast Cancer Management The Georgia Tumor Registrars Association 2016 Priscilla R. Strom, MD, FACS Objectives 1. understand newer indications for neoadjuvant treatment 2. understand

More information

BREAST CANCER PATHOLOGY

BREAST CANCER PATHOLOGY BREAST CANCER PATHOLOGY FACT SHEET Version 4, Aug 2013 This fact sheet was produced by Breast Cancer Network Australia with input from The Royal College of Pathologists of Australasia I m a nurse and know

More information

Factors Associated with Early Versus Late Development of Breast and Ovarian Cancer in BRCA1 and BRCA2 Positive Women

Factors Associated with Early Versus Late Development of Breast and Ovarian Cancer in BRCA1 and BRCA2 Positive Women Texas Medical Center Library DigitalCommons@The Texas Medical Center UT GSBS Dissertations and Theses (Open Access) Graduate School of Biomedical Sciences 5-2010 Factors Associated with Early Versus Late

More information

Mammary Tumors. by Pamela A. Davol

Mammary Tumors. by Pamela A. Davol Mammary Tumors by Pamela A. Davol Malignant tumors of the mammary glands occur with a higher incident than any other form of cancer in female dogs. Additionally, evidence suggests that females with benign

More information

Recurrence, new primary and bilateral breast cancer. José Palacios Calvo Servicio de Anatomía Patológica

Recurrence, new primary and bilateral breast cancer. José Palacios Calvo Servicio de Anatomía Patológica Recurrence, new primary and bilateral breast cancer José Palacios Calvo Servicio de Anatomía Patológica Ipsilateral Breast Tumor Relapse (IBTR) IBTR can occur in approximately 5 20% of women after breast-conserving

More information

The Biology and Genetics of Cells and Organisms The Biology of Cancer

The Biology and Genetics of Cells and Organisms The Biology of Cancer The Biology and Genetics of Cells and Organisms The Biology of Cancer Mendel and Genetics How many distinct genes are present in the genomes of mammals? - 21,000 for human. - Genetic information is carried

More information

Tumor suppressor genes D R. S H O S S E I N I - A S L

Tumor suppressor genes D R. S H O S S E I N I - A S L Tumor suppressor genes 1 D R. S H O S S E I N I - A S L What is a Tumor Suppressor Gene? 2 A tumor suppressor gene is a type of cancer gene that is created by loss-of function mutations. In contrast to

More information

Breast Cancer Statistics

Breast Cancer Statistics 1 in 8 Breast Cancer Statistics Incidence Mortality Prevalence 2 Breast Cancer Incidence Breast Cancer Mortality Breast Cancer Prevalence ~$100,000 Female Breast Anatomy Breasts consist mainly of fatty

More information

Breast Cancer. American Cancer Society

Breast Cancer. American Cancer Society Breast Cancer American Cancer Society Reviewed February 2017 What we ll be talking about How common is breast cancer? What is breast cancer? What causes it? What are the risk factors? Can breast cancer

More information

Intro to Cancer Therapeutics

Intro to Cancer Therapeutics An Intro to Cancer Therapeutics Christopher R. Chitambar, MD Professor of Medicine Division of Hematology & Oncology Froedtert and Medical College of Wisconsin Clinical Cancer Center cchitamb@mcw.edu Intro

More information

Mousa. Israa Ayed. Abdullah AlZibdeh. 0 P a g e

Mousa. Israa Ayed. Abdullah AlZibdeh. 0 P a g e 1 Mousa Israa Ayed Abdullah AlZibdeh 0 P a g e Breast pathology The basic histological units of the breast are called lobules, which are composed of glandular epithelial cells (luminal cells) resting on

More information

Contents 1 The Windows of Susceptibility to Breast Cancer 2 The So Called Pre-Neoplastic Lesions and Carcinoma In Situ

Contents 1 The Windows of Susceptibility to Breast Cancer 2 The So Called Pre-Neoplastic Lesions and Carcinoma In Situ Contents 1 The Windows of Susceptibility to Breast Cancer... 1 1.1 Introduction... 1 1.2 Risk Factor and Etiological Agents... 2 1.3 The Concept of the Windows of Susceptibility to Carcinogenesis... 5

More information

Chapter 9, Part 1: Biology of Cancer and Tumor Spread

Chapter 9, Part 1: Biology of Cancer and Tumor Spread PATHOPHYSIOLOGY Name Chapter 9, Part 1: Biology of Cancer and Tumor Spread I. Cancer Characteristics and Terminology Neoplasm new growth, involves the overgrowth of tissue to form a neoplastic mass (tumor).

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle   holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/1887/22278 holds various files of this Leiden University dissertation. Author: Cunha Carvalho de Miranda, Noel Filipe da Title: Mismatch repair and MUTYH deficient

More information

This is a summary of what we ll be talking about today.

This is a summary of what we ll be talking about today. Slide 1 Breast Cancer American Cancer Society Reviewed October 2015 Slide 2 What we ll be talking about How common is breast cancer? What is breast cancer? What causes it? What are the risk factors? Can

More information

Gynecologic Malignancies. Kristen D Starbuck 4/20/18

Gynecologic Malignancies. Kristen D Starbuck 4/20/18 Gynecologic Malignancies Kristen D Starbuck 4/20/18 Outline Female Cancer Statistics Uterine Cancer Adnexal Cancer Cervical Cancer Vulvar Cancer Uterine Cancer Endometrial Cancer Uterine Sarcoma Endometrial

More information

Wellness Along the Cancer Journey: Cancer Types Revised October 2015 Chapter 2: Breast Cancer

Wellness Along the Cancer Journey: Cancer Types Revised October 2015 Chapter 2: Breast Cancer Wellness Along the Cancer Journey: Cancer Types Revised October 2015 Chapter 2: Breast Cancer Cancer Types Rev. 10.20.15 Page 19 Breast Cancer Group Discussion True False Not Sure 1. Breast cancer is not

More information

Introduction to Cancer Bioinformatics and cancer biology. Anthony Gitter Cancer Bioinformatics (BMI 826/CS 838) January 20, 2015

Introduction to Cancer Bioinformatics and cancer biology. Anthony Gitter Cancer Bioinformatics (BMI 826/CS 838) January 20, 2015 Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics (BMI 826/CS 838) January 20, 2015 Why cancer bioinformatics? Devastating disease, no cure on the horizon Major

More information

Breast Cancer Diagnosis, Treatment and Follow-up

Breast Cancer Diagnosis, Treatment and Follow-up Breast Cancer Diagnosis, Treatment and Follow-up What is breast cancer? Each of the body s organs, including the breast, is made up of many types of cells. Normally, healthy cells grow and divide to produce

More information

AllinaHealthSystems 1

AllinaHealthSystems 1 Overview Biology and Introduction to the Genetics of Cancer Denise Jones, MS, CGC Certified Genetic Counselor Virginia Piper Cancer Service Line I. Our understanding of cancer the historical perspective

More information

September 20, Submitted electronically to: Cc: To Whom It May Concern:

September 20, Submitted electronically to: Cc: To Whom It May Concern: History Study (NOT-HL-12-147), p. 1 September 20, 2012 Re: Request for Information (RFI): Building a National Resource to Study Myelodysplastic Syndromes (MDS) The MDS Cohort Natural History Study (NOT-HL-12-147).

More information

Clonal evolution of human cancers

Clonal evolution of human cancers Clonal evolution of human cancers -Pathology-based microdissection and genetic analysis precisely demonstrates molecular evolution of neoplastic clones- Hiroaki Fujii, MD Ageo Medical Laboratories, Yashio

More information

Multistep nature of cancer development. Cancer genes

Multistep nature of cancer development. Cancer genes Multistep nature of cancer development Phenotypic progression loss of control over cell growth/death (neoplasm) invasiveness (carcinoma) distal spread (metastatic tumor) Genetic progression multiple genetic

More information

Molecular classification of breast cancer implications for pathologists. Sarah E Pinder

Molecular classification of breast cancer implications for pathologists. Sarah E Pinder Molecular classification of breast cancer implications for pathologists Sarah E Pinder Courtesy of CW Elston Histological types Breast Cancer Special Types 17 morphological special types 25-30% of all

More information

Breast Cancer Awareness

Breast Cancer Awareness Breast Cancer Awareness Presented by BHS Call: 800-327-2251 Visit: www.bhsonline.com 2016 BHS. All rights reserved. 1 Important Notice The information provided in this training is intended to raise awareness

More information

609G: Concepts of Cancer Genetics and Treatments (3 credits)

609G: Concepts of Cancer Genetics and Treatments (3 credits) Master of Chemical and Life Sciences Program College of Computer, Mathematical, and Natural Sciences 609G: Concepts of Cancer Genetics and Treatments (3 credits) Text books: Principles of Cancer Genetics,

More information

Chapter 3. Neoplasms. Copyright 2015 Cengage Learning.

Chapter 3. Neoplasms. Copyright 2015 Cengage Learning. Chapter 3 Neoplasms Terminology Related to Neoplasms and Tumors Neoplasm New growth Tumor Swelling or neoplasm Leukemia Malignant disease of bone marrow Hematoma Bruise or contusion Classification of Neoplasms

More information

RVP Medical Director Anthem Blue Cross. Provider Clinical Liaison, Oncology Solutions

RVP Medical Director Anthem Blue Cross. Provider Clinical Liaison, Oncology Solutions David Pryor MD, MPH RVP Medical Director Anthem Blue Cross Leora Fogel Provider Clinical Liaison, Oncology Solutions Remember these key facts: There are things you can do to lower your risk. Progress is

More information

Table S2. Expression of PRMT7 in clinical breast carcinoma samples

Table S2. Expression of PRMT7 in clinical breast carcinoma samples Table S2. Expression of PRMT7 in clinical breast carcinoma samples (All data were obtained from cancer microarray database Oncomine.) Analysis type* Analysis Class(number sampels) 1 2 3 4 Correlation (up/down)#

More information

Early Embryonic Development

Early Embryonic Development Early Embryonic Development Maternal effect gene products set the stage by controlling the expression of the first embryonic genes. 1. Transcription factors 2. Receptors 3. Regulatory proteins Maternal

More information

Ovarian Cancer Causes, Risk Factors, and Prevention

Ovarian Cancer Causes, Risk Factors, and Prevention Ovarian Cancer Causes, Risk Factors, and Prevention Risk Factors A risk factor is anything that affects your chance of getting a disease such as cancer. Learn more about the risk factors for ovarian cancer.

More information

Breast Cancer Task Force of the Greater Miami Valley A collaborative effort of health care professionals and breast cancer survivors in the Greater

Breast Cancer Task Force of the Greater Miami Valley A collaborative effort of health care professionals and breast cancer survivors in the Greater Breast Cancer Task Force of the Greater Miami Valley A collaborative effort of health care professionals and breast cancer survivors in the Greater Dayton Area Last Updated Fall 2014 TABLE OF CONTENTS

More information

PHARMACOGENETICS OF BREAST CANCER

PHARMACOGENETICS OF BREAST CANCER PHARMACOGENETICS OF BREAST CANCER MALGORZATA JAREMKO, PhD Mount Sinai School of Medicine, Department of Genetics and Genomic Sciences Outlines Breast cancer therapeutic situation Pharmacogenetics of antiestrogen

More information

T. R. Golub, D. K. Slonim & Others 1999

T. R. Golub, D. K. Slonim & Others 1999 T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have

More information

Cancer and Gene Alterations - 1

Cancer and Gene Alterations - 1 Cancer and Gene Alterations - 1 Cancer and Gene Alteration As we know, cancer is a disease of unregulated cell growth. Although we looked at some of the features of cancer when we discussed mitosis checkpoints,

More information

Cancer Cells. It would take another 20 years and a revolution in the techniques of biological research to answer these questions.

Cancer Cells. It would take another 20 years and a revolution in the techniques of biological research to answer these questions. Cancer Cells Cancer, then, is a disease in which a single normal body cell undergoes a genetic transformation into a cancer cell. This cell and its descendants, proliferating across many years, produce

More information

COPE Library Sample

COPE Library Sample Breast Anatomy LOBULE LOBE ACINI (MILK PRODUCING UNITS) NIPPLE AREOLA COMPLEX ENLARGEMENT OF DUCT AND LOBE LOBULE SUPRACLAVICULAR NODES INFRACLAVICULAR NODES DUCT DUCT ACINI (MILK PRODUCING UNITS) 8420

More information

MolEcular Taxonomy of BReast cancer International Consortium (METABRIC)

MolEcular Taxonomy of BReast cancer International Consortium (METABRIC) PERSPECTIVE 1 LARGE SCALE DATASET EXAMPLES MolEcular Taxonomy of BReast cancer International Consortium (METABRIC) BC Cancer Agency, Vancouver Samuel Aparicio, PhD FRCPath Nan and Lorraine Robertson Chair

More information

Lecture 8 Neoplasia II. Dr. Nabila Hamdi MD, PhD

Lecture 8 Neoplasia II. Dr. Nabila Hamdi MD, PhD Lecture 8 Neoplasia II Dr. Nabila Hamdi MD, PhD ILOs Understand the definition of neoplasia. List the classification of neoplasia. Describe the general characters of benign tumors. Understand the nomenclature

More information

LYMPHATIC DRAINAGE AXILLARY (MOSTLY) INTERNAL MAMMARY SUPRACLAVICULAR

LYMPHATIC DRAINAGE AXILLARY (MOSTLY) INTERNAL MAMMARY SUPRACLAVICULAR BREAST LYMPHATIC DRAINAGE AXILLARY (MOSTLY) INTERNAL MAMMARY SUPRACLAVICULAR HISTOLOGY LOBE: (10 in whole breast) LOBULE: (many per lobe) ACINUS/I, aka ALVEOLUS/I: (many per lobule) DUCT(S): INTRA- or

More information

Test Bank for Robbins and Cotran Pathologic Basis of Disease 9th Edition by Kumar

Test Bank for Robbins and Cotran Pathologic Basis of Disease 9th Edition by Kumar Link full download:https://getbooksolutions.com/download/test-bank-for-robbinsand-cotran-pathologic-basis-of-disease-9th-edition-by-kumar Test Bank for Robbins and Cotran Pathologic Basis of Disease 9th

More information

HEALTH CARE DISPARITIES. Bhuvana Ramaswamy MD MRCP The Ohio State University Comprehensive Cancer Center

HEALTH CARE DISPARITIES. Bhuvana Ramaswamy MD MRCP The Ohio State University Comprehensive Cancer Center HEALTH CARE DISPARITIES Bhuvana Ramaswamy MD MRCP The Ohio State University Comprehensive Cancer Center Goals Understand the epidemiology of breast cancer Understand the broad management of breast cancer

More information

Contemporary Classification of Breast Cancer

Contemporary Classification of Breast Cancer Contemporary Classification of Breast Cancer Laura C. Collins, M.D. Vice Chair of Anatomic Pathology Professor of Pathology Beth Israel Deaconess Medical Center and Harvard Medical School Boston, MA Outline

More information

Dr Rodney Itaki Lecturer Anatomical Pathology Discipline. University of Papua New Guinea School of Medicine & Health Sciences Division of Pathology

Dr Rodney Itaki Lecturer Anatomical Pathology Discipline. University of Papua New Guinea School of Medicine & Health Sciences Division of Pathology Neoplasia Dr Rodney Itaki Lecturer Anatomical Pathology Discipline University of Papua New Guinea School of Medicine & Health Sciences Division of Pathology General Considerations Overview: Neoplasia uncontrolled,

More information

FAQ-Protocol 3. BRCA mutation carrier guidelines Frequently asked questions

FAQ-Protocol 3. BRCA mutation carrier guidelines Frequently asked questions ULast updated: 09/02/2015 Protocol 3 BRCA mutation carrier guidelines Frequently asked questions UQ: How accurate are the remaining lifetime and 5 year breast cancer risks in the table? These figures are

More information

Transcriptional Profiles from Paired Normal Samples Offer Complementary Information on Cancer Patient Survival -- Evidence from TCGA Pan-Cancer Data

Transcriptional Profiles from Paired Normal Samples Offer Complementary Information on Cancer Patient Survival -- Evidence from TCGA Pan-Cancer Data Transcriptional Profiles from Paired Normal Samples Offer Complementary Information on Cancer Patient Survival -- Evidence from TCGA Pan-Cancer Data Supplementary Materials Xiu Huang, David Stern, and

More information

Lecture 1: Carcinogenesis

Lecture 1: Carcinogenesis Lecture 1: Carcinogenesis Anti-cancer (oncology agents): These are perhaps the most dangerous of drugs, other than the narcotic analgesics. This is due to their toxicities. Killing or inhibiting cancer

More information

3 cell types in the normal ovary

3 cell types in the normal ovary Ovarian tumors 3 cell types in the normal ovary Surface (coelomic epithelium) the origin of the great majority of ovarian tumors (neoplasms) 90% of malignant ovarian tumors Totipotent germ cells Sex cord-stromal

More information

Breast Cancer: Who Gets It? Who Survives? The Latest Information

Breast Cancer: Who Gets It? Who Survives? The Latest Information Breast Cancer: Who Gets It? Who Survives? The Latest Information James J. Stark, MD, FACP Medical Director, Cancer Program and Director of Palliative Care Maryview Medical Center Professor of Medicine

More information

Figure S4. 15 Mets Whole Exome. 5 Primary Tumors Cancer Panel and WES. Next Generation Sequencing

Figure S4. 15 Mets Whole Exome. 5 Primary Tumors Cancer Panel and WES. Next Generation Sequencing Figure S4 Next Generation Sequencing 15 Mets Whole Exome 5 Primary Tumors Cancer Panel and WES Get coverage of all variant loci for all three Mets Variant Filtering Sequence Alignments Index and align

More information

Cell Death and Cancer. SNC 2D Ms. Papaiconomou

Cell Death and Cancer. SNC 2D Ms. Papaiconomou Cell Death and Cancer SNC 2D Ms. Papaiconomou How do cells die? Necrosis Death due to unexpected and accidental cell damage. This is an unregulated cell death. Causes: toxins, radiation, trauma, lack of

More information

Index. Note: Page numbers of article titles are in boldface type.

Index. Note: Page numbers of article titles are in boldface type. Note: Page numbers of article titles are in boldface type. A Adjuvant therapy, for early-stage triple-negative breast cancer, 740 742 in older early-stage breast cancer patients, 790 795 anti-her2-directed

More information

- A cancer is an uncontrolled, independent proliferation of robust, healthy cells.

- A cancer is an uncontrolled, independent proliferation of robust, healthy cells. 1 Cancer A. What is it? - A cancer is an uncontrolled, independent proliferation of robust, healthy cells. * In some the rate is fast; in others, slow; but in all cancers the cells never stop dividing.

More information