Nearest Shrunken Centroid as Feature Selection of Microarray Data

Size: px
Start display at page:

Download "Nearest Shrunken Centroid as Feature Selection of Microarray Data"

Transcription

1 Nearest Shrunken Centroid as Feature Selection of Microarray Data Myungsook Klassen Computer Science Department, California Lutheran University 60 West Olsen Rd, Thousand Oaks, CA Nyunsu Kim Computer Science Department, California Lutheran University 60 West Olsen Rd, Thousand Oaks, CA Abstract The nearest shrunken centroid classifier uses shrunken centroids as prototypes for each class and test samples are classified to belong to the class whose shrunken centroid is nearest to it. In our study, the nearest shrunken centroid classifier was used simply to select important genes prior to classification. Random Forest, a decision tree based classification algorithm, is chosen as a classifier to seven cancer microarray data for correct diagnosis. was also performed using the nearest shrunken centroid classifier and its results are compared to those from random Forest. Our study demonstrates that the nearest shrunken centroid classifier is simple, yet efficient in selecting important genes, but does not perform well as a classifier. We report that performance of Random Forest as a classifier is far superior to that of Shrunken centroid classifier. 1. Introduction The development of microarray technology made analysis of gene expressions possible. Analyzing gene expression data from microarray devices has many important applications in medicine and biology: the diagnosis of disease, accurate prognosis for particular patients, and understanding the response of a disease to drugs, to name a few. However the amount of data in each microarray presents significant challenges to data mining. Microarray data typically has many attributes and few samples because of the difficulty of collecting and processing samples, especially for human data. The typical number of samples is small-less than 100. In contrast, the number of attributes which represent genes in microarray data is very large-typically tens of thousands. This causes several problems. With irrelevant and redundant attributes, the data set is unnecessarily large and takes too long to perform classification. Having so many attributes and so few samples creates a high lelihood of finding false positive with over fitting. Reducing the number of attributes is an important step before applying classification methods. It should be done by preserving a maximum of discriminant information to improve the learning accuracy. Good attributes must have the same expression pattern for all samples of the same class, but different expression patterns for examples belonging to different classes. Tibshirani et al [11] proposed the nearest shrunken centroid method for class prediction in DNAmicroarray studies. It uses shrunken centroids as prototypes for each class and identifies subsets of genes that best characterize each class. It "shrinks" each of the class centroids toward the overall centroid for all classes by a threshold. This shrinkage makes the classifier more accurate by eliminating the effect of noisy genes and as a result it does automatic gene selection. The gene expression profile of a new sample is compared to each of these class centroids. The class whose centroid that it is closest to, in squared distance, is the predicted class for that new sample. This part is the same as the usual nearest centroid rule. This type of classifier is sensitive to a small disturbance and performance is inferior to contemporary classifiers such as neural networks, support vector machines, and decision trees where complex criteria is used for classification. Random Forest is a statistical method for classification and was first introduced by Leo Breiman in 2001[2]. It is a decision tree based supervised learning algorithm. Its algorithm is an ensemble classification which is unsurpassable in accuracy among current data mining algorithms. It has been used for intrusion detection[14], probability estimation[15], and classification[13], but is not widely used in the microarray data classification problems until recently. Reported results[12][13] of a Random Forest classifier with microarray data are very promising. In this paper, we make an attempt to improve microarray data classification rates. The shrunken centroid is used to extract good attributes and Random Forest is used as a classifier. results from this new architecture are compared to those obtained by solely using the shrunken centroid method as a classifier. This paper is organized as follows. We present previous relevant work in section 2. Then we discuss the concept of the Random Forest classifier and the

2 shrunken centroid in section 3. Seven microarray data sets used in this paper are described in Section 4. Numerical results are presented in Section 5, followed by discussions and conclusions in Section Related Works Random Forest has been used in several problems. Zhang and Zulkernine used it for network intrusion detection[14]. Prediction of a chemical compound s quantitative or categorical biological activity was performed by Svetn et al[13]. Random Forest performance was investigated by Klassen [9] with different parameter values, the number of trees and attributed used to split a node in a tree. Several research works were focused on reducing attributes: Genetic Algorithm[10] and wrapper approach[6], support vector machine[3], maximum lelihood[5], statistical methods[1][11], neural networks[7], and probability[15]. Hoffmann[4] used pediatric acute lymphoblastic leukemia (ALL) to determine a small set of genes for optimal subgroup distinction and achieved very high overall ALL subgroup prediction accuracies of about 98%. Uriarte et al[23] investigated the use of Random Forest for classification of nine microarray data including leukemia, colon, lymphoma, and SRBCT and reported classification rates of 94.9%, 87.3%, 99.1%, 99.23%, and 99.79% for leukemia, colon, lymphoma, prostate, and SRBCT, respectively. Prostate cancer genes expression data was tested with Multilayer Perceptron to distinguish grade levels by Moradi at el[16] and an average of 85% classification rate was reported. In Khan s work[7], the principal component analysis (PCA) with SRBCT tumor data[25] reduced the dimensionality to 10 dominant PCA components. Using these neural networks, all 20 test samples were correctly classified. Klassen[8] reported 100% classification rate with SRBCT data using Levenberg-Marquardt (LM) algorithm Back Propagation neural networks after attribute selection with the nearest shrunken centroid method. Tibshirani et al [11] reported 100% classification rate with SRBCT data and 94% with leukemia data with the nearest shrunken centroid classifier. Feng Chu [24] used support vector machine with 4 effective feature reduction methods and reported 100%, 94.12%, and 100% for SRBCT, leukemia, lymphoma respectively. 3. Background 3.1 Random Forest Random Forest is a meta-learner which consists of many individual tree. Each tree votes on an overall classification for the given set of data and the Random Forest algorithm chooses the individual classification with the most votes. There are two different sources of randomness in Random Forest: random training set(bootstrap) and random selection of attributes. Using a random selection of attributes to split each node yields favorable error rates and are more robust with respect to noise. These attributes form nodes using standard tree building methods. Diversity is obtained by randomly choosing attributes at each node of a tree and using the attributes that provide the highest level of learning. Each tree is grown to the fullest possible without pruning until no more nodes can be created due to information loss. With a lesser number of attributes used for split, correlation between any two trees decreases and the strength of a tree decreases. With a larger number of attributes, each tree strength increases. These two have a reverse effect on error rates of Random Forest: less correlation increases the error rate while less strength decreases the error rate. The optimal result should be used for the number of attributes to get good results Shrunken Centroids There are two factors to consider in selecting good genes: within class distance and between classes distance. When expression levels of a gene for all samples in the same class are fairly consistent with a small variance, but are largely different among samples of different classes, the gene is considered as a good candidate for classification. The gene has discriminant information for different classes. In the nearest shrunken centroid method, variance within a class was further taken into consideration to measure the goodness of a gene within class. The difference between a class centroid and the overall centroid for a gene is divided by the within class variance to give a greater weight to a gene whose expressions are stable among samples in the same class. A threshold value is applied to the resulting normalized class centroid differences. If it is small for all classes, it is set to zero, meaning the gene is eliminated. This reduces the number of genes that are used in the final predictive model. The mathematics involved in the shrunken centroid algorithm can be found from Tibshirani s work [11]. Assume there are n patients, p genes and K classes where s i is the within class standard deviation for gene i K si = ( xij xij ) and m n K = k k = j C n n. 0 1 k s is a small positive constant to avoid having a large d value when a gene has a small standard deviation s i. In the nearest shrunken centroid, for each gene i, d is evaluated in each class k to see if it is smaller k

3 than a chosen value, threshold Δ. If it is too small for all classes, the gene i is considered not very significant and is eliminated from the gene lists. Otherwise update d by sign( d )( d Δ ). The updated d is ' d afterwards. Once the optimal threshold is found, all centroids are called updated accordingly (shrunken centroid) using the equation ' ' = i + ( + 0) k i x x m s s d ' where d is the updated d. If a gene is shrunk to zero for all classes, then it is considered not important and is eliminated. After the centroids are determined through the training stage, classification can take place with new samples. Test samples are classified to belong to the class whose shrunken centroid is nearest to it. 4. Data SRBCT data[7]: small round blue cell tumor. It contained 88 samples from four types of cancer cells: 18 neuroblastoma (NB), 25 rhabdomyosarcoma (RMS), 29 the Ewing family of tumors (EWS), and 11 Burkitt lymphoma (BL). There are total 2287 genes. Acute data[17]: The total number of genes is 7129, and number of samples is 72, which are all acute leukemia patients, 47 acute lymphoblastic leukemia (ALL) or 28 acute myelogenous leukemia (AML). Prostate data[21]: The total number of genes is and the number of classes are two: tumor and normal. There are total 102 samples: 52 normal and 50 tumor. Lymphoma data[17]: Total number of genes to be tested is 4026 and the number of samples to be tested is 62. There are all together three types of lymphomas. The first category, Chronic Lymphocytic Lymphoma(CLL) has 11 patients, the second type Follicular Lymphoma (FL) has 42 and the third type Diffuse Large B-cell Lymphoma (DLBCL) has 9. Colon data [19]: This dataset contains 62 samples. Among them, 40 tumor biopsies are from tumors (labeled as "negative") and 22 normal (labeled as "positive") biopsies are from healthy parts of the colons of the same patients. The total number of genes to be tested is Lung cancer data[20] : There are 181 tissue samples among which 31 samples belong to MPM class and 150 belong to ADCA class. Each sample is described by genes. MLL_leukemia data[22] : contains three kinds of leukemia samples compared to the binary-class leukemia dataset. The dataset contains 72 leukemia samples with 24 ALL, 20 MLL and 28 AML. The number of genes is Experiment and Results 5.1 Data Cleaning and Set up All data sets except SRBCT which already had the train set and the test set by the author [11] is divided into a train set and a test set by reserving ratios of sample classes. Roughly 60-80% of samples are used for train and the remaining are for test. The detail breakdown for each data set is shown in Table 1. Data sets were preprocessed to get rid of redundant data and to handle missing values. The freely available shrunken centroid algorithm software PAM which was used for our experiments has a function to handle missing values. It uses the k-nearest neighborhood algorithm to find the best value to fill in missing data. The default k-value of 10 was used in our experiments. Train sample SRBCT 63(23 EWS, 8 BL, 12 NB, 20 RMS) Acute 38(27 ALL, 11 AML) Prostate 80 (40 tumor, 40 normal) Lymphoma 47( 32 DLBCL, 7FL, 8 CLL) Colon 42 (30 tumor, 16 normal) Lung 149 (134 ADCA, 15 MPM) MLL 50 (16 ALL, 14 MLL, 20 AML) Test sample 20 (6 EWS, 3 BL,6 NB,5 RMS) 34(20 ALL, 14 AML) 22 (12 tumor, 12 normal) 15 (10 DLBCL, 2FL, 3CLL) 16 (10 tumor, 6 normal) 32(16 ADCA, 16 MPM) 22 ( 5 ALL, 6 MLL, 5 AML) Table 1: Train samples and test samples with number of class members 5.2 Gene Selection with shrunken centroid the Using the shrunken centroid algorithm, train data errors are plotted against threshold values as shown in Figure 1 with colon cancer. A range of threshold values from 2.5 to 3.8 show the lowest error. The largest threshold value was chosen in our study to obtain the smallest number of attributes to be used with the

4 Random Forest classifier. In Figure 1, the threshold value was chosen for colon cancer. The same process was performed for all 7 data sets and Table 2 shows the number of genes selected. Figure 1: Colon cancer nearest shrunken centroid training errors Threshold Number of genes SRBCT Acute prostate Lymphoma Colon Lung ML Table 2: Threshold value and the number of genes selected in shrunken centroid. 5.3 with Random Forest and its results In our study, Random Forest in WEKA software developed by the University of Waato was used. There are two parameters which affect a performance of Random Forest. One is the number of attributes (called Mtry in WEKA) to select attributes at random to split a node in a tree and the other is the number of trees to be generated in a forest. The number of tree depth was set to 0 for an unlimited tree depth. Forests with three different numbers of trees,10, 15 and 20, were initially used and twenty Mtry values from 1 to 20 were used for each tree value. Since number of attributes in 7 data sets are between 5 and 52 with an average of 24, MTry values up to 20 is sufficient. When needed, the number of trees was changed to explore further. Statistics values such as average, min, max and standard deviation of classification rates along with the total number of times of 100% classification rates were obtained from 20 Mtry values were gathered. A summary of results is shown in Table 3. Table 3 shows the obtained highest classification rate and a number of trees giving the highest classification rate. If several trees were given 100% classification rate, the smallest is shown in the table. It also shows the number of times 100% classification rates occurred from 20 different Mtry values for the tree. With SRBCT and lymphoma, we were able to obtain 100% classification rate with all tree values and with several Mtry values. Mtry values play an important role while the number of trees are less so. A difference between the highest classification rate and the lowest one is larger than those from other data sets. With acute leukemia, prostate, colon and MLL_leukemia, classification rates didn t reach 100%. Random Forest with Mtry values and the number of trees didn t perform well. Also the difference between the highest classification rate and the lowest is small compared to that from SRBCT and lymphoma. The interesting case is lung cancer. The number of tree of 10 demonstrated an excellent result of 100% classification rate with almost all of Mtry values. But with tree values of 15 and 20, we were not able to get 100% classification rate. Highest rate 100% classification occurrence Tree size SRBCT out of Acute prostate Lymphoma Colon Lung 1 16 out of ML 1 1 out of Table 3: Test Sample classification rates with Random Forest 5.4 with the shrunken centroid classifier We ran the shrunken centroid classifier with the same train and test sets we used with Random Forest. Threshold values were kept the same. rates we obtained are shown in Table 4 along with those with Random Forest. Table 4 shows that the shrunken centroid classifier gave 86.6%, 93.7% and 95.4% classification rates with

5 three data sets, Lymphoma, Lung, MLL_ respectively while Random Forest produced 100% for all three. With and Prostate, both gave the same classification rates of 94.11% and 90.91% respectively. With colon cancer, the classification rate with Shrunken Centroid is 75% while 81% with Random Forest. Threshol d Shrunken Centroid rates Random Forest Classificati on rates SRBCT Acute prostate Lymphoma Colon Lung ML Table 4: Test Samples classification rates with shrunken centroid and Random Forest. 6. Evaluation and discussion rates we obtained with Random Forest are the same as or higher than those with shrunken centroid with all data sets but Acute data and prostate data. The number of genes for these two data sets are relatively small, 21 for Acute, and 6 for prostate. We further investigated how different numbers of genes affect classification rates. 6.1 A larger number of attributes for Random Forest A range of threshold values gives the largest classification rate as shown in Figure 1, and we chose a few different threshold values which were less than the largest value to generates more attributes. The idea is that more attributes may increase classification rates with the Random Forest classifier. Table 5 shows prostate classification rates increased to 96.77% with larger threshold values. With the largest value 6.33, 6 genes were selected and its classification rate was 90.91%(see Table 3). With the shrunken centroid classifier, the rate went up to 93.55% from 90.91% (see Table 3), but then went down to 83.87%. With acute leukemia, Random Forest was able to classify all test samples correctly, but the shrunken centroid classifier went down to 90% from 94.11% (see Table 3) and didn t change at all with different threshold values. The shrunken centroid classifies test samples to the nearest shrunken centroid by using a simple standardized squared distance. This works well when genes with nonzero components in each class are mutually exclusive, such as the case with SRBCT data. Otherwise using the simple distance is not a good measure for classification. A sample with average values of all genes and a sample with a few large gene values and a few small gene values may end up in the same class. The second term used in the shrunken centroid, frequency of a class in the population may not be of help when population has a similar class ratio. Threshold No. of genes Shrunken Centroid rate Random Forest rate % 96.77% (ntree=14, Mtry=2) % 96.77% (ntree=9, Mtry=4) % 96.77% (ntree=8, Mtry=5) % 93.54% (ntree=8, Mtry=2) Table 5: Prostate classification rates with different threshold values Threshold No. of genes Shrunken Centroid rate Random Forest rate % 100% ( ntree=5, Mtry=3) % 95% ( ntree=5, Mtry=3) % 100% ( ntree=6, Mtry=4) % 95% ( ntree=6, Mtry=2) Table 6:Acute classification rates with different threshold values Random Forest uses decision-tree based supervised learning algorithm and is a meta learner to make a classification decision collectively based on each tree vote. Problems mentioned above with the shrunken centroid will not take place. Genes selected for

6 splitting a node in a tree will properly assign sample into a right decision node. 6.2 k-nearest neighbor algorithm for handling Missing values the Preprocessing data is an important part of data mining and can affect training and testing processes. We explored a goodness of k-nearest neighbor to fill in a missing value. We chose 62 samples with no missing values from the Colon data. We chose one gene T53360 and deleted values , and from randomly selected samples sample 5 (tumor), and sample 48(normal) respectively. We filled in missing values with k-values from 1 to 10. The k value giving the best matching with the original value we deleted is different for each class. Therefore an average from all 10 k-values is used and is shown in Table 7. The table shows that k-nearest neighbor computes much closer values to the original values than a simple average method does. A similar result was obtain with Lymphoma. Sample 5(tumor) Sample 48(normal) Original value deleted k-nearest neighbor average Simple average Table 7: Effectiveness of the k-nearest neighbor method handling missing values with colon cancer. 6. References [1] E. Blair, and R. Tibshirani, Machine learning methods applied to DNA microarray data can improve the diagnosis of cancer. SIGKDD Explorations, Vol. 5,Issue 2, 2002, pp [2] L. Breiman, Random Forest, Machine Learning, 45(1), 2001, pp5-32. [3] I. Guyon, Gene selection for cancer classification using support vector machines, Machine Learning, Vol. 46, 2002, pp [4] K. Hoffmann, Translating microarray data for diagnostic testing in childhood leukaemia, BMC Cancer, :229. [5] T. Ideker, Testing for differentially-expressed genes by maximum lelihood analysis of microarray data, Journal of Computational Biology, Vol. 7, 2000, pp [6] Inza, Gene selection by sequential wrapper approaches in microarray cancer class prediction, Journal of Intelligent and Fuzzy Systems, 2002, pp [7] J. Khan, and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medicine, Vol. 7, No. 6, 2001, pp [8] M. Klassen, of Cancer Microarray data using neural network, Proceedings of IADIS International Conference Applied Computing, 2007 [9] M. Klassen, M.Cummings, G. Seldana, Investigation of Random Forest performance with cancer microarray data. Proceedings of ISCA 24th International Conference ON Computers and Their Applications [10] L. Li, Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method, Combinatorial Chemistry and High Throughput Screening. 2001, pp [11] R. Tibshirani, Dignosis of multiple cancer types by shrunken centroids of gene expression, PNAS, Vol. 99, No. 10, 2002, pp [12] T. Shi, Tumor classification by tissue microarray profiling:random Forest clustering applied to renal cell carcinoma, Modern Pathology. Vol. 18, 2005, pp [13] V. Svetn, Random Forest: A and Regression Tool for compound classification and QSAR modeling, J. Chem. Inf. Computer Science, 43, 2003, pp [14] J. Zhang and M. Zulkernine, A Hybrid Network Intrusion Detection Technique Using Random Forest, Proceedings of the First International Conference on Availability, Reliability and Security (ARES'06), Vol., , pp [15] Ting-Fan Wu, Chih-Jen Lin and Ruby C. Weng, probability estimates for multi-class classification by pair wise coupling, The journal of machine learning research, Volume 5, [16] M. Moradi, P. Mousavi, and P. Abolmaesoumi. pathological distinction of prostate cancer tumors based on DNA microarray data. CSCBC 2006 conference proceedings.ontario, Canada. [17] Goloub, et al. Molecular of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 15 October 1999.Vol no. 5439, pp [18] Alizadeh, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature Feb 3;403(6769): [19] Alon, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues. Proceedings of the National Academy of Sciences, [20] Gordon, et al. Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 2002 AACR. [21] Singh, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell Mar;1(2): [22] Armstrong, et al. A gene expression profile analysis of acute lymphoblastic leukemia suggests a new subset of leukemia. The Scientist 2001, 2(1): [23] Ramon Uriarte and Sara Andres. gene selection and classification of microarray data using Random Forest BMC Bioinformatics Vol 7, No 3. [24] Feng Chu and Lipo Wang. Application of support vector machines to cancer classification with microarray data. International Journal of Neural systems, 2005, Vol 5, No 6.

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Diagnosis of multiple cancer types by shrunken centroids of gene expression Diagnosis of multiple cancer types by shrunken centroids of gene expression Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu PNAS 99:10:6567-6572, 14 May 2002 Nearest Centroid

More information

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

Comparison of discrimination methods for the classification of tumors using gene expression data

Comparison of discrimination methods for the classification of tumors using gene expression data Comparison of discrimination methods for the classification of tumors using gene expression data Sandrine Dudoit, Jane Fridlyand 2 and Terry Speed 2,. Mathematical Sciences Research Institute, Berkeley

More information

Colon cancer subtypes from gene expression data

Colon cancer subtypes from gene expression data Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto Sherman Ip Leon Law Module 6: Applied Statistics 26th February 2016 Aim Replicate findings of Felipe De Sousa et

More information

T. R. Golub, D. K. Slonim & Others 1999

T. R. Golub, D. K. Slonim & Others 1999 T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have

More information

Predicting Breast Cancer Survivability Rates

Predicting Breast Cancer Survivability Rates Predicting Breast Cancer Survivability Rates For data collected from Saudi Arabia Registries Ghofran Othoum 1 and Wadee Al-Halabi 2 1 Computer Science, Effat University, Jeddah, Saudi Arabia 2 Computer

More information

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Breast Cancer Survival Using Treatment and Patient Factors Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women

More information

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann http://genomebiology.com/22/3/2/research/69. Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann Address: Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich,

More information

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes. Final review Based in part on slides from textbook, slides of Susan Holmes December 5, 2012 1 / 1 Final review Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension

More information

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection 202 4th International onference on Bioinformatics and Biomedical Technology IPBEE vol.29 (202) (202) IASIT Press, Singapore Efficacy of the Extended Principal Orthogonal Decomposition on DA Microarray

More information

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines Florian Markowetz and Anja von Heydebreck Max-Planck-Institute for Molecular Genetics Computational Molecular Biology

More information

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features American Journal of Applied Sciences 8 (12): 1295-1301, 2011 ISSN 1546-9239 2011 Science Publications Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine

More information

Data complexity measures for analyzing the effect of SMOTE over microarrays

Data complexity measures for analyzing the effect of SMOTE over microarrays ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. Data complexity

More information

Cancer Gene Extraction Based on Stepwise Regression

Cancer Gene Extraction Based on Stepwise Regression Mathematical Computation Volume 5, 2016, PP.6-10 Cancer Gene Extraction Based on Stepwise Regression Jie Ni 1, Fan Wu 1, Meixiang Jin 1, Yixing Bai 1, Yunfei Guo 1 1. Mathematics Department, Yanbian University,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017 RESEARCH ARTICLE Classification of Cancer Dataset in Data Mining Algorithms Using R Tool P.Dhivyapriya [1], Dr.S.Sivakumar [2] Research Scholar [1], Assistant professor [2] Department of Computer Science

More information

Gene Expression Based Leukemia Sub Classification Using Committee Neural Networks

Gene Expression Based Leukemia Sub Classification Using Committee Neural Networks Bioinformatics and Biology Insights M e t h o d o l o g y Open Access Full open access to this and thousands of other papers at http://www.la-press.com. Gene Expression Based Leukemia Sub Classification

More information

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 A.K.A. Artificial Intelligence Unsupervised learning! Cluster analysis Patterns, Clumps, and Joining

More information

MACHINE LEARNING BASED APPROACHES FOR PREDICTION OF PARKINSON S DISEASE

MACHINE LEARNING BASED APPROACHES FOR PREDICTION OF PARKINSON S DISEASE Abstract MACHINE LEARNING BASED APPROACHES FOR PREDICTION OF PARKINSON S DISEASE Arvind Kumar Tiwari GGS College of Modern Technology, SAS Nagar, Punjab, India The prediction of Parkinson s disease is

More information

Evaluating Classifiers for Disease Gene Discovery

Evaluating Classifiers for Disease Gene Discovery Evaluating Classifiers for Disease Gene Discovery Kino Coursey Lon Turnbull khc0021@unt.edu lt0013@unt.edu Abstract Identification of genes involved in human hereditary disease is an important bioinfomatics

More information

Predicting Kidney Cancer Survival from Genomic Data

Predicting Kidney Cancer Survival from Genomic Data Predicting Kidney Cancer Survival from Genomic Data Christopher Sauer, Rishi Bedi, Duc Nguyen, Benedikt Bünz Abstract Cancers are on par with heart disease as the leading cause for mortality in the United

More information

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT Research Article Bioinformatics International Journal of Pharma and Bio Sciences ISSN 0975-6299 A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS D.UDHAYAKUMARAPANDIAN

More information

Algorithms Implemented for Cancer Gene Searching and Classifications

Algorithms Implemented for Cancer Gene Searching and Classifications Algorithms Implemented for Cancer Gene Searching and Classifications Murad M. Al-Rajab and Joan Lu School of Computing and Engineering, University of Huddersfield Huddersfield, UK {U1174101,j.lu}@hud.ac.uk

More information

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures 1 2 3 4 5 Kathleen T Quach Department of Neuroscience University of California, San Diego

More information

An Improved Algorithm To Predict Recurrence Of Breast Cancer

An Improved Algorithm To Predict Recurrence Of Breast Cancer An Improved Algorithm To Predict Recurrence Of Breast Cancer Umang Agrawal 1, Ass. Prof. Ishan K Rajani 2 1 M.E Computer Engineer, Silver Oak College of Engineering & Technology, Gujarat, India. 2 Assistant

More information

Hybridized KNN and SVM for gene expression data classification

Hybridized KNN and SVM for gene expression data classification Mei, et al, Hybridized KNN and SVM for gene expression data classification Hybridized KNN and SVM for gene expression data classification Zhen Mei, Qi Shen *, Baoxian Ye Chemistry Department, Zhengzhou

More information

A Biclustering Based Classification Framework for Cancer Diagnosis and Prognosis

A Biclustering Based Classification Framework for Cancer Diagnosis and Prognosis A Biclustering Based Classification Framework for Cancer Diagnosis and Prognosis Baljeet Malhotra and Guohui Lin Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8

More information

Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction

Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction Amir Barati Farimani, Mohammad Heiranian, Narayana R. Aluru 1 Department

More information

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/ CSCI1950 Z Computa4onal Methods for Biology Lecture 18 Ben Raphael April 8, 2009 hip://cs.brown.edu/courses/csci1950 z/ Binary classifica,on Given a set of examples (x i, y i ), where y i = + 1, from unknown

More information

Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts

Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts Kinar, Y., Kalkstein, N., Akiva, P., Levin, B., Half, E.E., Goldshtein, I., Chodick, G. and Shalev,

More information

Accurate molecular classification of cancer using simple rules.

Accurate molecular classification of cancer using simple rules. University of Nebraska Medical Center DigitalCommons@UNMC Journal Articles: Genetics, Cell Biology & Anatomy Genetics, Cell Biology & Anatomy 10-30-2009 Accurate molecular classification of cancer using

More information

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang Classification Methods Course: Gene Expression Data Analysis -Day Five Rainer Spang Ms. Smith DNA Chip of Ms. Smith Expression profile of Ms. Smith Ms. Smith 30.000 properties of Ms. Smith The expression

More information

A hierarchical two-phase framework for selecting genes in cancer datasets with a neuro-fuzzy system

A hierarchical two-phase framework for selecting genes in cancer datasets with a neuro-fuzzy system Technology and Health Care 24 (2016) S601 S605 DOI 10.3233/THC-161187 IOS Press S601 A hierarchical two-phase framework for selecting genes in cancer datasets with a neuro-fuzzy system Jongwoo Lim, Bohyun

More information

Outlier Analysis. Lijun Zhang

Outlier Analysis. Lijun Zhang Outlier Analysis Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based

More information

Package propoverlap. R topics documented: February 20, Type Package

Package propoverlap. R topics documented: February 20, Type Package Type Package Package propoverlap February 20, 2015 Title Feature (gene) selection based on the Proportional Overlapping Scores Version 1.0 Date 2014-09-15 Author Osama Mahmoud, Andrew Harrison, Aris Perperoglou,

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 8, Issue 1 2009 Article 13 Detecting Outlier Samples in Microarray Data Albert D. Shieh Yeung Sam Hung Harvard University, shieh@fas.harvard.edu

More information

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM)

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM) www.ijcsi.org 8 An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM) Shamsan Aljamali 1, Zhang Zuping 2 and Long Jun 3 1 School of Information

More information

Augmented Medical Decisions

Augmented Medical Decisions Machine Learning Applied to Biomedical Challenges 2016 Rulex, Inc. Intelligible Rules for Reliable Diagnostics Rulex is a predictive analytics platform able to manage and to analyze big amounts of heterogeneous

More information

Prediction of heart disease using k-nearest neighbor and particle swarm optimization.

Prediction of heart disease using k-nearest neighbor and particle swarm optimization. Biomedical Research 2017; 28 (9): 4154-4158 ISSN 0970-938X www.biomedres.info Prediction of heart disease using k-nearest neighbor and particle swarm optimization. Jabbar MA * Vardhaman College of Engineering,

More information

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23: Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:7332-7341 Presented by Deming Mi 7/25/2006 Major reasons for few prognostic factors to

More information

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Comparison of Collaborative Filtering Methods for Medication Reconciliation A Comparison of Collaborative Filtering Methods for Medication Reconciliation Huanian Zheng, Rema Padman, Daniel B. Neill The H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, 15213,

More information

Keywords Missing values, Medoids, Partitioning Around Medoids, Auto Associative Neural Network classifier, Pima Indian Diabetes dataset.

Keywords Missing values, Medoids, Partitioning Around Medoids, Auto Associative Neural Network classifier, Pima Indian Diabetes dataset. Volume 7, Issue 3, March 2017 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Medoid Based Approach

More information

L. Ziaei MS*, A. R. Mehri PhD**, M. Salehi PhD***

L. Ziaei MS*, A. R. Mehri PhD**, M. Salehi PhD*** Received: 1/16/2004 Accepted: 8/1/2005 Original Article Application of Artificial Neural Networks in Cancer Classification and Diagnosis Prediction of a Subtype of Lymphoma Based on Gene Expression Profile

More information

The Analysis of Proteomic Spectra from Serum Samples. Keith Baggerly Biostatistics & Applied Mathematics MD Anderson Cancer Center

The Analysis of Proteomic Spectra from Serum Samples. Keith Baggerly Biostatistics & Applied Mathematics MD Anderson Cancer Center The Analysis of Proteomic Spectra from Serum Samples Keith Baggerly Biostatistics & Applied Mathematics MD Anderson Cancer Center PROTEOMICS 1 What Are Proteomic Spectra? DNA makes RNA makes Protein Microarrays

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

Predictive Biomarkers

Predictive Biomarkers Uğur Sezerman Evolutionary Selection of Near Optimal Number of Features for Classification of Gene Expression Data Using Genetic Algorithms Predictive Biomarkers Biomarker: A gene, protein, or other change

More information

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering Gene expression analysis Roadmap Microarray technology: how it work Applications: what can we do with it Preprocessing: Image processing Data normalization Classification Clustering Biclustering 1 Gene

More information

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

INTRODUCTION TO MACHINE LEARNING. Decision tree learning INTRODUCTION TO MACHINE LEARNING Decision tree learning Task of classification Automatically assign class to observations with features Observation: vector of features, with a class Automatically assign

More information

SVM-Kmeans: Support Vector Machine based on Kmeans Clustering for Breast Cancer Diagnosis

SVM-Kmeans: Support Vector Machine based on Kmeans Clustering for Breast Cancer Diagnosis SVM-Kmeans: Support Vector Machine based on Kmeans Clustering for Breast Cancer Diagnosis Walaa Gad Faculty of Computers and Information Sciences Ain Shams University Cairo, Egypt Email: walaagad [AT]

More information

Rajiv Gandhi College of Engineering, Chandrapur

Rajiv Gandhi College of Engineering, Chandrapur Utilization of Data Mining Techniques for Analysis of Breast Cancer Dataset Using R Keerti Yeulkar 1, Dr. Rahila Sheikh 2 1 PG Student, 2 Head of Computer Science and Studies Rajiv Gandhi College of Engineering,

More information

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool Sujata Joshi Assistant Professor, Dept. of CSE Nitte Meenakshi Institute of Technology Bangalore,

More information

Multiclass microarray data classification based on confidence evaluation

Multiclass microarray data classification based on confidence evaluation Methodology Multiclass microarray data classification based on confidence evaluation H.L. Yu 1, S. Gao 1, B. Qin 1 and J. Zhao 2 1 School of Computer Science and Engineering, Jiangsu University of Science

More information

A Random Forest Model for the Analysis of Chemical Descriptors for the Elucidation of HIV 1 Protease Protein Ligand Interactions

A Random Forest Model for the Analysis of Chemical Descriptors for the Elucidation of HIV 1 Protease Protein Ligand Interactions A Random Forest Model for the Analysis of Chemical Descriptors for the Elucidation of HIV 1 Protease Protein Ligand Interactions Gene M. Ko, A. Srinivas Reddy, Sunil Kumar, Barbara A. Bailey, and Rajni

More information

FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION

FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION SOMAYEH ABBASI, HAMID MAHMOODIAN Department of Electrical Engineering, Najafabad branch, Islamic

More information

Introduction to Discrimination in Microarray Data Analysis

Introduction to Discrimination in Microarray Data Analysis Introduction to Discrimination in Microarray Data Analysis Jane Fridlyand CBMB University of California, San Francisco Genentech Hall Auditorium, Mission Bay, UCSF October 23, 2004 1 Case Study: Van t

More information

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Positive and Unlabeled Relational Classification through Label Frequency Estimation Positive and Unlabeled Relational Classification through Label Frequency Estimation Jessa Bekker and Jesse Davis Computer Science Department, KU Leuven, Belgium firstname.lastname@cs.kuleuven.be Abstract.

More information

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Department of Biomedical Informatics Department of Computer Science and Engineering The Ohio State University Review

More information

Classification of cancer profiles. ABDBM Ron Shamir

Classification of cancer profiles. ABDBM Ron Shamir Classification of cancer profiles 1 Background: Cancer Classification Cancer classification is central to cancer treatment; Traditional cancer classification methods: location; morphology, cytogenesis;

More information

Predicting Juvenile Diabetes from Clinical Test Results

Predicting Juvenile Diabetes from Clinical Test Results 2006 International Joint Conference on Neural Networks Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006 Predicting Juvenile Diabetes from Clinical Test Results Shibendra Pobi

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Positive and Unlabeled Relational Classification through Label Frequency Estimation Positive and Unlabeled Relational Classification through Label Frequency Estimation Jessa Bekker and Jesse Davis Computer Science Department, KU Leuven, Belgium firstname.lastname@cs.kuleuven.be Abstract.

More information

Impute vs. Ignore: Missing Values for Prediction

Impute vs. Ignore: Missing Values for Prediction Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 Impute vs. Ignore: Missing Values for Prediction Qianyu Zhang, Ashfaqur Rahman, and Claire D Este

More information

Variable Features Selection for Classification of Medical Data using SVM

Variable Features Selection for Classification of Medical Data using SVM Variable Features Selection for Classification of Medical Data using SVM Monika Lamba USICT, GGSIPU, Delhi, India ABSTRACT: The parameters selection in support vector machines (SVM), with regards to accuracy

More information

A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING TO DETECT MEAN AND/OR VARIANCE SHIFTS IN A PROCESS. Received August 2010; revised February 2011

A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING TO DETECT MEAN AND/OR VARIANCE SHIFTS IN A PROCESS. Received August 2010; revised February 2011 International Journal of Innovative Computing, Information and Control ICIC International c 2011 ISSN 1349-4198 Volume 7, Number 12, December 2011 pp. 6935 6948 A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING

More information

Biomarker adaptive designs in clinical trials

Biomarker adaptive designs in clinical trials Review Article Biomarker adaptive designs in clinical trials James J. Chen 1, Tzu-Pin Lu 1,2, Dung-Tsa Chen 3, Sue-Jane Wang 4 1 Division of Bioinformatics and Biostatistics, National Center for Toxicological

More information

Minimum Feature Selection for Epileptic Seizure Classification using Wavelet-based Feature Extraction and a Fuzzy Neural Network

Minimum Feature Selection for Epileptic Seizure Classification using Wavelet-based Feature Extraction and a Fuzzy Neural Network Appl. Math. Inf. Sci. 8, No. 3, 129-1300 (201) 129 Applied Mathematics & Information Sciences An International Journal http://dx.doi.org/10.1278/amis/0803 Minimum Feature Selection for Epileptic Seizure

More information

AUTOMATING NEUROLOGICAL DISEASE DIAGNOSIS USING STRUCTURAL MR BRAIN SCAN FEATURES

AUTOMATING NEUROLOGICAL DISEASE DIAGNOSIS USING STRUCTURAL MR BRAIN SCAN FEATURES AUTOMATING NEUROLOGICAL DISEASE DIAGNOSIS USING STRUCTURAL MR BRAIN SCAN FEATURES ALLAN RAVENTÓS AND MOOSA ZAIDI Stanford University I. INTRODUCTION Nine percent of those aged 65 or older and about one

More information

Bivariate variable selection for classification problem

Bivariate variable selection for classification problem Bivariate variable selection for classification problem Vivian W. Ng Leo Breiman Abstract In recent years, large amount of attention has been placed on variable or feature selection in various domains.

More information

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Brain Tumour Detection of MR Image Using Naïve

More information

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH 1 VALLURI RISHIKA, M.TECH COMPUTER SCENCE AND SYSTEMS ENGINEERING, ANDHRA UNIVERSITY 2 A. MARY SOWJANYA, Assistant Professor COMPUTER SCENCE

More information

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION OPT-i An International Conference on Engineering and Applied Sciences Optimization M. Papadrakakis, M.G. Karlaftis, N.D. Lagaros (eds.) Kos Island, Greece, 4-6 June 2014 A NOVEL VARIABLE SELECTION METHOD

More information

Increasing Efficiency of Microarray Analysis by PCA and Machine Learning Methods

Increasing Efficiency of Microarray Analysis by PCA and Machine Learning Methods 56 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 Increasing Efficiency of Microarray Analysis by PCA and Machine Learning Methods Jing Sun 1, Kalpdrum Passi 1, Chakresh Jain 2 1 Department

More information

Inter-session reproducibility measures for high-throughput data sources

Inter-session reproducibility measures for high-throughput data sources Inter-session reproducibility measures for high-throughput data sources Milos Hauskrecht, PhD, Richard Pelikan, MSc Computer Science Department, Intelligent Systems Program, Department of Biomedical Informatics,

More information

Discovering Meaningful Cut-points to Predict High HbA1c Variation

Discovering Meaningful Cut-points to Predict High HbA1c Variation Proceedings of the 7th INFORMS Workshop on Data Mining and Health Informatics (DM-HI 202) H. Yang, D. Zeng, O. E. Kundakcioglu, eds. Discovering Meaningful Cut-points to Predict High HbAc Variation Si-Chi

More information

Tissue Classification Based on Gene Expression Data

Tissue Classification Based on Gene Expression Data Chapter 6 Tissue Classification Based on Gene Expression Data Many diseases result from complex interactions involving numerous genes. Previously, these gene interactions have been commonly studied separately.

More information

Knowledge Discovery and Data Mining I

Knowledge Discovery and Data Mining I Ludwig-Maximilians-Universität München Lehrstuhl für Datenbanksysteme und Data Mining Prof. Dr. Thomas Seidl Knowledge Discovery and Data Mining I Winter Semester 2018/19 Introduction What is an outlier?

More information

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) AUTHORS: Tejas Prahlad INTRODUCTION Acute Respiratory Distress Syndrome (ARDS) is a condition

More information

Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selection based on Mutual Information

Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selection based on Mutual Information Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selection based on Mutual Information Abeer Alzubaidi abeer.alzubaidi022014@my.ntu.ac.uk David Brown david.brown@ntu.ac.uk Abstract

More information

NAÏVE BAYESIAN CLASSIFIER FOR ACUTE LYMPHOCYTIC LEUKEMIA DETECTION

NAÏVE BAYESIAN CLASSIFIER FOR ACUTE LYMPHOCYTIC LEUKEMIA DETECTION NAÏVE BAYESIAN CLASSIFIER FOR ACUTE LYMPHOCYTIC LEUKEMIA DETECTION Sriram Selvaraj 1 and Bommannaraja Kanakaraj 2 1 Department of Biomedical Engineering, P.S.N.A College of Engineering and Technology,

More information

arxiv: v2 [cs.cv] 8 Mar 2018

arxiv: v2 [cs.cv] 8 Mar 2018 Automated soft tissue lesion detection and segmentation in digital mammography using a u-net deep learning network Timothy de Moor a, Alejandro Rodriguez-Ruiz a, Albert Gubern Mérida a, Ritse Mann a, and

More information

Methods for Predicting Type 2 Diabetes

Methods for Predicting Type 2 Diabetes Methods for Predicting Type 2 Diabetes CS229 Final Project December 2015 Duyun Chen 1, Yaxuan Yang 2, and Junrui Zhang 3 Abstract Diabetes Mellitus type 2 (T2DM) is the most common form of diabetes [WHO

More information

REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE

REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 1 Biomarker discovery has opened new realms in the medical industry, from patient diagnosis and

More information

CANCER CLASSIFICATION USING SINGLE GENES

CANCER CLASSIFICATION USING SINGLE GENES 179 CANCER CLASSIFICATION USING SINGLE GENES XIAOSHENG WANG 1 OSAMU GOTOH 1,2 david@genome.ist.i.kyoto-u.ac.jp o.gotoh@i.kyoto-u.ac.jp 1 Department of Intelligence Science and Technology, Graduate School

More information

CANCER PREDICTION SYSTEM USING DATAMINING TECHNIQUES

CANCER PREDICTION SYSTEM USING DATAMINING TECHNIQUES CANCER PREDICTION SYSTEM USING DATAMINING TECHNIQUES K.Arutchelvan 1, Dr.R.Periyasamy 2 1 Programmer (SS), Department of Pharmacy, Annamalai University, Tamilnadu, India 2 Associate Professor, Department

More information

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

Predicting Breast Cancer Recurrence Using Machine Learning Techniques Predicting Breast Cancer Recurrence Using Machine Learning Techniques Umesh D R Department of Computer Science & Engineering PESCE, Mandya, Karnataka, India Dr. B Ramachandra Department of Electrical and

More information

Prediction-based Threshold for Medication Alert

Prediction-based Threshold for Medication Alert MEDINFO 2013 C.U. Lehmann et al. (Eds.) 2013 IMIA and IOS Press. This article is published online with Open Access IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial

More information

Investigating Links Between the Immune System and the Brain from Medical Claims and Laboratory Tests

Investigating Links Between the Immune System and the Brain from Medical Claims and Laboratory Tests Investigating Links Between the Immune System and the Brain from Medical Claims and Laboratory Tests Guhan Venkataraman Department of Biomedical Informatics guhan@stanford.edu Tymor Hamamsy Department

More information

Detection of Cognitive States from fmri data using Machine Learning Techniques

Detection of Cognitive States from fmri data using Machine Learning Techniques Detection of Cognitive States from fmri data using Machine Learning Techniques Vishwajeet Singh, K.P. Miyapuram, Raju S. Bapi* University of Hyderabad Computational Intelligence Lab, Department of Computer

More information

Classification of Smoking Status: The Case of Turkey

Classification of Smoking Status: The Case of Turkey Classification of Smoking Status: The Case of Turkey Zeynep D. U. Durmuşoğlu Department of Industrial Engineering Gaziantep University Gaziantep, Turkey unutmaz@gantep.edu.tr Pınar Kocabey Çiftçi Department

More information

CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY

CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY Muhammad Shahbaz 1, Shoaib Faruq 2, Muhammad Shaheen 1, Syed Ather Masood 2 1 Department of Computer Science and Engineering, UET, Lahore, Pakistan Muhammad.Shahbaz@gmail.com,

More information

UvA-DARE (Digital Academic Repository)

UvA-DARE (Digital Academic Repository) UvA-DARE (Digital Academic Repository) A classification model for the Leiden proteomics competition Hoefsloot, H.C.J.; Berkenbos-Smit, S.; Smilde, A.K. Published in: Statistical Applications in Genetics

More information

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models White Paper 23-12 Estimating Complex Phenotype Prevalence Using Predictive Models Authors: Nicholas A. Furlotte Aaron Kleinman Robin Smith David Hinds Created: September 25 th, 2015 September 25th, 2015

More information

COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) DETECTION OF ACUTE LEUKEMIA USING WHITE BLOOD CELLS SEGMENTATION BASED ON BLOOD SAMPLES

COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) DETECTION OF ACUTE LEUKEMIA USING WHITE BLOOD CELLS SEGMENTATION BASED ON BLOOD SAMPLES International INTERNATIONAL Journal of Electronics JOURNAL and Communication OF ELECTRONICS Engineering & Technology AND (IJECET), COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) ISSN 0976 6464(Print)

More information

Inferring Biological Meaning from Cap Analysis Gene Expression Data

Inferring Biological Meaning from Cap Analysis Gene Expression Data Inferring Biological Meaning from Cap Analysis Gene Expression Data HRYSOULA PAPADAKIS 1. Introduction This project is inspired by the recent development of the Cap analysis gene expression (CAGE) method,

More information

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 1 ISSN : 2456-3307 Data Mining Techniques to Predict Cancer Diseases

More information

Classification of Patients Treated for Infertility Using the IVF Method

Classification of Patients Treated for Infertility Using the IVF Method STUDIES IN LOGIC, GRAMMAR AND RHETORIC 43(56) 2015 DOI: 10.1515/slgr-2015-0041 Classification of Patients Treated for Infertility Using the IVF Method PawełMalinowski 1,RobertMilewski 1,PiotrZiniewicz

More information

Minority Report: ML Fairness in Criminality Prediction

Minority Report: ML Fairness in Criminality Prediction Minority Report: ML Fairness in Criminality Prediction Dominick Lim djlim@stanford.edu Torin Rudeen torinmr@stanford.edu 1. Introduction 1.1. Motivation Machine learning is used more and more to make decisions

More information

Using CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification

Using CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification Using CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification Kenna Mawk, D.V.M. Informatics Product Manager Ciphergen Biosystems, Inc. Outline Introduction to ProteinChip Technology

More information

A Hybrid Approach for Mining Metabolomic Data

A Hybrid Approach for Mining Metabolomic Data A Hybrid Approach for Mining Metabolomic Data Dhouha Grissa 1,3, Blandine Comte 1, Estelle Pujos-Guillot 2, and Amedeo Napoli 3 1 INRA, UMR1019, UNH-MAPPING, F-63000 Clermont-Ferrand, France, 2 INRA, UMR1019,

More information

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 *

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 * Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 * Department of CSE, Kurukshetra University, India 1 upasana_jdkps@yahoo.com Abstract : The aim of this

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information