BIOINFORMATICS. A novel strategy for microarray quality control using Bayesian networks

Similar documents
A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Predicting Breast Cancer Survivability Rates

T. R. Golub, D. K. Slonim & Others 1999

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Applying Machine Learning Techniques to Analysis of Gene Expression Data: Cancer Diagnosis

Application of Artificial Neural Network-Based Survival Analysis on Two Breast Cancer Datasets

Bayesian (Belief) Network Models,

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Natural Scene Statistics and Perception. W.S. Geisler

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

ERA: Architectures for Inference

Multiclass Classification of Cervical Cancer Tissues by Hidden Markov Model

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Chapter 1. Introduction

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

Unit 1 Exploring and Understanding Data

AUTOMATIC MEASUREMENT ON CT IMAGES FOR PATELLA DISLOCATION DIAGNOSIS

Pushing the Right Buttons: Design Characteristics of Touch Screen Buttons

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Modeling of Hippocampal Behavior

EXTRACT THE BREAST CANCER IN MAMMOGRAM IMAGES

Bayesian Networks in Medicine: a Model-based Approach to Medical Decision Making

Identifying the Zygosity Status of Twins Using Bayes Network and Estimation- Maximization Methodology

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

EARLY STAGE DIAGNOSIS OF LUNG CANCER USING CT-SCAN IMAGES BASED ON CELLULAR LEARNING AUTOMATE

On the Combination of Collaborative and Item-based Filtering

Sparse Coding in Sparse Winner Networks

CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY

Towards Learning to Ignore Irrelevant State Variables

QUANTIFICATION OF PROGRESSION OF RETINAL NERVE FIBER LAYER ATROPHY IN FUNDUS PHOTOGRAPH

Automatic Hemorrhage Classification System Based On Svm Classifier

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

A probabilistic method for food web modeling

Identification of Tissue Independent Cancer Driver Genes

Bayesian Belief Network Based Fault Diagnosis in Automotive Electronic Systems

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

LUNG CANCER continues to rank as the leading cause

AUTOMATIC DIABETIC RETINOPATHY DETECTION USING GABOR FILTER WITH LOCAL ENTROPY THRESHOLDING

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

EECS 433 Statistical Pattern Recognition

Intelligent Edge Detector Based on Multiple Edge Maps. M. Qasim, W.L. Woon, Z. Aung. Technical Report DNA # May 2012

A Naïve Bayesian Classifier for Educational Qualification

Learning Classifier Systems (LCS/XCSF)

IEEE SIGNAL PROCESSING LETTERS, VOL. 13, NO. 3, MARCH A Self-Structured Adaptive Decision Feedback Equalizer

Development and Application of an Enteric Pathogens Microarray

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

A Bayesian Network Analysis of Eyewitness Reliability: Part 1

Using Bayesian Networks to Direct Stochastic Search in Inductive Logic Programming

A Brief Introduction to Bayesian Statistics

Decisions and Dependence in Influence Diagrams

A Bayesian Network Model of Knowledge-Based Authentication

Automatic Definition of Planning Target Volume in Computer-Assisted Radiotherapy

How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection

Hybrid HMM and HCRF model for sequence classification

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

Yeast Cells Classification Machine Learning Approach to Discriminate Saccharomyces cerevisiae Yeast Cells Using Sophisticated Image Features.

Outlier Analysis. Lijun Zhang

WHO Prequalification of In Vitro Diagnostics PUBLIC REPORT. Product: Alere q HIV-1/2 Detect WHO reference number: PQDx

Positive and Unlabeled Relational Classification through Label Frequency Estimation

BAYESIAN NETWORK FOR FAULT DIAGNOSIS

Learning with Rare Cases and Small Disjuncts

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Intelligent Systems. Discriminative Learning. Parts marked by * are optional. WS2013/2014 Carsten Rother, Dmitrij Schlesinger

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

Gray level cooccurrence histograms via learning vector quantization

Positive and Unlabeled Relational Classification through Label Frequency Estimation

A Hierarchical Artificial Neural Network Model for Giemsa-Stained Human Chromosome Classification

10CS664: PATTERN RECOGNITION QUESTION BANK

Katsunari Shibata and Tomohiko Kawano

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Classification of benign and malignant masses in breast mammograms

Final Project Report Sean Fischer CS229 Introduction

ANALYSIS AND DETECTION OF BRAIN TUMOUR USING IMAGE PROCESSING TECHNIQUES

Identification of Sickle Cells using Digital Image Processing. Academic Year Annexure I

Learning in neural networks

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

International Journal of Software and Web Sciences (IJSWS)

Bayesian modeling of human concept learning

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

University of Cambridge Engineering Part IB Information Engineering Elective

Learning to Identify Irrelevant State Variables

Segmentation of Color Fundus Images of the Human Retina: Detection of the Optic Disc and the Vascular Tree Using Morphological Techniques

Quantitative Evaluation of Edge Detectors Using the Minimum Kernel Variance Criterion

The Development and Application of Bayesian Networks Used in Data Mining Under Big Data

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Tumor cut segmentation for Blemish Cells Detection in Human Brain Based on Cellular Automata

Bayesian graphical models for combining multiple data sources, with applications in environmental epidemiology

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

Changing expectations about speed alters perceived motion direction

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

AN INFORMATION VISUALIZATION APPROACH TO CLASSIFICATION AND ASSESSMENT OF DIABETES RISK IN PRIMARY CARE

Transcription:

BIOINFORMATICS Vol. 19 no. 16 23, pages 231 238 DOI: 1.193/bioinformatics/btg275 A novel strategy for microarray quality control using Bayesian networks Sampsa Hautaniemi 1,,, Henrik Edgren 2,3,, Petri Vesanen 1,, Maija Wolf 3, Anna-Kaarina Järvinen 2, Olli Yli-Harja 1, Jaakko Astola 1, Olli Kallioniemi 3 and Outi Monni 2 1 Institute of Signal Processing, Tampere University of Technology, PO Box 553, 3311 Tampere, Finland, 2 Biomedicum Biochip Center, PO Box 63, 14 University of Helsinki, Helsinki, Finland and 3 Medical Biotechnology Group, VTT Technical Research Centre of Finland and University of Turku, PO Box 16, 2521 Turku, Finland Received on December 3, 22; revised on March 3, 23; accepted on April 18, 23 ABSTRACT Motivation: High-throughput microarray technologies enable measurements of the expression levels of thousands of genes in parallel. However, microarray printing, hybridization and washing may create substantial variability in the quality of the data. As erroneous measurements may have a drastic impact on the results by disturbing the normalization schemes and by introducing expression patterns that lead to incorrect conclusions, it is crucial to discard low quality observations in the early phases of a microarray experiment. A typical microarray experiment consists of tens of thousands of spots on a microarray, making manual extraction of poor quality spots impossible. Thus, there is a need for a reliable and general microarray spot quality control strategy. Results: We suggest a novel strategy for spot quality control by using Bayesian networks, which contain many appealing properties in the spot quality control context. We illustrate how a non-linear least squares based Gaussian fitting procedure can be used in order to extract features for a spot on a microarray. The features we used in this study are: spot intensity, size of the spot, roundness of the spot, alignment error, background intensity, background noise, and bleeding. We conclude that Bayesian networks are a reliable and useful model for microarray spot quality assessment. Contact: sampsa.hautaniemi@tut.fi Supplementary information: http://sigwww.cs.tut.fi/ticsp/ SpotQuality/ 1 INTRODUCTION Gene expression profiling has been revolutionized in recent years by the development of microarray-based technologies. cdna microarrays offer a powerful method for rapid * To whom correspondence should be addressed. The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. determination of expression levels for thousands of genes simultaneously in an experiment (Schena et al., 1995). Typically, an experiment is performed by first labeling two RNA samples (e.g. cdna derived from test and reference samples) with different fluorophores (most commonly Cy3 and Cy5) and then hybridizing the labeled samples simultaneously onto a microscope slide containing thousands of cdna clones (in subsequent sections referred to as spots). The detected differences in the fluorescent intensities for a cdna clone reflect the relative differences in the abundance of the corresponding mrna transcripts in the hybridized samples. A vast amount of data are obtained from a single experiment, and even a small sample set requires a great deal of data handling, including quality assessment, prior to extraction of meaningful information. Important applications of the cdna microarray technique include, for example, disease classification for diagnostic and prognostic purposes, therapeutic developments for identification of new targets and profiling drug responses. Most microarray related studies concentrate on presenting different applications of microarray analysis, but only recently a few studies exploring microarray quality control in image analysis have been published (Brown et al., 21; Wang et al., 21; Model et al., 22; Chen et al., 22; Ruosaari and Hollmén, 22). It is essential to employ a spot quality control strategy since a microarray experiment, even under optimal conditions, results in several spots whose intensities may vary due to experimental variation. It is also crucial to exclude spots whose quality is poor in the early stages of a microarray study because microarray data normalization methods typically involve an estimation phase. Algorithms used in estimation may get confused if the number of aberrant values is large, easily leading to incorrect conclusions. Furthermore, microarray data are frequently used by researchers other than the original authors. As distributing the original microarray images requires massive storage space, a reliable quality measure for Bioinformatics 19(16) Oxford University Press 23; all rights reserved. 231

S.Hautaniemi et al. all spots in the microarray would enhance the usefulness of the distributed data by allowing replication of the original analysis without the need to reanalyze the images themselves. We define the spot quality to be a realization from an exclusive and exhaustive set of qualitative categories, where each category corresponds to the nature of a spot in a microarray. In practice, experimenters are interested in excluding unreliable spots from further analysis by dividing the spots into two categories. The first category consists of spots that are excluded from further analysis (in subsequent sections this category is referred to as bad or poor) and the second of spots that are used in the actual analysis phase (referred to as good). The spot quality assessment is essentially a classification problem. Accordingly, we suggest a quality control strategy that utilizes the Bayesian networks for computing the spot quality value from spot specific features. In this study, we both demonstrate that Bayesian networks contain several theoretical properties that make them useful in spot quality control and illustrate a case study that verifies the effectiveness of the Bayesian networks in an actual microarray experiment. The spot quality strategy we suggest is designed to be utilized for microarray technologies, in which the probes are in the form of distinct spots. These include e.g. cdna- and spotted oligonucleotide microarrays and possibly also several other emerging microarray formats, such as protein arrays. However, other parameters would need to be utilized and optimized for use with in situ synthesized, oligo-based arrays, such as those made by photolithography. 2 TYPICAL PROBLEMS IN cdna MICROARRAY IMAGES There are several features that affect the quality of a spot on a microarray. In this section we introduce the ones encountered most frequently in microarray experiments and explain the most common reasons for these aberrations. Examples of the features discussed in this section are available at the supplementary web site. 2.1 Spot intensity Signal intensity is traditionally considered one of the most important features affecting spot quality. The reason is that if a signal is weak, it is very difficult to distinguish the actual signal from the background. Accordingly, spots with higher signal intensity have better signal-to-noise ratio and thus are more reliable than spots with low signal intensity. Most weak spots are caused by the fact that many genes are physiologically expressed at very low levels, at or close to the sensitivity limit of the DNA microarrays. In addition, experimental factors that may cause low signal intensities are: Low amount of DNA in the spot. Molecular and physical composition of the DNA spot (purity of the DNA, attachment to glass, availability to hybridization etc.). Suboptimal labeling. Uneven or incomplete hybridization. Signal bleaching, low sensitivity of the scanner. It is also likely that some mrna molecules have secondary structures, which prevent proper reverse transcription, leading to weak or non-existent signal. Regardless of actual cause, as experimental and biological factors are not readily distinguishable from each other during image analysis, spots having low intensity are considered less reliable than spots with high intensity. 2.2 Spot size As microarray spots are produced by depositing an equal amount of liquid onto the microscope slide, all spots are expected to be of roughly equal size. Deviations can be caused by, for example, precipitates, impurities and debris of the printing solution, a printing needle that is not making adequate contact with the array surface, damaged or dirty needles, or if too little liquid is printed onto the slide. Spots that are too large can be caused by high humidity during printing or uneven coating of the slide. Both of these can cause the spot to spread over a larger area than intended, which can lead to problems with detection and quantification of surrounding spots. Whether a spot is significantly larger or smaller than expected, it is an indication that there has been some error during manufacturing, which should be reflected when determining the quality of the spot. 2.3 Spot morphology (spot roundness and bleeding) Spots are expected to be roughly circular in shape, but manufacturing of the microarrays may cause some variation to spot morphology. These include scratches on the microarray surface and uneven coating of the microarray. Other imperfections on the microarray s surface can also cause the spot to spread unevenly during printing. Furthermore, if the signal is weak, the spot may have an irregular shape because spot signal is not distinguished from the background. The phenomenon where a spot spreads so much that it is mixed with its neighbors is referred to as bleeding. In such cases it is generally impossible to reliably separate the signals of the spots from each other. Therefore, all spots affected by bleeding are usually excluded from further analysis. 2.4 Pixel intensity distribution A spot is expected to contain approximately equal amounts of DNA over all of its area. Consequently, brightness should be relatively uniform over all of the spot. Deviations from this are visible as brighter areas inside the spot and can be caused by uneven distribution of the printed DNA in the spot or non-specific binding. 232

Microarray quality control using Bayesian networks 3 BAYESIAN NETWORKS A Bayesian network is a directed acyclic graph, where each node represents a real-life feature such as spot quality or intensity (Pearl, 1988). Each discrete node has mutually exclusive and exhaustive values. For example, the values of spot quality could be {bad, good}. Arcs between the nodes represent direction of dependence. Together the nodes and the arcs define the structure of the Bayesian network. An example of a Bayesian network with four nodes is illustrated in Figure 1. A node is called a child if it is directly dependent on a node in the network. For example, in Figure 1, nodes B and C are children of A. A node having children is called a parent. The idea behind Bayesian networks is to decompose the joint probability distribution as a product of conditional distributions and then take advantage of conditional independences. For example, in Figure 1 node D is not directly dependent on A or C, sop(d A, B, C) reduces to P(D B). Accordingly, arcs coming to a node can be thought of as representing conditional probability distribution between the node and its parents (arc from A to C in Fig. 1). In Bayesian networks the central concept is propagation of the observed information. There are several algorithms for evaluating the network (Pearl, 1988). In brief, information propagation proceeds as follows. When a feature is observed, the corresponding node in the Bayesian network is instantiated (i.e. the belief of the node is set to follow the observation) and the node transmits observed evidence to its parents (λmessage) and to its children (π-message), who transmit the message forward until all nodes in the network are updated. The belief of each node is a product of λ and π, so observed evidence affects every node in the network. An example of λ- and π-messages is illustrated in Figure 1. 3.1 Bayesian networks in spot quality control Bayesian networks include several properties that are very appealing in spot quality assessment: Nodes in a Bayesian network can represent an arbitrary phenomenon. For example, a node may represent spot specific quantities such as bleeding or intensity. Several statistical tests can also be included as nodes in a Bayesian network. It is not required to observe all the features in order to compute the quality of the spot. However, if the data set contains missing values, it may be good to apply an expectation-maximization algorithm (Dempster et al., 1977). A priori information is straightforward to include in the model. Structure of the Bayesian network together with the conditional probability distributions between the nodes explicitly shows the influence of a feature on the spot quality. Fig. 1. Bayesian network with four nodes. λ- and π-messages denote propagation of diagnostic and causal supports, respectively. Arc between A and C shows that C is directly dependent on A, and dependency is coded by probability distribution P(C A). Especially the last point is a remarkable property. For example, let us assume that quality (Q) of a spot is either bad (q 1 ) or good (q 2 ) and one of the features is intensity of the spot (I) consisting of three values (i ={low, medium, high}). With a training data set (set of spots whose quality is known) it is possible to estimate the conditional probability distribution between Q and I. When both features are discrete, a conditional probability distribution reduces to a conditional probability table (CPT). In our case study, spot intensity is one of the 14 features used in spot quality classification. Below is an estimated CPT between Q and I: P(i 1 q 1 ) P(i 1 q 2 ).49.4 P(i 2 q 1 ) P(i 2 q 2 ) =.39.33. P(i 3 q 1 ) P(i 3 q 2 ).12.63 The CPT described above tells, for example, that if an arbitrary spot is bad, the probability that intensity is low is.49 [P(I = i 1 q 1 ) =.49], but for a good spot the probability that intensity is low is only.4. Furthermore, it is evident that a good spot is very likely to have high intensity. CPTs are used effectively in Bayesian networks. Assume that for a spot whose quality is to be determined, we observe that intensity is low (I = i 1 ). Accordingly, the λ-message from I to Q is [.49.4] implying that the quality of the spot is bad. All observations are transmitted to node Q in a similar fashion, and finally Q s belief indicates the final quality of the spot. In order to use Bayesian networks in spot quality assessment, the user needs to estimate the following before the processing of the measurements. First, the structure of the Bayesian network must be defined. This phase includes identification of a set of appropriate features and their dependences. Therefore, the definition of the structure of the Bayesian network may be laborious, albeit methods for estimating the structure of the Bayesian network from the data also exist (Heckerman, 1996). Second, conditional probability distributions must be defined between the nodes. This is straightforward to accomplish with a training data set, with a priori knowledge, or their combination. 233

S.Hautaniemi et al. Third, parameter prior distributions must be defined. In essence, the parameter distributions to be defined are P ( ), P(Q θ) and P(X Q, Y, θ). Here Q denotes the class variable (spot quality), X is a feature that is dependent on spot quality and on the set of the other features Y, and is defined to be variable whose values θ correspond to the true values of the physical probability (Heckerman, 1996). Usually in discrete classification problems P ( ) is assumed to follow the Dirichlet distribution (Heckerman, 1996). The selection of the prior distributions is a controversial issue, which has raised a lot of discussion (e.g. Spiegelhalter et al., 1994; Myllymäki et al., 22). 4 CASE STUDY The data for this study were obtained from two different hybridizations. Specifications, protocols, and the data are available at the supplementary web site. 4.1 Features One of the most critical phases in spot quality classification is the choice of appropriate features. Based on a priori knowledge, literature, and preliminary experiments (data not shown) we chose seven features that are known to affect the spot quality: 1. Bleeding. 2. Size of the spot. 3. Roundness of the spot. 4. Alignment error. 5. Spot intensity. 6. Intensity of the background. 7. Background noise. Each feature was measured separately using Cy5- and Cy3- channels, so altogether we used 14 features. First task in spot quality assessment is to extract the features. Our approach to extract feature values is to employ a Gaussian fitting procedure. This is based on the following. First, through a Gaussian fitting procedure we are able to avoid segmentation for all features except bleeding. The choice of segmentation algorithm is usually arbitrary, so in order to obtain general results it is important to ensure that segmentation does not affect feature extraction. Second, the Gaussian fitting procedure returns a value for each feature. Therefore there is no need to deal with missing values. However, for some spots background noise is much higher than the actual signal and Gaussian fitting does not converge. Accordingly, there are no values for any of the features, but as the signal is weaker than noise, these spots are of bad quality. Third, a Gaussian surface on a spot has very similar shape characteristics as the original spot, which allows a reliable feature extraction. Despite the fact that we fit a Gaussian surface on the data, we do not assume that spots would even ideally have this shape, but the 1 8 6 4 2 2 15 1 5 5 1 15 2 Fig. 2. The idea of Gaussian fitting. On the left is the original spot distribution. On the right is the resulting two-dimensional Gaussian fit. Parameters of the Gaussian fit enable determination of several features. parameters of a fitted Gaussian surface correspond to related properties of a common spot. A Gaussian fitting procedure has been used earlier for estimating how much label is attached to the spot in a microarray (Brändle et al., 21). Here we have utilized the Gaussian fitting procedure for measuring spot features as follows. Each spot was processed individually and image coordinates of a bounding rectangle of the spot were assumed to be known. As the features were extracted identically for each spot, we here consider only one spot. We denote the image coordinates within the bounding rectangle of the spot by vectors x 1,..., x N in R 2 and pixel intensity values within these coordinates by real values y 1,..., y N. We have used a procedure that will fit a Gaussian shaped surface on the data as illustrated in Figure 2. The fitting procedure is a standard non-linear least squares procedure (Draper and Smith, 1998). Let R φ denote a rotation matrix given by [ ] cos(φ) sin(φ) R φ = sin(φ) cos(φ) and σ1,σ 2 denote a diagonal matrix given by [ σ 2 ] σ1,σ 2 = 1 σ 2. 2 We denote by S φ,σ1,σ 2 a product matrix given by S φ,σ1,σ 2 = (R φ ) T σ1,σ 2 (R φ ). A function f representing an ideal Gaussian surface with realvalued parameters A, B, φ, σ 1, and σ 2 and a vector-valued parameter m in R 2 is defined as f(x; ψ) = A exp( (x m) T S φ,σ1,σ 2 (x m)) + B, (1) where ψ is defined as a six-tuple (A, B, φ, σ 1, σ 2, m) for notational convenience. An objective function J representing fitting error of the Gaussian surface defined in Equation (1) is given by J(ψ) = 1 8 6 4 2 2 15 1 5 N (f (x i ; ψ) y i ) 2. (2) i=1 5 1 15 2 234

Microarray quality control using Bayesian networks The objective function in Equation (2) is minimized with respect to the parameters with constraint A and the corresponding minimizing parameters are transformed into the spot features. From now on we shall assume that the parameters A, B, φ, σ 1, σ 2, and m are parameters minimizing the objective function, σ 1 and σ 2 are positive and σ 1 σ 2. The parameters are not uniquely determined, especially the parameter φ. However, we do not expect this non-uniqueness to cause problems because, for example, the features do not depend on the parameter φ. Parameters A and B are directly used as features representing spot and background intensities, respectively. Distance between m and the center of the bounding rectangle is used as a feature representing alignment error. Roundness of the spot is measured with a feature σ 1 /σ 2 and size of the spot with a feature σ 1 σ 2. Background noise is estimated with a feature that is the root mean square error: 1 N N (f (x i ; ψ) y i ) 2. i=1 Detection of spot bleeding is done as follows. A larger bounding rectangle than the original is introduced to also cover some pixels within the adjacent spots. The larger bounding rectangle is the original rectangle extended to each of the four directions by the amount of one pixel width. A threshold B +A/3 is used to divide pixels into spot pixels and background pixels. Eight masks inside the larger bounding rectangle are introduced to indicate the borders of the original bounding rectangle. The actual shapes of the masks are illustrated in Figure 3. The proportion of spot pixels within each of the eight masks is counted. For each direction the minimum of the proportions in the two masks is selected and the feature representing spot bleeding is the maximum of these four minimums. 4.2 Discretization The continuous features were discretized into discrete ones as follows. The spot bleeding feature was discretized into two values and the other features were discretized into three values. The threshold for spot bleeding was a constant proportion 65% of spot pixels. The other features were discretized so that 3% of the lowest feature values within the image were labeled with 1, 3% of the highest feature values with 3, and the remaining values with 2. The thresholds were determined empirically using the data sets. 4.3 Classifiers When the problem domain is largely unknown, as in the case of spot quality assessment, a natural approach is to try simple methods first and increase the complexity of the model if pixel center mask boundary original bounding rectangle Fig. 3. An illustration of the masks used in bleeding detection. Squares represent pixel centers within the image. The bounding rectangle around the spot is the 6 6 area of pixels bounded by a patched line. The masks used in bleeding detection are the areas of pixels bounded by a solid line. The direction of a mask refers to the side of the bounding rectangle on which the mask is. the results are not satisfactory. Along this line, we have applied a discrete naïve Bayesian network model, where the assumptions are that the features are discrete, directly dependent on spot quality and independent of each other. We have also utilized a continuous Bayesian classifier, where the features are modeled with Gaussian distributions. One of the most influencing phases in spot quality classification is the choice of appropriate features. In this study we have utilized seven features that are known to play a major role in the spot quality. However, often a properly chosen subset of features increases the classification performance. Moreover, it may be too conservative to use all the features in spot quality assessment. Consequently, the Bayesian network model could contain only a couple of essential features, for example, bleeding and intensity. In this case, the spot would be deleted if bleeding is detected or intensity is low. In this study we have utilized a structure search for the discrete naïve Bayesian network. The structure search is done by an interactive webtool B-Course (C-Trail) (Myllymäki et al., 22). B-Course chooses the best structure based on prediction accuracy. In brief, C-Trail searches for the best structure by choosing a new feature set that resembles the current best model. If the prediction accuracy of the new structure is better than the best so far, the new structure is declared the best model. Search for new models is stopped when continuing the search does not produce better classification accuracies. Here B-Course identified features bleeding, spot roundness and spot intensity for the Cy5-channel and bleeding, spot size, spot roundness, background intensity and fitting error for the Cy3-channel. As the features measured from the Cy5-channel correlate strongly with the features from the Cy3-channel, it is unrealistic to assume that the features are independent of each other. Therefore we have modeled this dependence by including an 235

S.Hautaniemi et al. edge between the pair-wise features, for example, the parents of intensity measured using Cy5-channel are intensity measured using Cy3-channel and spot quality. Feature bleeding differs from the other features in that if bleeding is detected, the spot is automatically of bad quality. Therefore we have modeled bleeding as a dummy node (Pearl, 1988), which means that a bleeding node transmits a λ-message only when bleeding is detected thus forcing quality to be bad regardless of the other observations. As discussed in Section 3.1, there is a need to define priors for the parameters in Bayesian networks. For P(Q θ)we have first applied a uniform prior, where all classes are equally probable a priori and a subjective prior as follows. Based on earlier experience we have assumed that approximately 8% of the spots in a microarray are of good quality a priori. We have assumed that P(X Q, Y, θ) is uniformly distributed. Finally, we have compared the performance of the Bayesian networks to decision tree and multi-layer perceptron artificial neural network (ANN) algorithms. In the ANN we have used five hidden neurons, tan-sigmoid transfer functions with linear output function and 15 training epochs. We preprocessed continuous data to have zero mean, standard deviation of one and scale from 1 to 1 for the ANN. 4.4 Results In this section we compare several classifiers for spot quality assessment and demonstrate the need for spot quality strategy. Our case study involves two data sets, both consisting of 16 spots, which were assigned to four quality categories: {bad, moderate (close to bad), moderate (close to good), good} by three experts who have conducted microarray experiments for several years. As several aspects affect the reliability of a microarray experiment, it is not always straightforward to perform a visual assessment of spot quality. There were some spots, where the experts disagreed between bad and medium (close to bad) as well as good and medium (close to good). Only those spots where the experts were unanimous (n = 155) were chosen for classification. We performed binary classification by assigning the label of a spot to be bad if the label determined by the experts was {bad, moderate (close to bad)} and good otherwise. Altogether there were 97 spots in category good and 58 in category bad. In order to test the accuracy of the classification we performed leave-one-out-cross-validation (LOOCV), which is briefly presented at the supplementary web page. We have summarized the accuracies of the classifiers in Table 1. Entry B-Course in Table 1 denotes the naïve Bayesian network whose structure is found by B-Course, which employs structure search inside the LOOCV loop. We have also observed spot quality for all 155 spots using the ratio quality method (Chen et al., 22). The ratio quality method results in a continuous quality value (ν) for each spot. As ν is a continuous value between zero (unreliable) Table 1. Accuracies for the classifiers Method Accuracy (%) B-Course (subjective) 96.8 Pair-wise NB (subjective) 95.5 NB (subjective) 95.5 NB (uniform) 94.8 Decision tree 91.6 ANN 9.3 Continuous Bayes (uniform) 85.2 Ratio quality (ω = 1) 82.6 Ratio quality (ω =.5) 69. Ratio quality (ω =.1) 65.2 NB denotes naïve Bayesian network, prior used is given in parenthesis. B-Course refers to NB classifier whose structure is searched by B-Course (C-Trail). ANN denotes artificial neural networks. All accuracies except ratio qualitys are determined with the LOOCV. and one (perfect quality), it is sometimes difficult to determine an appropriate threshold below which the spot is bad. In Table 1 we illustrate three thresholds ω {1,.5,.1}, e.g., if ω =.5, all spots having ν <.5 are considered unreliable, while spots having ν.5 are considered good. The receiver order characteristics (ROC) curve for the discrete naïve Bayesian networks is given in the supplementary material. Supplementary material also contains quality values for all 32 spots. In order to show the impact of the spot quality control we have applied the Bayesian networks to a so-called self versus self experiment (i.e. test and reference sample were the same) on a microarray consisting of 12 656 spots. Ideally, a scatter plot of a self versus self experiment is a straight line, but very often some spots deviate from the straight line, as illustrated in Figure 4A. The scatter plot in Figure 4A compares well with other microarray figures presented in the literature. However, spot quality control, where the only feature used was bleeding for both channels, indicates that there are 883 unreliable spots. The reason for such a high number of unreliable spots is given in Figure 4B, which illustrates a zoomed microarray slide. Several spots in the first and second subarray are clearly located in a region of uneven hybridization. Interestingly, many of the unreliable spots are ones that are along a straight line as illustrated in Figure 4C, indicating that it may not be possible to discard unreliable spots by applying statistical methods to the scatter plot. Mean and standard deviation for the original data were 1.28 and 8.84, respectively. When we removed 883 unreliable spots, mean and standard deviation were reduced to.92 and 1.8, respectively. In order to gain statistical assurance we randomly deleted a set of 883 spots from the original data set and compared it to standard deviation computed using quality filtered data. This was done one million times and we obtained p<1 6 meaning that is very unlikely to get smaller standard deviation by deleting spots randomly. 236

Microarray quality control using Bayesian networks Fig. 4. Impact of the spot quality control. (A) A scatter plot of all 12 656 spots. (B) A zoom-in figure of the scanned microarray. In (C), crosses denote unreliable spots in the scatter plot. 5 DISCUSSION In the present study we have suggested a novel approach for assessing the quality of a microarray spot with Bayesian networks. Bayesian networks enable easy incorporation of a priori knowledge and explicit representation of the impact of the features on spot quality. These two properties make Bayesian networks superior to black box models (e.g. ANNs), which are incapable of incorporating a priori knowledge or to show explicitly the dependences between features. Furthermore, a node in the Bayesian network can represent an arbitrary phenomenon, which in turn enables incorporation of several statistics-based spot quality strategies in the Bayesian network model. One of the most difficult phases in applying the Bayesian networks to spot quality assessment is to choose appropriate priors. In our case study, for P(Q θ) we applied both subjective and uniform priors, while P(X Q, Y, θ) was assumed to be uniformly distributed. However, even though a uniform distribution would be an intuitive choice for a prior, the problem is that it is not invariant to variable transformations, which may lead to a situation where the marginal likelihoods of two Bayesian network models representing the same domain are different (Myllymäki et al., 22). Thus, in real-life applications P(Q θ) could be estimated from the test data set, for example, by computing the fraction of good spots and then use that as a class prior for good spots in the actual microarray experiment. We have tested this estimation approach for assessing the prior for P(Q θ) and the results were the same as results when a uniform prior was used (data not shown). Estimation of the appropriate prior distributions for P(X Q, Y, θ) would first require identification of the features having the greatest impact on spot quality and then solving dependences between these features and spot quality. These tasks would require additional experiments and therefore are beyond the scope of this study, but would make a very interesting topic for further study. The self versus self experiment highlights an apparent need for reliable spot quality control. Classification comparisons demonstrate that the Bayesian network models are capable of producing reliable spot quality estimates. Further, our case studies suggest that a discrete naïve Bayesian network is a feasible classifier for spot quality assessment. This result was somewhat excepted since it is well known that sometimes discrete naïve Bayesian networks perform better than general Bayesian networks, continuous classifiers assuming Gaussian features and more complicated models such as decision trees. For profound discussion on this phenomenon we refer to (Kohavi and Sahami, 1996; Friedman et al., 1997). In our approach we assume that a training data set is available, which may sometimes be difficult to obtain. However, we have stated the need for training data explicitly contrary to existing statistics-based quality control methods that assume the same implicitly. For example the method described in (Chen et al., 22) results in a continuous value that represents the quality of the spot. This value alone is not sufficient since there is also a need to find a threshold below which the spot is of bad quality. In practice, this threshold is found through trial-and-error, which is identical to the usage of a training data set. The Bayesian networks could also be used in computing the quality for an entire microarray experiment. However, in that case the set of features used in quality assessment must be renewed. Suitable features could be, overall signal-tonoise ratio, green-to-red balance and morphology. If a set of appropriate features used in assessing the quality for the entire microarray experiment is found, the Bayesian networks could be a part of an automatic quality control system. For example, if integrated to the image analysis software, the quality control 237

S.Hautaniemi et al. system could warn the researcher that the scanned slide is of poor quality and should be rejected from the analysis. ACKNOWLEDGEMENTS The authors thank Dr Spyro Mousses and Dr Yidong Chen for useful discussions. This work was supported in part by Academy of Finland, Sigrid Juselius Foundation, University of Helsinki Research Foundation, Finnish Cancer Society and Helsinki University Central Hospital Research Fund. REFERENCES Brown,C., Goodwin,P. and Sorger,P. (21) Image metrics in the statistical analysis of DNA microarray data. Proc. Natl Acad. Sci. USA, 98, 8944 8949. Brändle,N., Chen,H.-Y., Bischof,H. and Lapp,H. (21) Robust parametric and semi-parametric spot fitting for spot array images. Proceedings of the 8th Intl. Conf. on Intelligent Systems for Molecular Biology, pp. 1 12. Chen,Y., Kamat,V., Dougherty,E., Bittner,M., Meltzer,P. and Trent,J. (22) Ratio statistics of gene expression levels and applications to microarray data analysis. Bioinformatics, 18, 127 1215. Dempster,A., Laird,N. and Rubin,D. (1977) Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat Ser. B, 39, 1 38. Draper,N. and Smith,H. (1998) Applied Regression Analysis. 3rd edition, Wiley, New York. Friedman,N., Geiger,F. and Goldszmidt,M. (1997) Bayesian network classifiers. Mach. Learning, 29, 131 163. Heckerman,D. (1996) A tutorial on learning with Bayesian networks. Technical report MSR-TR-95-6. Microsoft Advanced Technology Division, Redmont, USA. Kohavi,R. and Sahami,M. (1996) Error-based and entropy-based discretization of continuous features. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. AIII Press, pp. 114 119. Model,F., König,T., Piepenbrock,C. and Adorján,P. (22) Statistical process control for large scale microarray experiments. Bioinformatics, 1, 1 9. Myllymäki,P. Silander,T., Tirri,H. and Uronen,P. (22) B-Course: A web-based tool for Bayesian and causal data analysis. Int. J. Artificial Intell. Tools, 11, 369 387. Pearl,J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Interference. Morgan Kaufmann Publishers, San Mateo, CA. Ruosaari,S. and Hollmén,J. (22) Image analysis for detecting faulty spots from microarray images. In Lange,S. Satoh,K. and Smith,C.H. (eds), Proceedings of the 5th International Conference on Discovery Science (DS22). Springer, Berlin, pp. 259 266. Schena,M., Shalon,D., Davis,R. and Brown,P. (1995) Quantitative monitoring of gene expression patterns with complementary DNA microarray. Science, 27, 467 47. Spiegelhalter,D., Freedman,L. and Parmar,M. (1994) Bayesian approaches to randomized trials. J. R. Stat. Ser. A, 157, 357 387. Wang,X., Ghosh,S. and Guo,S. (21) Quantitative quality control in microarray image processing and data acquisition. Nucleic Acids Res., 29, E75. 238