Statistical models for predicting number of involved nodes in breast cancer patients

Similar documents
International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Joint Modelling Approaches in diabetes research. Francisco Gude Clinical Epidemiology Unit, Hospital Clínico Universitario de Santiago

Comparison of methods for modelling a count outcome with excess zeros: an application to Activities of Daily Living (ADL-s)

Modeling the Survival of Retrospective Clinical Data from Prostate Cancer Patients in Komfo Anokye Teaching Hospital, Ghana

Economic crisis and follow-up of the conditions that define metabolic syndrome in a cohort of Catalonia,

Saeed Ghanbari, Seyyed Mohammad Taghi Ayatollahi*, Najaf Zare

Normal variation in the length of the luteal phase of the menstrual cycle: identification of the short luteal phase

Modeling Multi Layer Feed-forward Neural. Network Model on the Influence of Hypertension. and Diabetes Mellitus on Family History of

Parameter Estimates of a Random Regression Test Day Model for First Three Lactation Somatic Cell Scores

Copy Number Variation Methods and Data

310 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16

Impact of Imputation of Missing Data on Estimation of Survival Rates: An Example in Breast Cancer

Using the Perpendicular Distance to the Nearest Fracture as a Proxy for Conventional Fracture Spacing Measures

Unobserved Heterogeneity and the Statistical Analysis of Highway Accident Data

National Polyp Study data: evidence for regression of adenomas

A comparison of statistical methods in interrupted time series analysis to estimate an intervention effect

A MIXTURE OF EXPERTS FOR CATARACT DIAGNOSIS IN HOSPITAL SCREENING DATA

Prediction of Total Pressure Drop in Stenotic Coronary Arteries with Their Geometric Parameters

The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis

The effect of salvage therapy on survival in a longitudinal study with treatment by indication

ARTICLE IN PRESS Neuropsychologia xxx (2010) xxx xxx

Richard Williams Notre Dame Sociology Meetings of the European Survey Research Association Ljubljana,

Estimating the distribution of the window period for recent HIV infections: A comparison of statistical methods

WHO S ASSESSMENT OF HEALTH CARE INDUSTRY PERFORMANCE: RATING THE RANKINGS

THE NATURAL HISTORY AND THE EFFECT OF PIVMECILLINAM IN LOWER URINARY TRACT INFECTION.

Insights in Genetics and Genomics

Length of Hospital Stay After Acute Myocardial Infarction in the Myocardial Infarction Triage and Intervention (MITI) Project Registry

CONSTRUCTION OF STOCHASTIC MODEL FOR TIME TO DENGUE VIRUS TRANSMISSION WITH EXPONENTIAL DISTRIBUTION

What Determines Attitude Improvements? Does Religiosity Help?

A Meta-Analysis of the Effect of Education on Social Capital

A GEOGRAPHICAL AND STATISTICAL ANALYSIS OF LEUKEMIA DEATHS RELATING TO NUCLEAR POWER PLANTS. Whitney Thompson, Sarah McGinnis, Darius McDaniel,

HIV/AIDS-related Expectations and Risky Sexual Behavior in Malawi

Lateral Transfer Data Report. Principal Investigator: Andrea Baptiste, MA, OT, CIE Co-Investigator: Kay Steadman, MA, OTR, CHSP. Executive Summary:

THIS IS AN OFFICIAL NH DHHS HEALTH ALERT

Are Drinkers Prone to Engage in Risky Sexual Behaviors?

INITIAL ANALYSIS OF AWS-OBSERVED TEMPERATURE

Project title: Mathematical Models of Fish Populations in Marine Reserves

HIV/AIDS-related Expectations and Risky Sexual Behavior in Malawi

Appendix F: The Grant Impact for SBIR Mills

Optimal Planning of Charging Station for Phased Electric Vehicle *

Are National School Lunch Program Participants More Likely to be Obese? Dealing with Identification

UNIVERISTY OF KWAZULU-NATAL, PIETERMARITZBURG SCHOOL OF MATHEMATICS, STATISTICS AND COMPUTER SCIENCE

Statistical Analysis on Infectious Diseases in Dubai, UAE

Alma Mater Studiorum Università di Bologna DOTTORATO DI RICERCA IN METODOLOGIA STATISTICA PER LA RICERCA SCIENTIFICA

An Introduction to Modern Measurement Theory

NHS Outcomes Framework

Fitsum Zewdu, Junior Research Fellow. Working Paper No 3/ 2010

Statistically Weighted Voting Analysis of Microarrays for Molecular Pattern Selection and Discovery Cancer Genotypes

Association between cholesterol and cardiac parameters.

The Effect of Fish Farmers Association on Technical Efficiency: An Application of Propensity Score Matching Analysis

Does reporting heterogeneity bias the measurement of health disparities?

ALMALAUREA WORKING PAPERS no. 9

Desperation or Desire? The Role of Risk Aversion in Marriage. Christy Spivey, Ph.D. * forthcoming, Economic Inquiry. Abstract

Validation of the Gravity Model in Predicting the Global Spread of Influenza

Birol, Ekin; Asare-Marfo, Dorene; Ayele, Gezahegn; Mensa-Bonsu, Akwasi; Ndirangu, Lydia; Okpukpara, Benjamin; Roy, Devesh; and Yakhshilikov, Yorbol

Applying Multinomial Logit Model for Determining Socio- Economic Factors Affecting Major Choice of Consumers in Food Purchasing: The Case of Mashhad

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

EVALUATION OF BULK MODULUS AND RING DIAMETER OF SOME TELLURITE GLASS SYSTEMS

Cancer morbidity in ulcerative colitis

FORGONE EARNINGS FROM SMOKING: EVIDENCE FOR A DEVELOPING COUNTRY

Evaluation of two release operations at Bonneville Dam on the smolt-to-adult survival of Spring Creek National Fish Hatchery fall Chinook salmon

Addressing empirical challenges related to the incentive compatibility of stated preference methods

TOPICS IN HEALTH ECONOMETRICS

Appendix for. Institutions and Behavior: Experimental Evidence on the Effects of Democracy

AUTOMATED DETECTION OF HARD EXUDATES IN FUNDUS IMAGES USING IMPROVED OTSU THRESHOLDING AND SVM

Using Past Queries for Resource Selection in Distributed Information Retrieval

Study and Comparison of Various Techniques of Image Edge Detection

Physical Model for the Evolution of the Genetic Code

The impact of asthma self-management education programs on the health outcomes: A meta-analysis (systemic review) of randomized controlled trials

Introduction ORIGINAL RESEARCH

NATIONAL QUALITY FORUM

Recent Trends in U.S. Breast Cancer Incidence, Survival, and Mortality Rates

Biased Perceptions of Income Distribution and Preferences for Redistribution: Evidence from a Survey Experiment

Investigation of zinc oxide thin film by spectroscopic ellipsometry

Effect of Exposure to Trace Elements in the Soil on the Prevalence of Neural Tube Defects in a High-Risk Area of China*

The Marginal Income Effect of Education on Happiness: Estimating the Direct and Indirect Effects of Compulsory Schooling on Well-Being in Australia

Non-parametric Survival Analysis for Breast Cancer Using nonmedical

Incorrect Beliefs. Overconfidence. Types of Overconfidence. Outline. Overprecision 4/22/2015. Econ 1820: Behavioral Economics Mark Dean Spring 2015

Reconstruction of gene regulatory network of colon cancer using information theoretic approach

Integration of sensory information within touch and across modalities

ARTICLES. Epidemiologic Evidence Showing That Human Papillomavirus Infection Causes Most Cervical Intraepithelial Neoplasia

HERMAN AGUINIS University of Colorado at Denver. SCOTT A. PETERSEN U.S. Military Academy at West Point. CHARLES A. PIERCE Montana State University

Comparison of support vector machine based on genetic algorithm with logistic regression to diagnose obstructive sleep apnea

Gurprit Grover and Dulumoni Das* Department of Statistics, Faculty of Mathematical Sciences, University of Delhi, Delhi, India.

DECREASING SYMPTOMS IN INTERSTITIAL CYSTITIS PATIENTS: PENTOSAN POLYSULFATE VS. SACRAL NEUROMODULATION. A Research Project by. Katy D.

SMALL AREA CLUSTERING OF CASES OF PNEUMOCOCCAL BACTEREMIA.

Sheffield Economic Research Paper Series. SERP Number:

Discussion Papers In Economics And Business

Optimal probability weights for estimating causal effects of time-varying treatments with marginal structural Cox models

HIV/AIDS AND POVERTY IN SOUTH AFRICA: A BAYESIAN ESTIMATION OF SELECTION MODELS WITH CORRELATED FIXED-EFFECTS

INTEGRATIVE NETWORK ANALYSIS TO IDENTIFY ABERRANT PATHWAY NETWORKS IN OVARIAN CANCER

Disease Mapping for Stomach Cancer in Libya Based on Besag York Mollié (BYM) Model

Assessment of Response Pattern Aberrancy in Eysenck Personality Inventory

Latent Class Analysis for Marketing Scales Development

Survival Comparisons for Breast Conserving Surgery and Mastectomy Revisited: Community Experience and the Role of Radiation Therapy

Sparse Representation of HCP Grayordinate Data Reveals. Novel Functional Architecture of Cerebral Cortex

DS May 31,2012 Commissioner, Development. Services Department SPA June 7,2012

ADDITIVE MAIN EFFECTS AND MULTIPLICATIVE INTERACTION (AMMI) ANALYSIS OF GRAIN YIELD STABILITY IN EARLY DURATION RICE ABSTRACT

Evaluation of Literature-based Discovery Systems

(From the Gastroenterology Division, Cornell University Medical College, New York 10021)

Transcription:

Vol.2, No.7, 641-651 (2010) do:10.4236/health.2010.27098 Health Statstcal models for predctng number of nvolved nodes n breast cancer patents Alok Kumar Dwved 1 *, Sada Nand Dwved 2, Suryanarayana Deo 3, Rakesh Shukla 1, Elzabeth Kopras 4 1 Center for Bostatstcal Servces, Department of Envronmental Health, College of Medcne, Unversty of Cncnnat, Cncnnat, USA; * Correspondng Author: alok_bhu1@yahoo.co.n 2 Department of Bostatstcs, All Inda Insttute of Medcal Scences, New Delh, Inda 3 Department of Surgcal Oncology, All Inda Insttute of Medcal Scences, New Delh, Inda 4 Department of Envronmental Health, College of Medcne, Unversty of Cncnnat, Cncnnat, USA Receved 12 March 2010; revsed 8 Aprl 2010; accepted 10 Aprl 2010. ABSTRACT Clncans need to predct the number of nvolved nodes n breast cancer patents n order to ascertan severty, prognoss, and desgn subsequent treatment. The dstrbuton of nvolved nodes often dsplays over-dsperson a larger varablty than expected. Untl now, the negatve bnomal model has been used to descrbe ths dstrbuton assumng that over-dsperson s only due to unobserved heterogenety. The dstrbuton of nvolved nodes contans a large proporton of excess zeros (negatve nodes), whch can lead to over-dsperson. In ths stuaton, alternatve models may better account for over-dsperson due to excess zeros. Ths study examnes data from 1152 patents who underwent axllary dssectons n a tertary hosptal n Inda durng January 1993-January 2005. We ft and compare varous count models to test model abltes to predct the number of nvolved nodes. We also argue for usng zero nflated models n such populatons where all the excess zeros come from those who have at some rsk of the outcome of nterest. The negatve bnomal regresson model fts the data better than the Posson, zero hurdle/nflated Posson regresson models. However, zero hurdle/nflated negatve bnomal regresson models predcted the number of nvolved nodes much more accurately than the negatve bnomal model. Ths suggests that the number of nvolved nodes dsplays excess varablty not only due to unobserved heterogenety but also due to excess negatve nodes n the data set. In ths analyss, only skn changes and prmary ste were assocated wth negatve nodes whereas party, skn changes, prmary ste and sze of tumor were assocated wth a greater number of nvolved nodes. In case of near equal performances, the zero nflated negatve bnomal model should be preferred over the hurdle model n descrbng the nodal frequency because t provdes an estmate of negatve nodes that are at hgh-rsk of nodal nvolvement. Keywords: Nodal Involvement; Count Models; Breast Cancer 1. INTRODUCTION Accurate predcton of the number of nvolved nodes n breast cancer patents helps n gradng severty of dsease, avod extensve axllary surgery dssectons and asssts wth treatment decsons such as the use of neoadjuvant chemotherapy [1,2]. Many studes have been performed to predct nodal status n breast cancer patents. Most of them merely predct the presence/absence of nvolved nodes rather than the number of nvolved nodes [3]. Untl now, only two studes have tred to predct the number of nvolved nodes n breast cancer patents. Guern and Vnh-Hung [3] found that a negatve bnomal model descrbes the number of nodal nvolvement better than the Posson model due to excess varablty, a condton called over-dsperson. Another study showed that the negatve bnomal model provdes a better ft as compared to the Posson model for the total number of nvolved nodes n breast cancer patents n a meta-analyss [4]. These studes used a negatve bnomal model, whch posted that the over-dsperson occurred entrely due to unobserved heterogenety and/or nodal clusterng. Copyrght 2010 ScRes. Openly accessble at http://www.scrp.org/journal/health/

642 A. K. Dwved et al. / HEALTH 2 (2010) 641-651 However, count data often nvolve over-dsperson not only due to unobserved heterogenety and/or clusterng but also due to the preponderance of zero frequency (negatve node n the case of cancer) [5]. Consequently, the nomnal Posson or the negatve bnomal dstrbutons may not satsfactorly account for excess varablty f ths varablty s ndeed due to excess zeros. In such stuatons, use of these models may lkely underestmate the probablty of negatve node status, and may provde msleadng results. Zero hurdle or zero nflated regresson models can be used to ncrease predctablty n stuatons wth excess zeros. In count data, the observed zeros can be ether structural zeros (e.g., the subject s at no rsk of the event of nterest) or samplng zeros (e.g., the subject s ndeed at some rsk of the event of nterest). It has been suggested that zero hurdle models are more approprate n case of excessve samplng zeros whle zero nflated models should be preferred n cases of mxtures of zeros.e., nvolvement of both types of zeros [6]. In breast cancer, all the patents are ndeed at some rsk of havng nodal nvolvement and thus all zeros are strctly samplng zeros. Thus, accordng to the prevalng wsdom, zero hurdle models could be employed to predct the nodal frequency among breast cancer patents. In epdemologc studes, generally count data nvolves zeros at some rsk of outcome of nterest. In such crcumstances, there exsts alternatve ways to conceptualze the so-called structural zeros and samplng zeros. Usng the epdemologcal parlance, we can conceptualze zeros n terms of dsease on-set and dsease progresson. In breast cancer patents, a lack of nodal nvolvement (observed zero) may be because the cancer s detected early enough n the dsease progresson (closer to the tme of dsease onset) or the cancer tself s of slow progresson and/or absence of rsk factors for hgh rate of dsease progresson. These knds of zeros may be dentfed as true or structural zeros. The rest of the zeros may be observed n the presence of varous rsk factors leadng up to a hgh rate of dsease progresson. These latter types of zeros can be dentfed as false or samplng zeros. Thus, wthn the framework of zero nflated models, excess zeros can be modeled as a mxture of true zeros and false zeros. Note that the false zeros can also arse ether due to chance, false recordng and/or due to false observaton. It has been reported that some of the nvolved (postve) nodes may be recorded as negatve due to msclassfcaton by the pathologst (referred to as reportng error) [7]. One study reported that non-dssecton of complete axllary lymph nodes mght provde false negatve nodes [8]. These false negatve nodes may be more lkely to be found among patents wth a hgh rsk of nodal nvolvement. Ths ndcates a need of estmaton of false negatve nodes so that they can follow up or be reassessed for dagnostc accuracy. In these stuatons, we suggest use of the zero nflated models, not only to account for excess zeros, but also to estmate the proporton of false zeros or patents wth zeros at hgh rsk of nodal postvty. Sgnfcant applcatons of zero hurdle and zero nflated models have been made n varous felds of research [91]. In recent years, the applcaton of these models and ther comparsons wth other count models has also ncreased n medcal and health felds [129]. A revew of the applcaton of such models n health research s also reported [20]. Extensons of these models for descrbng correlated data have also been reported [21-24]. These studes llustrate that zero hurdle/nflated models should be used f over-dsperson n the data s due to excess zeros. Results also ndcate that zero hurdle models should be preferred f only at-rsk zeros are present n the populaton. However, to our knowledge, the relatve performance of zero hurdle and nflated models n predctng the number of nvolved nodes has not been addressed. In ths paper, predcton of the number of nvolved nodes s made usng Posson regresson (PR), negatve bnomal (NB), zero hurdle Posson (ZHP), zero nflated Posson (ZIP), zero hurdle negatve bnomal (ZHNB) and zero nflated negatve bnomal (ZINB) models. Zero hurdle models n many epdemologc studes lke the present one may satsfactorly account for excess zeros, perhaps even as good as zero nflated models. We arguably demonstrate that the zero nflated models have an added advantage over the former n descrbng the event of nterest n relaton to the dsease process tself, ncludng dentfcaton of the factors nvolved n predctng the dsease onset and dsease progresson. 2. MATERIALS AND METHODS 2.1. Subjects We utlzed one of the largest breast cancer datasets avalable n Inda to assess the number of nvolved nodes dstrbuton. The data were extracted from the computerzed database of breast cancer patents mantaned at the Department of Surgcal Oncology, Insttute Rotary Cancer Hosptal (IRCH), All Inda Insttute of Medcal Scences (AIIMS), New Delh, Inda, a tertary care center, durng the perod from January 1993 to January 2005. The dataset was updated usng the orgnal records kept n the record secton of IRCH. Data from all patents who underwent surgery for breast cancer, ncludng axllary lymph node dssectons, were ncluded n ths study. Patents wth recurrent breast cancer, blateral breast Copyrght 2010 ScRes. Openly accessble at http://www.scrp.org/journal/health/

A. K. Dwved et al. / HEALTH 2 (2010) 641-651 643 carcnoma, any evdence of metastass, unknown prmary ste and male breast carcnoma were excluded from the study. Covarates and ther forms were chosen based on breast cancer lterature and an exploratory analyss of ths dataset.patents age at presentaton was stratfed as younger (below 35 years) and elder (more than or equal to 35 years). Duraton from onset of symptoms untl presentaton was classfed as less than or equal to 2, 2-4, 4-8 and more than 8 months. Party was categorzed as nullparous, sngle/doubleparous, and multparous. Other covarates ncluded menopausal status (post/pre); famly hstory of breast cancer (absent/present); prmary sde (left/rght); skn changes (no/yes); neoadjuvant chemotherapy (no/yes); prmary ste {medal (lower nner quadrant and upper nner quadrant)/lateral (lower outer quadrant and upper outer quadrant)/central (multple, central and others)}; tumor type (nfltratng ductal carcnoma/nfltratng lobular carcnoma and others); and pathologcal tumor sze was accordng to TNM classfcaton (< = 2/2-5/> 5cm). The neoadjuvant chemotherapy and total number of dssected nodes were only used n the model for adjustment, because these varables are hghly assocated wth nvolved nodes. The study populaton conssted of all cases of breast cancer and the outcome n queston was the number of nvolved nodes n a patent. Patents wth negatve nodes (zeros) were dvded nto two groups-those wth at low rsk of nodal nvolvement and those wth at hgh rsk of nodal nvolvement. A patent wth negatve nodes and havng a relatvely low rsk of nodal nvolvement was defned as at low rsk zero and labeled, n the context of modelng, as a true or structural zero. The remanng patents wth negatve nodes and a relatvely hgh rsk of nodal nvolvement due to the presence of varous rsk factors were defned as at hgh rsk zeros. In the context of modelng, we label them as false or samplng zeros. 2.2. Statstcal Models The Posson regresson model (PR) descrbes count outcomes or proporton/rates. Generally, the PR model explans less varablty of counts than the observed varablty. As a result, ths often gves msleadng relatonshps between covarates and outcomes. Excess varablty can be adjusted wthn the PR framework usng nflaton approaches of standard errors of the regresson coeffcents [25]. As such, t may be the approprate model to use for drawng correct nferences n the case of over-dsperson due to unobserved heterogenety and/or clusterng/temporal dependency. However, t may not be the most approprate n the case of excess zeros, as expected n assessng the dstrbuton of number of nvolved nodes. In the PR model, y s the number of nvolved nodes for the th patent, and λ s the mean number of nvolved nodes. If the number of nvolved nodes follows a Posson dstrbuton, ts probablty mass functon can be expressed as: e λ f y x, y 0,1,2, 1,2,..., 0 (1) λ y n y! If s are regresson coeffcents correspondng to the set of consdered covarates x s, and k s the number of consdered covarates, then the PR model can be expressed usng Eq.1 as: log λ β β x β x β xk (2) 0 1 1 2 2 k As an alternatve to the PR model, the negatve bnomal (NB) model has an nbult provson to account for over-dsperson due to unobserved heterogenety and/or temporal dependency [26]. As a result, ths model helps not only n adjustng the standard errors of the regresson coeffcents but also provdes a more flexble approach for predcton of the count outcome. Under the assumpton of over-dsperson beng merely due to unobserved heterogenety and/or temporal dependency, the NB model was used. The unobserved heterogenety may be due to unobserved predctors and/or too much varaton n some of the clncal and pathologcal cofactors. Temporal dependency n nodes may be occurrng due to clusterng of nodal nvolvement wthn patents. The NB model s expressed as: 1/α y Γ(y α ) α Γ(y 1)Γ(α ) α α f y x, (3) y 0,1, 2...; = 1,2...n; In ths model, s the over-dsperson parameter due to unobserved heterogenety and λ s the mean number of nvolved nodes. The NB regresson model can be obtaned smlar to Eq.2 by usng Eq.3. The NB model may not be approprate f the overdsperson s due to excess zeros because t underestmates the probablty of zeros and consequently underestmates the varablty present n the outcome. In such stuatons, alternatve models such as zero nflated/hurdle models that account for over-dsperson due to excess zeros are useful. Zero hurdle models are typcally used when the excess zeros arse from an at rsk populaton. Under the assumpton that over-dsperson results from excess zeros arsng from an at rsk group, zero hurdle Posson (ZHP) was used. In ths model, all zeros are consdered to be observed from a non-countng process, as opposed to a countng process. Wthn ths model, all zeros are typcally descrbed through logstc regresson, whereas postve counts are descrbed through a zero truncated Copyrght 2010 ScRes. Openly accessble at http://www.scrp.org/journal/health/

644 A. K. Dwved et al. / HEALTH 2 (2010) 641-651 Posson model. In the ZHP model, p s at rsk negatve nodes under logstc model. Assumng the mean number of nvolved nodes (λ ) under zero truncated Posson model, the ZHP dstrbuton may be expressed [27] as: If γ s and s are respectve regresson coeffcents under logstc and zero truncated Posson models correspondng to consdered covarates (x s), and the number of consdered covarates s k n each of the models, then usng Eq.4 regresson models can be expressed as: p log λ β β x β x β x log γ0 γ1x1 γ2x2 γkxk 1-p 0 1 1 2 2 k k (5) The ZHP model provdes two sets of results. These results can also be obtaned separately by fttng both a logstc regresson and zero truncated Posson model. Ths s why hurdle models are referred to as two-part models. The bnary process model dentfes factors assocated wth the presence/absence of nodal nvolvement, whereas modelng count process yelds factors assocated wth an ncrease n the number of nvolved nodes gven that the patent has nvolved nodes. Note that the ZHP model accounts for over-dsperson due to excess zeros but not due to unobserved heterogenety and/or temporal dependency n nodal nvolvement. In the latter case, one may use the zero hurdle negatve bnomal (ZHNB) model by consderng count process as zero truncated negatve bnomal dstrbuton. Substtutng a zero truncated negatve bnomal dstrbuton n Eq.4 yelds the ZHNB dstrbuton, and t can be expressed as Eq.6. Zero nflated models are typcally used when the excess zeros are a mxture of two types of zeros-true (structural zeros) and false (samplng zeros). We propose to categorze the negatve nodes n our populaton as a mxture of two types, those wth very low/no rsk of nodal nvolvement (true zeros) and those wth hgh rsk of nodal nvolvement (false zeros). In ths way, use of the zero nflated model framework not only accounts for the extra varablty due to excess zeros but also est- mates the relatve proporton of these at low rsk and at hgh rsk zeros. Further, ths can be used to dentfy subjects wth a hgh lkelhood of beng n one or the other type of zero classfcaton usng the rsk factors. In zero nflated models, occurrence of zeros s consdered as a result of two dstnct processes. Some of the zeros (zeros at hgh rsk ) are consdered to be observed from countng process and others (zeros at low rsk ) from non-countng process. As an nbult mechansm wthn these models, true zeros are typcally descrbed through logstc regresson, whereas false zeros are descrbed through smple count model. Lke hurdle models, the zero nflated models also provde two sets of results. However, the nterpretaton of regresson coeffcents under nflated models s dfferent from the hurdle models. Modelng bnary process provdes factors assocated wth negatve nodes n a low rsk populaton as compared to a hgh rsk populaton, whereas modelng count process provdes factors assocated wth the extent of the number of nvolved nodes, ncludng false negatve nodes gven that patents are n a hgh rsk populaton. Here, the probablty of observng negatve nodes s the sum of observng negatve nodes (true) under the logstc model plus the probablty that a ndvdual s not n the bnary process, and the probablty that negatve nodes (false) under the consdered count model. If the count process follows the Posson dstrbuton then t s called a zero nflated Posson (ZIP) model. To understand the ZIP model, consder the occurrence of at low rsk negatve nodes wth probablty p under a logstc model, whereas that of nvolved nodes (ncludng at hgh rsk false negatve nodes) wth probablty (1-p ) under the Posson model, havng a mean number of nvolved nodes (λ,), the ZIP dstrbuton can be expressed [28] as: y Γy p 1p exp λ,y =0 f y x exp λ λ 1 p,y 1;0 p 1;λ 0 (7) f y x p, y 0 y expλλ 1 p, y 1; 0 p 1; λ 0; 1, 2,,n y! 1expλ p, y 0 1 p f y x 1/α y Γ y α α 1/α α α α 1 Γ y 1Γα α, y 1 (4) (6) Copyrght 2010 ScRes. Openly accessble at http://www.scrp.org/journal/health/

A. K. Dwved et al. / HEALTH 2 (2010) 641-651 645 If γ s and s are respectve regresson coeffcents under logstc and Posson models correspondng to consdered covarates (x s), and the number of consdered covarates s k n each of the models, then usng Eq.7, regresson models can be expressed as: p log λ β β x β x β x log γ0 γ1x1 γ2x2 γkxk 1-p 0 1 1 2 2 k k (8) If the count process does not follow the Posson model then one may use the zero nflated negatve bnomal (ZINB) model by consderng count process as a negatve bnomal dstrbuton. In contrast to ZIP, the ZINB model accounts for the over-dsperson due to both types of zeros as well as due to unobserved heterogenety and/or temporal dependency. Substtutng negatve bnomal dstrbuton n Eq.7, the ZINB dstrbuton can be expressed as: 2.3. Model Comparsons The PR, NB, ZIP, ZHP, ZHNB and ZINB models were used to descrbe the number of nvolved nodes n breast cancer patents. The covarates found to be sgnfcant n unvarate analyss wth any of the regressons were ncluded nto all the regresson models to mantan the comparatve fndngs. The nested models (e.g., PR versus NB and ZIP, NB versus ZINB, and ZHP versus ZHNB) were compared usng a lkelhood rato. Sgnfcant result of the lkelhood rato test of comparson (PR versus NB, NB versus ZINB, and ZHP versus ZHNB) ndcates the presence of over-dsperson due to heterogenety and/or temporal dependency. The non-nested models (PR wth ZHP, PR wth ZHNB, PR wth ZINB, NB wth ZHP, NB wth ZIP, NB wth ZHNB, ZHP wth ZIP, ZHP wth ZINB and ZHNB wth ZINB) as well as nested models were also compared usng the Vuong test [29]. Sgnfcant and better ft of comparsons (PR wth ZHP/ZIP, and NB wth ZHNB/ZINB) explores whether or not the over-dsperson s due to excess zeros. To compare the predctve performance of the models, varous ndces such as log lkelhood, Akake Informaton Crteron (AIC), Bayesan Informaton Crteron (BIC), mean squared predcton error (MSPE) and mean absolute predcton error (MAPE) were also obtaned. A probablty plot (observed probablty mnus predcted probablty of postve nodes versus number of postve nodes) was constructed for each model. The probablty plot was constructed after truncaton at 10 postve nodes for ease of vsual comparson. The best-ftted model was also valdated usng the leave-one-out cross valdaton method [30]. The p-values less than 5% were consdered as sgnfcant results. STATA 9.0 package was used for all statstcal analyses. 3. RESULTS A total of 1152 patents were found to be elgble for ths study. Of those n the study, the presence of nvolved nodes was found n 705 (61.2%) patents. The mean and standard devaton of the number of nvolved nodes per patent were 3.9 and 5.6 respectvely (medan 1 and range: 0-33). Medan number of total dssected nodes per patent was 14 (range: 1-46). The mean age was 47.7 (standard devaton, 11.1) years and range 20-86 years. The dstrbutons of covarates consdered n the analyss are shown n Table 1. A descrptve comparson reveals that the cofactors party, skn changes, prmary ste and pathologcal tumor sze were consstently assocated wth outcome across all models. Three addtonal covarates, age, menopausal status and tumor type, were statstcally sgnfcant only n the PR model. There was good concordance n the assessment of statstcal sgnfcance n all aspects among ZHP, ZIP and NB models. A smlar relaton could also be seen between the ZINB and ZHNB models n provdng factors assocated wth the extent of nodal nvolvement. In other words, party, skn changes, prmary ste and tumor sze were found assocated wth a greater number of nvolved nodes n both models. However, the ZHNB model provded prmary ste, skn changes and pathologcal tumor sze assocated wth presence of postve nodes whereas ZINB model provded only prmary ste and skn changes assocated wth presence of postve nodes n at hgh-rsk populaton. The sgnfcant Pearson ch square goodness of ft (gof) test (p < 0.001) along wth other characterstcs of model ft ndcated that the PR model produced a poor ft for nodal nvolvement data. In the NB model, the estmated dsperson statstc (α) was 1.73 (95% CI: 1.54, 1.95). A sgnfcant lkelhood rato test (p < 0.001) of dsperson p y x 1/α 1 α p 1p, y 0 α 1 p, y 1 1Γα 1/α y Γ y α α Γ y α α (9) Copyrght 2010 ScRes. Openly accessble at http://www.scrp.org/journal/health/

646 A. K. Dwved et al. / HEALTH 2 (2010) 641-651 Table 1. Zero nflated negatve bnomal model for number of nvolved nodes. Age (year) Varables N Logstc Porton * Odds Rato (95% CI) NB Porton Rsk Rato (95% CI) > 35 977 1.00 1.00 < = 35 175 0.98 (0.54, 1.80) 1.12 (0.90, 1.38) Symptom duraton (month) < = 2 376 1.00 1.00 3-4 263 0.74 (0.43, 1.26) 1.00 (0.82, 1.23) 5-8 266 1.13 (0.71, 1.81) 1.17 (0.95, 1.43) > = 9 247 0.73 (0.43, 1.24) 1.08 (0.88, 1.33) Party Nullparous 47 1.00 1.00 P1/P2 445 1.18 (0.26, 5.31) 1.82 (1.20, 2.77) Multparous 660 1.67 (0.38, 7.44) 1.95 (1.29, 2.95) Menopausal Post Menopausal 587 1.00 1.00 Pre Menopausal 565 0.69 ( 0.45, 1.04) 1.01 (0.85, 1.18) Prmary sde Left 583 1.00 1.00 Rght 569 0.87 (0.60, 1.26) 0.91 ( 0.79, 1.06) Prmary ste Medal (UIQ + LIQ) 235 1.00 1.00 Lateral (LOQ + UOQ) 681 0.62 (0.40, 0.96) 1.29 (1.05, 1.60) Central/Multple/Other 236 0.38 (0.19, 0.74) 1.24 (0.97, 1.58) Skn changes No 746 1.00 1.00 Yes 406 0.38 ( 0.23, 0.62) 1.40 (1.19, 1.66) Tumor type Other/ILC 78 1.00 1.00 IDC 1074 0.62 (0.31, 1.22) 1.14 (0.82, 1.57) Tumor sze (centmeter) < = 2 236 1.00 1.00 2-5 666 0.63 (0.40, 1.01) 1.28 (1.03, 1.59) > 5 250 0.61 (0.34, 1.09) 1.49 (1.17, 1.91) * The odds rato of negatve nodes n low rsk group All the results are adjusted n relaton to neoadjuvant chemotherapy as well as total number of dssected nodes statstc from zero favored the NB model over the PR model. Recall that more than one thrd of the patents had negatve nodes, ndcatng an excess of negatve nodes. Intutvely, ths suggests that over-dsperson s most lkely due to excess negatve nodes. Frstly, all negatve nodes were consdered to arse from an at-rsk group, justfyng use of the ZHP model. Further, to estmate false negatve nodes, t was consdered that some of these negatve nodes mght be observed among patents who had a low rsk of nodal postvty (true zeros) and some proporton mght be observed among patents who had hgh rsk of nodal nvolvement (false zeros). Wth ths more natural consderaton, the ZIP model was used. Both the Vuong test (V = 12.60 and p < = 0.001) and the sgnfcant lkelhood rato test favored the ZHP model over the PR model. However, the comparson of ZHP and ZIP usng Vuong test (V = 2.01 and p = 0.04) slghtly favored the ZIP model. The results of Copyrght 2010 ScRes. Openly accessble at http://www.scrp.org/journal/health/

A. K. Dwved et al. / HEALTH 2 (2010) 641-651 647 Vuong tests also favored the NB model over the ZHP model (8.86, p = < 0.001) and the ZIP model (8.84, p < 0.001). As observed through mproved ft of the NB model over PR and ZHP/ZIP models, t clearly ndcates that over-dsperson s nvolved due to unobserved heterogenety and/or clusterng. In addton, ZHP/ZIP provded evdence of over-dsperson due to excess negatve nodes, n comparson to the PR model. Hence, a model ncorporatng over-dsperson due to excess negatve nodes as well as unobserved heterogenety smultaneously was expected to provde mproved predctablty of number of nvolved nodes. Accordngly, ZHNB and ZINB models were used to predct number of nvolved nodes. Under ZHNB and ZINB models, the estmated dsperson parameters of zero truncated negatve bnomal and NB models were observed dfferent than zero as [(α = 0.70; 95% CI: (0.56, 0.87)] and [(α = 0.71; 95% CI: (0.57, 0.89)] respectvely. Ths suggests that ZHNB/ ZINB models are more approprate than ZHP/ZIP models n descrbng the number of nvolved nodes. The better ft of ZHNB/ZINB models over the NB model suggests that over-dsperson s not only due to excessve negatve nodes but also due to unobserved heterogenety and/or clusterng. The result of the Vuong test showed no dfference between ZHNB and ZINB models n predctng nodal frequency (1.53, p = 0.13). The model ft characterstcs are shown n Table 2. The mnmum BIC was observed for the NB model, followed by ZHNB/ZINB models. However, other valdty ndces of the model (maxmum log lkelhood, mnmum AIC, MSPE and MAPE) favored ZHNB/ZINB models over all other models. The plot of observed mnus predcted probablty of nvolved nodes at each count s shown n Fgure 1. The PR model underestmates probablty of occurrence of negatve node and overestmates occurrence of one postve node. The lne of dfference between observed mnus predcted probablty of postve nodes was close to the reference zero lne, showng better ft of ZHNB/ZINB models than the other models. There s vrtually no dfference between ZHNB and ZINB models n all aspects of descrbng the number of nvolved nodes. The ZINB model provdes slghtly smaller valdty ndces as compared to ZHNB. Fnally, the ZINB model was assessed by the leave one out cross valdaton method. The MSPE n cross valdaton of the ZINB model was the lowest of all the models (0.0007), ndcatng that the ZINB model performs well for predctng nodal nvolvement n future patents. The ZINB model predcts that 70.6% all negatve nodes are at low rsk zeros, and the remanng 29.4% are at hgh rsk for negatve nodes. Ths ndcates that almost 30% of the patents observed as negatve for nodal nvolvement are at hgh rsk of nodal nvolvement based on cofactors. Table 1 dsplays the estmates of regresson coeffcents for varous cofactors of both portons of the ZINB model. For ZINB, the results of both parts of the models together help n understandng the role of the factors on nodal dstrbuton. The logstc porton showed that medal prmary ste and absence of skn changes sgnfcantly ncreased the chance of negatve nodes n breast cancer patents. Negatve bnomal porton reveals that the rsk of a greater number of nvolved nodes was 82 percent hgher n sngle/doubleparous patents versus nullparous patents, gven that the patents are n a hghrsk group. Further, ths was 95 percent hgher among multparous patents. The patents wth lateral ste nvolvement had 1.29 tmes hgher lkelhood for havng a larger number of postve nodes than patents wth the medal ste. Women wth skn changes had 1.39 tmes more nvolvement of hgher postve nodes as compared Fgure 1. Plots of observed mnus predcted probablty of postve nodes versus number of postve nodes for sx models. Table 2. Comparson of model ft characterstcs. PR NB ZHP ZIP ZHNB ZINB Log Lkelhood 4093.9 2598.6 3019.7 3018.4 2553.7 2551.1 AIC 8221.8 5233.1 6107.4 6104.8 5185.4 5172.2 BIC 8307.6 5324.0 6279.0 6276.5 5382.3 5348.9 MSPE 4764.0 139.1 632.5 627.62 52.9 49.2 MAPE 27.5 6.2 13.1 13.0 4.8 4.7 Copyrght 2010 ScRes. Openly accessble at http://www.scrp.org/journal/health/

648 A. K. Dwved et al. / HEALTH 2 (2010) 641-651 to ther counterparts. The chance of ncreased postve nodes was 28 percent hgher among patents wth 2-5 cm tumor sze, n comparson to patents wth less than 2 cm tumor sze. It was agan 1.49 tmes more lkely among patents wth more than 5 cm tumor sze as compared to less than 2 cm tumor sze. 4. DISCUSSION The number of nvolved nodes s one of the most mportant therapeutc and prognostc factors for breast cancer [1]. Clncans need to predct the number of nvolved nodes n breast cancer patents n order to mprove health outcomes. To the best of our knowledge, few studes have descrbed the number of nvolved nodes n breast cancer patents, and tested statstcal models to accurately predct nvolved node number. As for most of the count data, studes also found excess varablty n nodal dstrbuton than that expected by a Posson model. They also generally assume the cause of over-dsperson to be solely due to unobserved heterogenety, and therefore used the NB model to ft and descrbe nodal frequency [3,4]. However, data wth nodal nvolvement often nvolve excess zeros, whch also cause over-dsperson. Ths ndcates a need to explore fttng zero hurdle and zero nflated models, whch can also account for varablty due to excessve zeros. In the current paper, we ftted varous count models to dentfy putatve causes of over-dsperson, and to assess the predctve performance of these models wth regard to the nodal status n a populaton of patents wth breast cancer. We also llustrated the sgnfcance of usng zero nflated models n count data nvolvng zeros that emanate from the subjects that are all at-rsk of the event of nterest. The ZHNB/ZINB regresson models provde the best ft when predctng the number of nvolved nodes n breast cancer patents. Ths confrms that the dstrbuton of the nvolved nodes contaned over-dsperson not only due to unobserved heterogenety but also due to excessve negatve nodes (zeros). As expected, the PR model had the worst predcton ablty for nodal frequency. Accountng only one source of over-dsperson, ether due to excessve zeros or due to unobserved heterogenety, the predcton ablty of nodal frequency mproved as ndcated by NB, ZHP, ZIP models. However, use of ZHNB/ZINB models, whch assumes nvolvement of more than just one source of over-dsperson, provded smaller predcton error. The ZHNB and ZINB models were consstent and smlar for factor-dentfcaton n the extent of nodal nvolvement as well as for predcton of number of postve (nvolved) nodes. In the current study, we focused on predctng nodal frequency. On that bass, ether model can be used to predct number of nvolved nodes. Due to ease of nterpretng the results of ZHNB model, t can be preferred over ZINB model. These fndngs are supported by Rose et al. [6], who also found good concordance between the ZHNB and ZINB models on vaccne adverse data a case of only at rsk zeros smlar to the data used n our study. They suggested that the model selecton should be determned based on study objectves and the data generatng process. They recommend usng the ZHNB model due to nvolvement of only at rsk zeros. However, Baughman [31] suggested that model choce should be based on the ratonale behnd the consderaton of data generatng mechansm. Glthorpe et al. [32] suggested that the zero nflated models should be used accordng to the underlyng dsease process.e., consderatons of dsease onset and dsease progresson. In our opnon, zero hurdle models should be preferred f data consst of zeros whch are all comng from the subjects at no-rsk of the outcome of nterest, and over-dsperson s due to excess zeros. In such cases, zeros from the no-rsk populaton arse from a noncountng process. However, zeros comng from an at rsk populaton belong to the count process, thus nfluencng model choce based on the ratonale behnd the data generaton of the at rsk populaton. In the present study, f dagnoss s close to or at dsease onset, the rsk of fndng the event of nterest (nodal nvolvement) would be mnmal, whereas f the dagnoss s late and durng dsease progresson, the rsk of the event of nterest would be relatvely hgh. Prevous studes note that the dstrbuton of nvolved nodes often conssts of some proporton of false negatve nodes, whch may often arse n the hgh-rsk group [7,8]. There s ample evdence to consder at rsk zeros, at least n breast cancer, as a mxture of low-rsk and hgh-rsk zeros, thus, suggestng the use of zero nflated models. Use of the ZINB model not only gves estmate of the false negatve nodes.e., zero at hgh rsk of nodal nvolvement, but also provdes slghtly better predctve performance than the ZHNB model. The ZINB model estmated about 30 percent of the zeros that can be consdered false/at hgh rsk negatve nodes, suggestng that these patents are at hgh rsk of nodal nvolvement. Among these, some patents mght have been observed or reported falsely as havng negatve nodes. If so, then those patents mght have been under-treated and/or msclassfed, resultng n an naccurate predcted prognoss. Ths model wll help to dentfy such patents, and reduce msclassfcaton. There s a need to develop a sound strategy to classfy patents at hgh rsk zeros and low rsk zeros. Ths ssue s under nvestgaton by us, and s the subject of a future publcaton. Copyrght 2010 ScRes. Openly accessble at http://www.scrp.org/journal/health/

A. K. Dwved et al. / HEALTH 2 (2010) 641-651 649 The mean square predcton error was found to be 35.4% less usng ZINB as compared to the NB regresson model. In addton, the predctve performance of the ZINB model was sgnfcantly better than the NB regresson model, ndcatng that the NB model may not always be approprate for descrbng nodal dstrbuton. The leave-one-out cross-valdaton assessment of the developed ZINB model provded the mnmum mean square predcton error compared to the other developed models, ndcatng that the model performs well, even for future patents, n comparson to other models. Ths study s the frst report to analyze patterns of nodal nvolvement n breast cancer, usng a large dataset collected n Inda. In our study, 61.2% of the patents had the presence of nvolved nodes. Sandhu et al., usng a dfferent Indan dataset, also reported a 61.6% nodal nvolvement [33]. A dfferent study, also usng a populaton from Inda, reported an even hgher nodal postvty rate of 80.2% [34]. In our study, both presence of other than medal prmary ste and skn changes among patents are assocated wth hgh rsk of nodal nvolvement and wth a greater number of nvolved nodes. In addton to these two factors, hgher party and larger tumor sze are also assocated wth an ncreased rsk of a hgher number of nvolved nodes, gven that the patents are n hgh rsk populaton. These factors are consstently found to be assocated wth the presence of nvolved nodes n other studes [35-41], and are drectly or ndrectly consequences of late dagnoss. Overall, these fndngs confrm the need for ongong efforts to mnmze dagnostc delay n patents suspected of havng breast cancer. One lmtaton to our study s that t uses a dataset not desgned for our analyss. Important covarates, such as lymphatc vascular nvason and S-phase functon, were not ncluded n ths database. These covarates could be sgnfcantly assocated wth nvolved nodes, as reported n varous studes [42-45]. In addton, nstead of adjustment of these results n relaton to dssected number of nodes, an attempt could be made to model the proporton of postve nodes n patents through count data models or bnomal models. 5. CONCLUSIONS The ZHNB/ZINB regresson models can be used to descrbe nodal dstrbuton more approprately than the NB model. However, the ablty of the ZINB model to more accurately estmate at hgh-rsk zeros whle havng a comparatvely lower predcton error, as compared to the ZHNB model, suggests that t s the best model for predctng and descrbng the number of nvolved nodes. Many of the factors assocated wth nodal nvolvement may be a result of dagnostc delay of breast cancer patents, ndcatng the need to mnmze delay n dagnoss of breast cancer patents. There s also a need to further nvestgate the consequences of usng zero nflated models, as an alternatve to zero hurdle models, n at- rsk populatons. 6. ACKNOWLEDGEMENTS The authors would lke to express ther thanks to Dr. V. Sreenvas, Department of Bostatstcs, All Inda Insttute of Medcal Scences, New Delh; Dr. Arvnd Pandey, Natonal Insttute of Medcal Statstcs, New Delh; and also Dr. Kshore Chaudhry and Dr. D. K. Shukla, Dvson of Non-Communcable Dseases, Indan Councl of Medcal Research, New Delh, for ther crtcal comments throughout ths study. REFERENCES [1] Hernandez-Avla, C.A., Song, C., Kuo, L., Tennen, H., Armel, S. and Kranzler, H.R. (2006) Targeted versus daly naltrexone: Secondary analyss of effects on average daly drnkng. Alcoholsm, Clncal and Expermental Research, 30(5), 860-865. [2] Slymen, D.J., Ayala, G.X., Arredondo, E.M. and Elder, J.P. (2006) A demonstraton of modelng count data wth an applcaton to physcal actvty. Epdemologc Perspectves & Innovatons, 3(3), 1-9. [3] Horton, N.J., Km, E. and Satz, R. (2007) A cautonary note regardng count models of alcohol consumpton n randomzed controlled trals. BoMed Central Medcal Research Methodology, 7(9), 1-9. [4] Salnas-Rodrguez, A., Manrque-Espnoza, B. and Sosa- Rub, S.G. (2009) Statstcal analyss for count data: Use of health servces applcatons. Salud Publca Mex, 51(5), 397-406. [5] Asada, Y. and Kephart, G. (2007) Equty n health servces use and ntensty of use n Canada. Bomed Central Health Servces Research, 7(41), 12. [6] Grootendorst, P.V. (1995) A comparson of alternatve models of prescrpton drug utlzaton. Health Economcs, 4(3), 18398. [7] Aff, A.A., Kotlerman, J.B., Ettner, S.L. and Cowan, M. (2007) Methods for mprovng regresson analyss for skewed contnuous or counted responses. Annual Revew of Publc Health, 28, 9511. [8] Hur, K., Hedeker, D., Henderson, W., Khur, S. and Daley, J. (2002) Modelng clustered count data wth excess zeros n health care outcomes research. Health Servces and Outcomes Research Methodology, 2002, 3, 5-20. [9] Lee, A.H., Wang, K., Scott, J.A., Yau, K.K. and McLachlan, G.J. (2006) Mult-level zero-nflated Posson regresson modelng of correlated count data wth excess zeros. Statstcal Methods n Medcal Research, 15(1), 47-61. [10] Yau, K.K. and Lee, A.H. (2001) Zero-nflated Posson regresson wth random effects to evaluate an occupatonal njury preventon programme. Statstcs n Medcne, 20 (19), 2907-2920. [11] Mn, Y. and Agrest, A. (2005) Random effect models for Copyrght 2010 ScRes. Openly accessble at http://www.scrp.org/journal/health/

650 A. K. Dwved et al. / HEALTH 2 (2010) 641-651 repeated measures of zero-nflated count data. Statstcal Modellng, 5(1), 19. [12] Gardner, W., Mulvey, E.P. and Shaw, E.C. (1995) Regresson analyses of counts and rates: Posson, overdspersed Posson, and negatve bnomal models. Psychologcal Bulletn, 118(3), 392-404. [13] Hardn, J.W. and Hlbe, J.M. (2007) Generalzed Lnear Models and Extensons. A Stata Press Publcaton, Stat- Corp LP, Texas. [14] Mullay, J. (1986) Specfcatons and testng of some modfed count data model. Journal of Econometrcs, 33(3), 341-365. [15] Lambert, D. (1992) Zero-nflated Posson regresson, wth applcaton to defects n manufacturng. Technometrcs, 34(1), 14. [16] Vuong, Q.H. (1989) Lkelhood rato tests for model selecton and non-nested hypotheses. Econometrca, 57 (2), 307-333. [17] Pcard, R. and Cook, D. (1984) Cross-Valdaton of Regresson Models. Journal of the Amercan Statstcal Assocaton, 79(387), 575-583. [18] Baughman, L.A. (2007) Mxture model framework facltates understandng of zero-nflated and hurdle models for count data. Journal of Bopharmaceutcal Statstcs, 17(5), 943-946. [19] Glthorpe, M.S., Frydenberg, M., Cheng, Y. and Baelum, V. (2009) Modellng count data wth excessve zeros: The need for class predcton n zero-nflated models and the ssue of data generaton n choosng between zero-nflated and generc mxture models for dental cares data. Statstcs n Medcne, 28(28), 3539-3553. [20] Sandhu, D.S., Sandhu, S., Karwasra, R.K. and Marwah, S. (2010) Profle of breast cancer patents at a tertary care hosptal n north Inda. Indan Journal of Cancer, 47(1), 16-22. [21] Saxena, S., Rekh, B., Bansal, A., Bagga, A., Chntaman and Murthy, N.S. (2005) Clnco-morphologcal patterns of breast cancer ncludng famly hstory n a New Delh hosptal, Inda-A cross-sectonal study. World Journal of Surgcal Oncology, 3, 67-75. [22] Nouh, M.A., Ismal, H., Al El-Dn, N.H. and El-Bolkany, M.N. (2004) Lymph node metastass n breast carcnoma: Clncopathologc correlatons n 3747 patents. Journal of Egyptan Natonal Cancer Insttute, 16(1), 50-56. [23] Gann, P.H., Collla, S.A., Gapstur, S.M., Wnchester, D.J. and Wnchester, D.P. (1999) Factors assocated wth axllary lymph node metastass from breast carcnoma descrptve and predctve analyses. Cancer, 86(8), 1511-1518. [24] Olvotto, I.A., Jackson, J.S.H., Mates, D., Andersen, S., Davdson, W., Bryce, C.J. and Ragaz, J. (1998) Predcton of axllary lymph node nvolvement of women wth nvasve breast carcnoma a multvarate analyss. Cancer, 83(5), 948-955. [25] Ravdn, P.M., De Laurents, M., Vendely, T. and Clark, G.M. (1994) Predcton of axllary lymph node status n breast cancer patents by use of prognostc ndcators. Journal of Natonal Cancer Insttute, 86(23), 1771775. [26] Chua, B., Ung, O., Taylor, R. and Boyages, J. (2001) Frequency and predctors of axllary lymph node metastases n nvasve breast cancer. Australan and New Zealand Journal of Surgery, 71(12), 723-728. [27] Manjer, J., Balldna, G. and Garne, J.P. (2004) Tumour locaton and axllary lymph node nvolvement n breast cancer: A seres of 3472 cases from Sweden. European Journal of Surgcal Oncology, 30(6), 610-617. [28] Manjer, J., Balldn, G., Zackrsson, S. and Garne, J.P. (2005) Party n relaton to rsk of axllary lymph node nvolvement n women wth breast cancer. European Surgcal Research, 37(3), 17984. [29] Olvotto, I.A., Jackson, J.S.H., Mates, D., Andersen, S., Davdson, W., Bryce, C.J. and Ragaz, J. (1998) Predcton of axllary lymph node nvolvement of women wth nvasve breast carcnoma a multvarate analyss. Cancer, 83(5), 948-955. [30] Ravdn, P.M., De Laurents, M., Vendely, T. and Clark, G.M. (1994) Predcton of axllary lymph node status n breast cancer patents by use of prognostc ndcators. Journal of Natonal Cancer Insttute, 86(23), 1771775. [31] Chua, B., Ung, O., Taylor, R. and Boyages, J. (2001) Frequency and predctors of axllary lymph node metastases n nvasve breast cancer. Australan and New Zealand Journal of Surgery, 71(12), 723-728. [32] Cetntas, S.K., Kurt, M., Ozkan, L., Engn, K., Gokgoz, S. and Tasdelen, I. (2006) Factors nfluencng axllary node metastass n breast cancer. Tumor, 92(5), 416-422. [33] Fsher, B., Bauer, M., Wckerham, D.L., Redmond, C.L.K. and Fsher, E.R. (1983) Relaton of number of postve axllary nodes to the prognoss of patents wth prmary breast cancer. Cancer, 52(9), 1551557. [34] Harden, S.P., Neal, A.J., Al-Nasr, N., Ashley, S. and Quercdella, R.G. (2001) Predctng axllary lymph node metastases n patents wth T1 nfltratng ductal carcnoma of the breast. The Breast, 10(2), 15559. [35] Guern, A.S. and Vnh-Hung, V. (2008) Statstcal dstrbuton of nvolved axllary lymph nodes n breast cancer. Bull Cancer, 95(4), 449-455. [36] Kendal, W.S. (2005) Statstcal knematcs of axllary nodal metastases n breast carcnoma. Clncal & Expermental Metastass, 22(2), 17783. [37] Cameron, A.C. and Trved, P.K. (1998) Regresson Analyss of Count Data. Econometrc Socety Monograph, Cambrdge Unversty Press, New York. [38] Rose, C.E., Martn, S.W., Wannemuehler, K.A. and Plkayts, B.D. (2006) On the use of zero-nflated and hurdle models for modelng vaccne adverse event count data. Journal of Bopharmaceutcal Statstcs, 16(4), 463-481. [39] Rampaul, R.S., Mremad, A., Pnder, S.E., Lee, A. and Ells, I.O. (2001) Pathologcal valdaton and sgnfcance of mcrometastass n sentnel nodes n prmary breast cancer. Breast Cancer Research, 3(2), 11316. [40] Schaapveld. M., Otter, R., de Vres, E.G., Fdler, V., Grond, J.A., van der Graaf, W.T., de Vogel, P.L. and Wllemse, P.H. (2004) Varablty n axllary lymph node dssecton for breast cancer. Journal of Surgcal Oncology, 87(1), 42. [41] Martn, T.G., Wntle, B.A., Rhodes, J.R., Kuhnert, P.M., Feld, S.A., Low-Choy, S.J., Tyre, A.J. and Possngham, H.P. (2005) Zero tolerance ecology: Improvng ecologcal nference by modelng the source of zero observatons. Ecology Letters, 8(11), 1235246. [42] Zorn, C.J.W. (1996) Evaluatng zero-nflated and hurdle Copyrght 2010 ScRes. Openly accessble at http://www.scrp.org/journal/health/

A. K. Dwved et al. / HEALTH 2 (2010) 641-651 651 Posson specfcatons. Mdwest Poltcal Scence Assocaton, San Dego. [43] Boucher, J.P., Denut, M. and Gullen, M. (2007) Rsk classfcaton for clam counts: A comparatve analyss of varous zero nflated mxed Posson and hurdle models. North Amercan Actuaral Journal, 11(4), 11031. [44] Bohnng, D., Detz, E., Schlattmann, P., Mendonca, L. and Krchner, U. (1999) The zero nflated Posson model and the decayed, mssng and flled teeth ndex n dental epdemology. Journal of the Royal Statstcal Socety (Seres A), 162(2), 195-209. [45] Cheung, Y.B. (2002) Zero-nflated models for regresson analyss of count data: A study of growth and development. Statstcs n Medcne, 21(10), 1461469. Copyrght 2010 ScRes. Openly accessble at http://www.scrp.org/journal/health/