Evaluation of logistic regression models and effect of covariates for case control study in RNA-Seq analysis

Size: px
Start display at page:

Download "Evaluation of logistic regression models and effect of covariates for case control study in RNA-Seq analysis"

Transcription

1 Choi et al. BMC Bioinformatics (2017) 18:91 DOI /s y METHODOLOGY ARTICLE Evaluation of logistic regression models and effect of covariates for case control study in RNA-Seq analysis Seung Hoan Choi 1, Adam T. Labadorf 2, Richard H. Myers 2, Kathryn L. Lunetta 1, Josée Dupuis 1 and Anita L. DeStefano 1,2* Open Access Abstract Background: Next generation sequencing provides a count of RNA molecules in the form of short reads, yielding discrete, often highly non-normally distributed gene expression measurements. Although Negative Binomial (NB) regression has been generally accepted in the analysis of RNA sequencing (RNA-Seq) data, its appropriateness has not been exhaustively evaluated. We explore logistic regression as an alternative method for RNA-Seq studies designed to compare cases and controls, where disease status is modeled as a function of RNA-Seq reads using simulated and Huntington disease data. We evaluate the effect of adjusting for covariates that have an unknown relationship with gene expression. Finally, we incorporate the data adaptive method in order to compare false positive rates. Results: When the sample size is small or the expression levels of a gene are highly dispersed, the NB regression shows inflated Type-I error rates but the Classical logistic and Bayes logistic (BL) regressions are conservative. Firth s logistic (FL) regression performs well or is slightly conservative. Large sample size and low dispersion generally make Type-I error rates of all methods close to nominal alpha levels of 0.05 and However, Type-I error rates are controlled after applying the data adaptive method. The NB, BL, and FL regressions gain increased power with large sample size, large log2 fold-change, and low dispersion. The FL regression has comparable power to NB regression. Conclusions: We conclude that implementing the data adaptive method appropriately controls Type-I error rates in RNA-Seq analysis. Firth s logistic regression provides a concise statistical inference process and reduces spurious associations from inaccurately estimated dispersion parameters in the negative binomial framework. Keywords: RNA-Sequencing analysis, Firth s logistic regression, Negative binomial regression, Covariate effect Background Next generation sequencing (NGS) gene expression measurement methods simultaneously quantify tens of thousands of unique Ribonucleic Acid (RNA) molecules extracted from biological samples. These RNA sequencing (RNA-Seq) methods produce data that can be transformed into numerical values that are proportional to the abundance of RNA molecules and reflect the expression and turnover of those molecules. Identifying differentially expressed (DE) genes is an important step * Correspondence: adestef@bu.edu 1 Department of Biostatistics, Boston University, 801 Massachusetts Avenue, Boston, Massachusetts, USA 2 Department of Neurology, Boston University, 72 East Concord Street, Boston, Massachusetts, USA to understanding the molecular mechanism of disease. As with any statistical analysis, the underlying structure of the data dictates appropriate methodology. The NGS methods provide a count of RNA molecules that do not follow a normal distribution. The Negative Binomial (NB) distribution appropriately models the biological dispersion of a gene, and NB regression has been used to analyze RNA-Seq data. When Y, a random variable, follows a NB distribution with mean (μ) and dispersion (ϕ), the parameterization of the probability mass function, expected value, and variance of Y are The Author(s) Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

2 Choi et al. BMC Bioinformatics (2017) 18:91 Page 2 of 13 Y NBðμ; ϕþ; where μ 0 and ϕ 0 such that 0 1ϕ 1 f ðyþ ¼ Γðy þ ϕ 1 Þ ϕ 1 A Γðϕ 1 ÞΓðy þ 1Þ μ þ ϕ 1 1 ϕ 1 μ þ ϕ 1 ; E½Y ¼μ; and var ½y ¼μ þ μϕ 2 : Because the total number of reads of each sample will likely be different, a normalization step is required prior to performing differential expression inferences between two conditions. Normalization approaches have been evaluated elsewhere [1]. Based on these evaluations, we implemented the DESeq normalization approach in this study [2]. Estimation of the dispersion parameter (ϕ) of each gene is challenging with the small number of observations typically available in RNA-Seq studies. An overestimated dispersion may result in loss of power to detect DE genes and an underestimated dispersion parameter may increase false discoveries. Two of the most sophisticated and widely used software packages for identifying DE genes are DESeq2 and edger [3, 4], which estimate the dispersion parameter of each gene using empirical Bayes shrinkage and Cox-Reid adjusted profile likelihood methods, respectively. Many RNA-Seq analysis methods have been evaluated in different settings including multi-group study designs [1, 5 9]. Soneson et al. [9] compared performance of RNA-Seq analysis tools (edger, DESeq, bayseq [10], NBPSeq [11], TSPM [12], EBSeq [13], NOIseq [14], SAMseq [15], ShrinkSeq [16] and limma [17]) using real and simulated data sets. They reported that when sample size is small, the results should be cautiously interpreted. However, when sample size is large, the limma using variance stabilizing transformation method performed well. Seyednasrollah et al. [18] evaluated eight computational methods such as edger, DESeq, bayseq, NOISeq, SAMseq, limma, Cuffdiff2 [19], and EBSeq using publically available human and mice RNA-Seq data. They concluded that no method fits for all situations and results from distinct methods could be largely different. Rapaport et al. [7] assessed commonly used analysis packages (Cuffdiff, edger, DESeq, PoissonSeq [20], bayseq, and limma) for RNA-Seq data. They analyzed human RNA-Seq data with those methods and emphasize the importance of large sample replicates to accurately detect association with genes. Tang et al. [6] included TCC [21], edger, DESeq, DESeq2, voom [22], SAMseq, PoissonSeq, bayseq, and EBSeq in their evaluation with multi-group data. The edger and DESeq2 packages were recommended in their assessment. DESeq2 and edger are generally accepted in the analysis of RNA-Seq data because these packages are designed to properly handle studies with small sample size and lowly expressed genes. However, the appropriateness of the NB model compared to the logistic model has not been exhaustively evaluated. Because many RNA-Seq studies are designed to compare cases and controls, we explore logistic regression as an alternative approach, in which disease status is modeled as a function of RNA-Seq reads. This is a reversal of the experimental and explanatory variables in the NB model in the RNA-Seq setting. An attractive feature of the logistic framework is that the estimation of a dispersion parameter for gene expression is not necessary. Although some studies analyzed their data modeled as a function of gene expression [23, 24], a comparison to standard NB methods was not conducted. We evaluate logistic regression models in which the dependent variable is disease status and gene expression is the independent variable. Because of the substantial costs associated with RNA-Seq technology and the challenges in obtaining appropriate tissue sample sizes may be limited in some studies. When sample size is small the distribution of test statistics may not achieve the expected asymptotic distribution. Hence, we incorporated the data adaptive method into NB and logistic regressions because this approach estimates a recalibrated distribution of test statistics. We evaluated the validity of the data adaptive method in RNA-Seq studies. Another important feature of differential expression analysis is covariate adjustment. Adjustment for confounders is crucial in protecting against spurious associations. We define a confounder as a covariate that is associated with both the experimental and explanatory variables. Covariates in RNA-Seq analysis are associated with disease status, technical artifacts from experimental methodology, or intrinsic biological properties of a system in RNA-Seq models. If these covariates affect the abundance measurements of RNA-Seq data, then the covariates could significantly confound the association between RNA-Seq and disease status. Again, we considered two approaches for differential expression analysis: 1) NB regression where gene expression values are the outcome variable and case control status is the predictor variable and 2) logistic regression where case control status is a function of gene expression values. If diseaseassociated covariates are not associated with gene expression, then these covariates are non-predictive (NP) covariates in models with RNA-Seq as the outcome. The effect of adjusting for covariates, when the relationship between covariates and gene expression is not assessed through statistical tests or prior studies, has not been extensively evaluated in RNA-Seq studies using the NB framework. If we alternatively consider a logistic model, the NP covariates in the NB model become nonconfounding predictive (NCP) covariates in the logistic model, because the covariates are not associated with the independent variable (gene expression) but are associated with the dependent variable (disease status).

3 Choi et al. BMC Bioinformatics (2017) 18:91 Page 3 of 13 We use both simulated data sets and an application to a real Huntington s disease (HD) RNA-Seq data set [25]. The results of this study will guide the selection of an appropriate regression model and guide decisions regarding covariate adjustment in RNA-Seq studies. Methods This study focuses on a gene as a unit; hence various gene-based scenarios are considered. Regression methods for analyzing RNA-Seq data Negative binomial (NB) regression NB regression uses the Maximum-Likelihood (ML) fitting process [26]. The generalized linear model (GLM) framework is used by DESeq2 and edger. In the current study, GLM was implemented using the glm(,family = negative.binomial(1/ϕ)) function in the R-package MASS and utilized either the estimated dispersion from ML, Quasi-likelihood (QL) or the true dispersion value from the simulation scenario. ML and QL methods are described in Additional file 1: Supplementary Method Section 1. In our real data application, the original data and permuted data sets were analyzed with DESeq2 [3]. Classical logistic (CL) regression We conducted GLM in a logistic regression framework using the logit link function using the glm(,family = binomial) function in R. In the RNA-Seq setting CL regression may be limited by small sample size and complete separation. If the expression values of a gene are completely or nearly completely separated between case and control groups, which may occur when the effect size is large, the ML estimation from CL regression may fail to converge. Because complete separation may be an indicator of differential expression, we implemented Bayes and Firth s logistic regressions, which overcome complete separation bias. Bayes logistic (BL) regression Gelman et al. [27] proposed a prior to estimate stable coefficients in a Bayesian framework, when data show separation. The proposed prior is the Cauchy distribution with center 0 and scale 2.5. They demonstrated that this flat-tailed distribution has robust inference in logistic regression and is computationally efficient. The procedure is implemented by incorporating an EM algorithm into iteratively reweighted least squares, using the bayesglm function in the R-package arm. Firth s logistic (FL) regression The ML estimators may be biased due to small sample size and small total Fisher information. Firth proposed a method that eliminates first-order bias in ML estimation by introducing a bias term in the likelihood function [28]. Heinze and Schemper demonstrated that Firth s method is an ideal solution when the data show separation [29]. The logistf function in the R-package logistf was used. Simulation study The performance of statistical models was evaluated through Type-I error and power scenarios using combinations of the parameter values in Table 1. Generating simulated RNA-Seq data For each scenario, the read counts (y i ) were sampled from the NB distribution with mean (μ) and dispersion (ϕ) as specified in Table 1. We simulated 10,000 independent replicates per scenario using the following steps and evaluated Type-I error rate and power per scenario based on results from these 10,000 independent replicates. First, we generated simulated sample (i) data with their case (D=1) and control (D=0) statuses determined by the study design shown in Table 1. Then, a gene (g) expression value for each sample (y ig ) was generated following the NB distribution conditioning on the disease status. The log2fc determined the mean expression value for cases (y gd=1 ) in power scenarios. When simulating under the null hypothesis (Type-I error scenarios) log2fc = 0 and the mean expression value (μ gd ) was equal for cases and controls. We considered only up-regulation of genes, and assumed the dispersion was the same for cases and controls. The simulation model for the RNA-Seq count data is: y ig e NBðμ gd ; ϕ gþ; where μ gd 0, ϕ g 0, D is the binary case control status of a sample (i), μ g is mean expression value of a gene g. μ gd=1 is the mean expression value for cases for gene g and is calculated as 2 log2fc μ gd=0. Table 1 Parameters and their values in simulation scenarios Parameter Values Design Balanced, Unbalanced2, Unbalanced4 Number of cases (N D=1 ) 10, 25, 75, 500 Mean expression value in controls (μ D=0 ) 50, 100, 1000, Dispersion (ϕ) 0.01, 0.01, 0.5, 1 Covariate OR (CovOR) 1, 1.2, 3, 5, 10 log 2 fold-change (log2fc) 0, 0.3, 0.6, 1.2, 2 Number of Covariates 0, 1, 2, 3, 5, 10

4 Choi et al. BMC Bioinformatics (2017) 18:91 Page 4 of 13 Generating simulated covariate data The binary covariates (X) were simulated to follow a binomial distribution conditioning on case control status of each sample. The conditional probability was calculated based on the CovOR. XjD e BðN D ; P D Þ; where D is disease status (control = 0; case = 1), N D is sample size of D, P D=0 =0.5,andP D=1 =CovOR/(CovOR+1). When the number of cases is 10, CovOR of 10 is not considered. For every 10 replications among 10,000 replications in a scenario, a new covariate set was generated to incorporate between and within variance of covariates. All covariates in a model were independent and had the same CovOR. Analyzing simulated data We performed NB regression with Model A and conducted the CL, BL, and FL regressions with Model B. Model A : logðey ½ Þ ¼ β 0 þ β 1 D X C β k¼1 kþ1 X k Model B : logitðed ½ Þ ¼ β 0 þ β 1 Y þ X C k¼1 β kþ1 X k where Y is read count, D is case/control status and C is the number of covariates. Models without covariates were analyzed using all regression methods in order to compare the performances of unadjusted models. Models adjusting for covariates were analyzed only using NB and FL regressions to evaluate different types of covariate effects. Scenarios for which log2fc = 0 are Type-I error studies. Otherwise, the scenarios are power studies. Type-I error rates, at significance (alpha) levels 0.05 and 0.01, were calculated based on replicates with converged results. For power studies, because different Type-I error rates were observed among the distinct regression methods, comparing the power of different regression methods using the same threshold is not appropriate. For fair comparison, an empirical threshold for each regression method was calculated. Then, we computed the empirical power of each regression method using their empirical thresholds. Although only positive log2fcs were simulated in those scenarios in which power was evaluated, we consistently used two-sided tests for both Type-I error and power studies. The equations for Type-I error rate and empirical power are shown in Additional file 1: Supplementary Method Section 2. Validation of data adaptive (DA) method using crossvalidation technique The DA method re-estimates a distribution of test statistics under the null hypothesis of no association [30]. The DA approach enables one to obtain a recalibrated distribution of test statistics because when sample size is small, the asymptotic distribution may not be appropriate. This method also avoids heavy computing burden compared to implementing permutation tests with all possible permutations. The results from the DA method are validated using a cross-validation technique. Detailed procedures are provided in Additional file 1: Supplementary Method Section 3. Huntington s disease (HD) study A real RNA-Seq data set was analyzed using the DESeq2 R-package, which implements a NB GLM. The data set was also analyzed utilizing R (v3.0.0) to implement CL, BL, and FL regressions. The logistic regressions modeled case control status as a function of normalized counts of a gene and covariates. The publicly available HD data set [25] was downloaded from the GEO database (accession number GSE64810, geo/query/acc.cgi?acc=gse64810). This data set contains 20 HD cases and 49 neurologically normal controls. Details of the HD data set are provided in Additional file 1: Supplementary Method Section 4. Generating permuted HD Data The original HD study used RNA Integrity Number (RIN) and Age at Death (AAD) as covariates. RIN was included due to the potential confounding effect between HD and the abundance of RNAs. To remove the effect of RIN in our permutations, at first, samples were divided by RIN categories. Then, each gene was resampled within each category of RIN. Because AAD was included in the regression model due to its association with HD, the relationship between HD and AAD was preserved during the permutation process. We generated 10,000 Monte-Carlo permutations. Analyzing permuted HD data For the original HD data and each permutated data set, DE genes were identified using the NB model (Model A) as implemented in DESeq2. We also implemented the CL, BL, and FL regressions (Model B) to compare statistical models. In logistic models, we used normalized counts from DESeq2 as an independent variable. For these analyses, we adjusted for AAD and RIN. The Type-I error rates at our alpha levels using supplementary equation (2.1), and the exact p-values [31] were calculated with the results from the 10,000 permutations. The DA method was applied using our permutation results [30] to measure Type-I error rates and to obtain adjusted p-values of each gene. For HD data analysis, asymptotic p-values, exact p- values using 10,000 permutation results, and DA using 1000 p-values were calculated and corrected for multiple

5 Choi et al. BMC Bioinformatics (2017) 18:91 Page 5 of 13 testing by imposing a false discovery rate (FDR) of To assess the model adequacy, QQ-plots of original, exact and DA p-values were generated, and genomic inflation factors, λ gc, were calculated. This λ gc is the ratio of the median of the observed test statistics divided by the expected median of an asymptotic distribution. If a λ gc is greater than 1, this may suggest inflation in the test statistics and may indicate the presence of systemic bias such as hidden population structures, latent covariates, technical artifacts, etc. [32]. Analyzing HD data with simulated covariates To evaluate the effect of covariates in a model, the same method for generating covariates in our simulation study was applied to the HD data set to create simulated covariates. In this real data application, we focused on a moderate covariate effect on HD status (CovOR = 1.2). The HD data were analyzed using the NB GLM in DESeq2 with Model A and using the FL regression with Model B with 1000 replicated sets of simulated covariates (C = 1, 2, 3, 5, or 10). The change of λ gc with the addition of a varying number of simulated covariates in a model was evaluated. Results Simulation results of NB vs. Logistic regressions Type-I error simulations Type-I error rates from the simulated results of the scenarios at two alpha levels are presented in Table 2. All analyses presented in Table 2 were converged. The NB regressions using ML, QL and true dispersions show Table 2 Type-I error rates of regression methods from the balanced design having μ D=0 = 1000 Alpha N D=1 Disp NB CL BL FL <0.001 < < almost identical levels of performance (see Additional file 2: Table S1). When the sample size is small or the dispersion is high, NB regression shows inflated Type-I error rates but the CL and BL regressions are conservative (see Table 2). Large sample size and low dispersion generally yielded Type-I error rates that were close to the specified alpha levels as shown in Additional file 3: Figure S1. The increment of μ D=0 is not influential. The FL regression performs well or presents moderate conservativeness at both alpha levels. The Type-I error rates of the FL regression are less affected by the small sample size and the large dispersion than other logistic regressions. The Type-I error rates of additional scenarios exhibit patterns that are consistent with results in Table 2. The unbalanced design results shown in Additional file 4: Table S2 are consistent with the balanced design results in Table 2. In most scenarios, the DA method reduces the inflation observed with NB regression and the controls conservativeness observed with the CL, BL, and FL regressions. However, with small sample size, conservative results are still observed at alpha level 0.01 for the CL and BL models. The DA method with NB and FL regressions showed well-controlled Type-I error rates at all alpha levels even with small sample size. Power simulations We summarize the empirical power results from the balanced design with 10 cases in Fig. 1. The performance of the NB regressions with ML, QL and true dispersions are almost identical, as seen in Additional file 5: Table S3. Larger sample sizes increase power for all regression methods. The influence of mean expression in controls appears with small log2fc in Fig. 1. When sample size, log2fc, and dispersion are small, increase of mean expression in controls leads to an increase of power at both alpha levels. When log2fc is large and dispersion is small, the CL regression shows very low power. The NB, BL, and FL regressions gain power with large log2fc and low dispersion. These three regression methods have comparable empirical power and CL regression results in lower power in all scenarios. Application to RNA-Seq data in Huntington s disease Type-I error permutations The Type-I error rates from the permuted data sets at two alpha levels are shown in Figs. 2 and 3. We categorize genes in 5 groups by the estimated dispersion of a gene: (0, 0.05), (0.05, 0.15), (0.15, 0.8), (0.8, 1.5), and (1.5, 10). We define genes with dispersion >0.8 as largely dispersed. In the NB results from DESeq2, as dispersion increases, the Type-I error rates increase when genes are in the categories of (0, 0.05), (0.05, 0.15), and (0.15, 0.8). However, genes in the (0.8, 1.5), and (1.5, 10) categories

6 Choi et al. BMC Bioinformatics (2017) 18:91 Page 6 of 13 Fig. 1 Empirical power of regression methods from the balanced design with N D=1 = 10. Power of the Negative Binomial with true dispersion (NB), Classic Logistic (CL), Bayes Logistic (BL), and Firth s Logistic (FL) regressions at alpha levels of 0.05 and 0.01 are shown. The black dotted horizontal lines represent 95 and 90% power. Mean expression values (μ = 50 and 1000) are separated by black dotted vertical lines. Four dispersion values (0.01, 0.1, 0.5, and 1) are placed within each mean expression value. Dotted lines within each symbol imply 95% confidence interval. a The figure represents the results from the balanced design with N D=1 = 10 and log2fc = 0.3. b This figure shows the summary from the balanced design with N D=1 =10 and log2fc = 0.6 exhibit decreasing Type-I error rates (Additional file 6: Figure S2). Genes in the (0.8, 1.5), and (1.5, 10) categories largely have very low mean expression values. After excluding genes having mean expression values less than 3, Type-I error rates increase as the estimated dispersion increases as shown in Fig. 2. These increasingly liberal Type-I error rates are observed at both alpha levels and are consistent with our simulation results. In the CL, BL and FL regression results, we observe that genes in the categories of (0, 0.05), (0.05, 0.15), and (0.15, 0.8) produce increasingly conservative Type-I error rates at both alpha levels, as presented in Fig. 3. However, these increasingly conservative Type-I error rates are attenuated in the (0.8, 1.5), and (1.5, 10) categories (Additional file 7: Figure S3). Because we observe this inconsistent pattern of Type-I error rates among extremely lowly expressed genes in the DESeq2 results, we also examined the set of genes excluding those with mean expression values less than or equal to 3. After exclusion of genes with low expression, the remaining genes show consistent increasingly conservative Type-I error rates as dispersion increases as shown

7 Choi et al. BMC Bioinformatics (2017) 18:91 Page 7 of 13 Fig. 2 Type-I error rates from DESeq2 analysis of the permuted HD data. Type-I error rates from the DESeq2 (negative binomial model) analysis of the permuted HD data at alpha levels of 0.05 and 0.01 are presented in the figure. Each black empty dot represents the Type-I error rate of a gene. The red dots denote average values of Type-I error rates in each category of dispersion groups. The black dotted horizontal lines are our alpha levels. a shows Type-I error rates of genes having mean expression value of greater than 3 at alpha level of b displays Type-I error rates of genes having mean expression value of greater than 3 at alpha level of 0.01 Fig. 3 Type-I error rates from logistic regressions of the permuted HD data. Figure 3 contains Type-I error rates from Classical Logistic (CL), Bayes Logistic (BL), Firth s Logistic (FL) regressions of the permuted HD data at alpha levels of 0.05 and Each empty dot represents Type-I error rate of a gene. The dots filled with colors inside of boxes denote average values of Type-I error rates in each category of dispersion groups. The black dotted horizontal lines are our alpha levels. a shows Type-I error rates of genes having mean expression value of greater than 3 at alpha level of b displays Type-I error rates of genes having mean expression value of greater than 3 at alpha level of 0.01

8 Choi et al. BMC Bioinformatics (2017) 18:91 Page 8 of 13 in Fig. 3. Although Type-I error rates from the FL regression are also more conservative when dispersion is large, Type-I error rates are relatively well controlled compared to CL and BL regressions. The Type-I error rates observed in the real data set using logistic regression confirm our simulation results. The DA method controls Type-I error rates well for the DESeq2 results (Additional file 8: Figure S4) and the FL regression results (Additional file 9: Figure S5) at both alpha levels, regardless of dispersions of all genes. The Type-I error rates from CL and BL regressions at significance level of 0.01 are conservative as seen in Additional file 9: Figure S5. We also observed the bias of the regression methods using permuted HD data sets. As shown in Additional file 10: Figure S6, FL regression revealed the smallest bias. HD results from the NB vs. logistic regressions We analyze the HD data using NB GLM in the DESeq2 R-package, and using CL, BL, and FL regressions with the R-functions described in the Methods section. All regression results are corrected with the DA method using 1000 permutation results, and are adjusted for multiple testing using an FDR of The Q-Q plots and λ gc are shown in Additional file 11: Figure S7. The DA method reduced the mean λ gc from the results of DESeq2 and increased the mean λ gc from the results of the CL, BL, and FL regressions. As shown in Additional file 12: Figure S8, we identified 3,203 genes that were significant across all methods. The FL regression also identified 307 genes as DE genes that were not identified by the other methods. The DESeq2 approach identified 944 genes that were not significant using the other methods. The 10 most significant genes (FDR < 0.05) from FL regression that are not significant (FDR > 0.05) in the DESeq2 but significant in CL, BL and FL regressions are shown in Table 3. The most significant gene is SLC1A6 with p-value equal to 3.2E-06 from FL regression. Of the genes that are not significant (FDR > 0.05) in the CL, BL and FL analyses, the 10 most significant (FDR < 0.05) from DESeq2 are shown in Additional file 13: Table S4. The genes that are significant in FL analysis are listed in Additional file 14: Table S5. Simulation results of various covariate models from the NB and FL regressions The properties of the NB regression that are shown in Simulation results of NB vs. Logistic regressions are also presented in the simulation results with various covariate models. Type-I error simulations Type-I error rates from the simulated data to evaluate the inclusion of covariates in the models are presented in Table 4. When dispersion is small, an increasing number of NP covariates do not increase Type-I error rates in the NB models. Although the number of covariates appears to increase Type-I error rates when dispersion is 0.01 and CovOR is 5 (Table 4), this slightly increased Type-I error rate is close to the Type-I error rates in the model when dispersion is 0.01 and CovOR is 1.2. Adding more NP covariates when the dispersion is large increases Type-I error rates. The effects of large CovOR on Type-I error rates are not notable in NB models. Large sample size weakens the inflation that arises from a large number of NP covariates within large dispersion in the NB model. Type-I error rates with distinct dispersions are almost identical at both alpha levels (Additional file 15: Table S6). The same covariates that are non-predictive (NP) covariates in a NB model are non-confounding predictive (NCP) covariates in logistic models. Unlike the NB regression, even with small sample size (Table 4), when the CovOR is small, the FL regression is robust regarding the increment in the number of NCP covariates. When CovOR is large, Type-I error rates from FL regression become very conservative as the number of NCP covariates increase. Type-I error rates are not affected by large dispersion. When sample size is Table 3 Top 10 significant genes from FL regression among genes not significant in DESeq2 Gene Mean. Exp. Case Mean. Exp. Cont Disp NB.Pval CL.Pval BL.Pval FL.Pval SLC1A E E E-06 SERHL E E E-05 KCNK E E E-05 DISP E E E-05 SPOCK E E E-05 C20orf E E E-05 IST E E E-05 ARC E E E-04 STRADB E E E-04 PCP E E E-04

9 Choi et al. BMC Bioinformatics (2017) 18:91 Page 9 of 13 Table 4 Type-I error rates with covariate models from balanced design of N D=1 = 10 and μ D=0 = 1000 Disp CovOR Ncov Alpha = 0.05 Alpha = 0.01 NB FL NB FL increased, Type-I error rates are less affected by increased number of NCP covariates with large CovOR. In all scenarios, the DA method controls Type-I error rates well in the NB and FL regressions at both alpha levels. The newly approximated distribution of test statistics diminishes deviated Type-I error rates that were not controlled when many covariates were included in the NB and FL models. In addition to Type-I error rates, the bias is also calculated for each scenario. As shown in Additional file 16: Table S7, as dispersion increases, bias from NB regression increases. However, the bias from FL regression is robust with increasing dispersion. The bias does not change much for any of the models, as CovOR and the number of covariates changes. Power simulations The results of the power simulations to evaluate the inclusion of covariates in the models are summarized in Fig. 4. The empirical power of the NB regressions using different dispersion estimation methods was similar for all power scenarios (Additional file 17: Table S8). When sample size is increased the overall power is increased in both NB and FL regression. In our simulation, the power of NB and FL regression is affected by three factors 1) Dispersion, 2) CovOR, and 3) Number of NP/NCP covariates in a model, when sample size is fixed. Large dispersion, large CovOR, and increasing number of NP/NCP covariates in a model decrease power. NB regression is less sensitive to an increase of NP covariates with small dispersion. As shown Fig. 4a, NB regression results in marginally more power than FL regression when the number of covariates is large and dispersion is small. When dispersion is large but CovOR is small, the loss of power in NB regression is more sensitive to the increase in number of NP covariates than in FL regression as seen in Fig. 4b. Especially, in Fig. 4b with the CovOR of 1.2, the power of FL regression with 10 covariates in a model is more powerful than NB regression with 10 covariates. Regardless of dispersion, when CovOR and the number of covariates in a model are large, NB regression shows slightly better power than FL regression. This is demonstrated in Fig. 4 with CovOR equal to 5. HD results from covariate models The HD data were analyzed using DESeq2 and FL regression with additional simulated covariates. A summary of λ gc is presented in Table 5. An increase of the NP/NCP covariates leads to a marginally lower median of the λ gc. The standard deviations of the λ gc are increased with the increase of NP/NCP covariates. Discussion In this study, we propose using a logistic regression framework as an alternative to NB regression to analyze RNA-Seq data for case control studies. We have shown in our simulations that FL regression performs well in terms of controlling Type-I error rates and shows comparable empirical power. The dispersion is not estimated in the logistic framework, thus avoiding potential false association resulting from incorrectly estimated dispersion. The simulations presented focused on single genes varying relevant parameters (mean, dispersion, log 2 foldchange); transcriptome-wide data were not simulated. The Type-I error simulations presented demonstrate that NB regression has inflated Type-I error rates with small sample size. The degree of inflation is varied by the scale of the dispersion parameter with constant sample size. The relationship between increased Type-I error and dispersion was confirmed through permutation of a real data set. Although large sample size reduced the inflation from NB, the high cost of RNA-Seq technology and difficulty of obtaining certain sample tissues may preclude a larger sample size in some studies. The distinct Type-I error rates observed with varying dispersion parameter values may violate the general assumption that p-values from non-de genes follow a uniform distribution. However, the current simulation and permutation studies validate that the DA method is a suitable alternative approach that controls Type-I error rates in all regression methods. The empirical power of the NB, BL, and FL regressions are comparable across all scenarios. Lower power was observed for CL regression, which appears to be driven by scenarios of complete separation and a failure to converge. With large log2fc and small dispersion, simulated data are likely to show complete separation; in these scenarios the NB, BL and FL regressions are more powerful. For many circumstances with small sample size, the CL regression demonstrated the lowest empirical power among all methods because the CL regression is not able to accommodate complete separation.

10 Choi et al. BMC Bioinformatics (2017) 18:91 Page 10 of 13 Fig. 4 Empirical power of covariate models from balanced design with N D=1 = 10 and μ D=0 = The power of Negative Binomial with true dispersion (NB), and Firth s Logistic (FL) regressions at significance level 0.05 and 0.01 is shown in the figure. Black dotted horizontal lines represent 95 and 90% power. The odds ratios between covariates and case control status (CovOR = 1.2 and 5) are partitioned by vertical black dotted lines. The number covariates (0, 1, 2, 3, 5 (, and 10)) in the model are positioned within each CovOR. Dotted lines within each symbol represent the 95% confidence interval. a Balanced design from N D=1 =10, μ D=0 = 1000, dispersion = 0.01, and log2fc = 0.3. b Balanced design of N D=1 =25, μ D=0 = 1000, dispersion = 1, and log2fc = 2 Table 5 Summary of λ gc from HD analyses with simulated covariates Ncov Median_NB SD_NB Median_FL SD_FL Analysis of the HD data showed λ gc was decreased after applying the DA method to the results from NB GLM in DESeq2 but was increased after applying DA method to the results from CL, BL and FL regression. The exact p- values from 10,000 permutations revealed the same pattern. This pattern is consistent with our simulation results where Type-I error rates were inflated in the NB framework and conservative in the logistic framework when test statistics were compared with a theoretical asymptotic distribution. Although it is unknown which genes are truly DE in the HD data set, we compared DE genes identified in

11 Choi et al. BMC Bioinformatics (2017) 18:91 Page 11 of 13 the HD data by different statistical approaches. We found that SLC1A6 (solute carrier family 1, member 6; EAAT4) did not show evidence of association with HD when using DESeq2, but was highly significant when using FL regression, as shown in Table 3. SLC1A6, which is highly expressed in the cerebellum of human brain compared to other regions [33], showed lower levels of expression in prior studies of mood disorder diseases such as bipolar and major depression disorders in the striatum in situ hybridization study [34]. In addition, Utal et al. [35] showed that Purkinje cell protein 4 (PCP4), also known as PEP-19, had dramatic reduction in HD. This gene was not significantly associated with HD status when using DESeq2 (p-value = 0.086) but showed strong association when using FL regression (pvalue = ). We also found that some genes expressed highly in both cases and controls may not be detected in the NB framework, because it utilizes the ratio of mean expressions of cases and controls. For instance, the normalized mean expression value of SPOCK2 is 12,370 in cases and 15,649 in controls. Although the difference of the means is large, the gene might not be statistically significant due to the small effect size (log2fold-change = 0.34) in the NB framework. However, this gene is strongly associated with HD in our logistic framework (Table 3). SPOCK2, also known as SPARC/osteonectin and Testican-2, plays an important role in the central nervous system [36]. As a member of the testican group, the expression in various neuronal cell types including cell types cerebral cortex, thalamus, hippocampus, cerebellum was reported by in situ hybridization [36]. Several studies showed evidence of associations with prostate, colon, and breast cancer and bronchopulmonary dysplasia [37, 38]. The top genes that showed associations exclusively in NB GLM, except for gene S100A11, have low average counts as shown in Additional file 13: Table S4. The estimated dispersions for these genes are also fairly large suggesting that they may be false positives. The effect of including covariates has not been investigated for RNA-Seq studies. Identifying relationships with covariates for all genes is computationally demanding and existing software do not allow for defining genewise models for all genes, which makes this approach challenging. Therefore, RNA-Seq studies that include covariates in a single model applied to all genes will likely result in some gene expression models that include unassociated covariates. Hence, it is important to investigate the effect of NP covariates in RNA-Seq analysis. Simulations that included NP covariates in the NB model showed inflated Type-I error rates and a loss of power. With large dispersion, this inflation and loss of power becomes severe. The Type-I error in the FL regression is not notably affected by the increment of number of NCP covariates when the CovOR is small. With large CovOR and increased number of NCP covariates, conservative Type-I error rates are observed. The DA method effectively controls the increase of Type-I error rates even with larger CovOR and high number of NP/ NCP covariates. Our empirical power results show that the FLregressionismoregreatlyinfluencedbytheincreaseof covariates than the NB regression, when CovOR is large. Our HD analyses with simulated NP/NCP covariates demonstrated that an increase in the number of NP/NCP covariates results in a less stable λ gc (Table 5). Adding more NP covariates in an NB model slightly decreases the median of λ gc, and hence an increased in the median of p- values. In other words, many p-values in a set are generally increased. Based on our simulation results with NP covariates, Type-I error rates of largely dispersed genes are likely to be inflated and power of differentially expressed genes are likely to be decreased. Therefore, this slightly decreased median may indicate that the loss of power is greater than the gain of Type-I error rates. Adding more NCP covariates in a model also slightly decreases the median of the λ gc in FL regression. This decreased median might be caused by the loss of power, and this loss may occur from NCP covariates in a model according to our simulation results. Under a moderate CovOR, the number of NCP covariates in a model does not affect the Type-I error rates. The change in the median λ gc with additional covariates is larger in FL regression than in NB regression because the FL regression results are solely affected by the loss of power. NB regression results are influenced by both increased Type-I error and decreased power. The standard deviation of λ gc is increased with adding NP/NCP covariates in a model. This means that the results generated from a model that includes many covariates is not likely to be reliable, even if these covariates are associated with case control status but not gene expression. Conclusions In conclusion, unlike NB, CL and BL regressions, FL regression controls Type-I error rates well and maintains comparable power even with small sample size. Firth s logistic regression is an excellent alternative to NB regression for analysis of RNA-Seq data in case control studies. We recommend implementing the DA method in analysis of RNA-Seq data to appropriately control Type-I error rates. If computational burden of permutations required for the DA method precludes using this approach, FL regression is the best option for controlling Type-I errors with comparable power to NB regression. However, a parsimonious model is necessary to obtain robust results in the FL regression setting. This approach can be extended in multiple classes of disease status using a multinomial logistic regression method.

12 Choi et al. BMC Bioinformatics (2017) 18:91 Page 12 of 13 Additional files Additional file 1: Supplementary Method. This document provides detailed descriptions of the dispersion estimation method, Type-I error rates and empirical power calculations, procedures of data adaptive method using cross-validation technique, and Huntington sdiseasedata.(docx56kb) Additional file 2: Table S1. Type-I error rates of the NB regression from the balanced design with N D=1 = 10. Mean: The mean expression values in cases and controls, Disp: Dispersion, NB: Negative Binomial regression, MLD: Maximum likelihood estimated dispersion, QLD: Quasi-likelihood estimated dispersion, TD: True dispersion specified in the simulation. (DOCX 55 kb) Additional file 3: Figure S1. Type-I error rates of regression methods from the balanced design. Type-I error rates of the Negative Binomial with true dispersion (NB), Classic Logistic (CL), Bayes Logistic (BL), and Firth slogistic(fl) regressions at alpha levels of 0.05 and 0.01 are shown. The black dotted horizontal lines represent 5 and 1% of Type-I error rates. Dispersion values (ϕ = 0.01 and 1) are separated by black dotted vertical lines. Four values of the number of cases (10, 25, 75 and 500) are placed within each dispersion value. Dotted lines within each symbol imply 95% confidence interval. Figure S1 (A): The figure presents the Type-I error rates when μ = 50.FigureS1(B):Thisfigure shows the Type-I error rates when μ = (PNG 399 kb) Additional file 4: Table S2. Type-I error rates of regression methods from the unbalanced design with μ D=0 = Alpha: Significance levels, N D=1 : The number of cases, N D=0 : The number of controls, Disp: Dispersion, NB: Negative binomial regression with true dispersion, CL: Classical logistic regression, BL: Bayes logistic regression, FL: Firth s logistic regression. (DOCX 65 kb) Additional file 5: Table S3. Empirical power of NB regression from the balanced design with N D=1 = 10 and log2fc = 0.3. Mean: The mean expression values in cases and controls, Disp: Dispersion, NB: Negative Binomial regression, MLD: Maximum likelihood estimated dispersion, QLD: Quasi-likelihood estimated dispersion, TD: True dispersion specified in the simulation. (DOCX 56 kb) Additional file 6: Figure S2. Type-I error rates from DESeq2 analysis of the permuted HD data. This contains Type-I error rates from DESeq2 (negative binomial model) analysis of the permuted HD data at alpha levels of 0.05 and Each black empty dot represents Type-I error rate of a gene. The red dots denote average values of Type-I error rates in each category of dispersion groups. The black dotted horizontal lines are our alpha levels. Figure S2 (A) shows Type-I error rates of all genes at alpha level of Figure S2 (B) displays Type-I error rates of all genes at alpha level of (PNG 309 kb) Additional file 7: Figure S3. Type-I error rates from logistic model analyses of the permuted HD data. Figure S3 contains Type-I error rates from Classical Logistic (CL), Bayes Logistic (BL), Firth s Logistic (FL) regressions of the permuted HD data at alpha levels of 0.05 and Each empty dot represents Type-I error rate of a gene. The dots filled with colors inside of boxes denote average valuesoftype-ierrorratesineachcategoryofdispersiongroups.theblack dotted horizontal lines are our alpha levels. Figure S3 (A) shows Type-I error rates of all genes at alpha level of Figure S3 (B) displays Type-I error rates of all genes at alpha level of (PNG 470 kb) Additional file 8: Figure S4. Type-I error rates from DESeq2 analysis with the DA method using the permuted HD data. Figure S4 contains Type-I error rates from DESeq2 (negative binomial model) analysis with DA method of the permuted HD data at alpha levels of 0.05 and Each black empty dot represents Type-I error rate of a gene. The red dots denote average values of Type-I error rates in each category of dispersion groups. The black dotted horizontal lines are our alpha levels. Figure S4 (A) summarizes Type-I error rates of all genes with DA method at alpha level of Figure S4 (B) displays Type-I error rates of all genes with DA method at alpha level of (PNG 223 kb) Additional file 9: Figure S5. Type-I error rates from logistic model analyses with the DA method using the permuted HD data. Figure S5 presents Type-I error rates from Classical Logistic (CL), Bayes Logistic (BL), Firth s Logistic (FL) regressions with the DA method of the permuted HD data at alpha levels of 0.05 and Each empty dot represents Type-I error rate of a gene. The dots filled with colors inside of boxes denote average values of Type-I error rates in each category of dispersion groups. The black dotted horizontal lines are our alpha levels. Figure S5 (A) shows Type-I error rates of all genes with DA method at alpha level of Figure S5 (B) represents Type- I error rates of all genes with DA method at alpha level of (PNG 453 kb) Additional file 10: Figure S6. Bias from regression methods using the permuted HD data with μ g > 3. Figure S6 contains bias from Negative Binomial regression using DESeq2, Classical Logistic regression (CL), Bayes Logistic regression (BL), and Firth s Logistic regression (FL). Each black empty dot represents the bias of a gene. The black dotted horizontal line is no bias point. The bias of each gene is calculated using effect sizes of 10,000 permutations. (PNG 53 kb) Additional file 11: Figure S7. Q-Q plots of the HD Analyses. Figure S7 exhibits the Q-Q plots from the HD analysis adjusting for age at death and RIN from DESeq2 (A), and Classical (B), Bayes (C), and Firth s (D) Logistic regressions. Each regression method contains three different ways of calculating p-values (Original, DA, and Perm). Original p-values (Blue dots) are estimated from asymptotic distribution. DA p-values (Black dots) are evaluated from data adaptive asymptotic distribution using 1,000 permutations. Perm p-values (Yellow dots) are calculated using 10,000 permutations. (PNG 157 kb) Additional file 12: Figure S8. Venn diagram of HD analysis results using DA method. Each colored circle represents a different regression method. The numbers inside of the circles are the number of genes significant at FDR 0.05 based on p-values adjusted using the Data Adaptive (DA) method. There were 3,203 significant genes in common across all the methods. The FL identified the largest number of significant genes compared to CL and BL. The NB independently identified 944 genes. (PNG 474 kb) Additional file 13: Table S4. Top 10 significant genes from DESeq2 among genes not significant in logistic regressions. Mean.Exp.Case: Normalized mean expression value in cases, Mean.Exp.Cont: Normalized mean expression value in controls, Disp: Dispersion, NB.Pval: P-values from negative binomial regression with true dispersion, CL.Pval: P-values from classical logistic regression, BL.Pval: P-values from Bayes logistic regression, FL.Pval: P-values from Firth s logistic regression. (DOCX 53 kb) Additional file 14: Table S5. All significant genes in FL regressions using the DA method. Mean.Exp.Case: Normalized mean expression value in cases, Mean.Exp.Cont: Normalized mean expression value in controls, Disp: Dispersion, NB,CL,BL,FL:P-values from negative binomial regression, classical logistic regression, Bayes logistic regression, Firth s logistic regression. (XLS 1013 kb) Additional file 15: Table S6. Type-I error rates of the NB regression from the balanced design with N D=1 = 10 and μ = Disp: Dispersion, CovOR: Odds ratios between covariates and case control status, Ncov: The number of covariates in a model, NB: Negative binomial regression, MLD: Maximum likelihood estimated Dispersion, QLD: Quasi-likelihood estimated Dispersion, TD: The dispersion is used for the sampling. (DOCX 59 kb) Additional file 16: Table S7. Bias with covariate models from the balanced design of N D=1 =10andμ D=0 = Disp: Dispersion, CovOR: Odds ratios between covariates and case control status, Ncov: The number of covariates in a model, NB_TD: Negative binomial regression with the dispersion is used for the sampling, FL: Firth s logistic regression. (DOCX 47 kb) Additional file 17: Table S8. Type-I error rates of the NB regression from the balanced design with N D=1 = 10, μ D=0 = 1000, and log2fc = 0.3. Disp: Dispersion, CovOR: Odds ratios between covariates and case control status, Ncov: The number of covariates in a model, NB: Negative binomial regression, MLD: Maximum likelihood estimated Dispersion, QLD: Quasi-likelihood estimated Dispersion, TD: The dispersion is used for the sampling. (DOCX 59 kb) Additional file 18: R code. This R code regenerates the simulated data sets. (R 6 kb) Abbreviations AAD: Age at death; BL: Bayes logistic; CL: Classical logistic; CovOR: Covariate OR; DA: Data adaptive; DE: Differentially expressed; FDR: False discovery rate; FL: Firth s logistic; GLM: Generalized linear model; HD: Huntington s disease; log2fc: log 2 fold-change; ML: Maximum-likelihood; NB: Negative Binomial; NCP: Non-confounding predictive; NGS: Next generation sequencing; NP: Non-predictive; QL: Quasi-likelihood; RIN: RNA Integrity number; RNA: Ribonucleic acid; RNA-Seq: RNA sequencing

Dichotomizing partial compliance and increased participant burden in factorial designs: the performance of four noncompliance methods

Dichotomizing partial compliance and increased participant burden in factorial designs: the performance of four noncompliance methods Merrill and McClure Trials (2015) 16:523 DOI 1186/s13063-015-1044-z TRIALS RESEARCH Open Access Dichotomizing partial compliance and increased participant burden in factorial designs: the performance of

More information

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n. University of Groningen Latent instrumental variables Ebbes, P. IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1 Welch et al. BMC Medical Research Methodology (2018) 18:89 https://doi.org/10.1186/s12874-018-0548-0 RESEARCH ARTICLE Open Access Does pattern mixture modelling reduce bias due to informative attrition

More information

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Lec 02: Estimation & Hypothesis Testing in Animal Ecology Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then

More information

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Journal of Social and Development Sciences Vol. 4, No. 4, pp. 93-97, Apr 203 (ISSN 222-52) Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Henry De-Graft Acquah University

More information

Lecture 21. RNA-seq: Advanced analysis

Lecture 21. RNA-seq: Advanced analysis Lecture 21 RNA-seq: Advanced analysis Experimental design Introduction An experiment is a process or study that results in the collection of data. Statistical experiments are conducted in situations in

More information

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives DOI 10.1186/s12868-015-0228-5 BMC Neuroscience RESEARCH ARTICLE Open Access Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives Emmeke

More information

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Author's response to reviews Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Authors: Jestinah M Mahachie John

More information

Two-stage Methods to Implement and Analyze the Biomarker-guided Clinical Trail Designs in the Presence of Biomarker Misclassification

Two-stage Methods to Implement and Analyze the Biomarker-guided Clinical Trail Designs in the Presence of Biomarker Misclassification RESEARCH HIGHLIGHT Two-stage Methods to Implement and Analyze the Biomarker-guided Clinical Trail Designs in the Presence of Biomarker Misclassification Yong Zang 1, Beibei Guo 2 1 Department of Mathematical

More information

Selection and Combination of Markers for Prediction

Selection and Combination of Markers for Prediction Selection and Combination of Markers for Prediction NACC Data and Methods Meeting September, 2010 Baojiang Chen, PhD Sarah Monsell, MS Xiao-Hua Andrew Zhou, PhD Overview 1. Research motivation 2. Describe

More information

Supplement to SCnorm: robust normalization of single-cell RNA-seq data

Supplement to SCnorm: robust normalization of single-cell RNA-seq data Supplement to SCnorm: robust normalization of single-cell RNA-seq data Supplementary Note 1: SCnorm does not require spike-ins, since we find that the performance of spike-ins in scrna-seq is often compromised,

More information

Ecological Statistics

Ecological Statistics A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents

More information

(b) empirical power. IV: blinded IV: unblinded Regr: blinded Regr: unblinded α. empirical power

(b) empirical power. IV: blinded IV: unblinded Regr: blinded Regr: unblinded α. empirical power Supplementary Information for: Using instrumental variables to disentangle treatment and placebo effects in blinded and unblinded randomized clinical trials influenced by unmeasured confounders by Elias

More information

Bayesian and Frequentist Approaches

Bayesian and Frequentist Approaches Bayesian and Frequentist Approaches G. Jogesh Babu Penn State University http://sites.stat.psu.edu/ babu http://astrostatistics.psu.edu All models are wrong But some are useful George E. P. Box (son-in-law

More information

Sequence balance minimisation: minimising with unequal treatment allocations

Sequence balance minimisation: minimising with unequal treatment allocations Madurasinghe Trials (2017) 18:207 DOI 10.1186/s13063-017-1942-3 METHODOLOGY Open Access Sequence balance minimisation: minimising with unequal treatment allocations Vichithranie W. Madurasinghe Abstract

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging

DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging Genome Biology This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon. DiffVar: a new method for detecting

More information

Understandable Statistics

Understandable Statistics Understandable Statistics correlated to the Advanced Placement Program Course Description for Statistics Prepared for Alabama CC2 6/2003 2003 Understandable Statistics 2003 correlated to the Advanced Placement

More information

SUPPLEMENTAL MATERIAL

SUPPLEMENTAL MATERIAL 1 SUPPLEMENTAL MATERIAL Response time and signal detection time distributions SM Fig. 1. Correct response time (thick solid green curve) and error response time densities (dashed red curve), averaged across

More information

Business Statistics Probability

Business Statistics Probability Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Key Vocabulary:! individual! variable! frequency table! relative frequency table! distribution! pie chart! bar graph! two-way table! marginal distributions! conditional distributions!

More information

Poisson regression. Dae-Jin Lee Basque Center for Applied Mathematics.

Poisson regression. Dae-Jin Lee Basque Center for Applied Mathematics. Dae-Jin Lee dlee@bcamath.org Basque Center for Applied Mathematics http://idaejin.github.io/bcam-courses/ D.-J. Lee (BCAM) Intro to GLM s with R GitHub: idaejin 1/40 Modeling count data Introduction Response

More information

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences. SPRING GROVE AREA SCHOOL DISTRICT PLANNED COURSE OVERVIEW Course Title: Basic Introductory Statistics Grade Level(s): 11-12 Units of Credit: 1 Classification: Elective Length of Course: 30 cycles Periods

More information

Multiple imputation for handling missing outcome data when estimating the relative risk

Multiple imputation for handling missing outcome data when estimating the relative risk Sullivan et al. BMC Medical Research Methodology (2017) 17:134 DOI 10.1186/s12874-017-0414-5 RESEARCH ARTICLE Open Access Multiple imputation for handling missing outcome data when estimating the relative

More information

Biostatistics II

Biostatistics II Biostatistics II 514-5509 Course Description: Modern multivariable statistical analysis based on the concept of generalized linear models. Includes linear, logistic, and Poisson regression, survival analysis,

More information

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY Lingqi Tang 1, Thomas R. Belin 2, and Juwon Song 2 1 Center for Health Services Research,

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Data Analysis Using Regression and Multilevel/Hierarchical Models

Data Analysis Using Regression and Multilevel/Hierarchical Models Data Analysis Using Regression and Multilevel/Hierarchical Models ANDREW GELMAN Columbia University JENNIFER HILL Columbia University CAMBRIDGE UNIVERSITY PRESS Contents List of examples V a 9 e xv " Preface

More information

Write your identification number on each paper and cover sheet (the number stated in the upper right hand corner on your exam cover).

Write your identification number on each paper and cover sheet (the number stated in the upper right hand corner on your exam cover). STOCKHOLM UNIVERSITY Department of Economics Course name: Empirical methods 2 Course code: EC2402 Examiner: Per Pettersson-Lidbom Number of credits: 7,5 credits Date of exam: Sunday 21 February 2010 Examination

More information

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh Comparing the evidence for categorical versus dimensional representations of psychiatric disorders in the presence of noisy observations: a Monte Carlo study of the Bayesian Information Criterion and Akaike

More information

Fundamental Clinical Trial Design

Fundamental Clinical Trial Design Design, Monitoring, and Analysis of Clinical Trials Session 1 Overview and Introduction Overview Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics, University of Washington February 17-19, 2003

More information

EPI 200C Final, June 4 th, 2009 This exam includes 24 questions.

EPI 200C Final, June 4 th, 2009 This exam includes 24 questions. Greenland/Arah, Epi 200C Sp 2000 1 of 6 EPI 200C Final, June 4 th, 2009 This exam includes 24 questions. INSTRUCTIONS: Write all answers on the answer sheets supplied; PRINT YOUR NAME and STUDENT ID NUMBER

More information

Advanced Bayesian Models for the Social Sciences

Advanced Bayesian Models for the Social Sciences Advanced Bayesian Models for the Social Sciences Jeff Harden Department of Political Science, University of Colorado Boulder jeffrey.harden@colorado.edu Daniel Stegmueller Department of Government, University

More information

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill) Advanced Bayesian Models for the Social Sciences Instructors: Week 1&2: Skyler J. Cranmer Department of Political Science University of North Carolina, Chapel Hill skyler@unc.edu Week 3&4: Daniel Stegmueller

More information

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California Computer Age Statistical Inference Algorithms, Evidence, and Data Science BRADLEY EFRON Stanford University, California TREVOR HASTIE Stanford University, California ggf CAMBRIDGE UNIVERSITY PRESS Preface

More information

You must answer question 1.

You must answer question 1. Research Methods and Statistics Specialty Area Exam October 28, 2015 Part I: Statistics Committee: Richard Williams (Chair), Elizabeth McClintock, Sarah Mustillo You must answer question 1. 1. Suppose

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

Effects of propensity score overlap on the estimates of treatment effects. Yating Zheng & Laura Stapleton

Effects of propensity score overlap on the estimates of treatment effects. Yating Zheng & Laura Stapleton Effects of propensity score overlap on the estimates of treatment effects Yating Zheng & Laura Stapleton Introduction Recent years have seen remarkable development in estimating average treatment effects

More information

Data Analysis in Practice-Based Research. Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine

Data Analysis in Practice-Based Research. Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine Data Analysis in Practice-Based Research Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine Multilevel Data Statistical analyses that fail to recognize

More information

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5 PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science Homework 5 Due: 21 Dec 2016 (late homeworks penalized 10% per day) See the course web site for submission details.

More information

A re-randomisation design for clinical trials

A re-randomisation design for clinical trials Kahan et al. BMC Medical Research Methodology (2015) 15:96 DOI 10.1186/s12874-015-0082-2 RESEARCH ARTICLE Open Access A re-randomisation design for clinical trials Brennan C Kahan 1*, Andrew B Forbes 2,

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Evidence Based Medicine

Evidence Based Medicine Course Goals Goals 1. Understand basic concepts of evidence based medicine (EBM) and how EBM facilitates optimal patient care. 2. Develop a basic understanding of how clinical research studies are designed

More information

Performance of Median and Least Squares Regression for Slightly Skewed Data

Performance of Median and Least Squares Regression for Slightly Skewed Data World Academy of Science, Engineering and Technology 9 Performance of Median and Least Squares Regression for Slightly Skewed Data Carolina Bancayrin - Baguio Abstract This paper presents the concept of

More information

Score Tests of Normality in Bivariate Probit Models

Score Tests of Normality in Bivariate Probit Models Score Tests of Normality in Bivariate Probit Models Anthony Murphy Nuffield College, Oxford OX1 1NF, UK Abstract: A relatively simple and convenient score test of normality in the bivariate probit model

More information

THE INDIRECT EFFECT IN MULTIPLE MEDIATORS MODEL BY STRUCTURAL EQUATION MODELING ABSTRACT

THE INDIRECT EFFECT IN MULTIPLE MEDIATORS MODEL BY STRUCTURAL EQUATION MODELING ABSTRACT European Journal of Business, Economics and Accountancy Vol. 4, No. 3, 016 ISSN 056-6018 THE INDIRECT EFFECT IN MULTIPLE MEDIATORS MODEL BY STRUCTURAL EQUATION MODELING Li-Ju Chen Department of Business

More information

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012 STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION by XIN SUN PhD, Kansas State University, 2012 A THESIS Submitted in partial fulfillment of the requirements

More information

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 3: Overview of Descriptive Statistics October 3, 2005 Lecture Outline Purpose

More information

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve

More information

Mammographic density and risk of breast cancer by tumor characteristics: a casecontrol

Mammographic density and risk of breast cancer by tumor characteristics: a casecontrol Krishnan et al. BMC Cancer (2017) 17:859 DOI 10.1186/s12885-017-3871-7 RESEARCH ARTICLE Mammographic density and risk of breast cancer by tumor characteristics: a casecontrol study Open Access Kavitha

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

How to analyze correlated and longitudinal data?

How to analyze correlated and longitudinal data? How to analyze correlated and longitudinal data? Niloofar Ramezani, University of Northern Colorado, Greeley, Colorado ABSTRACT Longitudinal and correlated data are extensively used across disciplines

More information

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature16467 Supplementary Discussion Relative influence of larger and smaller EWD impacts. To examine the relative influence of varying disaster impact severity on the overall disaster signal

More information

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you? WDHS Curriculum Map Probability and Statistics Time Interval/ Unit 1: Introduction to Statistics 1.1-1.3 2 weeks S-IC-1: Understand statistics as a process for making inferences about population parameters

More information

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T.

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T. Diagnostic Tests 1 Introduction Suppose we have a quantitative measurement X i on experimental or observed units i = 1,..., n, and a characteristic Y i = 0 or Y i = 1 (e.g. case/control status). The measurement

More information

Modelling heterogeneity variances in multiple treatment comparison meta-analysis Are informative priors the better solution?

Modelling heterogeneity variances in multiple treatment comparison meta-analysis Are informative priors the better solution? Thorlund et al. BMC Medical Research Methodology 2013, 13:2 RESEARCH ARTICLE Open Access Modelling heterogeneity variances in multiple treatment comparison meta-analysis Are informative priors the better

More information

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS) Chapter : Advanced Remedial Measures Weighted Least Squares (WLS) When the error variance appears nonconstant, a transformation (of Y and/or X) is a quick remedy. But it may not solve the problem, or it

More information

11/24/2017. Do not imply a cause-and-effect relationship

11/24/2017. Do not imply a cause-and-effect relationship Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

More information

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0% Capstone Test (will consist of FOUR quizzes and the FINAL test grade will be an average of the four quizzes). Capstone #1: Review of Chapters 1-3 Capstone #2: Review of Chapter 4 Capstone #3: Review of

More information

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al. Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al. Holger Höfling Gad Getz Robert Tibshirani June 26, 2007 1 Introduction Identifying genes that are involved

More information

fl/+ KRas;Atg5 fl/+ KRas;Atg5 fl/fl KRas;Atg5 fl/fl KRas;Atg5 Supplementary Figure 1. Gene set enrichment analyses. (a) (b)

fl/+ KRas;Atg5 fl/+ KRas;Atg5 fl/fl KRas;Atg5 fl/fl KRas;Atg5 Supplementary Figure 1. Gene set enrichment analyses. (a) (b) KRas;At KRas;At KRas;At KRas;At a b Supplementary Figure 1. Gene set enrichment analyses. (a) GO gene sets (MSigDB v3. c5) enriched in KRas;Atg5 fl/+ as compared to KRas;Atg5 fl/fl tumors using gene set

More information

Correlation and regression

Correlation and regression PG Dip in High Intensity Psychological Interventions Correlation and regression Martin Bland Professor of Health Statistics University of York http://martinbland.co.uk/ Correlation Example: Muscle strength

More information

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 Introduction Loss of erozygosity (LOH) represents the loss of allelic differences. The SNP markers on the SNP Array 6.0 can be used

More information

A non-linear beta-binomial regression model for mapping EORTC QLQ- C30 to the EQ-5D-3L in lung cancer patients: a comparison with existing approaches

A non-linear beta-binomial regression model for mapping EORTC QLQ- C30 to the EQ-5D-3L in lung cancer patients: a comparison with existing approaches Khan and Morris Health and Quality of Life Outcomes 2014, 12:163 RESEARCH Open Access A non-linear beta-binomial regression model for mapping EORTC QLQ- C30 to the EQ-5D-3L in lung cancer patients: a comparison

More information

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions J. Harvey a,b, & A.J. van der Merwe b a Centre for Statistical Consultation Department of Statistics

More information

George B. Ploubidis. The role of sensitivity analysis in the estimation of causal pathways from observational data. Improving health worldwide

George B. Ploubidis. The role of sensitivity analysis in the estimation of causal pathways from observational data. Improving health worldwide George B. Ploubidis The role of sensitivity analysis in the estimation of causal pathways from observational data Improving health worldwide www.lshtm.ac.uk Outline Sensitivity analysis Causal Mediation

More information

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study STATISTICAL METHODS Epidemiology Biostatistics and Public Health - 2016, Volume 13, Number 1 Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation

More information

Use of GEEs in STATA

Use of GEEs in STATA Use of GEEs in STATA 1. When generalised estimating equations are used and example 2. Stata commands and options for GEEs 3. Results from Stata (and SAS!) 4. Another use of GEEs Use of GEEs GEEs are one

More information

Russian Journal of Agricultural and Socio-Economic Sciences, 3(15)

Russian Journal of Agricultural and Socio-Economic Sciences, 3(15) ON THE COMPARISON OF BAYESIAN INFORMATION CRITERION AND DRAPER S INFORMATION CRITERION IN SELECTION OF AN ASYMMETRIC PRICE RELATIONSHIP: BOOTSTRAP SIMULATION RESULTS Henry de-graft Acquah, Senior Lecturer

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 10, 11) Please note chapter

More information

Understanding the cluster randomised crossover design: a graphical illustration of the components of variation and a sample size tutorial

Understanding the cluster randomised crossover design: a graphical illustration of the components of variation and a sample size tutorial Arnup et al. Trials (2017) 18:381 DOI 10.1186/s13063-017-2113-2 METHODOLOGY Open Access Understanding the cluster randomised crossover design: a graphical illustration of the components of variation and

More information

Method Comparison for Interrater Reliability of an Image Processing Technique in Epilepsy Subjects

Method Comparison for Interrater Reliability of an Image Processing Technique in Epilepsy Subjects 22nd International Congress on Modelling and Simulation, Hobart, Tasmania, Australia, 3 to 8 December 2017 mssanz.org.au/modsim2017 Method Comparison for Interrater Reliability of an Image Processing Technique

More information

Identification of population average treatment effects using nonlinear instrumental variables estimators : another cautionary note

Identification of population average treatment effects using nonlinear instrumental variables estimators : another cautionary note University of Iowa Iowa Research Online Theses and Dissertations Fall 2014 Identification of population average treatment effects using nonlinear instrumental variables estimators : another cautionary

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 10: Introduction to inference (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 17 What is inference? 2 / 17 Where did our data come from? Recall our sample is: Y, the vector

More information

A Case Study: Two-sample categorical data

A Case Study: Two-sample categorical data A Case Study: Two-sample categorical data Patrick Breheny January 31 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/43 Introduction Model specification Continuous vs. mixture priors Choice

More information

Thresholds for statistical and clinical significance in systematic reviews with meta-analytic methods

Thresholds for statistical and clinical significance in systematic reviews with meta-analytic methods Jakobsen et al. BMC Medical Research Methodology 2014, 14:120 CORRESPONDENCE Open Access Thresholds for statistical and clinical significance in systematic reviews with meta-analytic methods Janus Christian

More information

WELCOME! Lecture 11 Thommy Perlinger

WELCOME! Lecture 11 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 11 Thommy Perlinger Regression based on violated assumptions If any of the assumptions are violated, potential inaccuracies may be present in the estimated regression

More information

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study Marianne (Marnie) Bertolet Department of Statistics Carnegie Mellon University Abstract Linear mixed-effects (LME)

More information

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School November 2015 Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach Wei Chen

More information

Correcting AUC for Measurement Error

Correcting AUC for Measurement Error Correcting AUC for Measurement Error The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters. Citation Published Version Accessed Citable

More information

6. Unusual and Influential Data

6. Unusual and Influential Data Sociology 740 John ox Lecture Notes 6. Unusual and Influential Data Copyright 2014 by John ox Unusual and Influential Data 1 1. Introduction I Linear statistical models make strong assumptions about the

More information

breast cancer; relative risk; risk factor; standard deviation; strength of association

breast cancer; relative risk; risk factor; standard deviation; strength of association American Journal of Epidemiology The Author 2015. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail:

More information

Meaning-based guidance of attention in scenes as revealed by meaning maps

Meaning-based guidance of attention in scenes as revealed by meaning maps SUPPLEMENTARY INFORMATION Letters DOI: 1.138/s41562-17-28- In the format provided by the authors and unedited. -based guidance of attention in scenes as revealed by meaning maps John M. Henderson 1,2 *

More information

ChIP-seq data analysis

ChIP-seq data analysis ChIP-seq data analysis Harri Lähdesmäki Department of Computer Science Aalto University November 24, 2017 Contents Background ChIP-seq protocol ChIP-seq data analysis Transcriptional regulation Transcriptional

More information

Bayesian Mediation Analysis

Bayesian Mediation Analysis Psychological Methods 2009, Vol. 14, No. 4, 301 322 2009 American Psychological Association 1082-989X/09/$12.00 DOI: 10.1037/a0016972 Bayesian Mediation Analysis Ying Yuan The University of Texas M. D.

More information

Detection of Unknown Confounders. by Bayesian Confirmatory Factor Analysis

Detection of Unknown Confounders. by Bayesian Confirmatory Factor Analysis Advanced Studies in Medical Sciences, Vol. 1, 2013, no. 3, 143-156 HIKARI Ltd, www.m-hikari.com Detection of Unknown Confounders by Bayesian Confirmatory Factor Analysis Emil Kupek Department of Public

More information

A Comparison of Sample Size and Power in Case-Only Association Studies of Gene-Environment Interaction

A Comparison of Sample Size and Power in Case-Only Association Studies of Gene-Environment Interaction American Journal of Epidemiology ª The Author 2010. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. This is an Open Access article distributed under

More information

On the use of the outcome variable small for gestational age when gestational age is a potential mediator: a maternal asthma perspective

On the use of the outcome variable small for gestational age when gestational age is a potential mediator: a maternal asthma perspective Lefebvre and Samoilenko BMC Medical Research Methodology (2017) 17:165 DOI 10.1186/s12874-017-0444-z RESEARCH ARTICLE Open Access On the use of the outcome variable small for gestational age when gestational

More information

Introduction & Basics

Introduction & Basics CHAPTER 1 Introduction & Basics 1.1 Statistics the Field... 1 1.2 Probability Distributions... 4 1.3 Study Design Features... 9 1.4 Descriptive Statistics... 13 1.5 Inferential Statistics... 16 1.6 Summary...

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale. Model Based Statistics in Biology. Part V. The Generalized Linear Model. Single Explanatory Variable on an Ordinal Scale ReCap. Part I (Chapters 1,2,3,4), Part II (Ch 5, 6, 7) ReCap Part III (Ch 9, 10,

More information

Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers

Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers Tutorial in Biostatistics Received 21 November 2012, Accepted 17 July 2013 Published online 23 August 2013 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/sim.5941 Graphical assessment of

More information

Cochrane Pregnancy and Childbirth Group Methodological Guidelines

Cochrane Pregnancy and Childbirth Group Methodological Guidelines Cochrane Pregnancy and Childbirth Group Methodological Guidelines [Prepared by Simon Gates: July 2009, updated July 2012] These guidelines are intended to aid quality and consistency across the reviews

More information

Role of respondents education as a mediator and moderator in the association between childhood socio-economic status and later health and wellbeing

Role of respondents education as a mediator and moderator in the association between childhood socio-economic status and later health and wellbeing Sheikh et al. BMC Public Health 2014, 14:1172 RESEARCH ARTICLE Open Access Role of respondents education as a mediator and moderator in the association between childhood socio-economic status and later

More information

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India 20th International Congress on Modelling and Simulation, Adelaide, Australia, 1 6 December 2013 www.mssanz.org.au/modsim2013 Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision

More information

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach)

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach) High-Throughput Sequencing Course Gene-Set Analysis Biostatistics and Bioinformatics Summer 28 Section Introduction What is Gene Set Analysis? Many names for gene set analysis: Pathway analysis Gene set

More information

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug? MMI 409 Spring 2009 Final Examination Gordon Bleil Table of Contents Research Scenario and General Assumptions Questions for Dataset (Questions are hyperlinked to detailed answers) 1. Is there a difference

More information