Test Retest and Between-Site Reliability in a Multicenter fmri Study

Similar documents
INDIVIDUALIZATION FEATURE OF HEAD-RELATED TRANSFER FUNCTIONS BASED ON SUBJECTIVE EVALUATION. Satoshi Yairi, Yukio Iwaya and Yôiti Suzuki

Stochastic Extension of the Attention-Selection System for the icub

Interpreting Effect Sizes in Contrast Analysis

Nadine Gaab, 1,2 * John D.E. Gabrieli, 1 and Gary H. Glover 2 INTRODUCTION. Human Brain Mapping 28: (2007) r

APPLICATION OF THE WALSH TRANSFORM IN AN INTEGRATED ALGORITHM FO R THE DETECTION OF INTERICTAL SPIKES

Shunting Inhibition Controls the Gain Modulation Mediated by Asynchronous Neurotransmitter Release in Early Development

Why do we remember some things and not others? Consider

Influencing Factors on Fertility Intention of Women University Students: Based on the Theory of Planned Behavior

The Probability of Disease. William J. Long. Cambridge, MA hospital admitting door (or doctors oce, or appropriate

FUNCTIONAL MAGNETIC RESONANCE IMAGING OF LANGUAGE PROCESSING AND ITS PHARMACOLOGICAL MODULATION DISSERTATION

A Brain-Machine Interface Enables Bimanual Arm Movements in Monkeys

Objective Find the Coefficient of Determination and be able to interpret it. Be able to read and use computer printouts to do regression.

Technical and Economic Analyses of Poultry Production in the UAE: Utilizing an Evaluation of Poultry Industry Feeds and a Cross-Section Survey

Dietary Assessment in Epidemiology: Comparison of a Food Frequency and a Diet History Questionnaire with a 7-Day Food Record

PEKKA KANNUS, MD* Downloaded from at on April 8, For personal use only. No other uses without permission.

Systolic and Pipelined. Processors

Real-Time fmri Using Brain-State Classification

Simulation of the Human Glucose Metabolism Using Fuzzy Arithmetic

The Engagement of Mid-Ventrolateral Prefrontal Cortex and Posterior Brain Regions in Intentional Cognitive Activity

The number and width of CT detector rows determine the

Multiscale Model of Oxygen transport in Diabetes

Sex Specificity of Ventral Anterior Cingulate Cortex Suppression During a Cognitive Task

lsokinetic Measurements of Trunk Extension and Flexion Performance Collected with the Biodex Clinical Data Station

Activation of the Caudal Anterior Cingulate Cortex Due to Task-Related Interference in an Auditory Stroop Paradigm

Two universal runoff yield models: SCS vs. LCM

Osteoarthritis and Cartilage 19 (2011) 58e64

Noninvasive Measurement of the Cerebral Blood Flow Response in Human Lateral Geniculate Nucleus With Arterial Spin Labeling fmri

Disrupted Rich Club Network in Behavioral Variant Frontotemporal Dementia and Early-Onset Alzheimer s Disease

The Neural Signature of Phosphene Perception

COHESION OF COMPACTED UNSATURATED SANDY SOILS AND AN EQUATION FOR PREDICTING COHESION WITH RESPECT TO INITIAL DEGREE OF SATURATION

Journal of Applied Science and Agriculture

Alternative Methods of Insulin Sensitivity Assessment in Obese Children and Adolescents

Radiation Protection Dosimetry, Volume II: Technical and Management System Requirements for Dosimetry Services. REGDOC-2.7.

Structural Safety. Copula-based approaches for evaluating slope reliability under incomplete probability information

pneumonia from the Pediatric Clinic of the University of Padua. Serological Methods

Predictors of Maternal Identity of Korean Primiparas

A Mathematical Model of The Effect of Immuno-Stimulants On The Immune Response To HIV Infection

Measurement uncertainty of ester number, acid number and patchouli alcohol of patchouli oil produced in Yogyakarta

Heritability of Small-World Networks in the Brain: A Graph Theoretical Analysis of Resting-State EEG Functional Connectivity

SMARTPHONE-BASED USER ACTIVITY RECOGNITION METHOD FOR HEALTH REMOTE MONITORING APPLICATIONS

Visual stimulus locking of EEG is modulated by temporal congruency of auditory stimuli

Brain Activity During Visual Versus Kinesthetic Imagery: An fmri Study

OPTIMUM AUTOFRETTAGE PRESSURE IN THICK CYLINDERS

Prevalence and Correlates of Diabetes Mellitus Among Adult Obese Saudis in Al-Jouf Region

Collaborative Evaluation of a Fluorometric Method for Measuring Alkaline Phosphatase Activity in Cow s, Sheep s, and Goat s Milk

Advanced Placement Psychology Grades 11 or 12

Evaluation of the accuracy of Lachman and Anterior Drawer Tests with KT1000 ın the follow-up of anterior cruciate ligament surgery

Causal Beliefs Influence the Perception of Temporal Order

Frequency Domain Connectivity Identification: An Application of Partial Directed Coherence in fmri

Neural Activity of the Anterior Insula in Emotional Processing Depends on the Individuals Emotional Susceptibility

Could changes in national tuberculosis vaccination policies be ill-informed?

NUMERICAL INVESTIGATION OF BVI MODELING EFFECTS ON HELICOPTER ROTOR FREE WAKE SIMULATIONS

Stacy R. Tomas, David Scott and John L. Crompton. Department of Recreation, Park and Tourism Sciences, Texas A&M University, USA

The Egyptian Journal of Hospital Medicine (January 2018) Vol. 70, Page

Coach on Call. Thank you for your interest in When the Scale Does Not Budge. I hope you find this tip sheet helpful.

F1 generation: The first set of offspring from the original parents being crossed. F2 generation: The second generation of offspring.

Altered Sleep Brain Functional Connectivity in Acutely Depressed Patients

Probing feature selectivity of neurons in primary visual cortex with natural stimuli.

Eating behavior traits and sleep as determinants of weight loss in overweight and obese adults

METHODS FOR DETERMINING THE TAKE-OFF SPEED OF LAUNCHERS FOR UNMANNED AERIAL VEHICLES

The Effects of Rear-Wheel Camber on Maximal Effort Mobility Performance in Wheelchair Athletes

Moyamoya disease is a rare cerebrovascular occlusive disorder

Variance -Estimates and Confidence Intervals for the Kappa Measure of Classification Accuracy

A Cross-Modal System Linking Primary Auditory and Visual Cortices: Evidence From Intrinsic fmri Connectivity Analysis

National Institutes of Health, Bethesda, Maryland 4 Genomic Imaging Unit, Department of Psychiatry, University of Würzburg, Germany

Abstract. Background and objectives

GLOBAL PARTNERSHIP FOR YOUTH EMPLOYMENT. Testing What Works. Evaluating Kenya s Ninaweza Program

Relationship Between Task-Related Gamma Oscillations and BOLD Signal: New Insights From Combined fmri and Intracranial EEG

Structural Brain Magnetic Resonance Imaging of Pediatric Twins

Neural Correlates of Strategic Memory Retrieval: Differentiating Between Spatial-Associative and Temporal-Associative Strategies

Protecting Location Privacy: Optimal Strategy against Localization Attacks

The Regional Economics Applications Laboratory (REAL) of the University of Illinois focuses on the development and use of analytical models for urban

CHAPTER 1 BACKGROUND ON THE EPIDEMIOLOGY AND MODELING OF BIV/AIDS

Neuroticism and Conscientiousness Respectively Constrain and Facilitate Short-term Plasticity Within the Working Memory Neural Network

Assessment of postural balance in multiple sclerosis patients

Distress is an unpleasant experience of an emotional,

Re-entrant Projections Modulate Visual Cortex in Affective Perception: Evidence From Granger Causality Analysis

Positive and Negative Modulation of Word Learning by Reward Anticipation

QUEEN CONCH STOCK RESTORATION

Relationship of Mammographic Parenchymal Patterns with Breast Cancer Risk Factors and Risk of Breast Cancer in a Prospective Study

Analysis on Retrospective Cardiac Disorder Using Statistical Analysis and Data Mining Techniques

Electronic Supplementary Information

S[NCE the publication of The Authoritarian

Effects of Anterior Cingulate Fissurization on Cognitive Control During Stroop Interference

Adaptive and Context-Aware Privacy Preservation Schemes Exploiting User Interactions in Pervasive Environments

Boston Massachusetts 3 Department of Psychology, University of Arizona, Tucson, Arizona

VIII. FOOD AND llutrielfl' COMPOSITION OF SBP MEALS

Analytical and Numerical Investigation of FGM Pressure Vessel Reinforced by Laminated Composite Materials

Effect of Hydroxymethylated Experiment on Complexing Capacity of Sodium Lignosulfonate Ting Zhou

Modulation of Cortical Oscillatory Activity During Transcranial Magnetic Stimulation

ADAPTATION OF THE MODIFIED BARTHEL INDEX FOR USE IN PHYSICAL MEDICINE AND REHABILITATION IN TURKEY

Neural Correlates of Personally Familiar Faces: Parents, Partner and Own Faces

Chasing the AIDS Virus

Instructions For Living a Healthy Life

Sexual arousal and the quality of semen produced by masturbation

P states that is often characterized by an acute blood and

Jan Derrfuss, 1 * Marcel Brass, 2 D. Yves von Cramon, 3,4 Gabriele Lohmann, 3 and Katrin Amunts 1,5

By: Charlene K. Baker, Fran H. Norris, Eric C. Jones and Arthur D. Murphy

Reliability and Validity of the Korean Version of the Cancer Stigma Scale

Joint Iterative Equalization and Decoding for 2-D ISI Channels

Transcription:

Human Bain Mapping 29:958 972 (2008) Test Retest and Between-Site Reliability in a Multicente fmri Study Lee Fiedman, 1 * Hal Sten, 2 Gegoy G. Bown, 3 Daniel H. Mathalon, 4 Jessica Tune, 1 Gay H. Glove, 5 Randy L. Gollub, 6,7 John Lauiello, 8 Kelvin O. Lim, 9 Tyone Cannon, 10 Douglas N. Geve, 7 Heny Jeemy Bockholt, 11 Aysenil Belge, 12,13 Byon Muelle, 9 Michael J. Doty, 14 Jianchun He, 15 William Wells, 16 Padhaic Smyth, 17 Steve Piepe, 18 Seyoung Kim, 17 Maek Kubicki, 19 Mak Vangel, 6,20 andsteveng.potkin 1 1 Depatment of Psychiaty and Human Behavio, Univesity of Califonia Ivine, Ivine, Califonia 2 Depatment of Statistics, Univesity of Califonia Ivine, Ivine, Califonia 3 Psychology Sevice 116B, Univesity of Califonia San Diego, VA San Diego Healthcae System, San Diego, Califonia 4 Psychiaty Sevice 116A, VA Connecticut Healthcae System, Connecticut 5 Depatment of Radiology, Stanfod Univesity, Stanfod, Califonia 6 Depatment of Psychiaty, Massachusetts Geneal Hospital, Chalestown, Massachusetts 7 Matinos Cente fo Biomedical Imaging, Massachusetts Geneal Hospital, Chalestown, Massachusetts 8 Depatment of Psychiaty, Univesity of New Mexico, Albuqueque, New Mexico 9 Depatment of Psychiaty, Univesity of Minnesota, Minneapolis, Minnesota 10 Depatment of Psychology, Los Angeles, Califonia 11 Mophomety and Neuoinfomatics Coe, The MIND Institute, 1101 Yale Boulevad NE, Albuqueque, New Mexico 12 Duke-UNC Bain Imaging and Analysis Cente, Duke Univesity Medical Cente, Duham, Noth Caolina 13 Depatment of Psychiaty, Univesity of Noth Caolina at Chapel Hill, Chapel Hill, Noth Caolina 14 Biomedical Engineeing Coe, The MIND Institute, 1101 Yale Boulevad NE, Albuqueque, New Mexico 15 Depatment of Psychiaty, Univesity of Iowa Hospital and Clinic, Iowa City, Iowa 16 Depatment of Radiology, Havad Medical School and Bigham and Women s Hospital, Boston, Massachusetts 17 Depatment of Compute Science, Univesity of Califonia Ivine, Ivine, Califonia 18 Isomics, Inc., 203 Fanklin Steet, Cambidge, Massachusetts 19 Laboatoy of Neuoscience, Depatment of Psychiaty, Havad Medical School, Bockton, Massachusetts 20 MGH/MIT GCRC Biomedical Imaging Coe, Chalestown, Massachusetts Contact gant sponsos: National Cente fo Reseach Resouces (NCRR), National Institutes of Health (NIH); Contact gant numbe: 1 U24 RR021992. *Coespondence to: Lee Fiedman, 1312 Michael Hughes D. NE, Albuqueque, NM 87112, USA. E-mail: lfiedman10@comcast.net Received fo publication 20 June 2006; Accepted 23 May 2007 DOI: 10.1002/hbm.20440 Published online 17 July 2007 in Wiley InteScience (www. intescience.wiley.com). VC 2007 Wiley-Liss, Inc.

Test Retest and Between-Site Reliability Abstact: In the pesent epot, estimates of test etest and between-site eliability of fmri assessments wee poduced in the context of a multicente fmri eliability study (FBIRN Phase 1, www.nbin.net). Five subjects wee scanned on 10 MRI scannes on two occasions. The fmri task was a simple block design sensoimoto task. The impulse esponse functions to the stimulation block wee deived using an FIR-deconvolution analysis with FMRISTAT. Six functionally-deived ROIs coveing the visual, auditoy and moto cotices, ceated fom a pio analysis, wee used. Two dependent vaiables wee compaed: pecent signal change and contast-to-noise-atio. Reliability was assessed with intaclass coelation coefficients deived fom a vaiance components analysis. Test etest eliability was high, but initially, between-site eliability was low, indicating a stong contibution fom site and site-by-subject vaiance. Howeve, a numbe of factos that can makedly impove between-site eliability wee uncoveed, including inceasing the size of the ROIs, adjusting fo smoothness diffeences, and inclusion of additional uns. By employing multiple steps, between-site eliability fo 3T scannes was inceased by 123%. Dopping one site at a time and assessing eliability can be a useful method of assessing the sensitivity of the esults to paticula sites. These findings should povide guidance to othes on the best pactices fo futue multicente studies. Hum Bain Mapp 29:958 972, 2008. VC 2007 Wiley-Liss, Inc. Key wods: test etest; epoducibility; intaclass coelation coefficient; multicente; FMRI INTRODUCTION Multicente fmri studies have a numbe of advantages, as outlined by Fiedman et al. [2006] and Fiedman and Glove [2006]. They also pose seious challenges. One common goal of most such studies is to liteally mege the data fom seveal scannes to incease the sample size applied to a substantive question of inteest, fo example, the elationship between cetain imaging phenotypes and genetic infomation. Liteally meging such data equies data fom diffeent scannes to be intechangeable and is only easonable if scanne diffeences in fmri esults can be minimized, i.e., the assessments fom one scanne to anothe ae eliable. 1 One appoach to assessing measuement eliability is to pefom a eliability study pio to a scientifically substantive study. If the between-site eliability is low, souces of uneliability might be identified and coected. Classically, eliability of this fom of data is assessed with an intaclass coelation coefficient, which can be estimated diectly fom an appopiate analysis of vaiance table o fom vaiance components [ICC; Conbach et al., 1972; Shout and Fleiss, 1979]. The ICC is a numbe that anges fom 0.0 to 1.0. 1 Meging of multisite data is not the only easonable appoach. One can, fo example, model the site effects. The simplest case of this would be to teat the site facto as a fixed effect and estimate the mean shift fom site to site. One could also model diffeent within-site vaiance estimates fo each site independently. Moe complex models ae also possible. We believe that stating out with the meging stategy povides a basis fo the discovey of factos that might attenuate between-site eliability. Futhemoe, we believe that thee ae a numbe of advantages to modeling site as a andom effect, if possible. Cicchetti and Spaow [1981] [Cicchetti, 2001] pesented guidelines fo intepetation of ICCs as follows: poo (below 0.40), fai (0.41 0.59), good (0.60 0.74), and excellent (above 0.75). The ICC is a commonly used metic to assess test etest eliability in fmri [Aon et al., 2006; Kong et al., 2007; Manoach et al., 2001; Specht et al., 2003; Wei et al., 2004], although othe metics have been employed [Genovese et al., 1997; Le and Hu 1997; Liou et al., 2006; Maita et al., 2002; Zou et al., 2005]. Thee ae seveal types of ICC [Shout and Fleiss, 1979] but if the goal is to liteally mege data fom sites, then the choice is naowed to ICC (type 2, 1) in the nomenclatue of Shout and Fleiss [1979] (see below). This ICC measues the degee of absolute ageement of each ate (scanne) with each othe ate. If this is sufficiently lage, then one has a good basis fo meging data. We pesent the esults of vaiance components analyses and associated ICCs fo a multisite study pefomed by the Function Biomedical Infomatics Reseach Netwok (fbirn). The fbirn goup published a pevious aticle on these data showing the impact of scanne site, task un, and test occasion on functional magnetic esonance imaging esults [Zou et al., 2005]. Howeve, this initial study was not aimed at measuing the magnitude of the vaious souces of unwanted vaiance in multisite fmri data. In this aticle, we use vaiance components analysis to measue the magnitude of the vaiance components associated with scanne site, task un, and testing occasion on the amplitude of the fmri signal [Dunn, 2004]. The pesent study tests the following hypotheses: Hypothesis 1. The esults of fmri studies can be expessed in seveal ways, fo example, pecent signal change (PSC), 959

Fiedman et al. t-value, P-value, egession b-weight, Peason coelation coefficient, location of activation, numbe of activated voxels, etc. In the pesent study, we hypothesized that noise would be uneliable acoss sites and that theefoe measues, which have an estimate of noise in the denominato (CNR-type measues), would have lowe eliability than measues of signal magnitude only (PSC). This hypothesis was stimulated patly by the epot of Cohen and DuBois [1999], who noted that PSC povided moe eliable estimates than numbe of activated voxels. Hypothesis 2. Measues that ae based on the median value fom an ROI will be moe eliable than measues that ae based on the maximum value fom an ROI. This was based on the notion that the maximum value could be some sot of outlie o unusual value, pehaps highly influenced by local venous achitectue (especially at 1.5T) [Ugubil et al., 1999]. Hypothesis 3. In a pevious epot [Fiedman et al., 2006], we found that image smoothness (measued as a FWHM) was elated to task effect size. 2 In that epot, we found that a majo cause of site diffeences in image smoothness was the pesence and type of apodization (k-space) filte employed duing image econstuction. We hypothesize that adjusting fo smoothness diffeences between scannes will incease eliability estimates fo CNR type measues. Hypothesis 4. The size of an ROI could well have an impact on the eliability of the measues taken fom it. If an ROI is too small, it may not captue the pimay activation fom vaious sites, especially when the same ROI is used acoss field stength. It is known that geometic distotion of functional images is geate at 3.0T than at 1.5T such diffeences could lead to somewhat diffeent esults if the same ROI is used fo both. On the othe hand, an ROI that is too lage will have low anatomic specificity. In the pesent study, fou uns of a sensoimoto task wee collected in each visit. This allowed us to compae the eliability of the aveage of two, thee, and fou uns to a single un. Theoetically, as is clea fom ou fomulae (see below), eliability will incease with moe uns. This is analogous to esults fom classical test theoy, in which inceasing the numbe of test items will incease eliability of the test (Speaman-Bown Pophecy fomula [Lod and 2 The statistical use of the tem effect size is diffeent fom the imaging use of the tem. In statistics, effect size efes to a quantity computed as a measue of effect magnitude (the numeato) divided by a measue of esidual vaiance o eo vaiance (the denominato). Fo example, fo a two-sample test of means, Cohen s D is the diffeence in means divided by the pooled standad deviation. In imaging, the tem effect size is typically used to descibe an effect magnitude (b-weight o pecent signal change) uncontolled (not divided) by an estimate of noise. Thoughout this manuscipt, we always use the tem in the statistical sense. Novick, 1968]). Howeve, factos such as inceasing fatigue and inattention ove time may educe the eliability of late uns. Theefoe, it was of inteest to see how well empiical esults match the theoetical expectation that aveaging uns inceases eliability. The pesent epot is based on the FBIRN Phase I taveling subject study [Fiedman and Glove, 2006; Fiedman et al., 2006; Zou et al., 2005]. In this study, five subjects wee scanned on 10 diffeent scannes on two occasions. This povided an excellent dataset with which to assess test etest and between-site eliability and to test ou hypotheses. MATERIALS AND METHODS Subjects Five healthy, English-speaking males (mean age: 25.2, ange 5 20.2 29) paticipated in this study. All wee ighthanded, had no histoy of psychiatic o neuological illnesses and had nomal heaing in both eas. Each subject taveled to nine sites (10 scannes) (Table I), whee they wee scanned on two consecutive days fo a total of 20 scans pe paticipant. Thee wee no missing subject visits (scan sessions), i.e., all 100 visits (5 subjects 3 2 visits 3 10 scannes) wee available fo analysis. Howeve, one scan session (1%) was unusable fo technical easons. All subjects wee instucted to avoid alcohol the night befoe the study, caffeine 2 h pio to the study and to get a nomal night s sleep the night befoe a scan session. This study was conducted in compliance with the Code of Ethics of the Wold Medical Association (Declaation of Helsinki) and the Standads established by the Institutional Review Boad of each paticipating institution. Afte a full explanation of the pocedues employed, infomed consent was obtained fom evey subject befoe paticipation in this study and befoe evey scan session. Image Acquisition A bite ba was used to stabilize each subject s head and was placed in the subject s mouth at the beginning of each scan session. An initial T2-weighted, anatomical volume fo functional ovelay was acquied fo each subject (fast spin-echo, tubo facto 5 12 o 13, oientation: paallel to the AC-PC line, numbe of slices 5 35, slice thickness 5 4 mm, no gap, TR 5 4,000 ms, TE 5 68, FOV 5 22 cm, matix 5 256 3 192, voxel dimensions 5 0.86 mm 3 0.86 mm 3 4 mm). The paametes fo this T2 ovelay scan wee allowed to vay slightly fom scanne to scanne accoding to field stength o othe local technical factos. The anatomical scan was followed eithe by a woking memoy task (total of 14.7 min) o an attention task (total of 16 min). Thee subjects always pefomed the woking memoy task and two subjects always pefomed the attention task. Ove the next hou o so, subjects pefomed fou uns of a sensoimoto task (descibed below, 4.25 min pe un, total of 17 min), two uns of a beath-hold 960

Test Retest and Between-Site Reliability TABLE I. Desciption of hadwae and sequences of the nine sites (10 scannes) paticipating in this study, five 1.5T scannes, fou 3T scannes, and one 4T scanne Site code Abbeviation Field stength (T) Manufactue RF coil type Functional sequence 1 SITE 1 1.5 GE Nvi LX TR quadatue head Spial 2 SITE 2 1.5 GE Signa CV/i TR quadatue head EPI 3 SITE 3 1.5 Siemens Sonata RO quadatue head EPI 4 SITE 4 1.5 Philips/Picke RO quadatue head EPI 5 SITE 5 1.5 Siemens Symphony TR quadatue head EPI 6 SITE 6 3.0 GE GE TR eseach coil EPI 7 SITE 7 3.0 Siemens Tio TR quadatue head EPI -Dual Echo 8 SITE 8 3.0 Siemens Tio TR quadatue head EPI 9 SITE 9 3.0 GE CV/NVi Elliptical quadatue head Spial in/out 10 SITE 10 4.0 GE Nvi LX TR quadatue head Spial task (4.25 min), and two esting state scans (fixation on a cosshais, 4.25 min). These eight scans wee pefomed in a countebalanced ode. The entie scan session was epeated the following day. In the pesent epot, only data fom the fou sensoimoto uns is pesented. The functional data wee collected using echo-plana (EPI) tajectoies (seven scannes) o spial tajectoies (thee scannes) (Table I) (oientation: paallel to the AC- PC line, numbe of slices 5 35, slice thickness 5 4 mm, no gap, TR 5 3.0 sec, TE 5 30 ms on the 3T and 4T scannes, 40 ms on the 1.5T scannes, FOV 5 22 cm, matix 5 64 3 64, voxel dimensions 5 3.4375 mm 3 3.4375 mm 3 4 mm). SITE 7 employed a double echo EPI sequence and SITE 9 employed a spial in/spial out sequence. All the spial acquisitions wee collected on Geneal Electic (GE) scannes. The sensoimoto task poduced fou uns of 85 volumes each (85 TRs). The RF coils used vaied with each scanne (Table I). Sensoimoto Task The sensoimoto (SM) task was designed initially fo calibation puposes and employed a block design, with each block taking 10 TRs (30 sec) beginning with five TRs (15 s) of est (subject instucted to stae at fixation coss) and five TRs (15 s) of sensoimoto activity (see below). Thee wee eight full cycles of this followed by a five TR est peiod at the end fo a total of 85 TRs (4.25 min). Duing the active phase, subjects wee instucted to tap thei finges bilateally in synchony with binaual tones, while watching an altenating contast checkeboad. The checkeboad flash and tone pesentation wee simultaneous. The subjects wee instucted to tap thei finges in an altenating finge tapping patten (index, middle, ing, little, little, ing, middle, index, index...) in synchony with the tones and checkeboad flashes. The thumb was not used in this study. Each tone was 166 ms long with 167 ms of silence. The tone sequence utilized a dissonant seies geneated by a synthesize (Midi notes 60, 64, 68, 72, 76, 80, 84, 88, 86, 82, 78, 74, 70, 66, 62, 58). The subjects esponses wee ecoded and monitoed with the PST Seial Response Box (Psychology Softwae Tools, Pittsbugh, PA). Pepocessing fmri Data Analysis The fist step of image pocessing was accomplished using Analysis of Functional NeuoImage (AFNI) softwae [Cox, 1996]. The fist two volumes wee discaded to allow fo T1-satuation effects to stabilize. All lage spikes in the data wee emoved fom each sensoimoto un, and each un was motion-coected (i.e., spatially egisteed to the middle volume of the un, TR 5 42). The data wee then slice-time-coected. A mean functional (T2*) image was ceated. The mean T2* image fo each un was spatially nomalized to an EPI canonical image in MNI space using tools available in SPM5 (http://www.fil.ion.ucl.ac.uk/spm/). This included affine tansfomations and thee nonlinea iteations. We limited the nonlinea iteations to thee to contol the amount of defomation. The spatial tansfomations wee applied to the time seies data as well, and the time seies was esampled at a 4 3 4 3 4mm 3 voxel size. Measuing PSC and CNR The impulse esponse functions (IRF) fo each voxel wee estimated using Keith Wosley s package, FMRISTAT [Wosley et al., 2002] (http://www.math.mcgill.ca/keith/ fmistat/), accoding to the FIR-Deconvolution method outlined at the FMRISTAT web page. The IRFs wee 30 s long (10 time points, 3 s apat), coveing the duation of the on and off block peiods. Tempoal dift was emoved by adding a linea and a quadatic component to the model. The PSC IRFs wee calculated by dividing the estimates (bs) by the model intecept (mean baseline level) and multiplying by 100. The CNR IRFs wee the t-values fo each time point in the FIR model. Image data tansfe between pogams (AFNI, SPM5, and FMRISTAT) was geatly facilitated by the application of the new NifTi standad (http://nifti.nimh.nih.gov/ nifti-1/). ROIs The ROIs ae displayed in Figue 1. The ROIs include only voxels that wee activated at all 10 scannes with an 961

Fiedman et al. factos fo this analysis, and fixed-effects included fieldstength, ROI, and the field-stength by ROI inteaction. The fixed-effects wee tested using planned contasts between field stength fo all ROIs togethe and each ROI sepaately. Planned contasts included a test of 3T > 1.5T. (The 4T > 3T test was not pefomed since thee was only one site at 4T.) Since these contasts wee obvious diectionally specific hypotheses, the P-values povided ae one-tailed. The multiple ROI contasts wee contolled fo the false discovey ate (FDR) [Benjamini and Hochbeg, 1995] using SAS PROC MULTTEST. Cohen s D values ae also povided to compae effect sizes. Effect sizes fo the field-stength effect fom PSC and CNR wee compaed using a paied t-test. Measuing eliability Vaiance components wee estimated fo each scala using SAS PROC Vacomp (SAS, Cay, NC), which employed the esticted maximum likelihood (REML) method. The vaiance components ae estimated accoding to the following model: Figue 1. ROIs. Fou of six ROIs employed in this study. ROIs ae shown in white. Fo each ROI, an axial, coonal, and sagittal view is pesented. The emaining two ROIs (ight moto and ight auditoy) wee compaable to the contalateal ROIs shown in this figue. uncoected P value <0.00001 and in all five subjects with an uncoected P value <0.00001. Thee wee six ROIs identified: left and ight moto cotex (LM and RM), left and ight auditoy cotex (LA and RA), bilateal visual cotex (BV), and the bilateal supplementay moto aea (SM). Extacting Scala Values fom Each ROI Fou IRFs wee extacted fom each ROI: (1) median IRF in pecent change units, (2) maximum IRF in pecent change units, (3) median IRF in CNR units, and (4) maximum IRF in CNR units. To aive at a scala value fom each IRF (fo each un of the SM task) the peak value of each IRF fom 6 s postblock onset to 18 s postblock onset was obtained. Statistical Analysis Data wee available fo five subjects at 10 scannes. Each subject was scanned on two visits, and thee wee fou SM uns pe visit. Testing field-stength effects To test fo field-stength effects, a mixed-model ANOVA was caied out fo median PSC and median CNR. The andom factos noted below wee included as andom Y ijkl ¼ mean þ subject i þ site j þ site-by-subject ij þ visitðsite-by-subjectþ ijk þ unexplained ijkl with Y ijkl denoting the dependent measue fo subject i, site j, visit k, and un l. Each facto was teated as a andom effect. (The site facto is often thought of as a fixed effect, but in the context of a multicente study it is desiable to think about a population of potential sites.) In this fomulation, visits ae teated as nested within site-subject combinations and the esidual tem is an estimate of the vaiance fo uns nested within subject-by-site-by-visit. This fomulation models visits and the uns that occu on these visits as distinct measuement occasions. The model also allows fo possible day-to day vaiation in magnet pefomance. As discussed below, altenative models ae possible and do not appeciably change the eliability esults. 3 3 The pesent data set allows us to conside the effect on eliability estimates of vaying the statistical model used to estimate the vaiance components. Fo example, one might view data fom the cuent study as being geneated by a completely cossed statistical design, i.e., visit cossed with site, subject and thei inteaction. We did exploe the fully cossed model, still teating all factos as andom. The effects of visit and un ae extemely small as ae all of thei inteactions except fo the inteaction visit 3 subj 3 site, which is essentially equivalent to ou nested vaiance component fo visit and the full inteaction, which is essentially ou esidual vaiance component. Although the analysis based on a model with visit and un viewed as factos cossed with site and subject does poduce slightly diffeent eliability values, these values do not vay geatly fom those epoted fo the nested design. One might expect a bigge diffeence between the two appoaches if thee was a moe consistent patten acoss visits o un (e.g., leaning o habituation effects). 962

Test Retest and Between-Site Reliability TABLE II. Vaiance components and eliability estimates fo PSC ROI extaction method Field Region Site Subject Site 3 Subject Visit Unexplained ICC_BET a ICC_T-R b Median 1.5 BV 0.003 0.036 0.001 0.007 0.010 0.72 0.80 1.5 LA 0.009 0.001 0.002 0.001 0.004 0.08 0.85 1.5 LM 0.005 0.002 0.002 0.001 0.006 0.16 0.77 1.5 RA 0.012 0.001 0.002 0.002 0.005 0.07 0.83 1.5 RM 0.008 0.000 0.002 0.003 0.004 0.02 0.74 1.5 SM 0.007 0.006 0.002 0.002 0.011 0.29 0.75 3 BV 0.027 0.060 0.011 0.011 0.024 0.52 0.85 3 LA 0.001 0.004 0.003 0.007 0.007 0.22 0.47 3 LM 0.002 0.009 0.006 0.006 0.010 0.37 0.66 3 RA 0.004 0.006 0.008 0.001 0.010 0.29 0.83 3 RM 0.001 0.002 0.007 0.004 0.010 0.14 0.62 3 SM 0.000 0.008 0.016 0.007 0.020 0.22 0.68 Maximum 1.5 BV 0.193 3.731 0.000 3.941 2.195 0.44 0.47 1.5 LA 0.307 0.151 0.068 0.171 0.209 0.20 0.70 1.5 LM 0.019 0.014 0.030 0.137 0.173 0.06 0.26 1.5 RA 0.359 0.386 0.063 0.628 0.366 0.25 0.53 1.5 RM 0.027 0.000 0.054 0.102 0.148 0.00 0.37 1.5 SM 0.008 0.033 0.021 0.014 0.078 0.34 0.65 3 BV 1.971 3.001 0.000 1.726 1.187 0.43 0.71 3 LA 0.000 0.193 0.215 0.186 0.296 0.29 0.61 3 LM 0.000 0.282 0.338 0.460 0.537 0.23 0.51 3 RA 0.074 0.846 0.137 0.217 0.604 0.59 0.74 3 RM 0.000 0.000 0.149 0.468 0.425 0.00 0.21 3 SM 0.008 0.119 0.083 0.082 0.110 0.37 0.66 a ICC_BET 5 between-site ICC. b ICC_T-R 5 test etest ICC. Tables II and III pesents esults fom the vaiance components analysis of PSC (Table II) and CNR (Table III). Analyses ae caied out fo each of six bain egions, sepaately fo 1.5T and 3.0T imaging sites. Two diffeent measues of eliability ae developed and pesented in Tables II and III. To descibe these measues, we fist TABLE III. Vaiance components and eliability estimates fo CNR ROI extaction method Field (T) Region Site Subject Site 3 Subject Visit Unexplained ICC_BET a ICC_T-R b Median 1.5 BV 0.082 0.141 0.025 0.056 0.080 0.44 0.77 1.5 LA 0.050 0.018 0.017 0.019 0.053 0.15 0.72 1.5 LM 0.071 0.015 0.006 0.024 0.055 0.12 0.71 1.5 RA 0.084 0.026 0.015 0.015 0.044 0.17 0.83 1.5 RM 0.037 0.006 0.011 0.031 0.055 0.07 0.55 1.5 SM 0.050 0.072 0.014 0.029 0.108 0.37 0.71 3 BV 0.490 0.360 0.023 0.111 0.278 0.34 0.83 3 LA 0.095 0.127 0.075 0.120 0.132 0.28 0.66 3 LM 0.084 0.201 0.029 0.081 0.249 0.44 0.69 3 RA 0.299 0.180 0.165 0.121 0.160 0.22 0.80 3 RM 0.287 0.087 0.029 0.053 0.251 0.17 0.78 3 SM 0.228 0.251 0.187 0.073 0.317 0.31 0.81 Maximum 1.5 BV 0.124 0.632 0.110 0.446 0.820 0.42 0.57 1.5 LA 0.352 0.364 0.106 0.278 0.731 0.28 0.64 1.5 LM 0.125 0.013 0.074 0.065 0.492 0.03 0.53 1.5 RA 0.307 0.884 0.317 0.178 0.767 0.47 0.80 1.5 RM 0.047 0.000 0.202 0.088 0.589 0.00 0.51 1.5 SM 0.050 0.125 0.062 0.057 0.326 0.33 0.63 3 BV 3.965 2.740 0.824 0.837 2.101 0.31 0.85 3 LA 0.428 2.695 0.144 0.709 1.658 0.61 0.74 3 LM 0.069 0.444 0.091 0.325 1.140 0.37 0.50 3 RA 1.032 2.054 0.691 0.303 1.572 0.46 0.84 3 RM 0.543 0.328 0.033 0.251 1.395 0.22 0.60 3 SM 0.290 0.782 0.308 0.054 0.933 0.47 0.83 a ICC_BET 5 between-site ICC. b ICC_T-R 5 test-etest ICC. 963

Fiedman et al. Note that the between-site eliability coefficient povides an estimate of the eliability of an imaging paamete aveaged acoss fou uns fo a andomly selected subject studied at a single andomly selected site on one andomly selected occasion. A measue fo test etest eliability asks about the eliability (o coelation) of two visits fo the same subject at the same site but on diffeent days. This eliability is calculated as Test-Retest Reliability ¼ðVD subject þ VD site þ VD subject-by-siteþ=total Visit Vaiance The between-site eliability estimates ae analogous to ICC (type 2, 1) and the test etest eliability estimates ae analogous to ICC (type 1, 1) fom Shout and Fleiss [1979] but have been adapted to the moe complex design. RESULTS Field-Stength Effects (Fig. 2, Table IV) Figue 2. Field stength effects. (A) Mean PSC estimates (based on median ROI extaction) fo 1.5T, 3T, and the 4T scanne acoss six ROIs (BV5 bilateal visual cotex, LA 5 left auditoy cotex, LM 5 left moto cotex, RA 5 ight auditoy cotex, RM 5 ight moto cotex and SM 5 supplementay moto cotex). (B) Mean CNR estimates (based on median ROI extaction) fo 1.5T, 3T, and the 4T scanne acoss six ROIs. intoduce a total visit vaiance tem, which is the sum of all of the vaiance components associated with the above model: Total Visit Vaiance ¼ðVD subject þ VD site þ VD site-by-subject þ VD visit þðvd unexplained=4þþ whee VD stands fo vaiance due to. We divide VD_unexplained by 4 to get the total visit vaiance (aveage ove fou uns) athe than total vaiance fo a single un. A measue of between-site eliability (fo a visit consisting of fou uns) is the coelation of two measues (based on the mean of fou uns) fo the same subject but at diffeent sites. In tems of the vaiance components estimated as above (and pesented in Tables II and III) this is: Between-Site Reliability ¼ VD subject=total Visit Vaiance Median PSC and CNR both incease with field-stength (Fig. 2). The 3T scannes had highe median PSC than the 1.5T scannes (Fig. 2A, Table IV). The all-egion test was statistically significant as wee five of six ROI-specific tests. Contolling fo an FDR of 0.05, two of the ROI-specific tests wee statistically significant. The Cohen s D effect sizes wee vey lage. The 3T scannes had highe median CNR than the 1.5T scannes (Fig. 2B, Table IV). The allegion test was statistically significant as wee all of the ROI-specific tests. Contolling fo an FDR of 0.05, all of the ROI-specific tests wee statistically significant. The Cohen s D effect sizes wee also vey lage. Howeve, the effect sizes fo median CNR wee statistically significantly highe than the effect sizes fo median PSC (P 5 0.001, two-tailed, paied t-test). The pesence of consistent mean elevations fo the 4T scanne compaed to the 3T scannes (11 of 12 measues), indicates that including the 4T scanne with the 3T scannes will lowe eliability. Fo this eason, we chose not to lump the 4T in with the high field goup. Test Retest Reliability (Fig. 3A, Tables II and III) Test etest eliability was geneally high (Fig. 3A). Fo the median PSC measue, the median test etest eliability was 0.76 (25th pecentile 5 0.67, 75th pecentile 5 0.83). Fo the median CNR measue, the median test etest eliability was 0.74 (25th pecentile 5 0.70, 75th pecentile 5 0.80). Thus, the cental tendency of test-etest eliability is at the bode between good and excellent eliability. Between-Site Reliability (Fig. 3B, Tables II and III) Between-site eliability was much lowe than test etest eliability (Fig. 3B vesus Fig. 3A). Fo the median PSC measue, the median between-site eliability was 0.22 (25th 964

Test Retest and Between-Site Reliability TABLE IV. Field-stength effects Contast Region Estimate T df One-tailed P-value FDR P-value Cohen s D Pecent signal change 3 T > 1.5 T ALL 0.10 2.28 7 0.028 1.72 3T > 1.5 T BV 0.15 3.25 8.57 0.005 0.030 2.22 3T > 1.5 T LA 0.09 2.04 8.57 0.037 0.056 1.39 3T > 1.5 T LM 0.09 1.87 8.57 0.048 0.058 1.28 3T > 1.5 T RA 0.12 2.67 8.57 0.013 0.039 1.82 3T > 1.5 T RM 0.05 1.02 8.57 0.168 0.168 0.70 3T > 1.5 T SM 0.10 2.16 8.57 0.030 0.056 1.48 Contast-to-noise atio 3 T > 1.5 T ALL 0.91 3.68 7 0.004 2.78 3T > 1.5 T BV 0.85 3.39 7.56 0.005 0.005 2.47 3T > 1.5 T LA 0.90 3.56 7.56 0.004 0.005 2.59 3T > 1.5 T LM 1.03 4.08 7.56 0.002 0.005 2.97 3T > 1.5 T RA 0.94 3.72 7.56 0.003 0.005 2.71 3T > 1.5 T RM 0.85 3.36 7.56 0.005 0.005 2.44 3T > 1.5 T SM 0.90 3.57 7.56 0.004 0.005 2.60 pecentile 5 0.13, 75th pecentile 5 0.31). Fo the median CNR measue, the median between-site eliability was 0.25 (25th pecentile 5 0.16, 75th pecentile 5 0.35). Both ICCs can be descibed as poo. slope, each incease by 1 mm in smoothness (FWHM) is associated with a 0.03 decline in median PSC. With PSC estimates in the ange of 0.6 1.0, this amounts to between Compaing Median-based Measues to Maximum-based Measues on Between-Site Reliability (Fig. 4, Table V) Between-site eliability estimates fo median and maximum PSC (Fig. 4A) and CNR (Fig. 4B) measues ae summaized in Table V. Thee is no obvious winne in these compaisons, and Wilcoxon Paied Tests eveal no statistically significant diffeences. The maximum measues do have highe median eliability (Table V) but the patten is not consistent acoss egions and measues. Moeove, fo PSC fo the 1.5T_BV measue, whee eliability is geatest, thee is a maked dop in eliability estimate in going fom median to maximum measue. The Effect of Smoothness Adjustment on Between-Site Reliability (Fig. 5, Table VI) In a pevious epot [Fiedman et al., 2006; see also Lowe and Soenson, 1997], we found that image smoothness had a stong effect on activation effect size, i.e., the multiple R 2. This is most simila to the CNR measue used heein, although it would be bette elated to contast-tototal-tempoal-vaiance-atio. Having in ou possession smoothness estimates fom that pape and median PSC and CNR estimates hee, it was of inteest to detemine if pio adjustment fo smoothness diffeences would have a beneficial effect on between-site eliability. Median PSC and CNR measues wee egessed against aveage smoothness estimates (FWHM units) (Table VI). Geneally, thee was a negative slope fo the elationship between smoothness and median PSC. Although the effect is small, with a median slope of 20.032, the elationship was statistically significant fo six measues. With this Figue 3. Reliability. (A) Test etest eliability estimates fo PSC and CNR fo 1.5T and 3T scannes at six ROIs. See Figue 2 fo ROI definitions. (B) Between-site eliability estimates fo PSC and CNR fo 1.5T and 3T scannes at six ROIs. 965

Fiedman et al. Figue 4. Compaing the median and the maximum. (A) Between-site eliability estimates fo PSC based on eithe median ROI extaction o maximum ROI extaction fo 1.5T and 3T scannes at six ROIs. See Figue 2 fo ROI definitions. (B) Between-site eliability estimates fo CNR based on eithe median ROI extaction o maximum ROI extaction fo 1.5T and 3T scannes at six ROIs. Measue TABLE V. Between-site eliability estimates fo median-based and maximum-based measues ROI extaction method 25th pecentile Median 75th pecentile Pecent signal Median 0.13 0.22 0.31 change Pecent signal Maximum 0.17 0.27 0.39 change CNR Median 0.16 0.25 0.35 CNR Maximum 0.27 0.35 0.46 Figue 5. Effects of contolling fo smoothness. (A) The effect of pio adjustment fo smoothness on between-site eliability of median PSC at 1.5T and 3T fo six ROIs. See Figue 2 fo ROI definitions. (B) The effect of pio adjustment fo smoothness on between-site eliability of median CNR at 1.5T and 3T fo six ROIs. TABLE VI. Slope estimates fo the elationship between image smoothness and PSC o CNR Field Region PSC_Slope PSC P-value CNR_Slope CNR_P-value 1.5 BV 20.036 0.14 0.378 0.00000 1.5 LA 20.052 0.00007 0.201 0.00000 1.5 LM 20.071 0.00000 0.316 0.00000 1.5 RA 20.039 0.00890 0.237 0.00000 1.5 RM 20.101 0.00000 0.243 0.00000 1.5 SM 20.096 0.00000 0.247 0.00000 3 BV 20.028 0.55 0.466 0.00082 3 LA 20.011 0.58 0.477 0.00000 3 LM 20.013 0.58 0.541 0.00000 3 RA 0.056 0.01136 0.803 0.00000 3 RM 0.012 0.55 0.789 0.00000 3 SM 0.026 0.39 0.738 0.00000 966

Test Retest and Between-Site Reliability The Effect of ROI Size on Between-Site Reliability (Fig. 6) One concen was that ou ROIs wee too small, even though they wee defined to include activations fom all sites (see above). All the oiginal ROIs wee dilated in the x, y, and z diection by thee voxels. The dilated ROIs poduced statistically significant inceases in between-site ICC fo PSC (Wilcoxon Test, P 5 0.034), especially at 3T (impovement at 1.5T 5 47%, impovement at 3T 5 93%) (Fig. 6). The Effect of Dopping One Site on Between-Site Reliability (Table VII) Figue 6. Effect of dilating ROIs. Between-site ICCs fo median PSC with and without ROI dilation. Note the impoved eliability with dilated ROIs especially at 3T. 3 and 5% of PSC estimates. Fo median PSC (Fig. 5A) smoothness adjustment impoves eliability fo 9 of 12 measues (Wilcoxon Test, P 5 0.041, two-tailed). Fo median CNR (Fig. 5B) smoothness adjustment impoves eliability fo 10 of 12 measues (Wilcoxon Test, P 5 0.005, two-tailed). The notion of dopping one site to incease between-site eliability is obviously a dastic step, but in some cases may be waanted. At the least, an analysis of eliability with and without a site can be a poweful diagnostic tool. In Table VII, the effect of dopping one site at a time fom the vaiance components and between-site eliability calculations is compaed to the case whee all sites ae included. Fo this analysis, all the data come fom the BV ROI exclusively. Dopping SITE 2 fom the analysis inceased the eliability of median PSC fo 1.5T scannes fom 0.44 to 0.53. Dopping SITE 4 fom the analysis of median CNR fo 1.5T scannes inceased the eliability fom 0.72 to 0.81. Fo both measues at 3T, SITE 6 site was binging down the eliability. Fo CNR, the change was TABLE VII. The effect of dopping one site at a time on vaiance components and between-site eliability Site that was dopped Measue ROI extaction method Field Site Subject Site 3 subject Visit Residual Between-site ICC ALL_IN CNR Median 1.5 0.44 SITE 1 CNR Median 1.5 0.09635 0.13354 0.02948 0.06330 0.08511 0.39 SITE 2 CNR Median 1.5 0.03733 0.14157 0.02467 0.04876 0.06907 0.53 SITE 3 CNR Median 1.5 0.08097 0.15893 0.01838 0.06547 0.08711 0.46 SITE 4 CNR Median 1.5 0.09754 0.10948 0.02281 0.03093 0.07174 0.39 SITE 5 CNR Median 1.5 0.09937 0.16168 0.02832 0.06957 0.08726 0.42 ALL_IN PSC Median 1.5 0.72 SITE 1 PSC Median 1.5 0.00358 0.03336 0.00261 0.00601 0.00894 0.70 SITE 2 PSC Median 1.5 0.00456 0.03558 0.00000 0.00704 0.00946 0.72 SITE 3 PSC Median 1.5 0.00330 0.03821 0.00043 0.00844 0.00950 0.72 SITE 4 PSC Median 1.5 0.00000 0.03619 0.00103 0.00504 0.01013 0.81 SITE 5 PSC Median 1.5 0.00394 0.03598 0.00138 0.00929 0.01073 0.68 ALL_IN CNR Median 3 0.34 SITE 6 CNR Median 3 0.01004 0.40512 0.01930 0.15419 0.30249 0.61 SITE 7 CNR Median 3 0.67092 0.28861 0.00000 0.09184 0.27842 0.26 SITE 8 CNR Median 3 0.72608 0.35661 0.02849 0.14302 0.27573 0.27 SITE 9 CNR Median 3 0.54591 0.37612 0.10214 0.01753 0.25724 0.34 ALL_IN PSC Median 3 0.52 SITE 6 PSC Median 3 0.00347 0.05069 0.01885 0.01305 0.02654 0.55 SITE 7 PSC Median 3 0.03281 0.05497 0.00649 0.01502 0.02200 0.48 SITE 8 PSC Median 3 0.02891 0.05924 0.01550 0.00924 0.02194 0.50 SITE 9 PSC Median 3 0.04419 0.07350 0.00355 0.00743 0.02535 0.54 967

Fiedman et al. The bette the estimate of the PSC measue o CNR measue, the highe the eliability should be. One way to impove the estimates is to base them on aveages acoss uns. Such an analysis is pesented in Figue 7, fo test etest eliability. Thee ae six ROIs, two field stengths, and two measuement types (PSC and CNR) o 24 total analyses, all of which ae plotted in Figue 7. The six panels wee used to spead the cuves out fo visibility, and the souce data fo individual cuves ae not identified fo the pesent pupose. The points epesent eithe data fom un 1, the aveage of uns 1 and 2, the aveage of uns 1, 2, and 3 o the aveage of all fou uns. The mean cuve on the left plots pedicted values and standad eos fom a polynomial egession (F 5 25.6, df 5 1.7, 40, P 5 0.0001). As pedicted, test etest eliability inceases with aveaging, but not in evey single case. Reliability fo the aveage of fou uns is highe than eliability fo a single un (t 5 5.76, df 5 23, P 5 0.000004, paied t-test, onetailed). The aveage of fou uns demonstates highe eliability than the aveage of thee uns in 18 of 24 cases (t 5 2.55, df 5 23, P 5 0.009, paied t-test, one-tailed). Clealy one way to enhance test etest eliability is to incease the numbe of uns, and it seems likely that eliability will continue to incease beyond fou uns. The patten of inceasing eliability with inceasing the numbe of uns included in the aveage was not consistently appaent fo between-site eliability estimates. Concatenating Reliability Enhancing Steps (Fig. 8) Figue 7. Effect of inceasing numbe of uns. Relationship between the numbe of uns contibuting to the aveage estimate (abscissa) and test etest ICC fo 24 measues [two measuement types (PSC vs. CNR), six ROIs, and two field stengths). In the left most panel, the pedicted means and standad eos ae plotted fom a epeated measues polynomial contast model. The 24 cuves ae spead acoss six panels to enhance visibility of each cuve. The souce of the data in each cuve is not identified in the figue. quite maked (0.34 0.61), wheeas fo PSC the change was modest (0.52 0.55). The Effect of Numbe of Runs on Test Retest and Between-Site Reliability (Fig. 7) To illustate the effects of accumulating the benefits of seveal steps to impove between-site eliability, we began with the oiginal data fo median PSC fo 3T scannes (Fig. 8). In the fist step, SITE 6 site was dopped and this led to a substantial incease in between-site eliability. In the second step, we dilated the all of the ROIs as descibed above. This futhe inceased between-site eliability. In the final step, we adjusted fo smoothness diffeences between sites. This latte effect was almost unnoticeable except fo the BV ROI. The median initial eliability was 0.26 and the median final eliability was 0.58, a statistically significant (Wilcoxon, P 5 0.014, one-tailed) impovement of 123%. DISCUSSION A key goal of this epot was to assess test etest eliability and between-site eliability fo a multicente fmri study involving 10 sites and a obust sensoimoto activation task. Measues of eliability ae obtained fom a vaiance components analysis of the fmri activations fo the study in which five subjects visited each of 10 sites fo two visits (on consecutive days). In geneal, test etest eliability was high, but initially, between-site eliability was low. Seveal methods wee evaluated fo impoving betweensite eliability. By employing multiple methods, maked inceases in between-site eliability wee noted. Figue 8. Concatenating steps to impove eliability. Effect of a seies of steps, descibed in the text, on between-site eliability fo median PSC fom 3T scannes. 968

Test Retest and Between-Site Reliability Test Retest Reliability We epot high test etest eliability fo a 1 day inteval fo activations fom a obust sensoimoto task in seveal ROIs. This indicates that fmri tasks can be eliable when etested on subsequent days within a site, but this eliability level will likely depend highly on the obustness of the task and the numbe of uns studied. The pesent epot is simila in seveal espects to the ecent epot of Aon et al. [2006]. These authos compaed test etest eliability on a leaning task (without pactice effects) with a test etest inteval of 1 yea. The dependent measue used fo ICC calculation in Aon et al. [2006] was a measue of signal change (athe than CNR), and signal changes wee mean estimates fom ROIs. All of the ROI-based test etest ICCs Aon et al. [2006] epoted wee in the excellent ange. Kong et al. [2007] epoted test etest ICCs fo a unilateal finge tapping task, which included pimay moto and supplementay moto aeas. They employed a measue of signal magnitude and obtained an ICC fo the left moto aea of 0.68 quite simila to the ICCs fo the LM and RM ROIs epoted heein (Fig. 3). The test etest eliability fo the supplementay moto aea in the Kong et al. [2007] study was 0.51 somewhat lowe than we epot heein. As expected, an incease in the numbe of uns used in an aveage estimate was associated with incease in the test etest eliability. The effect of inceasing numbes of uns on the estimate of test etest eliability was geatest fo the diffeence between a single un and two uns, and less fo the addition of each addition un, but was still statistically significant fo the diffeence between thee uns and fou uns. Since we an only fou uns of the sensoimoto task, we could not empiically estimate the effect of moe uns. Howeve, the impovement in eliability with inceasing uns should incease as a function of 1/ N uns. Thus, it seems unlikely that significant impovement in eliability with inceasing numbes of uns will continue beyond 6 8 uns. On the othe hand, fatigue is accumulating though the scanning session, and it seems likely that attention and motivation will wane as moe and moe uns ae tested. Thus, we pedict that the empiical elationship between numbe of uns and eliability will peak at some point and then eithe plateau o actually decline. Theefoe, we ecommend that in peliminay studies pio to a majo multicente study, the fmri task be tested on many uns at one o moe sites, to detemine the optimal numbe of uns to enhance eliability. Pevious fmri eliability studies have epoted inceased eliability when moe data ae collected. Fo example, Genovese et al. [1997] epoted inceased test etest eliability within a scanning session as a function of inceasing the un duation (numbe of volumes collected). In a PET CBF study, Gabowski and Damasio [1996] epoted that the eplication ate fo activations inceased makedly when the analysis was based on two PET uns athe than a single un. Of paticula inteest is the epot by Maita et al., [2002], in which the estimates of eliability wee based on 2 12 eplications (1 un pe visit, 12 visits) of a finge-tapping task. They epot that the gain in eliability... is most ponounced when we move fom 2 to 3 eplications, and tapes off substantially at aound 5 o 6 eplications. Between Site Reliability It was hypothesized that a measue of pecent signal change (PSC) would poduce highe between-site eliability than a measue of contast-to-noise-atio (CNR). This hypothesis was based on the notion that noise would likely be highly vaiable acoss sites and this would lowe the eliability of the CNR measue but not the PSC measue. This was appaently not the case, since the compaison of between-site eliability based on PSC vesus CNR did not show a maked advantage fo eithe measue, and actually showed a slight advantage to CNR measues. This statement only applies to the methods of assessment of PSC and CNR tested heein. Thee ae numeous methods fo computing PSC and CNR that may poduce diffeent esults. Fo example, instead of PSC and CNR measues fom an FIR-deconvolution, one could have used PSC and CNR measues fom the egession of a pedetemined tempoal model [events convolved with a canonical hemodynamic esponse function (HRF)]. Futue eseach will be equied fo a compehensive compaison of diffeent measues of PSC and CNR. Cohen and Dubois [1999] compaed the stability of egession b-weights (a signal magnitude measue) to the stability of numbe of activated voxels and found the fome to be much moe stable than the latte. Howeve, high ICC values have been epoted fo t-values (a CNR measue) [Specht et al., 2003]. These authos employed an event-elated visual activation paadigm on two visits and computed within-site ICC estimates fo each voxel based on t-values. In thei attend condition, thee was vey high eliability fo t-values in the pimay visual cotex (ICCs > 8). We also hypothesized that the extaction of a median IRF fom each ROI would lead to moe eliable esults than the extaction of the maximum IRF. In many cases, data based on the maximum IRF was moe eliable than data based on the median IRF, and no clea winne was obvious. We also hypothesized that contolling fo site diffeences in the native smoothness of the images [Fiedman et al., 2006] would impove between-site eliability. This was based on the notion that CNR measues would be elated to image smoothness, and that scannes that poduced smoothe images would show highe CNR. This tuned out to be the case, since adjusting fo smoothness diffeences in CNR pio to between-site eliability estimation poduced a statistically significant incease in eliability. Futhemoe, adjusting fo smoothness diffeences in PSC also poduces a statistically significant incease in between-site eliability. The slope of the elationship between image smoothness and PSC was negative, indicating that the 969

Fiedman et al. smoothe the data, the lowe the PSC estimate. This could be the esult of spatial smoothness tending to ound off the peaks and toughs of the aw image data and thus leading to a educed PSC. Ou hypothesis that inceasing the ROI size would incease between-site eliability was bone out, paticulaly at 3T. Pehaps this is simply due to the fact that ou oiginal ROIs wee too small, even though they wee defined to include activations fom all sites. The maked beneficial effect of ROI dilation at 3T could be due to the established fact that spatial image distotions ae inceased at 3T compaed with 1.5T. In the pesent study, no B0-distotion coection pocedues wee applied fo the EPI acquisitions. Such coection pocedues should educe vaiance in the activation sites, especially at 3T. Futhemoe, diffeent spatial nomalization techniques may have diffeential effects on eliability. Futue FBIRN data acquisitions employ B0- coection pocedues, so this souce of unwanted vaiance should be minimized somewhat going fowad. The ROIs we studied eflected common activation acoss all scannes. Ou study examined souces of vaiation in signal magnitude fo voxels whee the P-value was consistently above a statistical theshold acoss sites. We chose this definition of an ROI because conjunction methods of deiving ROIs ae commonly used in fmri studies [Fiston et al., 1999; Quintana et al., 2003]. This ROI definition, based on multisite consistency, should poduce an elevated between-site eliability compaed to ROI definitions that ae not based on multisite consistency. Howeve, employment of this ROI without othe eliability enhancing pocedues poduced low eliability estimates (1.5T median: 0.22, 3T median: 0.25). Employment of ROIs that ae not defined in efeence to multisite activation consistency (e.g., atlas-based ROIs) ae likely to poduce eliability estimates even close to 0.0. The pesent study emphasizes the citical impotance of ROI definition on eliability. Clealy, follow-up studies which compae diffeent methods of ROI definition ae waanted. We also examined the notion that inceasing the numbe of uns of a task would incease the between-site eliability. Although such an effect was unequivocally demonstated fo test etest eliability, the effect of inceasing the numbe of uns on between-site eliability was not consistently obseved. Between-site eliabilities wee geneally poo initially, and the lack of effect of inceasing the numbe of uns may simply eflect the notion that uneliable entities ae not made moe eliable by epeated sampling. The notion of combining uns to incease accuacy only makes sense if thee ae no pactice effects fo the task unde consideation. A numbe of fmri studies have now documented pactice effects fo some tasks [fo eview, see Kelly and Gaavan, 2005]. If one is to gain the eliability incease associated with aveaging uns of a task, one should establish that pactice effects ae not an impotant chaacteistic of the task. The notion of dopping a site to impove between-site eliability is a dastic step. Nonetheless, the maked impovement in between-site eliability fo PSC at 3T afte dopping the SITE 6 site suggests that in the pesent case, this might be waanted. Regadless of whethe one chooses to actually dop a site, the dop-one-site analysis is a useful tool to highlight sites that ae paticulaly influential in inceasing o deceasing between-site eliability. In the case of SITE 6, we have shown in pevious publications [Fiedman and Glove, 2006; Fiedman et al., 2006] that this site had the weakest activations of any of the highfield scannes. Since we ae employing a eliability estimate that assesses absolute ageement, any site diffeence in PSC o CNR will lowe eliability. We have ecently been infomed that this site has been seveely impacted by envionmental noise fom a neaby subway tain fo yeas. Dopping SITE 2 fom the 1.5T scannes was also associated with a substantial incease in between-site eliability. As we have pointed out in an ealie epot [Fiedman et al., 2006], SITE 2 site employed a athe sevee k-space (apodization) filte and theefoe poduced unusually smooth images. This maked incease in smoothness would be expected to poduce makedly elevated CNR estimates. Removal of this site would homogenize CNR acoss sites and lead to geate between-site eliability. Removal of SITE 4 was associated with inceased eliability fo PSC at 1.5T. Futhe examination evealed that SITE 4 had the lowest PSC estimates of any 1.5T site fo fou of five ROIs. This was an olde Picke 1.5T scanne that has since been decommissioned and eplaced. This analysis emphasizes the impotance of fully evaluating the pefomance of each scanne pio to inclusion in multicente studies. A single unusual scanne can makedly affect between-site eliability estimates. Anothe goal of this epot was to illustate the usefulness of using within-site ICC estimates as a benchmak fo between-site ICC estimates. In the multicente context, between-site eliability has only subject vaiance in the numeato, wheeas test etest eliability has subject, site, and subject-by-site vaiance in the numeato. So, by definition, test etest eliability will always be highe than between-site eliability. Howeve, test etest eliability can be assessed in one o seveal unicente studies. If one accepts the goal that between-site eliability should be in the good (0.60 0.74) o excellent ange (above 0.75), it is pobably wise to assess test etest eliability at one o seveal sites pio to initiation of a multicente study. If test etest eliability is not in the excellent ange, it seems unlikely that between-site eliability will be in the good ange, due to the additional vaiance due to site and the subject-by-site inteaction. It seems likely that thee will always be some degadation of eliability due to changes in hadwae and setting. A concen fo the pesent study and fo futue studies of within-site o between-site eliability is the sample size of the eliability study. Seveal appoaches to pospectively estimating sample sizes fo eliability studies have been poposed [Chate, 1999; Giaudeau and May, 2001; Walte et al., 1998]. These appoaches elate sample sizes to the widths of the confidence intevals aound ICC esti- 970