IBM Research Report. Reliability for Networked Storage Nodes

Similar documents
Sickle Cell. Scientific Investigation

Audiological Bulletin no. 31

Name: Key: E = brown eye color (note that blue eye color is still represented by the letter e, but a lower case one...this is very important)

Propensity score analysis with hierarchical data

Unbiased MMSE vs. Biased MMSE Equalizers

Performance of Fractured Horizontal Wells in High-Permeability Reservoirs P. Valkó, SPE and M. J. Economides, SPE, Texas A&M University

PERFORMANCE EVALUATION OF HIGHWAY MOBILE INFOSTATION NETWORKS

EXPERTISE, UNDERUSE, AND OVERUSE IN HEALTHCARE * Amitabh Chandra Harvard and the NBER. Douglas O. Staiger Dartmouth and the NBER

Individual differences in the fan effect and working memory capacity q

A DISCRETE MODEL OF GLUCOSE-INSULIN INTERACTION AND STABILITY ANALYSIS A. & B.

Since many political theories assert that the

WE HAVE all heard the saying, You are what you eat. This

Each year is replete with occasions to give gifts. From

White Rose Research Online URL for this paper:

Locomotor and feeding activity rhythms in a light-entrained diurnal rodent, Octodon degus

Lothian Palliative Care Guidelines patient information

A FORMATION BEHAVIOR FOR LARGE-SCALE MICRO-ROBOT FORCE DEPLOYMENT. Donald D. Dudenhoeffer Michael P. Jones

A Platoon-Level Model of Communication Flow and the Effects on Operator Performance

A GEOMETRICAL OPTIMIZATION PROBLEM ASSOCIATED WITH FRUITS OF POPPY FLOWER. Muradiye, Manisa, Turkey. Muradiye, Manisa, Turkey.

Dynamic Modeling of Behavior Change

Homophily and minority size explain perception biases in social networks

Widespread use of pure and impure placebo interventions by GPs in Germany

APPLICATION OF GOAL PROGRAMMING IN FARM AGRICULTURAL PLANNING

Public Assessment Report. Scientific discussion. Orlyelle 0.02 mg/3 mg and 0.03 mg/3 mg film-coated tablets. (Ethinylestradiol/Drospirenone)

Audiological Bulletin no. 32

An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System

Review Article Statistical methods and common problems in medical or biomedical science research

Running head: SEPARATING DECISION AND ENCODING NOISE. Separating Decision and Encoding Noise in Signal Detection Tasks

Public Assessment Report Scientific discussion. Kagitz (quetiapine) SE/H/1589/01, 04-05/DC

Evolutionary Control of an Autonomous Field

International Journal of Health Sciences and Research ISSN:

arxiv: v2 [cs.ro] 31 Jul 2018

Modeling Latently Infected Cell Activation: Viral and Latent Reservoir Persistence, and Viral Blips in HIV-infected Patients on Potent Therapy

A Dynamical for Trans. of WNV in Chic.-Mosq. Interaction 291

Two optimal treatments of HIV infection model

Journal of Theoretical Biology

META-ANALYSIS. Topic #11

Three-dimensional simulation of lung nodules for paediatric multidetector array CT

Citation Knight J, Andrade M (2018) Genes and chromosomes 4: common genetic conditions. Nursing Times [online]; 114: 10,

Localization-based secret key agreement for wireless network

A PRELIMINARY STUDY OF MODELING AND SIMULATION IN INDIVIDUALIZED DRUG DOSAGE AZATHIOPRINE ON INFLAMMATORY BOWEL DISEASE

Modeling H1N1 Vaccination Rates. N. Ganesh, Kennon R. Copeland, Nicholas D. Davis, National Opinion Research Center at the University of Chicago

Studies With Staggered Starts: Multiple Baseline Designs and Group-Randomized Trials

Derivation of Nutrient Prices from Household level Consumption Data: Methodology and Application*

Spiral of Silence in Recommender Systems

A Mathematical Model for Assessing the Control of and Eradication strategies for Malaria in a Community ABDULLAHI MOHAMMED BABA

Audiological Bulletin no. 35

Correcting for Lead Time and Length Bias in Estimating the Effect of Screen Detection on Cancer Survival

6dB SNR improved 64 Channel Hearing Aid Development using CSR8675 Bluetooth Chip

USING BAYESIAN NETWORKS TO MODEL AGENT RELATIONSHIPS

Summary. Introduction. Methods

Analysis and Simulations of Dynamic Models of Hepatitis B Virus

Competitive Helping in Online Giving

Reporting Checklist for Nature Neuroscience

Reverse Shoulder Arthroplasty for the Treatment of Rotator Cuff Deficiency

Mathematical Beta Cell Model for Insulin Secretion following IVGTT and OGTT

Binary Increase Congestion Control (BIC) for Fast Long-Distance Networks

Reporting Checklist for Nature Neuroscience

THE JOURNAL OF BIOLOGICAL CHEMISTRY Vol. 262, No. 26, Issue of September 15, pp , 1987 Printed in U.S.A.

Factorial HMMs with Collapsed Gibbs Sampling for Optimizing Long-term HIV Therapy

Supplementary Methods Enzyme expression and purification

Public Assessment Report. Scientific discussion. Carbidopa/Levodopa Bristol 10 mg/100 mg, 12.5 mg/50 mg, 25 mg/100 mg and 25 mg/250 mg tablets

Public Assessment Report. Scientific discussion. Ramipril Teva 1.25 mg, 2.5 mg, 5 mg and 10 mg tablets Ramipril DK/H/2130/ /DC.

Fully Heterogeneous Collective Regression

the risk of heart disease and stroke in alabama: burden document

Allergy: the unmet need

Public Assessment Report. Scientific discussion. Amoxiclav Aristo 500 mg/125 mg and 875 mg/125 mg film-coated tablets

Streptococcus suis type 2

As information technologies and applications

Original Article Detection of lymph node metastases in cholangiocanma by fourier transform infrared spectroscopy

Public Assessment Report Scientific discussion. Aspirin (acetylsalicylic acid) Asp no:

A simple mathematical model of the bovine estrous cycle: follicle development and endocrine interactions

A Propensity-Matched Cohort Study

Statistical Consideration for Bilateral Cases in Orthopaedic Research

c 2007 Society for Industrial and Applied Mathematics

Magnetic Resonance Imaging in Acute Hamstring Injury: Can We Provide a Return to Play Prognosis?

UC Berkeley UC Berkeley Previously Published Works

Public Assessment Report. Scientific discussion. Mebeverine HCl Aurobindo Retard 200 mg modified release capsules, hard. (mebeverine hydrochloride)


Influence of Neural Delay in Sensorimotor Systems on the Control Performance and Mechanism in Bicycle Riding

HCN channels enhance spike phase coherence and regulate the phase of spikes and LFPs in the theta-frequency range

Analyzing the impact of modeling choices and assumptions in compartmental epidemiological models

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

UK TOBY Cooling Register

Perceptions of harm from secondhand smoke exposure among US adults,

Preparations for pandemic influenza. Guidance for hospital medical specialties on adaptations needed for a pandemic influenza outbreak

Downloaded from:

Analyzing the Impact of Modeling Choices and Assumptions in Compartmental Epidemiological Models

Applying Inhomogeneous Probabilistic Cellular Automata Rules on Epidemic Model

Host-vector interaction in dengue: a simple mathematical model

On the Expected Connection Lifetime and Stochastic Resilience of Wireless Multi-hop Networks

Public Assessment Report. Scientific discussion. Panclamox 40/500/1000 mg, gastro-resistant tablet/film-coated tablet/film-coated tablet

Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles

A simple mathematical model of the bovine estrous cycle: follicle development and endocrine interactions

MULTI-STATE MODELS OF HIV/AIDS BY HOMOGENEOUS SEMI-MARKOV PROCESS

Optimal Precoding and MMSE Receiver Designs for MIMO WCDMA

A Vital Sign and Sleep Monitoring Using Millimeter Wave

CS738: Advanced Compiler Optimizations. Flow Graph Theory. Amey Karkare

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Transcription:

J058 A0509-008 September 0, 005 Computer Science IBM esearc eport eliability for etwore Storage oes KK ao, James L. Hafner, icar A. Goling IBM esearc Division Almaen esearc Center 650 Harry oa San Jose, CA 950-6099 esearc Division Almaen - Austin - Beijing - Haifa - Inia - T. J. Watson - Toyo - Zuric LIMITED DISTIBUTIO OTICE: Tis report as been submitte for publication outsie of IBM an will probably be copyrigte if accepte for publication. Itas been issue as a esearc eport for early issemination of its contents. In view of te transfer of copyrigt to te outsie publiser, its istribution outsie of IBM prior to publication soul be limite to peer communications an specific requests. After outsie publication, requests soul be fille only by reprints or legally obtaine copies of te article e.g., payment of royalties. Copies may be requeste from IBM T. J. Watson esearc Center, P. O. Box 8, Yortown Heigts, Y 0598 USA email: reports@us.ibm.com. Some reports are available on te internet at ttp://omino.watson.ibm.com/library/cyberdig.nsf/ome.

eliability for etwore Storage oes Abstract Hig-en enterprise storage as traitionally consiste of monolitic systems wit customize arware, multiple reunant components an pats, an no single point of failure. Distribute storage systems realize troug networe storage noes offer several avantages over monolitic systems suc as lower cost an increase scalability. In orer to acieve reliability goals associate wit enterprise-class storage systems, reunancy will ave to be istribute across te collection of noes to tolerate noe an rive failures. In tis paper, we present alternatives for istributing tis reunancy, an moels to etermine te reliability of suc systems. We specify a reliability target an etermine te configurations tat meet tis target. Furter, we perform sensitivity analyses were selecte parameters are varie to observe teir effect on reliability. Introuction Hig-en enterprise storage systems currently eploye in prouction environments ave traitionally been monolitic systems so-calle big iron wit several symmetrical multiprocessors, multiple internal fabrics, large cace memories an no single point of failure. Tese systems are expensive requiring customize arware an multiple reunant components an pats to ensure tat tere is no single point of failure. In contrast, acieving scalability troug istribute storage systems is becoming increasing popular in researc an evelopment, an, to some extent, in commercial eployments. A significant aspect of istribute systems is te ability to use common builing blocs across a wie range of storage requirements: from a few terabytes to te scale of petabytes. Tis translates into several avantages: lower cost ue to economies of scale, reuce number of inventory types, commonality of software across te prouct line, an so on. Te istribute storage system in tis paper is moele after te Collective Intelligent Brics project in IBM esearc [5]. Te storage system consists of several brics were eac bric or noe is a seale unit consisting of a controller, power supply, networing interfaces an is rives. Several components in te noe represent single points of failure. In orer to acieve reliability goals associate wit ig-en enterprise-class storage systems, reunancy as to be istribute across te collection of noes to tolerate rive an noe failure. In tis paper, we will moel te reliability of suc a system an loo at te alternatives for istributing reunancy between te noes in orer to meet reliability goals of large-scale enterprise systems. Te goal of tis paper is to view reunancy requirements from a storage viewpoint. We assume tat tere is enoug reunancy in switces an lins so tat reliability is limite by storage noes an rives; tat is, te interconnect fabric an topology is not a constraining factor in etermining te overall reliability of te system. Tis is typically te case wit [5]. We will escribe te ifferent configurations for acieving reliability in istribute storage systems, in Section. In Section 4, we will escribe te moels use to obtain reliability for tese configurations. Te implication of istributing ata across suc a system an its impact on reliability is presente in Section 5. Section 6 will present a baseline reliability analysis. We will analyze te sensitivity of te reliability of some of te configurations to several parameters, in Section 7. elate Wor Trivei [6] covers reliability analysis an in particular, te use of continuous-time Marov cains wit absorbing states for cras failures. Te moeling an analyses presente in tis paper are base on tis wor. Xin et al. [7] present reliability for large istribute systems but o not consier noe failures. Also, wile uncorrectable sector errors are ealt wit troug a sceme of signatures, te reliability improvements troug te use of tis sceme are not caracterize. Snappy Dis an Petal [4] represent sare-is, sare-metaata systems an partitione-is, partitione-metaata systems respectively. Te availability analysis presente in tis paper is intene only to gain insigts into te factors affecting availability rater tan to erive accurate preictions. Frlun et al. [] escribe an erasure coing algoritm in te context of a istribute storage system compose of inexpensive brics, an Gooson et al. [] escribe erasure-coe storage tat tolerates Byzantine failures. Bot tese

papers focus on te algoritms for erasure coing for istribute storage noes an o not aress te reliability analysis neee to ensure tat erasure coe istribute storage will meet require reliability goals. eunancy Configurations As mentione earlier, a noe consists of a controller car, networ interfaces, a collection of is rives an associate power supplies. Apart from te is rives an te networ interfaces, all oter major components are not uplicate. Tus, te noe is inerently unreliable as te failure of any one of tese components will result in noe failure. Terefore, in orer to buil a igly reliable storage system out of a collection of suc noes, reunancy will ave to be istribute troug te collection. We will loo at two imensions to realizing reunancy in te collection of noes: reunancy witin noes to tolerate internal rive failures, an reunancy across noes to tolerate entire noe failures. Witin te noes, we will employ tree possible configurations: no internal AID, AID 5 an AID 6, wic will tolerate 0, an rive failures respectively. We will acieve reunancy across noes by applying tree types of erasure coes between tem: coes tat can tolerate, an noe failures respectively. Te tree noe configurations an tree erasure coe types between noes yiel a total of 9 combinations between tem. We assume tat te noes in tis system are enclose entities tat are not amenable to service actions. Tis implies a fail-in-place pilosopy were faile components witin a noe are not replace. Specifically, in te case of failure of one or more is rives witin a noe, te noe will continue to operate wit a reuce set of iss until eiter all iss fail or some oter critical component fails renering te noe unusable. For te case of noes wit internal AID AID 5 or AID 6 we will assume tat on a rive failure, ata is re-stripe removing te faile rive from te array, tereby restoring reunancy at te en of tis operation. Te resulting loss in capacity is ajuste against te spare capacity, as escribe below. Te fail-in-place service moel implies tat, initially, storage capacity is over-provisione so tat loss in capacity wit subsequent failures can be tolerate. Te over-provisione storage capacity is eiter sufficient to eal wit expecte failures over te operational life of te installation, or spare noes are ae at appropriate times e.g. wen overall capacity utilization increases above preetermine tresols. 4 eliability Moels In tis reliability analysis, we are primarily intereste in preventing ata loss. Consequently, to compare reunancy configurations, we use te expecte number of ata loss events per unit time as a measure of reliability. We believe te expecte number of ata loss events per unit time is a metric tat is easier to compreen an relate to tan te more traitional Mean Time to Data Loss MTTDL. We will use Marov moels to etermine MTTDL, an use it to obtain te expecte number of ata loss events per year. We loo at tree types of failures tat lea to ata loss: an uncorrectable rea error from a is rive, a failure of a is rive an a failure of a noe. A ata loss event occurs wen te above failures occur in a combination tat cannot be anle by te ata protection sceme use in te system. For example, a controller wit a AID 5 array can tolerate a single failure a is failure or an uncorrectable rea error. Wen a rive in te AID 5 array fails an te array is rebuiling to a spare or a replacement rive, if eiter a secon rive fails or an uncorrectable rea error occurs on any of te remaining rives, tis results in a ata loss event. Clearly, te failure of te secon rive results in ata loss of a muc larger scale tan te uncorrectable rea error, but eiter failure results in ata loss of some magnitue. Wit respect to an uncorrectable rea error, we will assume it can result in a ata loss event only if te array is in a critical state an cannot tolerate any furter errors. We believe tis is a reasonable assumption because as long as te array as not lost any rive, te recovery from an uncorrectable rea error just requires reaing from te remaining rives an regenerating te ata item tat encountere te rea error. Te conitions uner wic tis recovery can fail are if anoter element in te same stripe encounters an uncorrectable rea error, or if anoter rive fails uring tis recovery. We believe tat bot tese conitions are extremely low probability occurrences an can be ignore. To escribe our moeling metoology, we illustrate te tecnique for a AID 5 is array. Figure sows te Marov moel for a AID 5 array wit mean time to failure of te is rives MTTF an mean time to repair rebuil a rive failure MTT.

- Figure : Marov Moel for a AID 5 array State 0 is wen te array is fully operational. State correspons to a rive failure tat will not experience an uncorrectable error uring te rebuil. State represents a ata loss state eiter ue to a secon rive failure or ue to an uncorrectable error uring rebuil. Te parameters are: number of rives in te array rive failure rate /MTTF rive rebuil rate /MTT probability of an uncorrectable error uring rebuil C HE C rive capacity HE is ar error rate expresse in ar errors per number of bytes rea Te metoology to solve a Marov moel wit absorbing states is escribe in [6]. Typically, >>. Solving tis moel for MTTDL gives MTTDL C HE - 0 4. oe Set an eunancy Set We introuce te concepts of noe set an reunancy set for a storage system mae up of networe storage noes as sown in Figure. Data objects are store across multiple noes in suc a system in orer to meet requirements suc as performance an reliability te focus of tis paper. Eac ata object constitutes exactly one stripe of ata tat is, te reunancy elements parity can be compute entirely from tis ata. For a given ata object, te set of noes tat contain te ata an its corresponing reunancy parity elements constitutes a reunancy set. Te noe set is te set of all te noes in te storage system. We assume tat ata is evenly istribute across all te noes in te storage system. Tus, eac noe as one or more reunancy set relationsips wit every oter noe in te noe set. Te total number of reunancy sets of size in a noe set of size is given by: Te even istribution of ata implies tat te failure omain is te entire noe set an not just iniviual reunancy sets. For example, in a reunancy sceme tat tolerates only a single failure, wen suc a failure as occurre an is being recovere from, a failure of any secon noe in te noe set will result in ata loss. eunancy Sets Figure : oe Sets an eunancy Sets We will present te moeling for systems were te noes ave internal AID in section 4., an te moeling for noes witout internal AID in section 4.. 4. oes wit Internal AID oe Set oe For a system in wic te noes ave internal AID, we use ierarcical Marov moels to obtain MTTDL. We represent te AID array internal to a noe in a Marov moel an obtain array failure rates from it. We ten use tese failure rates in a iger level Marov moel representing te reunancy arrangement between noes. It soul be note tat, as we assume tat te noes are not amenable to service actions, an tat on a rive failure, te array is re-stripe removing te faile rive from te array, te term as epicte in Figure is te array restripe rate an not te array rebuil rate. We alreay obtaine te MTTDL for a AID 5 array as:

MTTDL C HE We efine array failure as te failure of is rives beyon te fault tolerance provie by te AID sceme. From te above, we obtain D, te rate of array failure an S, te rate of a sector error uring a re-stripe. Tese are: D AID5 AID5 C HE S Figure 4 sows te Marov moel for a AID 6 array. Figure 4: Marov Moel for a AID 6 array State 0 is wen te array is fully operational; state is wen a single rive as faile; state correspons to a secon rive failure tat will not experience an uncorrectable rea error uring rebuil; an state represents a ata loss state, eiter ue to triple rive failure or an uncorrectable error wen rebuiling wit rives faile. Solving tis moel for MTTDL gives MTTDL C HE Corresponingly, we obtain D an S as D AID6 C HE S AID6 We will use tese rates in te iger level moel for te erasure coes between noes. Figure 5 sows te Marov moel for noes wit internal AID eiter AID 5 or AID 6 an a reunancy arrangement wit a fault tolerance of between noes. -- 0 - - Figure 5: Marov Moel for Fault Tolerance ; oes wit Internal AID Here is te number of noes in te noe set, is te noe failure rate an is te noe rebuil rate. Te array failure rate, D, an te rate of sector error uring a re-stripe, S, correspon to te internal AID in te noes. State 0 is wen te storage system is fully operational. Te system transitions to state wen eiter a noe fails or a noe experiences an array failure. In tis state, te ata of tis noe is rebuilt on te remaining noes in te noe set. Tis is escribe in Section 5. State represents te ata loss state cause by a secon noe or array failure or a sector error uring an internal AID re-stripe wile te noe rebuil is in progress. Te MTTDL for tis sceme internal AID, noe fault tolerance is given by: MTTDL D - D S 0 I, FT D S D D S D - D - D. S 0 Figure 6: Marov Moel for Fault Tolerance ; oes wit Internal AID Figure 6 sows te Marov moel for noes wit internal AID an an erasure coe wit a fault tolerance of between noes. As can be seen above, tis sceme tolerates two failures; a tir failure uring te noe rebuil operation results in a ata loss event, State. We will explain te factor an corresponing below in section 5... Te MTTDL for tis sceme internal AID, noe fault tolerance is: D D S

MTTDL I, FT D D D 0 4 Figure 7: Marov Moel for Fault Tolerance ; oes wit Internal AID Figure 7 sows te Marov moel for noes wit internal AID an an erasure coe wit a fault tolerance of between noes. Tis sceme tolerates tree failures; a fourt failure uring te noe rebuil operation results in a ata loss event, State 4. Te MTTDL for tis sceme internal AID, noe fault tolerance is: MTTDL I, FT 4. oes witout Internal AID D D S In configurations for noes witout internal AID, iniviual rives witin eac noe are use to realize te erasure coe between noes. We assume tat no more tan one rive per noe is use in eac reunancy set, tat is, eac bloc of a ata stripe is on a ifferent noe; tus, eac noe failure causes only a single erasure on eac reunancy set. Figure 8 sows te Marov moel for noes witout internal AID an an erasure coe of fault tolerance between noes. 0 - D - D - D. S - - Figure 8: Marov Moel for Fault Tolerance ; oes witout Internal AID - - S Altoug tere are only a few new parameters use in te above moel, we list all te parameters: noe set size rives per noe noe failure rate rive failure rate noe rebuil rate rive rebuil rate probability of an uncorrectable error uring noe rebuil C HE probability of an uncorrectable error uring rive rebuil C HE reunancy set size C rive capacity HE is ar error rate expresse in ar errors per number of bytes rea State 0 is wen te system is fully operational. State correspons to a noe failure tat will not experience an uncorrectable error uring noe rebuil. State correspons to a rive failure tat will not experience an uncorrectable error uring rive rebuil. State represents a ata loss state eiter ue to a secon noe or rive failure or ue to an uncorrectable error uring rebuil. Te MTTDL for tis sceme no internal AID, noe fault tolerance is: MTTDL I, FT 0 4 - - Figure 9: Marov Moel for Fault Tolerance ; oes witout Internal AID D - - - - 5 - - - 6 - - - - 7 -

- - - - - - - 4 - - - - 6-5 - - - - 7 0 8 - - 9 - - - 0 - - - - - - - - - - - 4 5 Figure 0: Marov Moel for Fault Tolerance ; oes witout Internal AID Figure 9 sows te Marov moel for noes witout internal AID an an erasure coe of fault tolerance between noes. Te xy parameters are probabilities of encountering an uncorrectable error uring a secon noe or rive rebuil y or respectively, after an initial noe or rive failure x or respectively. We will sow ow tese parameters are etermine in Section 5... Figure 0 sows te Marov moel for noes witout internal AID an an erasure coe of fault tolerance between noes. As can be seen, te Marov moels for noes witout internal AID become increasingly complex as te fault tolerance increases. Tis is because witout internal AID, a rive failure state is istinct from a noe failure state an tese states multiply as te fault tolerance increases. Consequently, using conventional tecniques to obtain a parameterize close form solution for tese iger levels of fault tolerance is not practical. However, by comparing Figures 8, 9 an 0, we observe similarities. For instance, te state transitions in Figure 8 are represente in two subsets in Figure 9 states,, an 7; an states 4, 5, 6 an 7. Similarly, Figure 9 itself is represente in two subsets in Figure 0. From tese observations, it can be seen tat a recursive meto can be evelope to solve tese Marov moels. In te appenix, we escribe a recursive meto to obtain a close form solution for noes witout internal AID wit arbitrary fault tolerance across noes. Te MTTDL for te last two scemes will be sown in Section 5. following te explanation of te parameters. 5 Implications of Distribute Data 5. oe ebuil Time We mentione earlier tat te fail-in-place service moel implies tat te set of noes is overprovisione wit spare capacity to eal wit subsequent failures tat will result in a loss of usable

capacity. Tis moel, couple wit te even istribution of ata, implies tat spare capacity is also evenly istribute among te noes. Tus, wen a noe fails, te ata on te faile noe is rebuilt by all te remaining noes, utilizing teir spare capacity. Similarly, in configurations witout internal AID, wen a rive fails, te ata on te faile rive is rebuilt on all te remaining rives. Tis is not te case for noes wit internal AID: a rive failure results in a re-striping operation, removing te faile rive from te array an restoring reunancy. ebuil time, an ence te rebuil rate, is a ey component in te expressions for MTTDL. We will escribe a moel to etermine rebuil time accurately. Our moel of rebuil time is base on te amount of ata tat is transferre uring a rebuil. We assume tat in a rebuil, te estination noe receives all te require reunancy ata an performs te necessary exclusive-o or equivalent operations to generate te ata it will write on its rives. For a noe set size of, a reunancy set size of an a fault tolerance of t, we express te amounts of ata below in units of a noe s wort of ata. ote tat tis means tat noes are involve in te rebuil of one lost ata object. Amount of ata rebuilt by eac noe Amount of ata receive by eac noe from t oter noes to rebuil te above Total ata receive by all te noes t total ata source by all noes Total ata source by eac noe t t Te effective rebuil time will be te maximum time require to move ata in an out of noes, to an from iss, an troug te interconnecting networ, epening on were te bottlenec lies. Hence, total ata in an out of a noe t * Total ata to an from te iss in a noe t Te total ata flowing in te interconnecting networ t 5. Scope of Sector Error eunancy Sets Critical eunancy Set Figure : Critical eunancy Sets We state earlier tat we assume tat an uncorrectable rea error causes a ata loss event only wen te reunancy set is in a critical state. Te even istribution of ata across all te noes implies tat, for fault tolerance or iger, wen a reunancy set is critical, only a portion of a noe s ata or rive s ata in te case of no internal AID is critical. Tis is illustrate in Figure. Let us assume tat we ave an erasure coe of fault tolerance between noes an tat te noes ave internal AID. Te X s inicate faile noes. Eac faile noe is a part of two reunancy sets, one sare wit te oter faile noe an one oterwise inepenent. However, only te sare set is critical; te oter as lost one noe but can tolerate a secon loss. 5.. oes wit Internal AID oe Set oe Te fraction of reunancy sets tat are critical an ence, can contribute to a sector loss are represente in te an terms in te MTTDL expressions for internal AID, fault tolerance an respectively. Eac noe is a part of reunancy sets. Tus,, an X X

, D FT I HE C MTTDL, D FT I HE C MTTDL Figure : MTTDL for o Internal AID, oe Fault Tolerance an. 5.. oes witout Internal AID For noes witout internal AID, we use -witsubscript terms to represent probabilities of encountering uncorrectable sector errors uring critical rebuils. Tese probabilities epen on te amount of critical ata tat must be rea for a rebuil operation, wic in turn is erive from critical reunancy sets. Unlie noes wit internal AID, reunancy sets may be critical because of combinations of noe an rive failures. Te combinations an corresponing fractions of critical reunancy sets for fault tolerance are: two noes: of a noe; two rives: of a rive; an a rive an a noe: of a rive. Te probability of encountering a ar error wile rebuiling a rive if te entire rive is critical is HE C. ow, if HE C, ten an. Similarly, te combinations an corresponing fractions of critical reunancy sets for fault tolerance are: tree noes: of a noe; two noes an a rive: of a rive; two rives an a noe: of a rive; an tree rives: of a rive.

Te probability of encountering a ar error wile rebuiling a rive if te entire rive is critical is C HE. ow, if C HE, ten an. We use tese parameters to solve te Marov moels an obtain te corresponing MTTDLs, wic are sown in Figure. A general solution for arbitrary fault tolerance is escribe in te appenix. 6 Baseline eliability We use te close form solutions for te MTTDL for te various configurations an etermine baseline reliability using parameters efine below. We assume tat estop/ata rives are use in te noes. MTTF noe MTTF 400,000 ours MTTF rive MTTF 00,000 ours HE rive ar error rate sector in 0 4 bits rea C rive capacity 00 GB Maximum rive trougput 50 I/O operations/sec. Drive sustaine transfer rate average 40 MB/sec. noe set size 64 reunancy set size 8 rives per noe e-stripe comman size MB ebuil comman size 8 KB Lin spee 0 Gbps 800 MB/sec. sustaine Capacity utilization 75% Banwit utilization for rebuil, re-stripe 0% Te lin spee nees clarification. Te rebuil performance epens on te total rate ata can move in an out of te noe over all lins. We assume tat noes are pysically seale units sape lie cubes an are stace togeter to buil larger treeimensional structures. oes communicate wit ajacent noes troug lins on eac of teir six surfaces. [] as more information on effective banwit of suc structures. We specify te reliability target in terms of ata loss events per PB-year. We view reliability from a manufacturer s perspective an coose a target tat tracs te fiel population of suc storage systems. We set a reliability target tat a fiel population of 00 systems eac wit a petabyte of logical capacity will experience less tan one ata loss event in 5 years. Tis translates to less tan x 0 - ata loss events per PB-year. Data Loss Events per PB-Year.00E0.00E0.00E00.00E-0.00E-0.00E-0.00E-04.00E-05.00E-06.00E-07.00E-08.00E-09.00E-0.00E-.00E-.00E-.00E-4 Fault Tol Fault Tol Fault Tol oe Fault Tolerance Figure : Baseline Comparison o Int. AID Int. AID 5 Int. AID 6 Figure sows a baseline comparison of te 9 configurations using te parameters efine above. We observe te following:. Configurations wit noe fault tolerance of o not meet our reliability target.. Tere is no significant ifference between internal AID 5 an internal AID 6 especially for fault tolerance or iger. We will iscuss wy tis is te case in Section 8.. At fault tolerance, te internal AID configurations excee te target by 5 orers of magnitue. Base on te above observations, we will not consier configurations wit fault tolerance of between noes. Also, for configurations wit internal AID, we will only use AID 5 as AID 6 oes not provie any avantage. Furter, we will not inclue te configuration at fault tolerance, internal AID in te sensitivity analyses item. above. Tis will leave us wit tree configurations for sensitivity analyses: Fault Tolerance witout internal AID, Fault Tolerance wit internal AID 5, an Fault Tolerance witout internal AID. 7 Sensitivity Analyses We will perform sensitivity analyses of te reliability to te following parameters: rive MTTF, noe

MTTF, rebuil bloc size, lin spee, noe set size, reunancy set size, an rives per noe. As we vary tese parameters one at a time, we will eep all te oter parameters at teir baseline level, except for rive an noe MTTF. For te latter two, we will use two values, one at eac en of a practical range as sown ere: Drive MTTF ours: low 00,000; ig 750,000; oe MTTF ours: low 00,000; ig,000,000. Data Loss Events per PB-YEar.00E00.00E-0.00E-0.00E-0.00E-04.00E-05.00E-06.00E-07.00E-08.00E-09 00,000 50,000 500,000 750,000 Drive MTTF,000,000,50,000 Figure 4: Sensitivity to Drive MTTF FT o AID FT o AID FT AID 5 FT AID 5 FT o AID FT o AID Figure 4 sows te sensitivity to is rive MTTF. We observe tat te configuration at fault tolerance, no internal AID oes not meet te target at all for low noe MTTF, an marginally meets it for ig noe MTTF. Te oter two configurations excee te target some more comfortably tan te oters over te entire range. FT, Internal AID 5 appears to be relatively insensitive to rive MTTF, especially for low noe MTTF clearly, it is limite by noe MTTF an provies anoter view wy AID 6, wic protects from a furter rive failure, oes not offer any avantage. o Internal AID again oes not meet te target for te most part. Te rebuil bloc size affects te noe an te rive rebuil rate, an respectively. As we saw in Sections 4 an 5, tese are ey parameters for te MTTDL. From Figure 6, it can be seen tat te rebuil bloc size affects te reliability significantly. FT, o Internal AID oes not meet te target for low MTTF. Te oter two configurations meet te target if te rebuil bloc size is 64 KB or larger. Data Loss Events per PB-Year.00E0.00E0.00E00.00E-0.00E-0.00E-0.00E-04.00E-05.00E-06.00E-07.00E-08.00E-09 6 64 8 56 04 ebuil Bloc Size KB Figure 6: Sensitivity to ebuil Bloc Size FT o AID FT o AID FT AID 5 FT AID 5 FT o AID FT o AID Te rebuil rate is etermine by te slower of te ata transfers across te networ between noes or witin a noe to an from te is rives. Wit te parameters as efine rives per noe, 50 I/O operations/secon, an so on, te rebuil rate is constraine by te lin spee up to aroun Gb/s beyon wic it is constraine by te is rives. Tis can be seen in Figure 7 wic sows sensitivity to lin spee at points, 5 an 0 Gb/s. Tere is no ifference in reliability between te last two points. Data Loss Events per PB-Year.00E00.00E-0.00E-0.00E-0.00E-04.00E-05.00E-06.00E-07.00E-08.00E-09 00,000 50,000 500,000 750,000 oe MTTF,000,000,50,000 Figure 5: Sensitivity to oe MTTF FT o AID FT o AID FT AID 5 FT AID 5 FT o AID FT o AID Te sensitivity to noe MTTF is sown in Figure 5. FT, Internal AID 5 sows te most sensitivity to noe MTTF an all tree configurations sow increase sensitivity wit ig rive MTTF. FT, Data Loss Events per PB-Year.00E00.00E-0.00E-0.00E-0.00E-04.00E-05.00E-06.00E-07.00E-08.00E-09 5 0 Lin Spee Gb/s Figure 7: Sensitivity to Lin Spee FT o AID FT o AID FT AID 5 FT AID 5 FT o AID FT o AID We now loo at sensitivity to te configurable parameters noe set size, reunancy set size an rives per noe. Figure 8 sows te sensitivity to

noe set size. As can be seen, FT, o Internal AID sows some sensitivity to te noe set size, but te oter two configurations are relatively insensitive to it. Te sensitivity to reunancy set size is sown in Figure 9. It can be seen tat all configurations appear to become less reliable as te reunancy set size increases, wit about an orer of magnitue ifference between te extremes. Data Loss Events per PB-Year.00E00.00E-0.00E-0.00E-0.00E-04.00E-05.00E-06.00E-07.00E-08.00E-09 7 64 5 000 oe Set Size Figure 8: Sensitivity to oe Set Size Data Loss Events per PB-Year.00E0.00E00.00E-0.00E-0.00E-0.00E-04.00E-05.00E-06.00E-07.00E-08.00E-09 8 6 0 eunancy Set Size FT o AID FT o AID FT AID 5 Low MTTF FT AID 5 Hig MTTF FT o AID FT o AID FT o AID FT o AID FT AID 5 FT AID 5 FT o AID FT o AID Figure 9: Sensitivity to eunancy Set Size Data Loss Events per PB-Year.00E00.00E-0.00E-0.00E-0.00E-04.00E-05.00E-06.00E-07.00E-08.00E-09 4 8 6 Drives/oe Figure 0: Sensitivity to Drives per oe FT o AID FT o AID FT AID 5 FT AID 5 FT o AID FT o AID From Figure 0, it can be seen tat tere is very little sensitivity to te number of rives per noe. It soul be note tat we are measuring normalize reliability ata loss events per PB-Year. As a result, wit some parameters suc as rives per noe, tere is a cancellation effect. Increasing te number of rives in a noe can result in ecrease reliability per noe owever, fewer suc noes will be require to yiel a petabyte. 8 Discussion Te baseline reliability analysis in Section 6 sowe tat AID 6 oes not offer any avantage over AID 5 wen use internal to networe storage noes. Tis is because te reliability of a networe storage system as a wole is affecte by bot rive an noe failures. Wen AID 5 is use internally, te effect of rive failures is consierably minimize suc tat te susceptibility to noe failures becomes a ominant factor. Proviing furter tolerance to rive failures by using AID 6 oes not alleviate te susceptibility to noe failures. It is interesting to note tat we nee to obtain a balance of protection against bot rive an noe failures increasing te protection for one witout corresponingly increasing it for te oter oes not result in an overall increase in reliability. Te sensitivity analyses in section 7 reveal interesting results. Firstly, we see tat tere is very little sensitivity to te configurable size parameters noe set size an rives per noe an a little more pronounce sensitivity to reunancy set size. We allue to te reason for te insensitivity to rives per noe earlier. Similar arguments apply to te noe set size. In te latter case, tere is an aitional factor. Even toug increasing te noe set size increases te size of te failure omain, te fraction of critical reunancy sets ecreases. We also see tat te reliability is constraine by is rive banwit rater tan networ banwit if te lin spee is Gb/s or iger, resulting in no cange in reliability at iger lin spees. By using rive banwit more efficiently troug te use of larger rebuil bloc sizes, we see significant improvements in reliability. In fact, te rebuil bloc size is a controllable parameter wit te most significant impact on reliability. In contrast, rive an noe MTTF are not easily controllable. Inustry experience as inicate tat rive MTTF can vary significantly between batces of rives an te same can be expecte of noes.

Te numbers we ave use in te baseline analysis are conservatively realistic wit te sensitivity analysis proviing an insigt into available earoom from a reliability perspective. For te specific target we ave cosen in tis paper, it appears tat eiter te [FT, Internal AID 5] or te [FT, o Internal AID] configurations meet te reliability requirement wit te conition tat te rebuil bloc size is at least 64 KB. 9 Conclusions We ave evelope effective reliability moels for networe storage noes base on Marov cains. We eal wit te complexity of solving large Marov moels in two ifferent ways ierarcical moels an recursive moels. Using tese metos, we are able to generate close-form parametric solutions tat ave broa utility. We ave cosen a specific reliability target in orer to focus on a few reunancy configurations. However, te closeform solutions we ave presente may be use to etermine reunancy configurations for a spectrum of reliability targets suc as in systems tat offer user-configurable goals. We ave also evelope a moel tat utilizes basic parameters suc as is rive banwit an networ lin spee, to generate effective rebuil rates. System reliability, as we ave seen, is impacte significantly by te rebuil rate; ence, obtaining a precise estimate using basic parameters ensures tat te reliability results are accurate. eferences [] C. Fleiner, D.. Kencammana Hoseote,. Garner, an W. Wilce. Quantitative Stuy of te Performance an eliability of a esilient - D Mes-base Server. Tecnical eport J 008, IBM esearc, ovember 00. [] S. Frlun, A. Mercant, Y. Saito, S. Spence, an A. Veitc. A Decentralize Algoritm for Erasure-Coe Virtual Diss. Depenable Systems an etwors, June 004. [] G.. Gooson, J.J. Wylie, G.. Ganger, an M.K. eiter. Efficient Byzantine-tolerant erasure-coe storage. Depenable Systems an etwors, June 004. [4] E.K. Lee, C.A. Teat, C. Witaer, an J. Hogg. A Comparison of Two Distribute Dis Systems. esearc eport 55, Digital Systems esearc Center, April 998. [5] IBM esearc. Collective Intelligent Brics Harware. ttp://www.almaen.ibm.com/stora gesystems/autonomic_storage/cib_ Harware/inex.stml [6] K.S. Trivei. Probability an Statistics wit eliability, Queuing, an Computer Science Applications. Prentice-Hall, 98 [7] Q. Xin, E.L. Miller, T. Scwarz, D.D.E. Long, S.A. Brant, an W. Litwin. eliability Mecanisms for Very Large Storage Systems. IEEE/ ASA Goar Conference on Mass Storage Systems an Tecnologies, April 00. Appenix: ecursive Solution to eliability Moels wit o Internal AID In tis section we outline te recursive metoology use to solve for te MTTDL in te case of no internal AID wit reunancy of arbitrary fault tolerance across noes. Te results for,, an of Sections 4. an 5. are special cases. For a CTMC see [6] wit state set S, absorbing states A, non-absorbing states B S A an mean time spent in state i B given by τ i, te MTTDL is compute as MTTDL τ A. i B i Te termsτ i are can be compute as te solution to te system of equations τ B Q B π B 0 wereτ B...,τ i,..., π B 0 is te vector of initial i B probabilities for te states in B, an Q B is te submatrix restricte to te non-absorbing states B of te infinitesimal generator matrix Q. Te matrix Q is efine as follows: te off-iagonal entries are te transition rates for eac pair of states in S tese are non-negative; te iagonal entries are efine so tat te row sums of Q all equal zero te iagonal entries are negative. In all our moels, tere is only one initial state te first state in an enumeration of B so tat π B 0,0,..., 0. Consequently, we ave τ B,0,...,0 QB an Q B MTTDL,0,...,0,...,. Te vector on te rigt in tis formula computes te sum in A. We let QB so tat as positive iagonal entries, non-positive off-iagonal entries an MTTDL,0,...,0,..., A. t t

We call te absorption matrix for te moel. Let M be te expression on te rigt an sie of A.. ecall te formula aj / et were aj is te ajoint of te transpose of te matrix of eterminants of all one-less imension submatrices of. Set um,0,...,0 aj,..., So tat we ave M um / et. A. um is an abbreviation for numerator. We also efine te notation Set upper left corner of aj, tat is, te eterminant of te submatrix of after removing te first row an first column. If r is a scalar x, ten set um, Set, an et r, so tat M / r. We use tis notation an formulation later. As we note in Section 4., te CTMC for te no internal AID moel wit fault tolerance as a recursive structure. By a re-labeling, we can escribe tis recursion as follows. First buil te moel as in Fig. 8 for fault tolerance. e-label state as A te absorbing state, state as an state as to inicate te type of failure tat we moel on te transition into tese states. To create te moel for general from te moel for, o te following:. Mae two copies of te moel for fault tolerance inuctively. Eac nonabsorbing state as a label of lengt in te letters 0,,.. Merge te two absorbing states into one state A.. Prefix eac state label in te first copy wit an an in te secon copy wit a. 4. In te eac copy, replace by an by, etc.. In te first copy replace every subscript on eac wit a new subscript prefixe by ; in te secon copy prefix eac -subscript by. 5. A a new root state wit label all 0 s of lengt. Set te rate from tis new state to te root state of te first copy labele 0 0 to, an bac wit ; set te rate from tis new state to te root state of te secon copy labele 0 0 wit rate an bac wit. Tis completes te construction. t Te general moel is parameterize by,,,,, an were { α : α {, } } so te subscripts are all wors of lengt in te letters an an assume tis is in reverse lexicograpical orer accoring to te subscripts. Generally, we will suppress te last four parameters as tey are not epenent on wat level we are in te recursive construction only an cange as we see above. At times we suppress te epenence on an as well for notational brevity. Wen >, tere are no transitions from te root state to te absorbing state. Wen, tere is a transition an it is etermine by an see Figure 8. Wen >, te only transitions to te absorbing state come at te inner most level of te recursion. For every state wit label containing only te letters an, tere is a transition to te absorbing state wit rate. For every state wose label is of te form α 0 were α contains only te letters an, tere is a transition to te absorbing state wit rate. α α Te construction step 4 suggests te following notational operation for te sets : for x or, efine te ot operation { xα : α {, } x } so tat. Given tis notation an construction, it is easy to see tat te absorption matrix, for te moel of fault tolerance as te form r r r 0 0 were represents a vector of te form,0,.., 0 similarly for, r an r, an for >, r r r r r since tere is no transition from te root state labele wit all zero wor to te absorbing state in tis case. If ten r, r an r.

MTTDL, L, L Figure A: General form for MTTDL for Fault Tolerance Te imension of is. Te matrices an are of te same structural form as. Let U be te matrix of size tat is all zero except for a single one in te upper left corner. Ten U is te absorption matrix for te level moel wit parameters an replace by an respectively. Similarly, U, is te absorption matrix for te level moel wit parameters an an again replace by, respectively. Symbolically, for x or, x, xu, x. A.4 We now ave a formal moel of te recursive construction an te effect tis recursive construction as on te absorption matrices an te parameters at eac level. From te efinitions of aj an et an a straigtforwar calculation, it is not ifficult to prove te following lemma: Lemma. For, um an et r Set r et r et Set r et r et um um Set et et Set By A.4, te term wit x or, an suppressing te an an et x Set x et x, on te left sie x. A.5 um x um, x as well. Tese formulas provie te basis for an inuctive argument. We nee some aitional notation in orer to state te result an assumptions on relative size of parameters in orer to erive our approximation results. Set L x, y x y so tat yr / an r L x, y xr on recalling tat r. Furtermore, for any orere set H of symbols, let L H L H, for, an for > H L H L L H, L H were H H H an elements of L H is te first H an H is te last elements. So, for our special set we ave L L, L L, an L. We can now state te general teorem: Teorem: Assume is at least an orer of magnitue smaller tan bot an. Ten et, an L, L um,. From tis an A. we easily erive te approximation formula for MTTDL for te general moel of fault tolerance across noes an no internal AID as sown in Figure A. Te proof of te teorem is a fairly straigtforwar inuction, using te formula A.5 an te Lemma. We leave out te etails. Te statements of MTTDL in Section 4. an 5. for, an are easily seen to be special cases of tis teorem, after replacing te parameters by teir values as efine in tose sections. In particular, we see tat te numerator of te quotient is simply. Te enominator contains a term an two possibly comparable terms epening on te relative orers of magnitue te parameters.