Copy Number Variation Methods and Data

Copy Number Varaton Methods and Data

Copy number varaton (CNV) Reference Sequence ACCTGCAATGAT TAAGCCCGGG TTGCAACGTTAGGCA Populaton ACCTGCAATGAT TAAGCCCGGG TTGCAACGTTAGGCA ACCTGCAATGAT TTGCAACGTTAGGCA ACCTGCAATGAT ACCTGCAATGAT TAAGCCCGGG TTGCAACGTTAGGCA Copy number varatons (CNVs) are regons >1kb n a genome that occur n dfferent copy number n a populaton.

CNVs n Cancer Cells Development of sold tumors s assocated wth acquston of complex genetc alteratons. These alteratons can be of any length, ncludng full chromosomes. Changes can be caused by underlyng falures n mantenance of genetc stablty, Canges may be promoted by postve selecton as they provde growth advantages. Regons commonly amplfed may contan genes that mprove the vablty and growth of the tumor cells.

CNVs n populatons CNVs rangng from 1kb - several Mb segregate n humans and model organsms. Medan length of known CNVs s ~100kb, but that number s shaky. CNVs often have lmted phenotypc mpact CNV-alleles are nherted n the germlne. >10% of the human genome has varable copy number. The genome of two ndvduals has an average dfference n length of ~10 Mb. CNVs cover genes (Redon et al(06): 2900 genes are covered by CNVs)

More CNV facts Most CNVs are sngletons In most genomc regons, CNV-mutatons are rare. CNVs are usually n hgh LD wth SNPs. 25-fold enrchment near segmental duplcatons. CNVs are generated by several types of events, e.g. non-allelc homologous recombnaton and retrotransposton. More detected CNVs are duplcatons, but t s not clear f ths s detecton bas.

CNVs are dstrbuted throughout the Genome

CNVs and dseases Ads: CNV coverng CCL3L1, large copy number s protectve of HIV/AIDS nfecton. Assocaton wth Crohn s dsease and BMI. Autsm: De novo deletons more common n patents wth Autsm. New Syndrome: Deleton on chromosome 17 defnes novel genetc dsorder.

How to fnd CNVs? Non-Mendelan Inhertance

How to fnd CNVs? Compettve Genomc Hybrdsaton Feuk et al. Nature Revews Genetcs 7, 85 97 (February 2006)

How to fnd CNVs? SNP assays rs2613837 1.40 1.20 Norm Intensty (B) 1 0.80 0.60 0.40 0.20 0 0 0.20 0.40 0.60 0.80 1 1.20 1.40 1.60 1.80 Norm Intensty (A)

How to fnd CNVs? Pared end resequencng

Method Matters

Challenges of the data analyss The sgnal s nosy, hybrdzaton ntenstes depend on many envronmental factors. In populaton samples, probes coverng CNVs are rare, about 1% of all probes cover the rare allele of a CNV. The state space s not well-defned. Especally n cancer, we may observe a large amplfcaton n copy number.

2 Algorthms CBS -Crcular Bnary Segmentaton HMM - Hdden Markov Model Basc dea: Consecutve probes are lkely to have the same copy number, thus the data s locally correlated.

Crcularly Bnary Segmentaton Ohlsen et al. Bostatstcs (2004), 5, 4, pp. 557 572 Method to analyze CGH data. Desgned for Cancer cell analyss-needs to model complex changes n copy number. Implemented n the program DNAcopy, whch s wdely used. Has a nce vsual output.

Change pont method Defnton: Let X 1, X 2,...X n be a sequence of random varables. An ndex ν s called a change-pont f X 1,..., X ν have a common dstrbuton functon F 0 and X ν+1,... have a dfferent common dstrbuton functon F 1 untl the next change-pont (f one exsts). Here, a change ponts represent the change n copy number along the genome.

Testng for the exstence of change ponts Let S = X 1 + + X, 1 n, be the partal sums. When the data are normally dstrbuted wth a known varance 2, the statstc for testng the null hypothess of no change aganst the alternatve of exactly one change at an unknown locaton s gven by Z B = max 1<n Z, where Z s a t-statstc for unequal sample szes and n- are estmates of the varance from data ponts 1,, and +1,,n. Crtcal values for ths test can be derved by Monte Carlo smulaton. The locaton of the change-pont s estmated to be such that Z B = Z.. 1 1 2 ˆ 1) ( ˆ 1) ( 1 2 2 n S S S n n n Z n n

Fndng more change ponts Assume v s a changepont. To dentfy addtonal change ponts, run the test on X 1,,X v and on X v+1,,x n. Repeated recursvely untl no further change pont can be detected. Ths algorthm does result n a multple testng problem as the probablty of fndng spurous change-ponts s a functon of the number of true change-ponts. Snce the true number of change-ponts s unknown no correcton s performed.

However Snce the bnary segmentaton procedure s based on a test to detect a sngle change, a potental problem wth t s that t cannot detect a small changed segment bured n the mddle of a large segment. Ths problem wth the bnary segmentaton procedure s due to the fact that t looks for only one change-pont at a tme. Segment 1 Segment 2

Crculary Bnary Segmentaton 3 segments 2segments

Crcular bnary segmentaton The test statstc s then Z C = max 1<jn Z j wth Note that Z C allows for both a sngle change( j = n) and two changes ( j < n). Once the null hypothess s rejected the change-pont(s) s (are) estmated to be (and j ) such that Z C = Z j Multple pars of change ponts can be detected wth the same recursve algorthm descrbed before. j n S S S j S S j n j n j n j Z j n j j n j j 1 2 2 1 1 2 ˆ 1) ( ˆ 1) (

Applcaton to data

HMM Frdlyand et al. Journal of Multvarate Analyss 90 (2004) 132 153 Goals: Identfy the number of states present n the data. Estmate the state at each probe s l.

Markov Chan Markov Chans are statstcal processes where each next state only depends on the present state. In a hdden Markov model, the actual states are unobserved, we only get ndrect sgnals.

CNVs as HMM 0 0 1 1 1 0 2 2 0 0 Probes coverng a CNV are locally correlated, a probe n a CNV s more lkely to be next to an other probe n a CNV.

HMM A HMM wth the fxed number of hdden states K can be characterzed n terms of three parameters: () the ntal state probabltes, ; () the transton probablty matrx, A () the collecton of Gaussan emsson probablty functons defned wthn each state, B. For HMMs, there are effcent EM-based algorthms to create maxmum lkelhood estmates of A, B and. Based on A, B and, ML- estmates for copy number status at each poston are generated.

HMM for CNVs Emsson dstrbutons are generally generated from outsde nformaton. Genotype nformaton can be analyzed as orthologous sgnal. Transton matrx provdes pror frequency of CNVs and pror length dstrbuton.

Comparng the methods Smulaton study by takng hybrdzaton ntenstes from a prmary breast tumor sample, assgnng the true CNV status dependent on the mean sgnal. CNV lengths followed the dstrbuton of stretches of hgh and low sgnal n the data. Gaussan nose was added. For each smulated dataset, breakponts were estmated usng three methods, DNAcopy, HMM and Glad. Successve CNVs were merged.

Comparson of Methods HMM had the greatest power to detect the shortest segments DNAcopy surpassng HMM for longer segments. DNAcopy had by far the lowest FDR for all segment lengths.

Comparng tagsnps and callng methods P(C) (freq. of mnor CNV allele) P(O A) (false postve rate) P(O C) (senstvty) 0.02 0.05 0.1 0.2 IF r 2 IF r 2 IF r 2 IF r 2 0.9 1.74 0.57 1.37 0.73 1.25 0.80 1.20 0.83 0.01 0.8 2.05 0.49 1.59 0.63 1.44 0.69 1.40 0.71 0.7 2.49 0.40 1.88 0.53 1.70 0.59 1.66 0.60 0.9 4.41 0.23 2.45 0.41 1.80 0.56 1.48 0.67 0.05 0.8 5.51 0.18 2.99 0.33 2.16 0.46 1.78 0.56 0.7 7.13 0.14 3.77 0.27 2.68 0.37 2.18 0.46 IF - Inflaton factor to overcome callng error.