Документ взят из кэша поисковой машины. Адрес оригинального документа : http://hea-www.harvard.edu/astrostat/Stat310_1112/SchwartzmanJaffeGavrilovMeyer2011_MultipleTestingChIP-SeqPeaks.pdf
Дата изменения: Tue Feb 7 23:14:32 2012
Дата индексирования: Tue Oct 2 06:34:35 2012
Кодировка:

Harvard University
Harvard University Biostatistics Working Paper Series
Year 2011 Paper 133

Multiple Testing of Local Maxima for Detection of Peaks in ChIP-Seq Data
Armin Schwartzman Yulia Gavrilov Andrew Jaffe Clifford A. Meyer

Harvard School of Public Health and Dana Farber Cancer Institute, armin@jimmy.harvard.edu Johns Hopkins Bloomberg School of Public Health Dana-Farber Cancer Institute Johns Hoplins Bloomberg School of Public Health This working paper is hosted by The Berkeley Electronic Press (bepress) and may not be commercially reproduced without the permission of the copyright holder. http://www.bepress.com/harvardbiostat/paper133 Copyright c 2011 by the authors.

Multiple Testing of Local Maxima for Detection of Peaks in ChIP-Seq Data
Armin Schwartzman1,2 , Andrew Jaffe3 , Yulia Gavrilov1,2 , Clifford A. Meyer
1,2 1,2 1,2

Department of Biostatistics, Harvard Scho ol of Public Health, and Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute 3 Department of Epidemiology and Department of Biostatistics, Johns Hopkins Blo omberg Scho ol of Public Health August 3, 2011
Abstract A topological multiple testing approach to peak detection is proposed for the problem of detecting transcription factor binding sites in ChIP-Seq data. After kernel smoothing of the tag counts over the genome, the presence of a peak is tested at each observed local maximum, followed by multiple testing correction at the desired false discovery rate level. Valid p-values for candidate peaks are computed via Monte Carlo simulations of smoothed Poisson sequences, whose background Poisson rates are obtained via linear regression from a Control sample and the local GC content. The proposed method resolves nearby binding sites that other metho ds do not.

1

Intro duction

The problem of detecting signal p eaks in the presence of background noise app ears often in the analysis of high-throughput data. In ChIP-Seq data, the problem of finding transcription factor binding sites along the genome translates to a large-scale p eak detection problem with a one-dimensional spatial structure, where the numb er, locations and heights of the p eaks are unknown. Recently, Schwartzman et al. (2010) (hereafter SGA) introduced a top ological multiple testing approach to p eak detection where, after kernel smoothing, the presence of a signal is tested not at each spatial location but only at the local maxima of the smoothed observed sequence. In this pap er, we show how that approach can b e used to formalize the inference problem of finding binding sites in ChIP-Seq data. To achieve this, we also prop ose a new regression-based method for estimating the local background binding rate from a Control sample.

1.1

ChIP-Seq data

ChIP-Sequencing or ChIP-Seq is an exp erimental method that is used to map the locations of binding sites of transcription factors along the genome in vivo Barski and Zhao (2009); Park (2009). Transcription factors control the transcription of genetic information from DNA to mRNA in living cells, and abnormalities in this process are often associated with cancer. Given a particular transcription factor of interest, ChIP-Seq combines chromatin immunoprecipitation 1

Hosted by The Berkeley Electronic Press

(ChIP) with massively parallel DNA sequencing, allowing enrichment of the DNA segments b ound by the transcription factor and mapping of their locations along the genome. The result is a long list of sequenced forward and reverse tags, also called reads, each associated with a sp ecific genomic address. After alignment of these tags, the data consists of a sequence of tag counts along the genome, with a tendency to a higher concentration of tags near the transcription factor binding sites. An example of a data fragment is shown in Rows 1 and 2 of Figure 1. The goal of the analysis is to identify the true binding sites. This translates to finding genomic locations where the binding rate is higher than it would b e if the transcription factor were not present. To this end, Johnson et al. (2007) suggested sequencing a Control input sample to provide an exp erimental assessment of the background tag distribution, helping reduce false p ositives. The cost currently associated with this technology often does not allow more than a single ChIP-Seq sample, also called IP sample, and a single Control sample. To illustrate the usefulness of the Control, Rows 1 and 2 of Figure 1 show a short fragment of the raw data after alignment in the Control and IP samples resp ectively for the same p ositions in the genome. The interesting p eaks are marked by red circles in Row 3, corresp onding to sites with high binding rate in the IP sample but lower rate in the Control. Other candidate p eaks, marked in blue, do not have a significantly higher binding rate in the IP sample than in the Control. As an additional condition, it is not enough that a site has higher binding rate relative to the Control, but it must have a high binding rate in absolute terms. In other words, sites with a very weak binding rate are not interesting, even if the corresp onding binding rate in the Control is even weaker.

1.2

Testing of lo cal maxima

The search for binding sites may b e set up as large-scale multiple testing problem where, at each genomic location, a test is p erformed for whether the binding rate is higher than the background. Testing at each genomic location is statistically inefficient b ecause it requires a multiple testing correction for a very large numb er of tests over the entire length of the genome. In ChIP-Seq, the binding rate at a true binding site has a unimodal p eak shap e that spreads into neighb oring locations, caused by the variability in the length of the sequenced segments. Thus, as argued by SGA, it is enough to test for high binding rates only at locations that resemble p eaks, that is, local maxima of the smoothed data. In this sense, the local maxima serve as top ological representatives of the candidate binding sites. Peak detection on the aligned data is carried out using the Smooth and Test Local Maxima (STEM) algorithm of SGA. It consists of: 1. kernel smoothing; 2. finding the local maxima as candidate p eaks; 3. computing p-values for the heights of the observed local maxima; and 4. applying a multiple testing procedure to the obtained p-values. For Step 1, following the `matched filter principle' recommended by SGA, we use a symmetric unimodal kernel that roughly matches the shap e of the p eaks to b e detected. This shap e corresp onds to the spatial spread of tag locations around a true binding site and is assumed to b e the same for all binding sites, up to an amplitude scaling factor, as dictated by the 2
http://www.bepress.com/harvardbiostat/paper133

Control 1.5 1.5

Control

0.0

194908000

194912000 location (chr3) Treatment (IP)

194916000

0.0

86310000

86314000 location (chr7) Treatment (IP)

86318000

1.5

0.0

194908000

194912000 location (chr3) Smoothed IP

194916000

0.0

1.5 86310000

86314000 location (chr7) Smoothed IP

86318000

x 1e-2

x 1e-2 194912000 location (chr3) Background rate lambda 194916000

60

194908000

0 10 86310000

0

86314000 location (chr7) Background rate lambda

86318000

x 1e-3

x 1e-3 194912000 location (chr3) log10(SNR) 194916000

3.5

1.5

194908000

1.5 86310000

4.0

86314000 location (chr7) log10(SNR)

86318000

0.5 2.0

194908000

194912000 location (chr3) log10(p-value)

194916000

0.5 86310000

2.0

86314000 location (chr7) log10(p-value)

86318000

-2

-10

194908000

194912000 location (chr3)

194916000

-10

-2 86310000

86314000 location (chr7)

86318000

Figure 1: A fragment of the Fox A1 aligned data featuring a few representative p eaks found by our method. Row 1: Control sample. Row 2: IP sample, same fragment as the Control. Row 3: Smoothed IP sample; significant p eaks are indicated in red, non-significant ones in blue. Row 4: Estimates of the background Poisson rate 0 (t) at local maxima of the smoothed IP sample. Row 5: Signal-to-noise ratio (SNR), equal to p eak height divided by background rate (log 10 scale). Row 6: P-values (log 10 scale)). Notice the difference in vertical scales b etween the left and right panels.

3

Hosted by The Berkeley Electronic Press

physics and chemistry exp erimental protocol. This shap e, up to an amplitude scaling factor, is estimated from the data during the alignment process. In Step 2, local maxima are defined as smoothed counts that are higher than their neighb ors after correcting for ties. In Step 3, p-values test the hyp othesis that the local binding rate is less or equal to the local background rate or a minimally interesting binding rate. The required distribution of the heights of local maxima is computed via Monte Carlo simulations. Finally, Step 4 is carried out using the Benjamini-Hochb erg (BH) procedure (Benjamini and Hochb erg, 1995), although in general, other multiple testing algorithms may b e used instead. The STEM algorithm is promising for ChIP-Seq data b ecause it was shown in SGA to provide asymptotic error control and p ower consistency under similar modeling assumptions. Like in ChIP-Seq data, SGA assumed that the signal p eaks are unimodal with finite supp ort and that the search occurs over a long observed sequence. Under those conditions, and assuming the noise to b e additive Gaussian stationary ergodic, SGA proved that the BH procedure controls the false discovery rate (FDR) of detected p eaks. Here, FDR is defined as the exp ected ratio of falsely detected p eaks among detected p eaks, where a detected p eak is considered true (false) if it occurs inside (outside) the supp ort of any true p eak. The control is asymptotic as b oth the search space and the signal strength get large. SGA also showed that the statistical p ower of the algorithm tends to one under the same asymptotic conditions. In ChIP-Seq data, the definitions of true and false detected p eaks apply within the spatial extent of the true p eak shap e, which is estimated here during the alignment process.

1.3

Estimation of the background rate and Monte Carlo calculation of pvalues

ChIP-Seq data differs from the modeling assumptions of SGA in that ChIP-Seq data consists of a long sequence of Poisson p ositive integer counts (Mikkelsen et al., 2007), which cannot b e modeled with Gaussian signal-plus-noise model. Moreover, the process generating the background noise counts is not globally stationary (Johnson et al., 2007). To make inference p ossible, we assume the Poisson rate to vary over the genome but not too fast so that it is approximately constant in the immediate vicinity of any candidate p eak. The background Poisson rate at any given location is estimated as a linear function of the local Control counts at three different scales and a nonlinear function of the local GC content. The linear coefficients are estimated from the data by multiple regression, automatically solving the normalization problem of having different sequencing depths b etween the IP and Control samples. Finally, as required by Step 3 of the STEM algorithm ab ove, for an observed local maximum of the smoothed ChIP-Seq data at a given location, its p-value is computed via Monte Carlo simulation using the background Poisson parameter estimated for that location. Note that the STEM algorithm requires an estimate of the background, but does not dep end on how that estimate was obtained. Here we prop ose a regression method, but that method could b e easily changed without changing the basic op eration of the STEM algorithm.

1.4

Other metho ds

A numb er of methods to analyze ChIP-Seq data have b een prop osed by others b efore. Examples include MACS (Zhang et al., 2008), cisGenome (Ji et al., 2008), QuEST (Valouev et al., 2008), and FindPeaks (Fejes et al., 2008). Some of these methods also view the problem of detecting binding sites as a p eak detection problem. While these methods use statistical models and estimate error rates, they do not formally state the inference problem they attempt to solve from

4
http://www.bepress.com/harvardbiostat/paper133

a statistical p oint of view (an exception, that frames the problem from a Bayesian p ersp ective, is BayesPeak (Spyrou et al., 2009); our approach is frequentist). Here we attempt to frame the ChIP-Seq analysis problem as a formal inference problem in multiple testing, relying on the error control prop erties proven in SGA, and using a regression method to estimate the background binding rate. As a reference, we compare the results of our analysis to those of MACS and cisGenome on two different datasets. A particular feature of our method is that it focuses on detecting binding sites rather than binding regions, and therefore has the ability to distinguish nearby binding sites that the other methods do not.

1.5

Datasets

We demonstrate our method on two different ChIP-Seq datasets. In the first, ChIP-Seq targetting the transcription factor FoxA1 was p erformed on the breast cancer cell line MCF-7 (Zhang et al., 2008). This dataset includes a ChIP-Seq sample (hereafter IP), in which the FoxA1 antib ody was used, and a Control input sample, in which the procedure was rep eated without the antib ody. Sequencing covered the entire genome, producing ab out 3.9 million tags in the IP sample and ab out 5.2 million tags in the Control sample. The second dataset concerns the growth-associated binding protein (GABP) (Valouev et al., 2008). This larger dataset consists of an IP sample with ab out 7.8 million tags and a Control sample with ab out 17.4 million tags. The methods in this pap er were develop ed on the FoxA1 dataset, and later applied to the GABP dataset as an indep endent testb ed. In b oth datasets, the goal of the analysis is to detect genomic loci in the IP sample that have a significantly high numb ers of tags b oth in absolute terms and relative to the Control sample. The methods in this pap er were implemented in R.

2
2.1

Peak detection for ChIP-Seq data
Alignment and estimation of the peak shape

Before statistical analysis, we follow the approach in MACS of first aligning the forward and reverse tags, after which that distinction is unnecessary and tags can b e counted together. Alignment requires estimating the amount by which tags need to b e shifted. This process, describ ed in the detail in App endix, also allows us to estimate the shap e of the spatial spread of the shifted tag counts around a p eak. For illustration, Figure 2a shows the spatial distributions of the forward and reverse tags in the IP sample of the FoxA1 dataset b efore alignment, obtained from 1000 strong and easily detectable p eaks in chromosome 1. These distributions are displaced with resp ect to one another. The optimal shift found in this case was 62 base pairs (bp), almost the same as the estimated shift of 63 found by MACS for the same data. Figure 2a shows the estimated p eak shap e after alignment of these distributions (black dashed). As a further refinement, it can b e observed in Figure 2a that the binding rate is approximately constant more than ab out 400 bp away from the center, and should not b e included as part of the p eak. As a correction, the supp ort of the kernel was reduced by multiplying the black-dashed shap e by a quartic biweight function of size W = 801, producing the estimate in solid black. This p eak shap e, normalized to unit sum, is used as a smoothing kernel in the STEM algorithm for p eak detection. The quartic biweight function has the effect of providing the kernel with continuous derivatives at the edges, a desirable prop erty to avoid spurious local maxima at that step of the algorithm. 5
Hosted by The Berkeley Electronic Press

80

60

Ctrl var (1k bins) -1000 -500 0 500 1000

counts

40

20

0.00 0.00

0

0.05

0.10

0.15

0.05

0.10

0.15

a

Displacement from center of peak (bp)

b

Ctrl mean (1k bins)

Figure 2: (a) Estimated distribution of tag counts in the forward strand (red) and in the reverse strand (blue) of the FoxA1 dataset (chromosome 1). Aligning the distributions and averaging the counts results in the joint count distribution and p eak shap e (black dashed). The p eak shap e is multiplied by a quartic biweight function (black solid). (b) Sample mean vs. sample variance of the aligned Control sequence in bins of size 1 Kb. The blue line has slop e 1.

2.2

The Poisson mo del and the STEM algorithm

After alignment, the data consists of a table of genomic locations, each with an associated tag count. The rest of genomic locations are assumed to have a count of zero. Since the data are given as p ositive integer counts, it is reasonable to model them as Poisson variables (Mikkelsen et al., 2007). Sp ecifically, we assume that the IP and Control counts IP(t) and C(t) at locations t are indep endent Poisson sequences IP(t) Po [IP (t)] , C(t) Po [C (t)] , t Z, (1)

where IP (t) 0 and C (t) 0 denote the mean rates at location t, which may vary over t. The values of the processes IP(t) and C(t) are assumed indep endent over t given IP (t) and C (t). As model validation, Figure 2b shows a graph of the sample mean vs. sample variance of the aligned Control sequence in the FoxA1 dataset, computed in bins of size 1 Kilobase (Kb). The two quantities are seen to b e nearly prop ortional with a prop ortionality constant of 1, as exp ected from the Poisson model (1). The IP sample exhibits a similar pattern (not shown). Regions of high binding frequency are represented by p eaks in the mean Poisson rates. The goal is to find regions where IP (t) is higher than the local background rate 0 (t), but also higher than a minimally interesting binding rate L . The rate L does not dep end on t and establishes a lower b ound on the height of p eaks to b e detected, ensuring that the detected p eaks are strong not only relative to the local background but also in absolute terms. This avoids detecting uninteresting weak p eaks where an even weaker local background would deem them statistically significant otherwise. At every t, the ab ove comparison translates to testing whether IP (t) 0 (t) and IP (t) L , that is, IP (t) max{0 (t),L }. To gain efficiency, rather than testing at every single location t, tests are only p erformed at local maxima of the smoothed IP sequence. This is carried out formally using the following adaptation of the STEM algorithm from SGA. 6

http://www.bepress.com/harvardbiostat/paper133

Algorithm 1 (STEM algorithm). 1. Let w(t) b e a unimodal kernel of length W . Apply kernel smoothing to the IP sequence to produce the smoothed sequence 1 IP(t) = w(t) IP(t) = W
(W +1)/2

w(s)IP(t - s).
s=-(W -1)/2

(2)

~ 2. Find all local maxima of IP(t) as candidate p eaks. Let T denote the set of locations of those local maxima. ~ 3. For each local maximum t T , compute a p-value p(t) for testing the null hyp othesis H0 (t) : IP (t) + (t) 0 vs . IP (t) > + (t) 0 (3)

in a neighb orhood of t, where + (t) = max{0 (t),L }. 0 4. Let m b e the numb er of local maxima. Apply a multiple testing procedure on the set of ~ p-values and declare significant all p eaks whose p-values are smaller than the threshold. Details on each of the steps are given in the following sections.

2.3

Smo othing and lo cal maxima

According to SGA, the b est smoothing kernel for the purp oses of p eak detection is that which maximizes the signal-to-noise ratio (SNR) after convolving the p eak shap e, assumed to underly the signal p eaks in the data, with the smoothing kernel. This is achieved by choosing the smoothing kernel to b e equal to the p eak shap e itself (up to a scaling factor), a principle called "matched filter theorem" in signal processing (Pratt, 1991; Simon, 1995). Note that this is not the same as the optimal kernel in nonparametric regression. In ChIP-Seq data, binding rate p eaks corresp onding to different binding sites for the same transcription factor are assumed to have the same shap e in terms of spatial spread, but may have different heights. The common p eak shap e is estimated in the alignment process (solid curve in Figure 2). It is unimodal, constrained to b e symmetric, and has heavier tails than the Gaussian density. In Step 1 of the STEM algorithm (Algorithm 1), smoothing was carried out setting w(t) equal to the solid curve in Figure 2, normalized to have unit sum, with W = 801. Rows 2 and 3 in Figure 1 compare the raw and smoothed IP data. The smoothed data is high at locations where the density of tag counts is high. Notice that kernel smoothing produces p ositive counts locations where the unsmoothed IP data may have no counts. In Step 2 of the STEM algorithm (Algorithm 1), local maxima of the smoothed sequence IP(t) are defined as values IP(t) that are greater than their immediate neighb ors IP(t - 1) and IP(t + 1). If the maximum is tied b etween two neighb oring values, then the p eak location is assigned the mean p osition of the tied maxima, and the height their maximum smoothed value. A useful prop erty of the kernel that avoids producing spurious local maxima is to have continuous derivatives. This was ensured by multiplication of the estimated p eak shap e by a quartic biweight function, as describ ed in Section 2.1 ab ove. Restricting the analysis to local maxima reduces the amount of data to process further. In the aligned IP sample of the FoxA1 dataset, the numb er of local maxima found was ab out 2.7 million, down from ab out 3.9 million original mapp ed tags. 7
Hosted by The Berkeley Electronic Press

Tr t mean (1k bins)

0.15

Tr t mean (1k bins) 0.00 0.05 0.10 0.15

0.10

0.000 0.0

0.00

0.002

0.05

0.004

0.006

0.2

0.4

0.6

0.8

a

Ctrl mean (1k bins)

b

GC content (1k bins)

Figure 3: Marginal distributions of the 1 Kb bin averages in the IP sample of the FoxA1 dataset as a function of: (a) the 1 Kb bin averages in the Control sample; (b) the 1 Kb bin GC content.

2.4

Estimation of the lo cal background rate

Computation of p-values in Step 3 of the STEM algorithm (Algorithm 1) requires knowledge of the background Poisson rate 0 (t) under the null hyp othesis. Estimation of 0 (t) is difficult b ecause it varies with t in an unknown fashion (Johnson et al., 2007). Here we prop ose a simple method to estimate the background rate from the local Control data and the local GC content, as follows. The GC content is included as it is known to generally affect binding rates. Since the Control sample is intended to represent the background process in the IP sample, it is reasonable to assume that the local background rate 0 (t) in the IP sample is a linear function of the corresp onding local background rate C (t) in the Control sample. The prop ortionality accounts for the difference in sequencing depth b etween the two samples. In the FoxA1 data, the IP sample has ab out 3.9 million tags, while the Control sample has ab out 5.2 million counts. Assuming that 0 (t) varies slowly with t, 0 (t) at a particular location t can b e estimated as a linear function of the corresp onding rate C (t) in the Control. The local Control rate C (t), in turn, may b e estimated as the average tag count in the Control sample within a window of size 1 Kb centered at t. The size 1 Kb is ab out the smallest precision that allows comparison of p eaks, usually of size a few hundred bp, against the background. Because counts may often b e sparse, to add stability to the parameter estimates we consider the local rate to also b e linearly related to the corresp onding rate in the Control within a window of size 10 Kb and within a window of size 100 Kb, all centered at t. To illustrate these relationships, Figure 3a shows a graph of the 1 Kb bin averages in the Control sample of the FoxA1 dataset against the 1 Kb bin averages in the IP sample. While there is a lot of variability, the main trend is seen to b e linear, captured in the figure by a marginal linear fit. The outliers in the upp er left corner corresp ond vaguely to the p eaks sought. However, their relative small numb er introduces little bias in the regression. Similar trends are observed when plotting the 1 Kb bin averages in the IP sample as a function of the 10 Kb and 100 Kb bin averages in the Control sample (not shown). Figure 3b shows the distribution of the 1 Kb bin averages in the IP sample of the FoxA1 dataset as a function of the GC content in the same 1 Kb bins. It is apparent that the local GC content affects the binding rates, but it does so in a nonlinear fashion. This relationship

8

http://www.bepress.com/harvardbiostat/paper133

Table 1: Estimated regression coefficients by non-negative least squares in the FoxA1 dataset. Standard errors were computed for reference from ordinary least squares. Predictor Coefficient Estimate Std. error 1 Kb a1 0.30378 0.00086 10 Kb a2 0.28403 0.00204 100 Kb a3 0.19787 0.00206 GC spline a4 0.00000 0.00004 a5 0.00000 0.00001 a6 0.00011 0.00001 a7 0.00036 0.00004 a8 0.00000 0.00009

may b e captured nonparametrically using 5 B-spline basis functions. Summarizing, 0 (t) is estimated from local windows of sizes 1 Kb, 10 Kb and 100 Kb centered at t via
13

^ ^ 0 (t) = a1 C

,1k

^ (t)+ a2 C

,10k

^ (t)+ a3

C,100k

(t)+
j =4

aj bj [GC1k (t)]

(4)

^ ^ ^ where C,1k (t), C,10k (t), and C,100k (t) are the Control averages in windows of size 1 Kb, 10 Kb and 100 Kb centered at t, GC1k (t) is the GC content in a window of size 1 Kb centered at t, b1 (·),... ,b5 (·) are B-spline basis functions defined on the interval (0, 0.8), and a1 ,... ,a8 are global parameters. To estimate the global parameters a1 ,... ,a8 , we set up a global linear regression as in (4), except that the predictors and the resp onse are replaced by the 1 Kb, 10 Kb, and 100 Kb bin averages, as in Figure 3. The linear regression is solved with the additional constraint that all coefficients, and thus the fitted values, are non-negative. Applying this regression in the FoxA1 dataset gave the estimates listed in Table 1. Interestingly, the three window sizes are given approximately equal weight, while the GC content is given very little weight. The coefficients also automatically account for sequencing depth. If the binding rate in the Control were constant, then the background estimate for the IP would b e approximately equal to the Control rate multiplied by the sum of the three Control window coefficients, equal to 0.786. This factor is close to the overall ratio b etween the total numb er of reads in the IP sample and the total numb er of reads in the Control sample, equal to 0.747. However, the multi-window model makes the estimate adaptive to the local variability in the background rate. ^ As an example, Row 4 of Figure 1 shows the local estimates 0 (t), roughly following the tag pattern observed in Row 1. Row 5 shows the SNR, defined as the ratio b etween the p eak ^ height IP(t) and the estimated background rate 0 (t). ^ Figure 4a shows the distribution of the estimated values of (t)0 over the entire genome for the FoxA1 dataset. A reasonable value for the minimally interesting binding rate L is ^ the median of this distribution. We thus defined L = mediant {0 (t)} = 0.00097. This value corresp onds to ab out 1 tag in a 1 Kb window, which is indeed low and uninteresting.

2.5

Computing p-values
ithm 1, the p-value p(t) of an observed location t is defined as the probability to higher under the least favorable null hyp For the purp oses of computing p-values, 9 local maximum of the smoothed obtain the observed height of the othesis IP (t) = + (t) in (3) in a 0 the null hyp othesis need only b e

In Step 3 of Algor sequence IP(t) at a local maximum or neighb orhood of t.

Hosted by The Berkeley Electronic Press

4e+05

0.08

1.0 0.8 0.6

3e+05

Frequency

2e+05

0.04

0.06

0.4 0.2 0.0

1e+05

0e+00

-4

-3

-2

-1

0.02

0.05

0.10

0.15

a Figure 4: (a) Histogram of estimated values in red) is 0.00097. (b) Right-tail distribution by Monte Carlo simulation, as a function of

b ^ of 0 in the FoxA1 dataset. The median (marked ^ (u; ) of the height u of local maxima, obtained sF the background rate .

assumed in a local neighb orhood of each candidate p eak b ecause IP(t) dep ends only on the data within a local neighb orhood, as dictated by the smoothing kernel w(t). In this section we assume that 0 (t) and L are known, having b een estimated according to the methods describ ed in Section 2.4 ab ove. In SGA, the background noise process was assumed stationary. In ChIP-Seq data, in contrast, the background rate 0 (t) is not constant. However, if the background process is locally stationary and 0 (t) varies slowly with t, then the background process in the neighb orhood of a ~ given location t = t may b e assumed to have similar statistical prop erties in that neighb orhood ~ as a stationary sequence with constant background rate 0 (t). In particular, the height of a ~ would have approximately the same distribution local maximum of the smoothed sequences at t in b oth cases. Sp ecifically, supp ose X (t; ) is a sequence of i.i.d. Poisson random variables with constant mean rate . Smoothing of X (t; ) with the kernel w(t) as in Step 1 of Algorithm 1 produces the smoothed sequence 1 ~ X (t; ) = w(t) X (t; ) = W
(W +1)/2

X (t - s; ).
s=-(W -1)/2

(5)

~ The height of a local maximum of the stationary sequence X (t; ) has the right cumulative distribution function (cdf ) ~ F (u; ) = P X (t; ) u t is a local maximum, . (6)

Then, assuming that IP(t) is locally stationary, the required distribution of the height of a local maximum of IP(t) at t may b e approximated by the distribution F (u; ) (6) corresp onding to the constant rate 0 (t). Finally, given the observed height IP(t) at t, its p-value under the null hyp othesis (3) is defined as p(t) = F IP(t); + (t) . 0 10 (7)

http://www.bepress.com/harvardbiostat/paper133

The distribution (6) is difficult to compute analytically. Instead, we resort to Monte Carlo simulations, where for each given value of , a sequence X (t; ) of indep endent Poisson variables is generated as defined ab ove smoothed using the kernel w(t), and its local maxima found. The distribution (6) is then estimated empirically from the obtained heights of the local maxima of ~ the smoothed simulated sequence X (t; ) (5). To reduce computations, rather than p erforming a new simulation for each new background rate 0 (t), a table of cdfs (6) is prepared in advance for a set of values of that covers the range of p ossible values of 0 (t) to b e found in the data. Then for any sp ecific value of 0 (t), the table is interp olated to find the distribution (6) corresp onding to that rate. ^ In the FoxA1 dataset, the smallest and largest values of 0 (t) found were 1.67 в 10-5 and -2 , resp ectively, giving values of + (t) in the range 0.00097 to 0.0749. Taking a safety ^ 7.49 в 10 0 margin of 25%, we p erformed the Monte Carlo simulation describ ed ab ove for 300 values of equally spaced on a logarithmic scale b etween 0.00073 and 0.0842. The length of the simulated Poisson sequences was set to b e as long as needed to obtain at least 100 nonzero counts, but not smaller than 1 в 105 . In order to reduce the variability from the simulation, the table of cdfs ^ F (u, ) was smoothed nonparametrically over for each fixed u via linear regression using 5 B-spline basis functions, with the additional constraint that all coefficients, and thus the fitted values, are non-negative. The safety margin mentioned ab ove ensured that none of the values of actually neded were near the edges of the table for the purp oses of spline smoothing. ^ Figure 4b shows the obtained function F (u; ) (6), given as a table of size 300 values of ^ (u; ) in (7) for any particular values of IP(t) and by 200 values of u. In order to evaluate F 0 (t), bilinear interp olation was used b etween the closest grid p oints. As an example, Row 6 of Figure 1 shows the calculated p-values p(t) in that data segment. Because of numerical precision in the Monte Carlo simulations, very low p-values could not b e distinguished from zero. In the figures, these are drawn as if they were equal to 10-10 .

2.6

Multiple testing

Following SGA, we applied the BH procedure on the sequence of m = 2683941 p-values from ~ the FoxA1 dataset, each corresp onding to a local maximum of the smoothed sequence IP(t). Of these local maxima, 17993 were declared significant at an FDR level of 0.01. Their associated addresses t are effectively p oint estimates of the locations of the binding sites they represent. As an example, in Row 3 of Figure 1, the significant local maxima are indicated by red circles. As final results, the detected p eaks were ranked according to their p-values. Of the 17993 significant p eaks, the top 7918 had p-values that could not b e distinguished from 0 b ecause of the numerical accuracy of our Monte Carlo simulations. These p eaks were ranked according to their SNR. To give a sense of the amount of signal in the data and the validity of the procedure, Figure 5a compares the observed marginal distribution of p-values to the exp ected marginal distribution under the complete null hyp othesis in the FoxA1 dataset. The observed marginal distribution of p-values (shown in black in Figure 5a) is given by the empirical distribution 1 ^ G(p) = m ~ 1 [p(t) p] ,
~ tT

0 p 1,

(8)

~ where T is the set of m locations of the local maxima of IP(t), with p-values given by (7). The ~ marginal distribution under the complete null hyp othesis is estimated in two different ways, one purely empirical and one more theoretical. 11
Hosted by The Berkeley Electronic Press

1.0

0.8

0.6

0.4

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.0

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

a

b

Figure 5: Marginal distribution of p-values in the FoxA1 dataset: (a) observed (black) and estimated under the global null hyp otesis empirically (red) and theoretically (blue); (b) sp ecific null distributions for = L = 0.00097 (red), = 0.00135 (blue), and = 0.00256 (green).

The empirical estimate (shown in red in Figure 5a) was obtained by running the entire analysis on the Control sample as if it was the IP, that is, searching for p eaks in the Control sample using the same Control sample for estimating the background. The obtained null distribution of p-values lies b elow the diagonal as required for validity, and it exhibits a high frequency of p-values equal to 1, corresp onding to p eaks with only one tag in them. The theoretical estimate (shown in blue in Figure 5a), was obtained as follows. Recall that ~ for a smoothed stationary Poisson sequence X (t; ) with constant rate , the distribution of the ~ height of a local maximum at t T is given by (6). Analogous to (7), define the corresp onding ~ ~ null p-value as p0 (t) = F X (t); for t T . Its distribution G0 (p; ) = P (p0 (t) p) for any t is given by G0 (p; ) = 1, F (u1 ; ) p F (uk ; ) , F (uk ; ) p < F (u (9)

k -1

; ) ,

k = 2, 3,...

~ where uk , k = 1, 2,..., are the discrete values taken by the smoothed process X (t; ) at the ~ b ecause of stationarity. In the local maxima. Note that G0 (p; ) is indep endent of t for t T ~ ChIP-Seq problem, we approximate the null distribution of the p-value at t T by the null ^+ (t)) corresp onding to a stationary process with constant rate = 0 (t), ^ distribution G0 (p; 0 which dep ends on t only through the value of . Since each of the observed p-values in (8) ^ corresp onds to a different background rate 0 (t), the estimated marginal distribution under the global null hyp othesis is given by the mixture distribution 1 ^ G0 (p) = m ~ ^ G0 p; + (t) , 0 0 p 1. (10) distribution, explains the uniform but ree examples

~ tT

Referring back to Figure 5a, the observed distribution is always ab ove the null and the large derivative at zero indicates the presence of a strong signal, which large numb er of significant p eaks found. Note that the null distribution is not stochastically larger. To b etter understand the mixture (10), Figure 5b shows th 12

http://www.bepress.com/harvardbiostat/paper133

of the individual null distributions (9). All are discrete and stochastically larger than the continuous uniform distribution. For small , the most common p-value is 1 as most local maxima take the smallest p ossible value u1 , equal to the mode of the kernel w(t), obtained when there is an isolated count of 1 in a neighb orhood of zeros. This explains the large jump at 1 in panel (a). As gets larger, the distribution b ecomes closer to the continuous uniform distribution.

3

Comparison to other metho ds

As a reference, we compared our method to MACS and cisGenome on b oth the FoxA1 and GABP datasets. While the FoxA1 dataset was used in the development of MACS and our method, the GABP dataset was not used in the development of any of the three methods, providing an impartial test of p erformance. The methods were compared by a motif analysis and in terms of their mutual agreement. On the FoxA1 dataset, the MACS and cisGenome software were applied using the default values. MACS returned a list of 13639 regions, which were ranked by p-value. In the case of cisGenome, binding regions were ranked by their FDR estimate and those with an FDR estimate of less than 0.05 were retained, yielding 7945 significant regions. In b oth cases, the locations of the p eaks were identified by the rep orted summit within each region. This makes the results comparable to ours, since our method produces p eak locations, rather than regions. On the GABP dataset, all three methods were applied using the default values. Our method, thresholding at an FDR level of 0.01, produced 4072 significant p eaks. MACS produced a list of 13828 p eaks, while cisGenome produced a list of 4275 p eaks with an FDR estimate of 0.05 or b elow.

3.1

Motif analysis

As biological validation, a motif analysis was p erformed where, for each p eak declared significant, the numb er of motifs related to the appropriate transcription factor was counted within 100 bp and 400 bp of the estimated p eak location. The distance of 400 bp approximately corresp onds to the spatial spread of the measurements b elonging to a binding site, as determined by the estimate in Figure 2a. Table 2 shows the average numb er of motifs and the prop ortion of p eaks with at least one motif within those distances for the top 7945 p eaks found by all methods in the FoxA1 dataset, and the top 4072 p eaks found by all methods in the GABP dataset. In the FoxA1 dataset, the numb er 7945 is the numb er of p eaks in the smallest of the three lists, in this case cisGenome. Taking the same numb er of top p eaks in each list makes the averages and prop ortions in the table comparable, as the p eak lists are ordered and the various methods use different criteria for their list cut-offs. In the GABP dataset, the numb er 4072 is again the numb er of p eaks in the smallest of the three lists, in this case our method. The table indicates that our method, lab eled "STEM+Regr" for simplicity, shows a similar but slightly b etter p erformance, particularly in the 400 bp range.

3.2

Peak overlap and discrepancies

To help explain the previous results, Table 3 shows the p ercentage of p eaks from the top 7945 from each method in the FoxA1 dataset, or the top 4072 from each method in the GABP dataset, that were also found by each of the other methods within a distance of 100 bp and 400 bp. The matrices in the table are not symmetric b ecause the corresp ondence b etween p eaks is 13
Hosted by The Berkeley Electronic Press

Table 2: Motif analysis comparing the p erformance of the prop osed method against MACS and cisGenome on two different datasets. Results are for the top 7945 p eaks in all methods for the FoxA1 dataset, and the top 4072 p eaks in all methods for the GABP dataset. Average numb er of Prop ortion with at motifs within least one motif within 100 bp 400 bp 100 bp 400 bp Dataset Method FoxA1 STEM+Regr 0.886 1.826 0.609 0.828 MACS 0.888 1.812 0.609 0.827 cisGenome 0.862 1.769 0.593 0.812 GABP STEM+Regr 0.836 1.652 0.556 0.779 MACS 0.832 1.650 0.557 0.776 cisGenome 0.827 1.597 0.546 0.757

Table 3: Numb er of p eaks from the methods listed in the columns that were also found by the methods listed in the rows within a distance of 100 bp and 400 bp. Results are for the top 7945 p eaks from all methods in the FoxA1 dataset, and the top 4072 p eaks from all methods in the GABP dataset. % found within 100 bp STEM+Regr MACS cisGenome 100 80.3 76.1 78.8 100 83.8 74.0 83.8 100 100 90.9 81.8 90.9 100 84.3 81.8 84.3 100 % found within 400 bp STEM+Regr MACS cisGenome 100 84.2 79.2 78.9 100 84.4 74.1 84.3 100 100 94.3 85.2 91.2 100 85.2 82.3 85.2 100

Dataset FoxA1

GABP

Method STEM+Regr MACS cisGenome STEM+Regr MACS cisGenome

not one-to-one; p eaks found by one method may b e represented by two or more p eaks found by another method. The table shows that there is a fair amount of overlap (ab out 74% to 94%) b etween the methods, our method aligning b etter with MACS than with cisGenome. To b etter understand the discrepancies, Figure 6 shows two examples of genomic segments from the GABP dataset after alignment. The left panel shows one of the 52 out of the top 4072 p eaks in the list produced by our method that were not found among the 13828 p eaks produced by MACS nor among the top 4275 p eaks produced by cisGenome. Our method detected a secondary p eak within 567 bp of a ma jor p eak (Row 3), in a binding region that was counted as a single region by b oth MACS and cisGenome. All the other p eaks in this group of 52 were found to b e secondary p eaks or sometimes tertiary p eaks. Of these, 48 were more than 400 bp away from its closest neighb or, with distances up to 1467 bp and averaging 604 bp. This indicates that these secondary p eaks, not distinguished by the other methods, may b e separate binding sites. The ability to resolve these p eaks is a consequence of our method searching for binding sites, rather than binding regions. The right panel of Figure 6 shows one of the 821 of the 13828 p eaks produced by MACS that were also found among the list of 4275 produced by cisGenome but not among the top 4072 produced by our method. This p eak was not called significant by our method b ecause its

14

http://www.bepress.com/harvardbiostat/paper133

associated p-value was not low enough (Row 6). This is b ecause the p eak height is low (notice the difference in vertical scale in Row 3 relative to the left panel), while the estimated local background rate is high (Row 4). Thus the SNR (Row 5) is lower than other p eaks in the genome, pushing this one down the ranking list. In fact, the global regression model (4) for this dataset gave all the weight in the coefficients to the 1 Kb window and none to the larger windows or the GC coefficients, so the local background estimate at the p eak (Row 4) directly reflects the high activity in the Control in the neighb orhood of the p eak (Row 1). Other p eaks in this group of 821 are similar. This example illustrates the imp ortance of the estimation of the local background rate in the analysis.

4
4.1

Discussion
Metho dological considerations

We have presented a method for detection of p eaks in ChIP-Seq data based on the STEM algorithm of SGA with promising results. The applicability of SGA to ChIP-Seq data relied on the common assumption that the signal p eaks, represented by a mean function, are unimodal and have the same shap e up to an amplitude scaling factor. The adaptation to ChIP-Seq data required two main modifications: 1) estimation of the local background rate; 2) use of Monte Carlo simulations to compute p-values. From a methodological p oint of view, estimation of the background rate 0 (t) is arguably the most crucial step in the analysis, as the inference for a particular local maximum is highly dep endent on the background rate at that location. In this pap er, we have focused on the inference asp ects of detecting p eaks with a spatial structure via the STEM algorithm. Estimation of the local background rate (the particle "Regr" in the acronym "STEM+Regr") is not part of the original STEM algorithm, but is necessary for the analysis of ChIP-Seq data b ecause the noise process is not stationary. This conceptual separation is helpful in that the background estimation method could b e replaced by a different method if desired, without affecting the general implementation of the STEM algorithm for p eak detection. Statistical methods for estimating the variable rate (t) in dynamic Poisson models or nonhomogeneous Poisson sequences have b een develop ed in other contexts. Bayesian methods (West et al., 1985; Harvey and Durbin, 1986; Bolstad, 1995) are computationally intensive, estimating (t) at each location t based on the estimates at locations 1, 2,... ,t - 1. This is computationally infeasible for long genomic sequences as in ChIP-Seq data. Other methods require either rep eated realizations (Arkin and Leenis, 2000) or a more sp ecific structure of the process (Zhao and Xie, 1996; Helmers et al., 2003), which are not available in ChIP-Seq data. In this pap er, we have prop osed a simple solution to local background estimation based on multiple linear regression. This method has the ability to easily incorp orate the Control sample and other covariates such as the local GC content. In addition, estimating the regression coefficients from the data automatically adjusts for sequencing depth and the relative weighting b etween the various window sizes. Given this relative weighting, the method adaptively estimates the local background at each location. In this sense, the regression model solves the normalization problem and gives a partial answer to the question of how slowly 0 (t) varies with t. In the FoxA1 dataset, the three windows of length 1 Kb, 10 Kb and 100 Kb were given equal weight by the regression, while in the GABP dataset all the weight was given to the 1 Kb window. Because the GABP dataset is richer in numb er of reads, the regression method automatically accounts for it and allows estimation of the background rate at a smaller spatial scale. 15
Hosted by The Berkeley Electronic Press

Control 1.5 1.5

Control

0.0

35204000

35208000 location (chr22) Treatment (IP)

35212000

0.0

55506000

55510000 location (chr17) Treatment (IP)

55514000

1.5

0.0

35204000

35208000 location (chr22) Smoothed IP

35212000

0.0

1.5 55506000

55510000 location (chr17) Smoothed IP

55514000

x 1e-2

x 1e-2 35204000 35208000 location (chr22) Background rate lambda 35212000

0 20 55506000

0 60

55510000 location (chr17)

55514000

Background rate lambda x 1e-3 40 10 55506000

x 1e-3

0 35204000

40

35208000 location (chr22) log10(SNR)

35212000

55510000 location (chr17) log10(SNR)

55514000

35204000

35208000 location (chr22) log10(p-value)

35212000

-0.2 0.8

0.0

1.0

55506000

55510000 location (chr17) log10(p-value)

55514000

-2

-10

35204000

35208000 location (chr22)

35212000

-3 55506000

0

55510000 location (chr17)

55514000

Figure 6: Left: A fragment of the aligned GABP data featuring a secondary p eak called by our method but not MACS or cisGenome. Right: A fragment of the aligned GABP data featuring a p eak called by MACS and cisGenome but not by our method. The variables plotted are the same as in Figure 1. Notice the difference in vertical scales b etween the left and right panels.

16

http://www.bepress.com/harvardbiostat/paper133

Often in ChIP-Seq data a Control sample is unavailable. In such case, the regression model (4) could have the 10 Kb and 100 Kb averages from the IP sample itself as predictors instead of those from the Control, with the 1 Kb window not included in the model. This would allow estimation of the background from the neighb orhood of each p eak, alb eit with some p ositive bias. The GC content contribution would remain. Fortunately, the p ositive bias would make the inference more conservative, affecting the detection p ower more than its validity. In the comparison with MACS and cisGenome, it was observed that the STEM algorithm p erforms comp etitively in terms of nearby motifs. In the datasets analyzed, all three methods found many of the same strong p eaks. However, our method found secondary and terciary p eaks near other strong p eaks that were not distinguished by the other methods, and may p ossibly b e separate binding sites. This is a result of our method searching for localized binding sites rather than binding regions of arbitrary size. On the other hand, our method did not call significant other p eaks that were called by the other methods. These p eaks were not strong enough when compared to their corresp onding background estimate at that location, at least according to the background estimation method used here. It is p ossible that a different background estimation method would have caused these p eaks to b e called significant. In fact, the other methods did b ecause they had different assumptions ab out what represents a strong p eak. In this pap er, we have attempted to frame the ChIP-Seq problem as a formal multiple testing problem. Assuming a Poisson model, we searched for p eaks in the IP sample whose binding rate is higher than the Control at that location and higher than a minimally interesting rate. The significance results and FDR levels may b e trusted as far those hyp otheses are concerned. Moreover, b ecause of the inherent spatial spread of the binding p eak shap e up to 400 bp and the spatial smoothing applied to the data, the detection procedure implictly defines a detected p eak to b e a true p ositive if it is within that distance of its true location. This helps explain the relatively low FDR cut-off of 0.01 used in the analysis. However the spatial spread also limits the spatial precision with which the detected p eak locations can b e estimated. The biological validity of the results is limited by the validity of the modeling assumptions. Particularly difficult is the background estimation, for which no good model exists to date. Because of its imp ortance, background estimation is where future research in ChIP-Seq analysis should focus its attention.

4.2

Computational considerations

In addition to the modelling considerations mentioned ab ove, the final ranking of the detected p eaks dep ends on the numerical accuracy with which p-values are computed. In the Monte Carlo simulations for computing the distribution of the heights of local maxima, the estimation is more accurate for high values of , as these produce more observations. In the simulations, we set the simulation length to b e at least 105 , or as long as needed in order to obtain at least 100 Poisson counts. The latter condition was necessary for very low Poisson rates, but cannot b e consider sufficient. The random variability was amelliorated by applying B-spline smoothing across in order to obtain the table in Figure 4b. In the analysis results, we observed that a large numb er of detected p eaks had a p-value of zero, meaning that the Monte Carlo simulation did not have enough numerical resolution to distinguish b etween their p-values. These p eaks were ranked sub-optimally by SNR. More accurate calculation of p-values could b e achieved with longer Monte Carlo simulations or by more sophisticated simulation techniques, such as Imp ortance Sampling. Computational complexity is also imp ortant in ChIP-Seq analysis b ecause of the large 17
Hosted by The Berkeley Electronic Press

amount of data to b e processed. The methods in this pap er were implemented in R to ease their development and sharing among researchers, but at the exp ense of computational sp eed. The main computational b ottleneck of our method is kernel smoothing, taking ab out 6 8 hours to run over the entire genome on a Dell Power Edge R710 server with CPU sp eed 2.67 GHz, 48 GB of memory and a Linux CentOS 5.5 op erating system. All the other processing steps together take ab out another hour. Kernel smoothing is mathematically simple, yet unfortunately inefficient in R for very long sequences. Computing time for kernel smoothing increases linearly with the kernel and the sequence size. In our implementation, the data was divided into subgroups of tags no more than 104 bp apart, trading off the length of the groups and their numb er. Computational time was also reduced by reducing the length of the kernel by multiplying it by a quartic biweight function of smaller supp ort. In comparison to the other methods, cisGenome is fast b ecause it is precompiled in C, while MACS is implemented in Python. Even though we implemented programming tricks for sp eed gains in R, such as run length encoding, in the future the ideas prop osed here could b e made computationally comp etitive by implementing them in C. At this stage, we do not intend that the method prop osed in this pap er is viewed as a monolithic direct comp etitor to the already existing methods for analyzing ChIP-Seq data, but rather as a suggestion of how rich inferential theory for multiple testing in spatial domains, such the one describ ed in SGA, can inform a more accurate detection of p eaks, at least from a statistical p oint of view. A software package implementing the methods in this pap er are available at: http://www.biostat.jhsph.edu/ajaffe/research.html.

A
A.1

Alignment details
Raw data

The FoxA1 raw data consists of a table of ab out 3.9 million rows for the IP sample and a table of ab out 5.2 million rows for the Control sample. Each row corresp onds to a mapp ed tag of length 35 bp and contains the b egininng and end genomic addresses for the tag and an indicator of whether the tag b elongs to the forward (+) or reverse (-) DNA strand. We define the location of a tag to b e given by its b eginning address, corresp onding to the lower address for the forward (+) tags and the higher address for the reverse (-) tags. The GABP data set, containing ab out 7.8 million tags in the IP sample and ab out 17.4 million tags in the Control sample, was converted to the same format b efore processing. Genomic locations not listed in the table were assumed to have an associated tag count of zero. Duplicate tags were considered measurement artifacts and were removed from the analysis. In order to b e counted together, the tags from the two strands need to b e aligned with each other. We followed an alignment method similar to that in MACS, shifting all tags by the same amount in the 3' direction of the tag sequence toward the most likely binding site: forward (+) tags toward higher genomic addresses and reverse (-) tags toward lower genomic addresses. Once shifted, tags coinciding at the same location are counted together. The result of this process is a table of genomic locations, each with an associated tag count. This aligned data is used as the input for p eak detection, describ ed in Section 2.

18
http://www.bepress.com/harvardbiostat/paper133

A.2

Estimation of the tag shift and peak shape

As in MACS, we estimate the size of the shift from the tag count distributions corresp onding to a set of strong and easily detectable p eaks, as describ ed b elow. We p erformed the shift estimation on Chromosome 1 b ecause of its likelihood to contain enough such strong p eaks, but other long chromosomes could b e used instead. As part of the process, the shift estimation also allows us to estimate the distribution of shifted tags counts around a p eak. This p eak shap e, normalized to unit sum, is used later as a smoothing kernel in the STEM algorithm for p eak detection. The estimation proceeds as follows. Algorithm 2 (Estimation of shift size and p eak shap e). 1. Temp orarily shift all tags ((+) forward and (-) back) by a tentative shift amount (default 100 bp). This produces a table of genomic locations, each with an associated tag count. 2. Perform p eak detection on the count data from the previous step and select a set of strong p eaks (details given b elow). Let t1 ,... ,tN be their locations. 3. Set a window size W (an odd numb er, default 2001 bp). The distribution of the forward tags is a vector of length W whose i-th entry is equal to the average numb er of forward tags at a constant distance (W +1)/2 - i from the p eak, that is, at locations tj - (W +1)/2+ i, j = 1,... ,N . Rep eat for the reverse tags. 4. Fit a spline to the distribution of forward tags and record its mode. Rep eat for the reverse tags. The estimated shift is half the distance b etween the two modes, rounded to the nearest integer. 5. To estimate the p eak shap e, shift the original forward and reverse distributions by the estimated shift, symmetrize the joint distribution by averaging b oth the forward and reverse tag distributions and their mirror images with resp ect to the center of the window, and fit a spline. The p eak detection step (Step 2) ab ove need not b e exact. Since the tag distribution is evaluated in a window around the strong p eaks, it is enough that the true location of those p eaks is contained somewhere near the center of that window. To achieve this, we apply the first half of the STEM algorithm, as follows. 2a. Set a tentative unimodal symmetric kernel (default Gaussian with standard deviation 50) and p erform kernel smoothing on the count data from Step 1. (Implementation details given in Section 2.3). 2b. Find the local maxima of the smoothed count sequence. (Implementation details given in Section 2.3). 2c. Select the N highest local maxima (default 1000). At the end of this process, the data consists of a long sequence of genomic addresses and associated counts 0, 1 or 2, ready for p eak detection analysis. The maximal count of 2 is a result of the elimination of duplicates from the original list of tags. Because binding rates are generally low, truncation at 2 does not greatly affect the Poisson model used thereafter.

19
Hosted by The Berkeley Electronic Press

References
Bradford L. Arkin and Lawrence M. Leenis. Nonparametric estimation of the cumulative intensity function for a nonhomogeneous Poisson process from overlapping realizations. Management Science, 46(7):989998, 2000. A. Barski and K. Zhao. Genomic location analysis by ChIP-Seq. Journal of Cel lular Biochemistry, 107:1118, 2009. Yoav Benjamini and Yosef Hochb erg. Controlling the false discovery rate: a practical and p owerful approach to multiple testing. J R Statist Soc B, 57(1):289300, 1995. W. M. Bolstad. The multiprocess dynamic Poisson model. J. Am. Statist. Assoc., 90(429): 227232, 1995. A. Fejes, G. Rob ertson, M. Bilenky, R. Varhol, M. Bainbridge, and S. Jones. FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics, 24(15):17201730, 2008. A. C. Harvey and J. Durbin. The effects of seat b elt legislation on British road casualties: A case study in structural time series modelling. J. Royal Statist. Soc., 149:187227, 1986. Roelof Helmers, I. Wayan Mangku, and Ri das Zitikis. Consistent estimation of the intensity car function of a cyclic Poisson process. J. Multivar. Anal., 84(1):1939, 2003. Hongkai Ji, Hui Jiang, Wenxiu Ma, David S. Johnson, Richard M. Myers, and Wing H. Wong. An integrated software system for analyzing ChIP-chip and ChIP-Seq data. Nature Biotechnology, 26(11):12931300, 2008. D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold. Genome-wide mapping of in vivo protein-dna interactions. Science, 316:14971502, 2007. T. S. Mikkelsen, M. Ku, D. B. Jaffe, B. Issac, E. Lieb erman, G. Giannoukos, P. Alvarez, W. Brockman, T. K. Kim, R. P. Koche, W. Lee, E. Mendenhall, A. O'Donovan, A. Presser, C. Russ, X. Xie, A. Meissner, M. Wernig, R. Jaenisch, C. Nusbaum, E. S. Lander, and B. E. Bernstein. Genome-wide maps of chromatin state in plurip otent and lineage-committed cells. Nature, 448:553560, 2007. Peter J. Park. ChIP-seq: advantages and challenges of a maturing technology. Nature Rev. Genet., 10:669680, 2009. William K. Pratt. Digital image processing. Wiley, New York, 1991. Armin Schwartzman, Yulia Gavrilov, and Rob ert J. Adler. Multiple Testing of Local Maxima for Detection of Unimodal Peaks in 1D. Working pap er, Harvard University, http://www.bepress.com/harvardbiostat/paper131/, 2011. Marvin Simon. Digital communication techniques: signal design and detection. Prentice Hall, Englewood Cliffs, NJ, 1995. C. Spyrou, R. Stark., A. G. Lynch, and S. Tavar. Bayesian analysis of ChIP-Seq data. BMC Bioinformatics, 10:299, 2009.

20
http://www.bepress.com/harvardbiostat/paper133

A. Valouev, D. S. Johnson, A. Sundquist, C. Medina, E. Anton, S. Batzglou, R. M. Myers, and A. Sidow. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nature Methods, 5:829834, 2008. M. West, P. J. Harrison, and H. S. Migon. Dynamic generalized linear models and Bayesian forecasting (with discussion). J. Am. Statist. Assoc., 80:7396, 1985. Yong Zhang, Tao Liu, Clifford A. Meyer, Jґ ^me Eeckhoute, David S. Johnson, Bradley E. ero Bernstein, Chad Nussbaum, Richard M. Myers, Myles Brown, Wei Li, and X. Shirley Liu. Model-based analysis of ChIP-Seq (MACS). Genome Biology, 9(9):R137, 2008. M. Zhao and M. Xie. On maximum likelihood estimation for a general non-homogeneous Poisson process. Scand. J. Statist., 23(4):597607, 1996.

21
Hosted by The Berkeley Electronic Press