White Paper White paper on the Transcription Factor ChIP-Seq September 11, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 Fax: +45 86 20 12 22 www.clcbio.com [email protected] White paper: White paper on the Transcription Factor ChIP-Seq Contents 1 Abstract 3 2 Introduction 3 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Aims for a next-generation peak caller . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 New approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Signal detection using a Hotelling-filter 7 3.1 Quality control of ChIP-seq data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Discovering obvious peaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Learning the peak shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.5 Peak Shape Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.6 Peak-detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.7 The Transcription Factor ChIP-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4 Performance evaluation 14 4.1 Gold-standard dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Calculating performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3 Running Transcription Factor ChIP-Seq with different input datasets . . . . . . . . 16 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5 Conclusions 20 5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2 Future Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6 Materials and Methods 23 References 24 P. 2 White paper: White paper on the Transcription Factor ChIP-Seq 1 Abstract In this White Paper we present a performance analysis on the new Transcription Factor ChIP-seq tool available in CLC Genomics Workbench and CLC Genomics Server (version 7.5 and up), and in Biomedical Genomics Worbench and Server. Using an independent manually curated benchmark dataset [Rye et al., 2011] we show that the CLC implementation of the shape-based peak caller ranks well among popular state-of-the-art peak callers while requiring a minimum of interventionandparameterizationfromtheuser. Wedescribethealgorithmicprinciplesunderlying the implementation and provide an evaluation in terms of receiver-operator characteristic (ROC) plots. Finally, we identify the most promising avenues for further developments and future application areas. 2 Introduction 2.1 Background In order to identify functional elements in a genome, a number of experimental high-throughput techniques have been developed for investigating specific interactions between proteins and DNA. These protocols provide us with a deeper understanding of gene-regulatory and epigenetic mechanisms by identifying, for example, Transcription-Factor Binding Sites (TFBS) or the location of epigenetic marks. In this paper, we focus on transcription factors. An example is the Octamer Transcription Factor 1 (OCT1), which recognizes an 8 bp long, conserved DNA motif in promoter regions and activates the associated genes. The three dimensional structure of the transcription factor OCT1 bound to a DNA fragment is shown in figure 1. Figure1: AtranscriptionfactorbindingtoafragmentofDNA.ThisillustrationisbasedonPDBentry1gt0 of the Octamer-binding transcription factor OCT1, rendered in space-filling surface representation with blue/gray/redcoloringindicatinglower/medium/higherstructuralflexibility. TheDNAfragmentisshownin stick/ribbonrepresentation. ThefigurewascreatedusingCLCDrugDiscoveryWorkbench1.5. In broad terms, these techniques chemically cross-link proteins to those stretches of DNA they are bound to in vivo. After shearing the DNA, a protein of interest is extracted along with the cross-linked DNA fragments from the cell-lysate using specific antibodies. Following this Chromatin Immuno-Precipitation (ChIP) step, the short stretches of DNA attached to the protein of interest are identified by high-throughput sequencing. P. 3 White paper: White paper on the Transcription Factor ChIP-Seq For any targeted protein and a given cell-line or condition, this results in several million reads of raw sequencing data. Usually a control experiment is performed where the immuno-precipitation stepisleftout. Forexample,theENCODEProject(ENCyclopediaOfDNAElements)hasproduced data on hundreds of regulatory factors (see http://encodeproject.org/) in mouse and human. For more in-depth information we recommend the "ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia" [Landt et al., 2012,Marinov et al., 2014]. The initial step for all downstream computational analyses of the sequencing data starts by mapping the resulting set of raw reads to a reference genome. Obviously, the quality of the read mapping has an impact on the downstream analysis results. However, de- tails of the mapping process are beyond the scope of this white paper. In this context it suffices to highlight that the CLC read mapper is one of the most accurate and best per- forming tools available for this task (see http://www.clcbio.com/files/whitepapers/ whitepaper-on-CLC-read-mapper.pdf). In this paper we will assume that an accurate read mapping is provided. For the performance comparison, all peak callers used the same read mappings as input. 2.2 State of the art By plotting the number of reads mapped to genomic coordinates as a so-called coverage graph, consistentandspecificbindingsitesoftheproteinofinterestbecomevisuallyapparentaspeaks (see figure 2). Rather than identifying such regions by eye, subsequent bioinformatics analysis aimsatthereliableautomatedidentificationofprotein-DNAbindingeventsfromthereadmapping - a process referred to as peak calling. Figure 2: Coverage graphs and mapped reads from the NRSF dataset, showing a region of human Chromosome 1. The uppermost track marks two regions classified as positive peaks in the reference dataset with blue arrows. The two tracks below visualize the read mapping of the two replicate ChIP- samples. Thetwolowermosttracksdisplaythereadsfromthetwocontrolsamples. Inalltracks,forward readsareshowningreenandreversereadsareshowninred. BelowthereadsoftheChIP-seqsample dataatthepeakregions,thecoveragegraphofforwardandreversereadsisshown. From a computational viewpoint, the main lessons learned from the current generation of peak calling algorithms can be summarized as follows: The success of peak calling depends on how P. 4 White paper: White paper on the Transcription Factor ChIP-Seq well the statistical model of the ChIP-seq signal can be fitted to the data under consideration. In this context, parameterizing a peak caller can be seen as tweaking its intrinsic model to improve the fit to the data. However, this requires in-depth knowledge of the underlying algorithm and statistical model and a good grasp of how the behavior is affected by the parameters. Therefore the fine-tuning of parameters remains a "black art" to most biologists who want to analyze the results of their ChIP-seq experiments. Since parameter optimization is a hard and time-consuming task, it is recommended [Landt et al., 2012] to focus on improvements in experimental design in order to obtain better input data rather than attempting to optimize the downstream bioinformatics pipeline. In recent literature, it has been observed that current peak callers may miss peaks that are clearly visible to the human eye. More precisely the peaks exhibit distinct shapes which act as visual cues. However, as stated by Rye et al., 2011, "peak shape information is not fully exploited in the evaluated programs." There is still a gap between the peaks apparent to expert users and what is recognized algorithmically. In a recent paper, Heydarian et al., 2014 report: "This discordance led us to revisit our analyses of the ChIP-seq data, but we were unable to find parameters for the MACS algorithm that successfully discriminated many promoters as active, even though visual inspection clearly showed they had peaks in both chromatin modifications." One approach to improve peak calling is to take more of the observed specific characteristics of existing ChIP-seq datasets into account and encode them directly into the algorithm (see e.g. Kornacker et al., 2012). However, hard-coding more specific models into the heuristic is not a sustainable path for future developments in the long run, as this procedure may be overfitting certain kinds of datasets. Specially given the increasing variety of ChIP-seq related experimental protocols, this approach would ultimately lead to a similar variety of specialized heuristics. This varietyandspecializationwillmakeitincreasinglydifficulttoupdate,test,andreleasealgorithms while leaving users confused as to which tool and version is optimally suited to their data at hand. Nevertheless, novel peak-shape recognition algorithms have been shown to outperform existing heuristics, identifying peaks that were missed previously [Mendoza-Parra et al., 2013,Kumar et al., 2013]. Generally speaking, in contrast to a hard-coded internal model these approaches "learn" the peak-shape from the underlying data. In the "learning-phase" an initial set of positive examples is identified, i.e. regions that unambiguously contain peaks. From this initial set, a computational representation of the specific peak-shape is constructed. This representation is then applied to the entire dataset to perform the automated peak calling, based on a suitable statistical framework. 2.3 Aims for a next-generation peak caller Reflecting on the recent developments in the field combined with the practical lessons learned from a number of different current peak calling algorithms and the potential advances offered by the aforementioned shape-based approaches, we formulated the main requirements for our new peak calling toolset as follows: Generality In order to keep up with the steadily rising number of experimental protocols that require peak calling for data-analysis, the algorithmic engine has to be general enough to be applicable across datasets from many different *-seq technologies (i.e. ChIP-seq, FAIRE-seq, DNase-seq etc.). The toolset should be swiftly adaptable to new datasets as they become available without expensive recoding efforts. P. 5 White paper: White paper on the Transcription Factor ChIP-Seq Specificity For achieving optimal results, the specific characteristics of the peaks need to be recognized by the algorithm. The optimization and parameterization for the task at hand should not sacrifice the generality of the underlying algorithmic implementation. Robustness Rather than inventing ad-hoc scoring schemes, the algorithms need to be built on a mathematically and statistically well-founded framework. In particular, we employ methodologies from digital signal processing and machine learning, which have been extensively studied and are deeply understood. Simplicity Despite the algorithmic and statistical complexity of the data-analysis task, the implementationneedstobesuitableforageneralaudience. Thistranslatesintominimizing or even eliminating the need for parameterization and automating the most common tasks. At the same time, the algorithm needs to be transparent about its results and intuitive to use, such that advanced users can adopt the tools easily to their needs. 2.4 New approach Some of the aims formulated above may seem to be contradictory or even mutually exclusive to each other at first sight. However, recently a general approach has been described, based on the observation that the peak calling constitutes a special case of signal detection algorithms and that it "can be solved by adapting ’uniform’ and ’formally optimal’ techniques from the signal processing literature" [Kumar et al., 2013]. In addition to being general such that signals of arbitrary shape can be processed, it is based upon a well studied framework from signal processing theory. In broad terms, the specific shape of the signal is learned from a number of "positive" regions and "negative" regions (or noise) where the signal is absent or is the result of a sequencing artifact. The resulting shape of the signal minus the noise is encoded in a vector (the so-called Hotelling-observer, named after the mathematical statistician Harold Hotelling [Hotelling, 1931]), which is then evaluated against the data-stream. This resulting filter contains the information needed for peak detection and can easily be transferred and applied to other datasets as well. This approach lends itself to visual parameterization by example such that advanced users can define the set of regions from which the filter is constructed, making it transparent as to what pattern the algorithm is detecting. Examples of approaches along this rationale are described in Kornacker et al., 2012, Kumar et al., 2013, and Stanton et al., 2013. In contrast to the "black box approach" of current peak callers, these methods lend themselves to visual and interactive parameterization by displaying positive and negative examples in a user-friendly way. At the same time, the underlying implementation remains independent of the peak-shape it is detecting - analogous to text-search algorithms being independent of the text pattern or regular expressions. In order to build specific peak-shape filters without extensive manual annotation of positive and negative regions, we take into account that the vast majority of peak callers hardly disagree about top-scoring peaks [Wilbanks and Facciotti, 2010]. The differences in performance become apparent only for less obvious peaks; it is in this "gray zone" where the fit between the data and the intrinsic statistical model of the algorithms decides about their relative performance. This observation suggests a strategy for boot-strapping the shape-based approach: There are sufficient positive regions that can be safely and unambiguously identified by any of the currently available methods in a first pass. A more specific model of the peak-shape is then inferred from these clear-cut examples, allowing the algorithm to tune itself automatically to the data at hand. A second peak-detection step is performed, resulting in a much more sensitive peak-detection overall. P. 6 White paper: White paper on the Transcription Factor ChIP-Seq The components of the underlying "learn and apply"-cycle for peak-shapes are available as advanced tools, allowing for more manual intervention. These modules can be combined to producemorecomplexworkflowsorforiterativerefinementofalgorithmicperformance. Wedefer further details to an advanced tutorial, here we stick to the automated ChIP-seq functionality, akin to running existing algorithms with standard parameters. A more detailed description of the algorithmic basis and our implementation will be given in the next section, followed by a section on the performance evaluation, benchmarking the Transcription Factor ChIP-Seq against CisGenome [Ji et al., 2008], HOMER [Heinz et al., 2010], and MACS [Zhang et al., 2008]. 3 Signal detection using a Hotelling-filter Since the shape of the signal from ChIP-seq data depends on which protein was targeted in the immuno-preciptation reaction [Stanton et al., 2013,Kumar et al., 2013], the Transcription Factor ChIP-Seq is designed to take the characteristic peak-shape into account. For example, the typical signal shape of a transcription factor binding site like NRSF shows a high concentration of forward reads followed by a high concentration of reverse reads (figure 3). Figure3: Distributionofforward(green)andreverse(red)readsaroundabindingsiteofthetranscription factorNRSF.Thecentreoftheputativebindingsiteisindicatedbyaredverticalline. The average shape of the positive regions of the NRSF transcription factor for the forward and reverse strands is shown in figure 4. Figure4: AveragepeakshapeofthetranscriptionfactorNRSF. TheTranscriptionFactorChIP-Seqmakesuseofboththecharacteristicpeakshapeandenriched read coverage to identify peaks in ChIP-seq data. Next, we will outline the individual steps of the entire ChIP-seq peak calling pipeline, starting from quality control and normalization of the data, P. 7 White paper: White paper on the Transcription Factor ChIP-Seq describing the fundamentals of using a filter, learning a characteristic peak shape from highly enriched regions and calling peak regions including boundary refinement. Finally, we describe howthesestepsareworkingtogethertoresultinahighlyautomatedpeakdetectionpipelinethat provides near optimal results without the need for extensive parameterization by the user. 3.1 Quality control of ChIP-seq data During the first step of the analysis, the Transcription Factor ChIP-Seq computes several quality measures to check whether the input data satisfy the assumptions made by the algorithm. These measures can be derived from the cross-correlation profile between reads mapping to the forward and to the reverse strand. This plot is often used to investigate the quality of ChIP-seq experiments [Landt et al., 2012,Marinov et al., 2014]. A correlation profile is shown in figure 5. Fragment-length peak Fragment-length peak = 76,168 Read-length peak = 68,534 Read-length peak Background= 26,918 Read-length = 27 Read-length Estimated fragment-length peak = 101 Estimated fragment-length peak Figure 5: Inspection of the cross-correlation plot of a ChIP-seq experiment. On the left, full correlation profile. On the right, blow-up of the interesting region between 0 and 400 bp. Note that in the right plot, thecross-correlationvalueshavebeennormalizedsothattheareaunderthecurveis1. Several features of the cross-correlation plot are connected to the quality of the ChIP-seq experiment: Read-length peak Thecross-correlationfunctiontypicallyhasamaximum(oftencalledaphantom peak[Landtetal.,2012])whenthestrandshiftvalueisequaltothelengthofthesequenced reads. Inourexample(figure5, redlines), thereadlengthisknowntobe27andtheheight of the read-length peak is 68,534. Fragment-length peak The cross-correlation function typically has a maximum when the value of the strand-shift is close to the length of the DNA fragment being sequenced. This is indicated in green in figure 5, where the height of the peak is 76,168. This peak is a characteristic feature of a ChIP-seq experiment. One can expect a pronounced peak around the fragment length because the frame shift between reads mapping to the forward and to the reverse strand near a typical transcription factor binding site (figure 3) is on average equal to the fragment length [Stanton et al., 2013,Landt et al., 2012,Kharchenko et al., 2008]. Therefore, the(relative)heightofthispeakcanbeconsideredaproxyforthequality of the ChIP-seq experiment. Finally, the location of this peak can be used to estimate P. 8 White paper: White paper on the Transcription Factor ChIP-Seq the average length of the DNA fragments after the fragmentation step (e.g. sonication or MNAse digestion). The average fragment length in our example is estimated to be 101 bp. Background The value of the cross-correlation at a large distance. The Transcription Factor ChIP-Seq uses the value at a strand shift of 1,500, which can be considered to be far enoughfromthefragmentlengthdistribution[Kharchenkoetal.,2008]. Thislevelisshown in blue in figure 5, where it is equal to 26,918. This value can be considered a proxy for the background noise of the experiment. Normalized strand coefficient The normalized strand coefficient describes the ratio between the fragment-length peak and the background cross-correlation. In our example (figure 5), NSC = 76,168 2.8.Thisvalueshouldbegreaterthan1.05forChIP-seqexperiments[Landt 26,918 ⇡ et al., 2012]. Relative strand correlation The relative strand correlation describes the ratio between the fragment-length peak and the read-length peak in the cross-correlation plot obtained after correcting for the background. In our example (figure 5), RSC = 76,168�26,918 1.2. 68,534 26,918 ⇡ This value should be high (at least 0.8) for transcription factor binding sites�, which have a concentrated signal. However, it should be noted that this value can be low even for successful ChIP-seq experiments on histone modifications, whose signal can span large genomic regions [Landt et al., 2012]. ThosequalitymeasureshavebeeninvestigatedbythemodENCODEconsortiumandaredescribed in more detail in Landt et al., 2012. The cross-correlation profile shown in figure 5 is typical of a successfulChIP-seqexperiment. Ontheotherhand,cross-correlationplotswithoutapronounced fragment-length peak are typically reflective of poor quality and ragged cross-correlation profiles are typically caused by low yield (figure 6). Low quality Few reads read-length peak fragment-length peak background Figure6: Cross-correlationplotsoftwolow-qualityChIP-seqdatasets. Ontheleft,theread-lengthpeak issignificantlyhigherthanthefragment-lengthpeak(relativestrandcorrelationofaround0.5),indicating potential problems in the immuno-precipitation step. On the right, a very noisy cross-correlation profile indicatesaChIP-seqexperimentwhereaverysmallnumberofreadswassequenced. 3.2 Normalization The Transcription Factor ChIP-Seq analyzes the genomic coverage of the reads. For each read mapping, the 5’ position of the reads mapping to the forward strand and the 3’ position of the P. 9 White paper: White paper on the Transcription Factor ChIP-Seq reads mapping to the reverse strand are extracted. For each genomic position x, we define f(x) as the number of reads mapping to the forward strand where x is the 5’ position and r(x) as the number of reads mapping to the reverse strand where x is the 3’ position. A quantile standardization is then applied to f(x) and r(x) such that the normalized coverage functions f 0 and r follow a standard normal distribution, i.e. f (x),r (x) (0,1). 0 0 0 ⇠ N 3.3 Discovering obvious peaks The next step of the Transcription Factor ChIP-Seq is to build a filter, which can be used to identifygenomicregionswhosereadcoverageprofilematchesthecharacteristicpeakshapeand to determine the statistical significance of this match. In order to build such a filter, examples of positive (e.g. ChIP-seq peaks) and negative (e.g. background noise, PCR artifacts) profiles are needed as input. The rationale is that regions with very high coverage in the ChIP-seq experiment are positive examples and regions with high coverage in the control and low in the experimental ChIP-seq data are negative examples, as they most likely originate from regions with strong sequencing biases or from PCR artifacts. Positive regions are generally easy to find and are typically found by every peak caller [Wilbanks and Facciotti, 2010]. The Transcription Factor ChIP-Seq finds these peaks by building a Gaussian filter based on the mean and variance of the fragment length distribution, which are inferred from the cross-correlation profile (figure 6). An example of a filter is shown in figure 7. Fragment length standard deviation Fragment length mean Figure7: AGaussianfilterforthetranscriptionfactorNRSF. The filter is then applied to the input data as shown in figure 8 and the result is a score that indicates how likely a genomic position is to be a center of a peak. In detail, the score is calculated as score = genomic coverage?filter, (1) where ? denotes the cross-correlation operator. The cross-correlation between a function and a filter can be described as follows: For each genomic position x, we extract the genomic coverage profile of a window centered at x. We multiply this profile by the peak shape filter and we sum the result. The resulting number indicates how well the shape of the filter is matched. The score will reach a maximum at the center of a peak. Peaks are then identified as the regions whose centers are the genomic positions with highest score and whose size is the size of the filter. P. 10
Description: