Skip to main content
Biology LibreTexts

19.4: Primary data processing of ChIP data

  • Page ID
    41030
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    Read mapping

    The problem of read mapping seeks to assign a given read to the best matching location in the reference genome. Given the large number of reads and the size of human genome, one common requirement of all read mapping algorithms is that they be efficient in both space and time. Furthermore, they must allow mismatches due to sequencing errors and SNPs.

    Based on previous lectures, we know various ways to perform mapping of reads: sequence alignment (O(mn) time) and hash-based approaches such as BLAST, for example. Other approaches exist as well: linear time string matching (O(m + n) time) and suffix trees and suffix arrays (O(m) time). However, a problem with all these techniques is that they have a large memory requirement (often O(mn)). Instead, state-of-the-art techniques based on the Burrows-Wheeler transformation [1] are used. These run in O(m) time and require just O(n) space.

    The Burrows-Wheeler transform originally arose from the need to compress information. It takes a long string and rearranges it in a way that has adjacent repeating letters. This string can be compressed because, for example, instead of writing 100 A’s the computer can now just indicate that there are 100 A’s in a row. The Burrows-Wheeler transform also has some other special properties that we will exploit to search in sublinear time.

    The Burrows-Wheeler transform creates a unique transformed string that is shorter than the original string. It also can be reversed easily to generate the original string, so no information is lost. The transformed string is in sorted order, which allows for easy searching. The details of Burrows-Wheeler transformation are described below and are illustrated in Figure 19.3.

    First, we produce a transform from an original string by the following steps. In particular, we produce a transform of the reference genome.

    1. For a given reference genome, add a special character at the beginning and end of the string (e.g., “BANANA” becomes ^BANANA@). Then generate all the rotations of this string (e.g., one such rotation would be NANA@^BA).
    2. Sort the rotations lexicographically — i.e., in alphabetical order — with special characters sorted last.
    3. Only keep the last column of the sorted list of rotations. This column contains the transformed string

    Once a Burrows-Wheeler transform has been computed, it is possible to reverse the transform to compute the original string. This can be done with the procedure in Figure ??. Briefly, the reverse transformation works as follows: given the transformed string, sort the string characters in alphabetical order; this gives the first column in the transform. Combine the last column with the first to get pairs of characters from the original rotations. Sort the pairs and repeat.

    By using sorting pointers rather than full strings, it is possible to generate this transform of the reference genome using a space that is linear in its size. Furthermore, even with a very large number of reads, it is only necessary to do the transform one in a forward direction. After counting the reads in the transformed space, it is then only necessary to do the reverse transform once to map the counts to genome coordinates.

    In particular, from the Burrows-Wheeler transform we observe that all occurrences of the same suffix are effectively next to each other rather than scattered throughout the genome. Moreover, the ith occurrence of a character in the first column corresponds to the ith occurrence in the last column. Searching for substrings using the transform is also easy. Suppose we are looking for the substring “ANA” in the given string. Then the problem of search is reduced to searching for a prefix “ANA” among all possible sorted suffixes (generated by rotations). The last letter of the substring (“A”) is first searched for in the first letters of the sorted rotations. Then, the one-letter rotations of these matches are considered; the last two letters of the substring (“NA”) are searched for among the first two letters of these one-letter rotations. This process can be continued with increasing length suffixes to find the substring as a prefix of a rotation. Specifically, each read is searched for and is found as a prefix of a rotation of the reference genome; this gives the position of the read in the genome. By doing a reverse transform, it is possible to find the genomic coordinates of the mapped reads.

    Note that this idea is no faster in theory than hashing, but it can be faster in practice because it uses a smaller memory footprint.

    Quality control metrics

    As with all experimental data, ChIP methods contain biases and their output may be of varied quality. As a result, before processing the data, it is necessary to control for these biases, to determine which reads in the data achieve a certain level of quality, and to set target thresholds on the quality of the data set as a whole. In this section we will describe these quality control problems and metrics associated with them.

    QC1: Use of input DNA as control

    First, the reads given by ChIP are not uniformly scattered in the genome. For example, accessible regions of the genome can be fragmented more easily, leading to non-uniform fragmentation. To control for this bias, we can run the ChIP experiment on the same portion of DNA without using an antibody. This yields input DNA, which can then be fragmented and mapped to give a signal track that can be thought of as a background — i.e., reads we would expect by chance. (Indeed, even in the background we do not see uniformity.) Additionally, we have a signal track for the true experiment, which comes from the chromo-immunoprecipitated DNA. Shown in Figure 19.4

    QC2: Read-level sequencing quality score threshold

    When sequencing DNA, each base pair is associated with a quality score. Thus, the reads given by ChIP- seq contain quality scores on the base pair level, where lower quality scores imply a greater probability of mismappings. We can easily use this information in a preprocessing step by simply rejecting any reads whose average quality score falls below some threshold (e.g., only use reads where Q, the average quality score, is greater than 10). Shown in Figure 19.5

    QC3: Fraction of short reads mapped

    Each read that passes the above quality metric may map to exactly one location in the genome, to multiple locations, or to no locations at all. When reads map to multiple locations, there are a number of approaches for handling this:

    • A conservative approach: We do not assign the reads to any location because we are so uncertain. Con: we can lose signal
    • A probabilistic approach: We fractionally assign the reads to all locations. Con: can add artifacts (unreal peaks)
    • A sampling approach: We only select one location at random for a read. Chances are, across many reads, we will assign them uniformly. Con: can add artifacts (unreal peaks)
    • An EM approach: We can map reads based on the density of unambiguous reads. That is, many unique reads that map to a region give a high prior probability that a read maps to that region. Note: we must make the assumption that the densities are constant within each region
    • A paired-end approach: Because we sequence both ends of a DNA fragment, if we know the mapping of the read from one end, we can determine the mapping of the read at the other end even if it is ambiguous.

      Either way, there will likely be reads that do not map to the genome. One quality control metric would be considering the fraction of reads that map; we may set a target of 50%, for instance. Similarly, there may be regions to which no reads map. This may be due to a lack of assembly coverage or too many reads mapping to the region; we treat unmappable regions as missing data.

    QC4: Cross-correlation analysis

    An additional quality control that is cross-correlation analysis. If single-end reads are employed, the a DNA binding protein will generate a peak of reads mapping to the forward strand offset a distance roughly equal to the DNA fragment length from a peak of reads mapping to the reverse strand. A similar pattern is generated from paired end reads, in which read ends fall into two groups with a given offset, one read end will map to the forward strand and the other to the reverse strand. The average fragment length can be inferred by computing the correlation between the number of reads mapping to the forward strand and number of reads mapping to the reverse strand as a function of distance between the forward and reverse reads. The correlation will peak at the mean fragment length.

    The cross-correlation analysis also provides information on the quality of the ChIP-seq data set. Input DNA should not contain any real peaks, but often shows a strong cross-correlation at a distance equal to the read length. This occurs because some reads map uniquely in between regions that are unmappable. If a read can map uniquely at position x in between two unmappable regions on the forward strand, then a read can also map uniquely to the reverse strand at position x + r -1, where r is the read length. Reads that map in this manner generate the strong cross- correlation at distance equal to the read length in the input DNA. If a ChIP-seq experiment was unsuccessful and did not significantly enrich for the protein of interest, then a large component of the reads will be similar to the unenriched input, which will produce a peak in the cross-correlation at read length. Thus, the strength of the cross-correlation at read length relative to the strength at fragment length can be used to evaluate the quality of the ChIP-seq data set. Acceptable ChIP-seq libraries should have a cross-correlation at fragment length at least as high as at read-length, and the higher the ratio between the fragment-length cross-correlation and the read-length cross-correlation, the better.

    QC5: Library Complexity

    As a final quality control metric, we can consider the complexity of the library, or the fraction of reads that are non-redundant. In a region with signal, we might expect reads to come from all positions in that region; however, we sometimes see that only a small number of positions in a region have reads mapping to them. This may be the result of an amplification artifact in which a single read amplifies much more than it should. Consequently, we consider the non-redundant fraction of a library:

    \[ \mathrm{NRF}=\frac{\text { No. of distinct unique-mapping reads }}{\text { No. of unique mapping reads }} \nonumber \]

    This value measures the complexity of the library. Low values indicate low complexity, which may occur, for example, when there is insufficient DNA or one DNA fragment is over-sequenced. When working with at least 10 million uniquely mapped reads, we typically set a target of at least 0.8 for the NRF.

    Peak Calling and Selection

    After reads are aligned, signal tracks as shown in Figure 19.6 can be generated. This data can be ordered into a long histogram spanning the length of the genome, which corresponds to the number of reads (or degree of fluorescence in the case of ChIP-chip) found at each position in the genome. More reads (or fluorescence) suggests a stronger presence of the epigenetic marker of interest at this particular location.

    In particular, to generate these signal tracks we transform the read counts into a normalized intensity signal. First, we can use the strand cross-correlation analysis to estimate the fragment length distribution f. Since we now know f, as well as the length of each read, we can extend each read (typically just 36 bp) from the 5’ to 3’ direction so that its length equals the average fragment length. Then, rather than just summing the intensity of each base in the original reads, we can sum the intensity of each base in the extended reads from both strands. In other words, even though we only sequence a small read, we are able to use information about an entire segment of which that read is a part. We can do this same operation on the control data. This yields signal tracks for both the true experiment and the control, as shown in Figure 19.7.

    To process the data, we are first interested in using these signal tracks to discover regions (i.e., discrete intervals) of enrichment. This is the goal of peak calling. There are many programs that perform peak calling with different approaches. For example, MACS uses a local Poission distribution as its statistical model, whereas PeakSeq uses a conditional binomial model.

    One way to model the read count distribution is with a Poisson distribution. We can estimate the expected read count, λlocal from the control data. Then,

    \[ \operatorname{Pr}(\text { count }=x)=\frac{\lambda_{\text {local }}^{x} e^{-\lambda_{\text {local }}}}{x !} \nonumber \]
    Thus, the Poisson p-value for a read count x is given by Pr(count ≥ x). We specify a threshold p-value (e.g., 0.00001) below which genomic regions are considered peaks.

    We can transform this p-value into an empirical false discovery rate, or eFDR, by swapping the ChIP (true) experiment data with the input DNA (control) tracks. This would yield the locations in the genome where the background signal is higher than the ChIP signal. For each p-value, we can find from both the ChIP data and the control data. Then, for each p-value, the eFDR is simply the number of control peaks divided by the number of ChIP peaks. With this, we can then choose which peaks to call based on an eFDR threshold.

    A major problem that arises is that no single universal eFDR or p-value threshold can be used. Ideal thresholds depend on a range of factors, including the ChIP, the sequencing depth, and the ubiquity of the target factor. Furthermore, small changes in the eFDR threshold can yield very large changes in the peaks that are discovered. An alternative measure is the Irreproducible Discovery Rate, or IDR, and this measure avoids these FDR-specific issues.

    Irreducible Discovery Rate (IDR)

    A major drawback of using traditional statistical methods to evaluate the significance of ChIP-seq peaks is that FDR and p-value-based approaches make particular assumptions regarding the relationship between enrichment and significance. Evaluating the significance of ChIP peaks using IDR rather than a p-value or FDR is advantageous because it allows us to leverage the information present in biological replicates to call peaks without setting a threshold for significance. IDR-based approaches rely upon the idea that real signal is likely to be reproducible between replicates, whereas noise should not be reproducible. Using IDR to call significant peaks returns peaks that satisfy a given threshold for significance. To determine which peaks are significant via IDR, the peaks in each biological replicate are ranked based on their enrichment in descending order.The top N peaks in each replicate are then compared against each other, and the IDR for a given replicate is the fraction of peaks present in the top N peaks in the replicate that are not present in the other replicates (i.e, the fraction of peaks that are not reproducible between replicates). To develop more mathematical intuition, the following (entirely optional) subsection will rigorously introduce the concept of the IDR.

    Mathematical Derivation of the IDR

    Since the IDR utilizes ranks, this mean that the marginal distributions are uniform, and the information is mostly encoded in the joint distributions of the ranks across biological replicates. Specifically, when the marginal distributions are uniform, we can model the joint distributions through a copula model. Simply put, a copula is a multivariate probability distribution in which the marginal probability of each variable is uniform. Skar’s Theorem states that there exists at least one copula function which allows us to express the joint in terms of the dependence of the marginal distributions.

    \[F_{k}\left(x_{1}, x_{2}, \ldots x_{k}\right)=C_{x}\left(F_{X_{1}}\left(x_{1}\right), \ldots F_{X_{k}}\left(x_{K}\right)\right) \nonumber \]

    Where Cx is the copula function and the F(x) is the cumulative distribution for a variable x. Given this information, we can set a Bernoulli distribution Ki ~ Bern(πi) that denotes whether the ith peak is from the consistent set or the spurious set. We can derive z1 = (z1,1, z1,2) if Ki = 1 or z0 = (z0,1, z0,2) if Ki = 0 (where z0,i means that it’s from the spurious set in biological replicate i). Using this, we can model the z1,1 and z0,1 models as the following:

    \[ \left(\begin{array}{c}
    z_{i, 1} \\
    z_{i, 2}
    \end{array}\right) \mid K_{i}=k \sim N\left(\left(\begin{array}{c}
    \mu_{k} \\
    \mu_{k}
    \end{array}\right),\left(\begin{array}{cc}
    \sigma_{k}^{2} & \rho_{k} \sigma_{k}^{2} \\
    \rho_{k} \sigma_{k}^{2} & \sigma_{k}^{2}
    \end{array}\right)\right) \nonumber \]

    We can utilize two different models to model whether it comes from the spurious set (denoted by 0), or the real set(1). If the real set, we have μ1 > 0 and 0<ρ1<1, where as in the null set we have μ0 =0, and σ02 = 1. We can model a variable ui,1 and ui,2 with the following formulas:

    \[ u_{i, 1}=G\left(z_{i, 1}\right)=\pi_{1} \Phi\left(\frac{z_{i, 1}-\mu_{1}}{\sigma_{1}}\right)+\pi_{0} \Phi\left(z_{i, 1}\right) \nonumber \]

    \[u_{i, 2}=G\left(z_{i, 2}\right)=\pi_{1} \Phi\left(\frac{z_{i, 2}-\mu_{1}}{\sigma_{1}}\right)+\pi_{0} \Phi\left(z_{i, 2}\right) \nonumber \]

    Where Φ is the normal cumulative distribution function. Then, let the observed xi,1 = \(F^{-1}\left(u_{i, 1}\right)\) and \(x_{i, 2}=F^{-1}\left(u_{i, 2}\right)\), F1 and F2 are the marginal distributions of the two coordinates. Thus, for a signal i, we have:

    \[P\left(X_{i, 1} \leq x_{1}, X_{i, 2} \leq x_{2}\right)=\pi_{0} h_{0}\left(G^{-1}\left(F_{1}\left(x_{i, 1}\right), G^{-1}\left(F_{2}\left(x_{i, 2}\right)\right)+\pi_{1} h_{1}\left(G^{-1}\left(F_{1}\left(x_{i, 1}\right), G^{-1}\left(F_{2}\left(x_{i, 2}\right)\right)\right.\right.\right.\right. \nonumber \]

    We can express h0 and h1 with the following normal distributions, similar to the z1 and z2 that were defined above:

    \[\begin{aligned}
    &h_{0} \sim N\left(\left(\begin{array}{l}
    0 \\
    0
    \end{array}\right),\left(\begin{array}{ll}
    1 & 0 \\
    0 & 1
    \end{array}\right)\right)\\
    &h_{1} \sim N\left(\left(\begin{array}{c}
    \mu_{1} \\
    \mu_{1}
    \end{array}\right),\left(\begin{array}{cc}
    \sigma_{1}^{2} & \rho_{1} \sigma_{1}^{2} \\
    \rho_{1} \sigma_{1}^{2} & \sigma_{1}^{2}
    \end{array}\right)\right)
    \end{aligned} \nonumber \]

    We can now infer the parameters θ = (μ1 , ρ1 , σ1 , π0 ), using a EM algorithm, where the inference is based on P (Ki = 1 | (xi,1, xi,2); \(\hat{\theta}\)). Thus, we can define the local irreproducible discovery rate as:

    idr(xi,1, xi,2) = P (Ki = 0 | (xi,1, xi,2); \(\hat{\theta}\))

    So to control the IDR at some level \alpha, we can rank (xi,1, xi,2) by their IDR values. We can then select (x(i),1, x(i),2), i = 1 . . . l, where

    \[I=\operatorname{argmax}_{i} \frac{1}{i} \sum_{j=1}^{i} i d r_{j} \leq \alpha \nonumber \]

    IDR is analogous to a FDR control in this copula mixture model. This subsection summarizes the in- formation provided in this lecture: www.biostat.wisc.edu/~kendzi...AT877/SK_2.pdf. The original paper, along with an even more detailed formulation of IDR, can be found in Li et al. [10]

    Advantages and use cases of the IDR

    IDR analysis can be performed with increasing N, until the desired IDR is reached (for example, N is increased until IDR=0.05, meaning that 5% of the top N peaks are not reproducible). Note that N can be different for different replicates of the same experiment, as some replicates may be more reproducible than others due to either technical or biological artifacts.

    IDR is also superior to simpler approaches to use the reproducibility between experiments to define significance. One approach might be to take the union of all peaks in both replicates as significant, however; this method will accept both real peaks and the noise in each data set. Another approach is to take the intersection of peaks in both replicates, that is, only count peaks present in both data sets as significant. While this method will very effectively eliminate spurious peaks, it is likely to miss many genuine peaks. IDR can be thought of as combining both these approaches, as it accepts all peaks, regardless as to whether they are reproducible, so long as the peaks have sufficient enrichment to fall within the segment of the data with an overall irreproducibility rate above a given threshold. Another advantage to IDR is that it can still be performed even if biological replicates are not available, which can often be the case for ChIP experiments performed in rare cell types. Psuedo-replicates can be generated from a single data set by randomly assigning half the reads to one pseudo-replicate and half to another pseudo-replicate.

    Interpreting Chromatin Marks

    We now move onto techniques for interpreting chromatin marks. There are many ways to analyze epigenomic marks, such as aggregating chromatin signals (e.g., H3K4me3) on known feature types (e.g., promoters of genes with high or low expression levels) and performing supervised or unsupervised machine learning methods to derive epigenomic features that are predictive of different types of genomics elements such as promoters, enhancers or large intergenic non-coding RNAs. In particular, in this lecture, we examine in detail the analysis of chromatin marks as done in [7].


    This page titled 19.4: Primary data processing of ChIP data is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Manolis Kellis et al. (MIT OpenCourseWare) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.