Skip to main content
Biology LibreTexts

31.3: Structure of an eQTL Study

  • Page ID
    41231
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    The basic approach behind an eQTL study is to consider each gene’s expression as a quantitative multi-factor trait and regress on principal components that explain the variance in expression. First, cells of the tissue of interest are extracted and their RNA extracted. Expression of proteins of interest is measured either by microarray or by RNA-seq analysis. Expression levels of each gene are regressed on genotypes, controlling for biological and technical noise, such that

    \[Y_{i}=\alpha+X_{i} \beta+\epsilon_{i}\nonumber\]

    Where Yi is the gene expression of gene i, Xi is a vector containing the allelic composition of each SNP associated with the gene (and can take on values 0, 1, or 2 given a reference allele), \(\alpha\) and \(\beta\) are column vectors containing the regression coefficients, and \(\epsilon_{i}\) is the residual error (See Figure 31.5) [9]. In concept, such a study is extremely simple. In practice, there are hundreds of potential confounders and statistical uncertainties which must be accounted for at every step of the process. However, the same regression model can be used to account for these covariates.

    Figure 31.9 contains an example eQTL study conducted on asthma. The key result from the study is the linear model in the upper right: we can see as the genotype tends more towards the ”A” variant, the target gene expression decreases.

    Considerations for Expression Data

    Quantifying expression of genes is fraught with experimental challenges. For a more detailed discussion of these issues, see Chapter 14. One important consideration for this type of expression analysis is the SNP- under-probe effect: probe sequences that map to regions with common variants provide inconsistent results due to the effect of variation within the probe itself on binding dynamics. Thus, experiments repeated with multiple sets of probes will produce a more reliable result. Expression analysis should also generally exclude housekeeping genes, which are not differentially regulated across members of a population and/or cell types, since these would only dilute the statistical power of the study.

    Considerations for Genomic Data

    There are two main considerations for the analysis of genomic data: the minor allele frequency and the search radius. The search radius determines the generality of the effect being considered: an infinite search radius corresponds to a full-genome cis and trans-eQTL scan, while smaller radii restrict the analysis to cis-eQTLs. The minor allele frequency (MAF) determines the cutoff under which a SNP site is not considered: it is a major determinant of the statistical power of the study. A higher MAF cutoff generally leads to higher statistical power, but MAF and search radius interact in nonlinear ways to determine the number of significant alleles detected (see Figure 31.6).

    Covariate Adjustment

    There are many possible statistical confounders in an eQTL study, both biological and technical. Many biological factors can affect the observed expression of any given mRNA in an individual; this is exacerbated by the impossibility of controlling the testing circumstances of the large population samples needed to achieve significance. Population stratification and genomic differences between racial groups are additional contributing factors. Statistical variability also exists on the technical side. Even samples run on the same machine at different times show markedly different clustering of expression results. (Figure 31.7).

    Researchers have successfully used the technique of Principal Component Analysis (PCA) to separate the effects of these confounders. PCA can produce new coordinate axes along which SNP-associated gene expression data has the highest variance, thereby isolating unwanted sources of consistent variation (see Chapter 20.4 for a detailed description of Principal Component Analysis). After extracting the principal components of the gene expression data, we can extend the linear regression model to account for these confounders and produce a more accurate regression.

    FAQ

    Q: Why is PCA an appropriate statistical tool to use in this setting and why do we need it?

    A: Unfortunately, our raw data has several biases and external factors that will make it difficult to infer good eQTLs. However, we can think of these biases as being independent influences on the datasets that create artificial variance in the expression levels we see, confounding the factors that give rise to actual variance. Using PCA, we can decompose and identify these variances into their principal components, and filter them out appropriately. Also, due to the complex nature of the traits being analyzed, PCA can help reduce the dimensionality of the data and thereby facilitate computational analysis.

    FAQ

    Q: How do we decide how many principal components to use?

    A: This is a tough problem; one possible solution would be to try a different number of principal components and examine the eQTLs found afterwards - very this number for future tests by seeing whether the outputted eQTLs are viable. Note that it would be difficult to ”optimize” different parameters for the eQTL study because each dataset will have an optimal number of principal components, a best value for MAF, etc...

    Points to Consider

    The following are some points to consider when conducting an eQTL study.

    • The optimal strategy for eQTL discovery in a specific dataset out of all different ways to conduct normalization procedures, non-specific gene filtering, search radius selection, and minor allele frequency cutoffs may not be transferable to another eQTL study. Many scientists overcome this using greedy tuning of these parameters, running the eQTL study iteratively until a maximum number of significant eQTLs are found.

    • It is important to note that eQTL studies only find correlation between genetic markers and gene expression patterns, and do not imply causation.

    • When conducting an eQTL study, note that most significant eQTLs are found within a few kb of the regulated gene.

    • Historically, it has been found that most eQTL studies are about 30-40% reproducible, and this is a relic of how the dataset is structured and the different normalization and filtering strategies the respective researchers use. However, eQTLs that are found in two or more cohorts consistently follows similar expression influence within each of the cohorts.

    • Many eQTLs are tissue-specific; that is, their influence in gene expression could occur in one tissue but not in another, and a possible explanation of this is the co-regulation of a single gene by multiple eQTLs that is dependent on one gene having multiple alleles.


    31.3: Structure of an eQTL Study is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

    • Was this article helpful?