Skip to main content
Biology LibreTexts

Section 2: Characterization

  • Page ID
    41360
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    For each tumor, our aim is to obtain a complete, base-level characterization of that tumor, its evolutionary history and the mechanisms that shaped it. We can use massively parallel sequencing to get the base level genome characterization, but this approach brings with it some associated challenges.

    1. Massive amounts of data The main challenge with increased amounts of data is an increase in the computational power required to analyze this data, as well as storage costs associated with keeping track of all of the sequenced genomes. There also needs to be an analysis pipeline (automated, standardized, reproducible) to have consistent findings across the different characterization efforts. Finally, we need to come up with new ways of visualizing and reporting on large scale data.
    2. Sensitivity / Specificity Cancer characterization starts with the proper identification of SNP mutations present in cancer cells, and maximal removal of false positive reads. When selecting tumor samples, the extracted DNA is a mix of normal genomes and complex tumor genomes. The mutational allelic fraction (the fraction of DNA molecules from a locus that carry a mutation), is used to study significance of a mutation and its prevalence in the cancer subtype. This fraction depends on the purity, local copy number, multiplicity of the tumor sample, and the cancer cell fraction (CCF, amount of cancer cells that carry the mutation). Clonal mutations are carried by all cancer cells, and sub-clonal mutations are carried by a subset of the tumor cells.

    As well as detecting the presence of clonal and subclonal mutations, proper analysis requires removal of false positive mutagenic events. Two types of false positives include sequencing errors and germline mutations. Sequencing errors can come from misread bases, sequencing artifacts, and misaligned reads, while germline mutations usually occur in predicable places in the genome (1000/MB known, 10-20/MB novel). By having multiple reads of the same sequence the likelihood of repeated errors in sequencing drops rapidly, and by knowing where in the genome a germline mutation is likely, a filter can correct for the additional false positive probability. The overall sensitivity of detecting single nucleotide variations depends on the frequency of background mutations and the number of alternative reads.

    A third type of false positive can come from cross patient contamination if the tumor sample contains DNA from another person. ContEst is a method to accurately detect contamination by comparison to a SNP array.

    A mutation caller is a classifier asking at every genomic locus, Is there a mutation here?. These classifiers are evaluated using many Receiver Operators Characteristic (ROC) curves, which depend on the allele fraction, coverage of tumor and normal sample, and sequencing and alignment noise. MuTect is a highly sensitive Somatic Mutation Caller. The MuTect pipeline is as follows: Tumor and normal samples are passed into a variant detection statistic (which compares the variant model to the null hypothesis), which is passed through site-based filters (proximal gap, strand bias, poor mapping, tri- allelic site, clustered position, observed in control), then compared to a panel of normal samples, and finally classified as candidate variants. MuTect can detect low allele fraction mutations and is thus suited for studying impure and heterogenous tumors.

    3. Discovering mutational processes

    Instead of detecting the presence of mutations in cancer genes, a different approach could be to dis- cover if there were specific patterns among mutations in the cancer samples. A ”Lego plot” is a way to visualize patterns of mutations, in which the heights of each of the colors represents frequencies of the 6 types of base pair substitutions, and the frequency of each is plotted relative to the 16 different contexts this mutation could occur in (neighboring nucleotides). The specific types of mutagenic events in each type of cancer can be plotted and analyzed. As an example, a novel mutation pattern (AA ¿ AC) is found in esophageal cancer. Cancers can be grouped by these specific mutational spectra. Dimensionality reductions using non-negative Matrix Factorization (NMF) of lego plot data can be used to identify fundamental spectral signatures.

    4. Estimating purity, ploidy and cancer cell functions

    As well as detecting mutations in cancer cells, removing false positives, and detecting patterns of mutations, a proper characterization of each tumor sample is required. Because of heterogeneity and sample impurities, estimating the purity, absolute copy number and cancer cell fraction (CCF) of the tumor sample being sequenced is needed to get correct total number and prevalence of the mutated alleles.

    5. Tumor heterogeneity and evolution

    Samples can have large distributions of point mutations and copy number alterations, but a Bayesian clustering algorithm can help identify the mutations and copy number alterations in distinct subpopulations.


    Section 2: Characterization is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

    • Was this article helpful?