Imagine this: you are a police officer on patrol and you receive a call that multiple 30-year-old Caucasian males were seen breaking and entering; stealing heirlooms from a nearby neighborhood. The suspects were last seen entering a convention center and, to your dismay, you arrive to find the entire convention center is an antique show containing several 30-year-old Caucasian males carrying heirlooms. What do you do to apprehend your perpetrators? You could arrest everyone that fits the description and interrogate them. On the other hand, you could scan the crowd for clues that there is a group of people that do not belong, or also radio to the police station for more information to narrow down the crowd. Needless to say, without gaining more contextual information for prudent discernment of the situation, you may arrest the wrong men and let the criminals go free.
This is where cancer genomics is today; the sophistication of sequencing techniques have allowed for datasets that can detect every genomic mutation within cancer cells. Unfortunately, mutation rates are not equal among all genes. While this may seem a non-issue, this could lead scientists to ascertain that a mutated gene is associated with cancer when, in fact, the gene that “matches the description” is more susceptible to mutation, but has no role in oncogenesis. This is exactly what occurred to researchers who found high mutation rates of olfactory genes within lung cancer1. Doubtful of the role of olfactory genes in lung tumorigenesis, these scientists ultimately concluded that the mutation of olfactory genes had no role in the transformation of the lung epithelial cells1.
In Nature, Lawrence et al. further explored this issue, showing that failure to correct for the variability of mutation rates across the genome could lead to false positives for cancer associated genes1. To illustrate the importance of incorporating heterogeneity into the methodologies of data analysis, the authors compared a datasets with similar mutation frequencies to datasets that had different average mutation frequencies and found, when failing to take into account variability of mutations, there was an increase false categorization of cancer associated genes. Furthermore, the authors demonstrate that an analysis of an increasing sample size, as seen in the “big data” datasets of American Society of Clinical Oncology’s “CancerLinQ™”2 and the Cancer Genome Atlas3, without correcting mutation rates, may exacerbate the amount of false positives for cancer associated genes by decreasing the threshold needed to reach statistical significance. Lawrence postulated that heterogeneity may affect the detection of appropriate cancers by failing to correct for three contextual events: heterogeneity in mutation rates amongst samples of the same cancer type (patient-specific context), heterogeneity in mutation rates based on nucleotides surrounding a sequence (sequence-specific context), and heterogeneity in mutation rates based on the time that the gene is replicated or transcribed (replication/transcription-specific context). Using the mutated olfactory genes mentioned above, along 3083 tumor-control pairs spanning 27 different cancer types, the authors demonstrate the importance of these contextually-discerning mutation rates and construct an algorithm for further context-based analysis, called MutSigCV.
Lawrence et al. studied cancer samples of the same cancer type (3,083 tumor-normal pairs across 27 tumor types) with variable average mutation rate. The authors found that, among all pairs and tumor types, there was a 1,000-fold variance in median frequency of mutations within the sample size. In these samples, the lowest variances were amongst hematological and pediatric cancers while the highest were among tumors induced by environmental factors, such as smoking and radiation. Given the importance of having accurate knowledge of the variability of rate of mutation, this underscores the importance in treating different cancer types, as well as patients with the same cancer, with a context-specific treatment protocol.
However, correcting for mutational frequencies attributed to tissue types, and mutations caused by known carcinogens and differences in cancer types, the authors still found that there was high mutational variability within certain samples of the same cancer type. Since mutation variance cannot be wholly accounted for by carcinogens, Lawrence et al. postulated that nucleotide makeup of the gene sequence may play a role in the mutation rate variability. The authors tested mutational heterogeneity in multiple tumors by assaying for 96 possible mutations (taking into account flanking bases) that were simplified into a radial chart for analysis1. Lawrence et al found that certain tumor types clustered into certain mutated sequences with the same flanking nucleotides (for instance lung cancers had a really high C to A mutations) was predominate, but still varied, within a certain cancer type.
While both variance in median mutation rates, and predominance of a specific sequence mutation, within specific cancer types was significant, the most important aspect in mutational heterogeneity seems to be in regional areas across a whole genome of cancer types, attributing to an excess of fivefold differences in median mutation rates1. Lawrence et al. credited this to two factors: the amount a gene is transcribed for the time the DNA section is replicated. The authors discovered that mutation rates are highest in genes with low rates of transcription and late DNA replication events. Comparing falsely-implicated olfactory receptor genes to known cancer associated genes, Lawrence et al. demonstrate different transcription rates and different replication times, with olfactory genes being expressed at cells with lower rates and later replication times. In contrast, cancer associated genes have higher transcription rates and earlier replication times. In other words, while normal and cancer associated genes are both gaining mutations, the events that lead to these mutations are different. Thus, without parsing out mutational rates compared to replication and transcription, one may falsely assume that similar mutation levels must determine a cancer associated genes.
In the end, the authors surmised that “the rich variation in mutational spectrum across tumours underscores the problems with using an overly simplistic model of the average mutational process for a tumour type and failing to account for heterogeneity within a tumor type.” They state that their new analysis algorithm, MutSigCV, takes into account these context dependent nuances, allowing for cancer genomic analysis of mutations that eliminates these false positives. Using MutSigCV, Lawrence et al. was able to take a list of 450 suspected cancer associated genes in lung carcinoma and narrow the list to 11 suspected genes; genes shown to be linked to cancer1. This underscores the importance of context-specific analysis of big data in terms of cancer genomics. Without such a process, the use of whole genome sequencing for mutation rates for novel drug targets may be inadequate, sending many pharmaceutical and biotech companies toward therapeutic targets that, while look like the right suspect, are just an innocent bystanders that “fit the description”.
1 Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature, doi:10.1038/nature12213 (2013).
2 DeMartino, J. K. & Larsen, J. K. Data Needs in Oncology: “Making Sense of The Big Data Soup”. Journal of the National Comprehensive Cancer Network 11, S-1-S-12 (2013).
3 Network, C. G. A. R. Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519-525, doi:10.1038/nature11404 (2012).