Purity of TCGA tumour samples
We obtained gene expression profiles (RNA-seqV2), DNA methylation profiles (HumanMethylation450) and immunohistochemistry (IHC) analysis for 9,364 tumour samples and 1,958 adjacent normal samples across 21 solid tumour types from the TCGA repository (Supplementary Table 1). For each sample, we assigned purity estimates using four methods: ESTIMATE, which uses gene expression profiles of 141 immune genes and 141 stromal genes6; ABSOLUTE, which uses somatic copy-number data (estimations were available for only 11 cancer types)7; LUMP (leukocytes unmethylation for purity), which averages 44 non-methylated immune-specific CpG sites (Supplementary Fig. 1 and Methods); and IHC, as estimated by image analysis of haematoxylin and eosin stain slides produced by the Nationwide Children's Hospital Biospecimen Core Resource. All estimates are in Supplementary Data 1. The purity estimates from the DNA, RNA and methylation-based methods had high concordance between most cancer types (Fig. 1a; Supplementary Fig. 2). The correlation of the three genomic-based methods with IHC was lower in all cancer types (Supplementary Fig. 3). However, in all assayed methods, the correlation coefficients with IHC were positive among all cancer types, suggesting that IHC provides a qualitative estimation of purity.
Tumour purity of TCGA cancers types.
Next, we determined the average purity level of all associated tumour samples for each cancer type, according to each method. In accordance with previous results, there was high concordance among the DNA, RNA and methylation-based methods, and lower to no agreement with IHC (Supplementary Fig. 4). Average tumour purity across all samples from all cancer types was 81.1±13.9%, 76.1±16.1% and 75.7±21.2% (mean±s.d.) for ESTIMATE, LUMP and IHC, respectively. An exception was ABSOLUTE, with an average estimate of 62.3±19.9%. This difference is explained by methodological differences: while ABSOLUTE is a direct measure of the tumour cells in a sample, ESTIMATE and LUMP estimate purity indirectly by measuring immune and stromal counterparts in the sample. Thus, the difference in the average estimates is arguably due to the presence of non-immune and stromal cells in a sample, such as contaminating adjacent normal cells, which are not measured by ESTIMATE and LUMP.
We used these methods to derive a consensus measurement of purity estimations (CPE). CPE is the median purity level after normalizing levels from all methods to give them equal means and s.d.'s (75.3±18.9%). CPE is used in all analyses below, but results were consistent in the majority of methods when individually tested (presented in Supplementary Figs 1–15). Regardless of the method used, purity patterns differ by cancer type, with high purity (over >90%) in brain-originating tumours such as adrenocortical carcinoma (ACC) and lower-grade glioma (LGG) and low purity (<70%) in cancers resulting from chronic mutagenic exposures, such as lung adenocarcinoma (LUAD), squamous cell carcinoma and head and neck squamous cell carcinoma (Fig. 1b; Supplementary Fig. 5).
This variation in purity levels resembles variation in mutational rates between cancer types, which are commonly associated with more robust response to therapies targeting immune checkpoint pathways20,23. We examined correlations between median purity levels and median mutational burden of cancer types, and found a significant association between them (Pearson correlation R=−0.60, P=0.004; Fig. 2; Supplementary Fig. 6). This association seems to be highly consistent throughout the range of mutational burden. The only cancer types not following this strong correlation were thyroid papillary carcinoma (THCA), breast invasive carcinoma (BRCA) and kidney renal clear cell carcinoma (KIRC), each with lower mutational burden than expected by purity; and ACC and skin cutaneous melanoma (SKCM), each being higher than expected. Interestingly, ACC and SKCM had the lowest missense:silent ratio of all cancer types, while BRCA had the highest, suggesting an underlying force in those cancer types that drives them away from the purity-mutational burden curve. An analysis restricted to the remaining 16 cancer types gave an almost perfect correlation (Pearson correlation R=−0.94, P=4.4e−8). However, it is important to note that we did not observe significant negative correlations in a sample-by-sample correlation in specific tumour types, suggesting that this association is a property of cancer types rather than individual patients.
Tumour purity and mutational burden.
Within cancer types, we found major differences between different samples. For example, in SKCM, 56 of the samples (11.8%) were highly purified (>90%), while 95 (20.0%) had poor purity (<60%). While only 1.9% of the samples had purity levels lower than the TGCA's minimum of 60% by IHC, 40.3, 8.9 and 18.5% had low purity according to ABSOLUTE, ESTIMATE and LUMP, respectively.
We investigated different samples from the same patient to determine whether they were concordant. Across all cancer types, 37 patients were analysed twice from two different portions of the same sample. We observed high concordance between the samples (Pearson correlation R=0.73, Supplementary Fig. 7). This result persisted even when analysing cancer types separately. For example, purity from the 10 LUAD patients with two samples was highly correlated (Pearson correlation R=0.82). We concluded that differences in purity levels among cancer patients and cancer types are robust, consistent and specific to the tumour.
Tumour purity versus clinical features and outcomes
The observation that tumour purity was maintained in different samples from the same tumour suggested that purity is an intrinsic property of the tumour. We sought to explore whether it was associated with clinical features. We examined associations between purity levels and all available clinical features provided by TCGA in each cancer type. We analysed 722 clinical features, spanning 299 unique features (Supplementary Data 2). Generally, characteristics including sex, age, ethnicity, alcohol use and smoking were not associated with tumour purity. However, we detected 11 associations (false discovery rate <1%) with features characterizing the tumour, most prominently with histological tumour analyses in different cancer types. Histological subtypes, which are classified based on cell type and pattern, are frequently treated as similar entities of the same cancer type, although there are obvious differences in the tumour's biological characteristics and prognosis. We observed differences in purity levels between the different histopathological subtypes of LGG, BRCA, THCA, and between cervical squamous and adenocarcinomas (CESC; Fig. 3a; Supplementary Fig. 8a–d). We additionally observed a consistent decrease in purity as tumour grade progressed in KIRC and LGG, which is consistent with the lower purity of glioblastoma (GBM) samples (grade 4), and in the primary grade of prostate adenocarcinoma (Fig. 3b; Supplementary Fig. 8e–g). In LGG, we found differences in purity at different tumour locations (Supplementary Fig. 8h). In BRCA, we also found differences between oestrogen receptor-positive and -negative samples (Supplementary Fig. 8i). The only significant non-pathological associations of purity we found were a history of thyroid gland disorder in THCA and presence of IDH1 mutation in LGG (Supplementary Fig. 8j–l). The latter association likely results from the fact that LGG tumours with wild-type IDH1 are molecularly and clinically similar to GBM24, which have lower purity levels. Divergent purity levels were found prominently in pathologic diagnoses, and moreover, the lack of association between purity and patient characteristics suggests that purity differences is at least not an intrinsic characteristic, but a result of the sampling by the surgeon and the level of difficulty separating it from its environment.
We employed a Cox proportional hazard regression analysis to test for association between purity and survival time. We found associations with purity with three methods in KIRC and LGG (Fig. 3c; Supplementary Table 2). As described, purity in LGG samples differed between histological subtypes. Survival analysis was consistent with these findings, as astrocytomas tend to have poorer prognosis than other subtypes25. This result could also be explained by clinical outcomes associated with IDH1 mutation24, which is also associated with purity, as shown above. Our observation in KIRC may explain prognosis for this cancer as well, as we found lower purity in higher-grade tumours. These explanations reinforce our claim that purity differences are extrinsic.
Tumour purity confounds genomic analyses
We next examined the confounding effect of tumour purity on genomic analyses. We divided this effect into three commonly used bioinformatics methods: correlation, clustering and differential analysis. Our presentation focuses on gene expression profiles, but all the analyses hold to the same extent in other genomic measurements.
Correlative analyses are widely applied to genomics in the study of cancer. One key approach is the gene co-expression network, which assigns a score to a pair of genes based on their co-expression frequencies in different samples. Co-expression networks have been used extensively in cancer studies, with an aim to unravel hallmark pathways and prioritize novel candidate genes26. We found that identifying co-expression networks from genomics data without accounting for tumour purity is problematic. Gene expression profiles from bladder carcinoma illustrate the problem. For example, expression levels of colony-stimulating factor 1 receptor (CSF1R) and Janus kinase 3 (JAK3), tyrosine protein kinases and known cancer-driver genes27, are highly correlated with each other (Spearman correlation R=0.67, P<1e−20). Thus, one might suggest a shared co-expression network between them, which would be a novel finding. However, this correlation likely results from the high correlation of both genes with tumour purity (Fig. 4a).
Tumour purity confounds co-expression analysis.
We extended this observation to all available gene pairs. Strikingly, we found that the strongest gene networks, that is, groups of genes with correlated expression profiles, were composed of genes highly associated with purity (Fig. 4b; Supplementary Fig. 9). Group A, which contains 25.7% of the genes, was enriched with 60.0% of all co-expressing gene pairs (Spearman coefficient |R|>0.5), but also with genes negatively correlated with purity (91.1% of genes with R<−0.3). In total, 49.7% of co-expressing genes were between genes that were both correlated with purity (|R|>0.3), compared with an expected ratio of only 0.6%. As expected, the group A gene ontology annotations were enriched with terms related to the immune system, but also with other terms such as extracellular matrix organization and other cellular functions (Supplementary Table 3). Group C, on the other hand, contained only genes positively correlated with purity. Those genes did not seem to share specific gene ontology annotations. While genes in both groups may be part of a shared co-expression network, the above analysis demonstrates that a correlation between them may be explained in large part by tumour purity. We attempted to address this bias by applying partial correlations with controlling for tumour purity in the co-expression analysis. The number of pairwise co-expressions in bladder carcinoma decreased by 39.7%, and the fraction of co-expressions between purity-associated genes decreased by 58.4%. Overall in all 21 cancer types, we observed a 21.0% decrease in the number of pairwise co-expressions when controlling for purity (Fig. 4c; Supplementary Fig. 10), and a decrease of 48.7% of co-expressions when both genes are correlated with purity. This decrease was tightly correlated with the pairwise correlation of the genes with purity (defined as the multiplication of the coefficients of the correlation of expression with purity between the co-expressing genes). For every 0.1 increase in the level of pairwise correlation with purity, we observed a 0.1 correlation decrease (Fig. 4d). We concluded that naive correlation between genomic profiling measures gives results that are highly confounded by tumour purity. We suggest that future co-expression analyses should employ partial correlation analysis by adjusting for tumour purity.
The subclassification of cancers based on genomic measurements has been a fundamental part of cancer research and therapeutics development in recent years. Numerous publications have applied molecular subtyping methods in different cancer types28,29, and have shown its power in facilitating precision medicine30. It should be emphasized that employing genomic measurements for subtyping tumours is distinct from histological subtyping by visual analysis, though there have been attempts to consolidate these two approaches. This study highlights the risk of confounding potential tumour purity when applying unsupervised clustering for molecular subtyping. In three cancer types—breast, GBM and LUAD—the molecular subtypes and the subtyping method based on gene expression profiles are widely accepted, and in all three, we detected discrepancies in purity among subtypes. Four molecular subtypes of GBM have been proposed: classical, neural, proneural and mesenchymal31. Purity analysis on centroids of 840 genes revealed consistently lower purity in the mesenchymal and neural subtypes (Mann–Whitney U-test P=1.8e−9; Fig. 5a; Supplementary Fig. 11). Three LUAD subtypes have been proposed: magnoid, bronchoid and squamoid32. The classification utilizes centroids of 506 genes33. Again, purity is a dominant factor in distinguishing the three subtypes (Mann–Whitney U-test P=1.0e−9; Fig. 5b; Supplementary Fig. 12). We suspected that associations between purity and molecular subtyping resulted from use of unsupervised clustering techniques, which emphasizes genes that are associated with purity. Thus, 47.1% and 45.4% of the genes used for subtyping in GBM and LUAD, respectively, were correlated with purity (|r|>0.3) compared with 21.2 and 10.7% of all genes (P=1.6e−18 and P=1.1e−50, Kolmogorov–Smirnov test; Fig. 5c). It should be noted that the differences in purity between subtypes might still be genuine and intrinsic characteristics of the subtypes. We suspect that this is the case in the molecular subtypes of BRCA. Our analysis detected differences in purity levels among the PAM50 molecular subtypes of BRCA34 (Supplementary Fig. 13); however, these differences are consistent with our finding of differences in oestrogen receptor status as obtained from pathologic analysis (Supplementary Fig. 8i). In the other cancer types, where classification is based on unsupervised clustering techniques and there are currently no non-molecular factors that distinguish subtypes, the confounding effect of tumour purity is alarming. We hypothesize that clustering with expression levels adjusted for purity will point to a different subtyping strategy for these samples.
Tumour purity confounds molecular subtyping.
Last, we analysed the confounding effect of purity on DE analysis. Identifying differentially expressed genes in tumours is an important tool for studying tumorigenesis, and has been routinely applied to identify diagnostic and prognostic markers and therapeutic targets. We use the term ‘purity' in a broad sense to define the proportions of non-immune counterparts in the sample, which can be calculated for both non-cancer and cancer samples, and can be estimated using ESTIMATE and LUMP. We applied a consensus estimate based on these two methods on normal samples of 13 cancer types with sufficient normal material (‘normal' describes adjacent non-tumour samples). We found high concordance between the two methods in all cancer types except in LUAD (Supplementary Fig. 14a). We also observed high concordance between average purity estimates of TCGA normal samples and purity estimates of equivalent tissues taken from the Genotype-Tissue Expression project (Supplementary Fig. 14b)35. We found substantial differences in purity levels among different tissues and among different samples from the same tissue. Moreover, in several cancer types, we observed immense discrepancies between tumours and adjacent normal tissue (Fig. 6a; Supplementary Fig. 14c,d). For example, purity in normal kidney samples was, on average, 28.3% higher than the KIRC cancer samples. On the other hand, purity in normal lung samples was, on average, 26.7% less than in the lung squamous cell carcinoma cancer samples.
Differential expression analysis adjusted to tumour purity.
We used the DESeq2 package36 to apply DE analysis to RNA-seq counts of tumour and normal samples across a dozen cancer types with sufficient normal tissue for sampling. We compared our findings with a DE analysis designed to include purity estimates, which is equivalent to adjusting gene expression by purity. This comparison found numerous marked differences in relative expression levels. Many genes were differentially expressed before purity adjustment, but no differences between cancer samples and controls were seen after adjustment. Some genes even changed state from up- to downregulation or the other way around. Most importantly, we found differentially expressed genes after adjustment that had not been identified before. Figure 6b illustrates expression patterns of the immunotherapy target cytotoxic T-lymphocyte-associated protein 4 (CTLA4) and its ligand, CD86 (also known as B7.2) in traditional and adjusted DE analyses in three cancer types. Standard DE analysis labelled both genes as highly upregulated in KIRC samples. However, most of the difference from healthy samples could be ascribed to differences in purity. In LUAD, on the other hand, CTLA4 was detected as upregulated only after accounting for purity, while the downregulation of CD86 was again a byproduct of purity. In THCA, this trend was reversed: CTLA4 seemed downregulated, until DE adjustment, while CD86 was only detected as upregulated after adjustment.
On average, 13.7% of the genes originally considered as DE were lost, and 11.0% of genes were newly detected as DE after adjustment (Table 1; Supplementary Fig. 15). By ranking all genes by DE P value, we extracted genes that would have been missed in traditional analysis (Supplementary Data 2). We used Ingenuity Pathway Analysis37 to identify enriched pathways of for genes that after adjustment were ranked twice as high as before adjustment; this analysis revealed significant enrichments of many immune-related pathways in different cancer types (Fig. 6; Supplementary Data 3). Notably, the analysis highlighted different T-cell activation pathways in different cancer types, particularly the CTLA4, CD28 and iCOS-iCOSL signalling pathways in T cells, which are the key pathways in anti-CTLA-4 immunotherapy treatments38. As illustrated in Fig. 6b, genes in these pathways were prone to being ignored in traditional gene expression analysis, as their expression was masked by sample heterogeneity. We propose that considering tumour purity in DE analysis should be an integral tool for the discovery of novel genes and pathways altered in tumorigenesis.
Comparison of traditional and purity-adjusted differential expression analyses.
What are risk factors and causes of cancer?
Anything that may cause a normal body cell to develop abnormally potentially can cause cancer. Many things can cause cell abnormalities and have been linked to cancer development. Some cancer causes remain unknown while other cancers have environmental or lifestyle triggers or may develop from more than one known cause. Some may be developmentally influenced by a person's genetic makeup. Many patients develop cancer due to a combination of these factors. Although it is often difficult or impossible to determine the initiating event(s) that cause a cancer to develop in a specific person, research has provided clinicians with a number of likely causes that alone or in concert with other causes, are the likely candidates for initiating cancer. The following is a listing of major causes and is not all-inclusive as specific causes are routinely added as research advances:
Chemical or toxic compound exposures: Benzene, asbestos, nickel, cadmium, vinyl chloride, benzidine, N-nitrosamines, tobacco or cigarette smoke (contains at least 66 known potential carcinogenic chemicals and toxins), asbestos, and aflatoxin
Ionizing radiation: Uranium, radon, ultraviolet rays from sunlight, radiation from alpha, beta, gamma, and X-ray-emitting sources
Pathogens: Human papillomavirus (HPV), EBV or Epstein-Barr virus, hepatitisviruses B and C, Kaposi's sarcoma-associated herpes virus (KSHV), Merkel cell polyomavirus, Schistosoma spp., and Helicobacter pylori; other bacteria are being researched as possible agents.
Genetics: A number of specific cancers have been linked to human genes and are as follows: breast, ovarian, colorectal, prostate, skin and melanoma; the specific genes and other details are beyond the scope of this general article so the reader is referred to the National Cancer Institute for more details about genetics and cancer.
It is important to point out that most everyone has risk factors for cancer and is exposed to cancer-causing substances (for example, sunlight, secondary cigarette smoke, and X-rays) during their lifetime, but many individuals do not develop cancer. In addition, many people have the genes that are linked to cancer but do not develop it. Why? Although researchers may not be able give a satisfactory answer for every individual, it is clear that the higher the amount or level of cancer-causing materials a person is exposed to, the higher the chance the person will develop cancer. In addition, the people with genetic links to cancer may not develop it for similar reasons (lack of enough stimulus to make the genes function). In addition, some people may have a heightened immune response that controls or eliminates cells that are or potentially may become cancer cells. There is evidence that even certain dietary lifestyles may play a significant role in conjunction with the immune system to allow or prevent cancer cell survival. For these reasons, it is difficult to assign a specific cause of cancer to many individuals.
Recently, other risk factors have been added to the list of items that may increase cancer risk. Specifically, red meat (such as beef, lamb, and pork) was classified by the International Agency for Research on Cancer as a high-risk agent for potentially causing cancers; in addition processed meats (salted, smoked, preserved, and/or cured meats) were placed on the carcinogenic list. Individuals that eat a lot of barbecued meat may also increase risk due to compounds formed at high temperatures. Other less defined situations that may increase the risk of certain cancers include obesity, lack of exercise, chronic inflammation, and hormones, especially those hormones used for replacement therapy. Other items such as cell phones have been heavily studied. In 2011, the World Health Organization classified cell phone low energy radiation as "possibly carcinogenic," but this is a very low risk level that puts cell phones at the same risk as caffeine and pickled vegetables.
Proving that a substance does not cause or is not related to increased cancer risk is difficult. For example, antiperspirants are considered to possibly be related to breast cancer by some investigators and not by others. The official stance by the NCI is "additional research is needed to investigate this relationship and other factors that may be involved." This unsatisfying conclusion is presented because the data collected so far is contradictory. Other claims that are similar require intense and expensive research that may never be done. Reasonable advice might be to avoid large amounts of any compounds even remotely linked to cancer, although it may be difficult to do in complex, technologically advanced modern societies.