Cancer Diagnosis Recording Comparison: CPRD Aurum and GOLD (2026)

Bold claim: cancer epidemiology hinges on data quality—and the way records are captured across primary and secondary care can tilt our understanding of incidence and survival more than expected. And this is the part most people miss: where the data come from and how they’re linked can change the story.

Introduction

The Clinical Practice Research Datalink (CPRD) provides access to two primary-care databases, CPRD Aurum and CPRD GOLD, which hold deidentified electronic health records from UK general practices. Although both are widely used, differences in coverage, coding systems, and observation windows can influence how completely and accurately cancer diagnoses are recorded. CPRD Aurum covers English practices, while CPRD GOLD spans all UK practices. The data can be enriched by linking to Cancer Registry data (CR), Hospital Episode Statistics (HES) Admitted Patient Care data, and the Office for National Statistics (ONS) death data. These linkages enable researchers to gauge missing cancer diagnoses by comparing primary-care records with registry and hospital data.

Prior work has shown CPRD data to be highly accurate and complete for many conditions, prescriptions, and deaths. Yet cancer-record validity has varied by cancer type. For example, one study of 116,769 patients found that about 10% of cancer cases identified in CPRD GOLD or HES lacked a confirmatory CR diagnosis, and up to 32% of cancers identified in CR were missing from CPRD GOLD. Linking HES and CR to CPRD Aurum or GOLD can improve case capture for certain cancer types and is particularly valuable when outcomes like cancer stage or grade—often not present in GP records—are needed.

Most past comparisons looked at a subset of cancers. This study aims to map how well cancer diagnoses are recorded across CPRD Aurum and GOLD for 19 cancer types and to assess how linked data (CR and HES) influence recording, with the goal of guiding researchers on data-source choice for cancer research. The objectives are to (i) measure and compare annual incidence rates (IRs) for 19 cancers using CPRD Aurum and GOLD, with and without CR/HES linkage; and (ii) estimate cancer-specific survival using the ONS death registry.

By evaluating completeness and accuracy across multiple sources, the study seeks to illuminate the strengths and limitations of primary-care data for cancer epidemiology. The results can help researchers decide when linkage is worth the trade-offs in sample size, geography, and timeliness, and when primary-care data alone may suffice.

Methods

Study Design

This retrospective cohort study estimated annual IRs and survival probabilities for 19 cancer types among England-based patients. Analyses were conducted separately for CPRD Aurum and GOLD, each with and without linkage to HES and CR, and using the ONS death register for survival data.

Data Sources

CPRD Aurum and GOLD hold deidentified primary-care data, demographics, and diagnoses. CPRD Aurum (2017–present) covers roughly a quarter of the UK population, with data from 1995 onward, derived mainly from England via EMIS Web. CPRD GOLD (1987–present) covers about 4% of the UK population, drawn from Scotland, Wales, Northern Ireland, and England via Vision. England-based CPRD data can be linked to HES, CR, and ONS; HES provides hospital episodes since 1997; CR contains registrable cancers; and ONS holds official death dates and causes. Death certification in England is legally required, so ONS data are considered well ascertained.

Study Population

The 19 cancers studied were: ALL, AML, bladder, brain, breast, colorectal, esophageal, gastric, head and neck, lung, melanoma, MM, neuroendocrine, ovarian, pancreatic, prostate, renal, thyroid, and uterine cancers. Diagnoses were identified using ICD-10 codes and CPRD coding. Analyses for prostate cancer were limited to men; ovarian and uterine cancers to women. Patient selection applied to both CPRD Aurum and GOLD: meeting CPRD quality standards, eligible for linkage, and registered for at least one day during 2011–2018. Linked data eligibility depended on GP participation and consent to linkage; thus the population was England-based. Incident cancers were defined as the first recorded diagnosis in any dataset for a given cancer type within the study window. Follow-up for survival ran to 2020, using ONS death data. (Note: some cancers may be misclassified as incident if prior history is incomplete.)

Analytical Datasets

Separate analytical datasets were created to compare cancer diagnoses across data sources: CPRD Aurum alone, HES alone, CR alone, combinations like HES-CR, and fully linked CPRD Aurum-HES-CR; similarly for CPRD GOLD. These datasets underpin cross-source comparisons and survival analyses.

Statistical Analysis

Incidence counts and IRs (per 100,000 person-years) were calculated annually (2011–2018) for each dataset. New cases were defined as the first-ever diagnosis of a cancer type. 95% CIs used Poisson assumptions. Cross-dataset comparisons were descriptive. Proportions of incident diagnoses captured by each dataset, relative to the fully linked reference, were computed. Survival was estimated with Kaplan–Meier methods for cancers diagnosed in 2011–2018, using deaths from any cause and cancer-specific deaths as outcomes. Fully linked CPRD-HES-CR data were tied to ONS mortality data for unadjusted survival estimates.

Limitations include: time-at-risk was not censored at the event, potentially underestimating IRs; some missing cancer codes in certain types; and lack of adjustment for stage, histology, or treatment in survival analyses. All analyses were performed in SAS 9.4.

Results

Details on sample sizes and attrition are provided in the study’s tables. The counts in fully linked CPRD Aurum-HES-CR and CPRD GOLD-HES-CR datasets varied by time due to software transitions (Vision to EMIS) in data collection.

Cancer Type–Specific Incidence Rates

Across CPRD Aurum and linked datasets, IRs were generally similar in absolute terms across data sources, with larger relative differences for rarer cancers. The fully linked CPRD Aurum-HES-CR dataset yielded the highest IRs, suggesting the most complete case capture. Prostate, breast, lung, and colorectal cancers had consistently high IRs. Head and neck cancer showed substantial variation by data source, being lowest in CPRD Aurum alone and highest in fully linked datasets. For cancers typically diagnosed in primary care (e.g., breast, prostate), CPRD Aurum alone provided relatively higher capture; for cancers more often diagnosed in secondary care (e.g., gastric, renal, bladder), HES and CR linked data showed higher IRs.

CPRD GOLD and Linked Datasets

IR patterns for CPRD GOLD closely resembled those for Aurum, but the fully linked CPRD GOLD-HES-CR dataset consistently showed higher IRs than any single source, indicating improved capture with linkage. Temporal trends in CR-derived IRs were similar between Aurum and GOLD.

Proportion of Incident Diagnoses by Dataset and Year

Compared with the fully linked reference, the proportion of diagnoses captured by CPRD Aurum, HES, and CR were broadly similar for many cancers. CPRD Aurum reported higher proportions for several common cancers (breast, prostate, melanoma, colorectal, pancreatic, renal, brain, thyroid). HES data more heavily captured AML, ALL, MM, bladder, gastric, head and neck, and uterine cancers. CR data generally captured fewer cancers than CPRD Aurum or HES, with this pattern persisting over time. Similar patterns were observed for CPRD GOLD.

Survival Analyses

In the fully linked Aurum dataset, any-cause mortality reduced survival over time compared with cancer-specific death for most cancers; the difference was less pronounced for highly aggressive cancers (e.g., pancreatic). Survival patterns in the fully linked GOLD dataset aligned with Aurum.

Discussion

This study provides the most comprehensive comparison to date across 19 cancers, using CPRD Aurum and GOLD linked to HES and CR, to illuminate the strengths and limitations of primary-care data for oncology research. Key takeaways: fully linked datasets yield the highest IRs and arguably the most complete capture, but some cancers show weaker primary-care recording due to care pathways that rely more on secondary care. Linking data enhances completeness but reduces sample size, geographic generalizability, and can introduce lag in data availability. Study design should balance these practical considerations with the research question.

Overall, IRs were similar between Aurum and GOLD and broadly consistent with external benchmarks. Variability across data sources by cancer type reflects diagnostic and care pathways, as well as coding practices. Primary-care-only data tend to capture cancers managed mostly in primary care (e.g., breast and prostate) well, while cancers requiring hospital-based management (e.g., lung, pancreatic) are better captured when linked to HES and CR.

The Cancer Registry (CR) is often treated as a gold standard, but its completeness varies by cancer type and time. Since the CR merged multiple regional registries into a national database in 2013, inconsistencies can arise due to multiple data sources feeding the registry. Initiatives like the 2-week wait referral and the Quality and Outcomes Framework have likely improved cancer recording in primary care, but gaps remain.

Unadjusted survival times were similar between Aurum and GOLD, and including ONS cause-of-death information showed that cancer-specific survival can differ from all-cause survival, particularly for less aggressive cancers where cancer-attributable death more strongly shapes outcomes.

The study period extends to early 2020, covering the initial COVID-19 surge. This period introduced disruptions to screening, diagnosis, and treatment, potentially biasing survival estimates. The authors note the need for caution when interpreting cancer survival during the pandemic, as observed differences may reflect both true effects and data artefacts from healthcare disruptions.

Limitations include the 2011–2018 observation window (limited by data availability for linked datasets at download), potential misclassification of incident cases due to incomplete historical data, and lack of adjustment for stage or treatment in survival analyses. Strengths include the large, linked dataset covering 19 cancers and a careful examination of data-source combinations to map the data landscape for cancer research.

Conclusions

This work represents the most exhaustive comparison of cancer-recording across 19 cancer types in CPRD Aurum and GOLD, with and without HES and CR linkage. It shows that CPRD data capture a high proportion of cancer diagnoses across most types, though the completeness varies by cancer and data source. For cancers with lower primary-care capture, linkage to HES and CR is recommended. Including ONS death data demonstrates that cancer-specific survival can differ from all-cause survival, underscoring the importance of selecting appropriate death endpoints in future analyses. Researchers should tailor data-source choices to their study question, balancing completeness with practical considerations. For breast, prostate, and lung cancers, using CPRD Aurum or GOLD alone may suffice, but linking to HES and/or CR is advisable for more comprehensive case capture.

Abbreviations
ALL, Acute lymphoblastic leukemia; AML, Acute myeloid leukemia; CPRD, Clinical Practice Research Datalink; CR, Cancer Registry; HES, Hospital Episode Statistics; ICD-10, International Classification of Diseases 10th Revision; IR, Incidence rate; MM, Multiple myeloma; ONS, Office for National Statistics; UK, United Kingdom.

Data Sharing and Ethics

Data used in this study come from CPRD under license and are not publicly available, but analysis outputs may be shared upon reasonable request. The research was conducted under appropriate governance and approvals, with data linked to HES and CR and death data under regulatory permissions. Ethical considerations centered on data privacy and the responsible use of linked health records.

Takeaway

When planning cancer epidemiology research in the UK using CPRD data, expect substantial gains in accuracy from linking GP data with hospital and registry information, especially for cancers frequently diagnosed in secondary care. However, weigh this against possible reductions in sample size and geographic reach, and stay mindful of how contemporary events (like a pandemic) can shape survival outcomes. How would the choice between primary-care-only versus fully linked data affect your study design and conclusions? Would this influence your approach to defining incidence and mortality endpoints? Share thoughts in the comments.

Cancer Diagnosis Recording Comparison: CPRD Aurum and GOLD (2026)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Cheryll Lueilwitz

Last Updated:

Views: 6147

Rating: 4.3 / 5 (54 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Cheryll Lueilwitz

Birthday: 1997-12-23

Address: 4653 O'Kon Hill, Lake Juanstad, AR 65469

Phone: +494124489301

Job: Marketing Representative

Hobby: Reading, Ice skating, Foraging, BASE jumping, Hiking, Skateboarding, Kayaking

Introduction: My name is Cheryll Lueilwitz, I am a sparkling, clean, super, lucky, joyous, outstanding, lucky person who loves writing and wants to share my knowledge and understanding with you.