Evaluating generalizability of oncology trial results to real-world patients using machine learning-based trial emulations

Evaluating generalizability of oncology trial results to real-world patients using machine learning-based trial emulations

Trial selection

This study focused on aNSCLC, mBC, mPC and mCRC owing to their high prevalence in the community oncology population in the USA and globally. Our goal was to identify phase 3 RCTs related to these cancers that produced positive results and were clinically relevant to oncology practice at the time of study design in January 2023. Trials had to meet the following criteria: (1) they included only two interventional arms; (2) the treatments involved drugs or biologics recommended in the NCCN guidelines; (3) the trial protocol was accessible; and (4) at least 600 patients in the Flatiron Health database resembled the trial’s patient population (Extended Data Fig. 1).

After a review by experienced oncologists, the final selection included 11 trials spanning the four cancers: FLAURA37, KEYNOTE-189 (ref. 38), CHECKMATE-078 (ref. 39), KEYNOTE-024 (ref. 40) and KEYNOTE-042 (ref. 41) for aNSCLC; PALOMA-2 (ref. 42), PALOMA-3 (ref. 43) and CLEOPATRA44 for mBC; CHAARTED45 and LATITUDE46 for mPC; and FIRE-3 (ref. 47) for mCRC. Most of these trials investigated novel anti-cancer agents, including genotype-directed therapy, checkpoint inhibitors and hormone-based therapy, particularly in the first-line treatment of metastatic disease. Additional information about the trials is provided in Supplementary Table 14.

Data source

The Flatiron Health database was used for both the prognostic model development and trial emulation steps of the study. The database included patients diagnosed with aNSCLC or mBC on or after 1 January 2011 and those diagnosed with mPC or mCRC on or after 1 January 2013. Data lock dates varied by cancer type: March 2021 for aNSCLC, September 2022 for mBC and mCRC and October 2022 for mPC.

Flatiron Health collects real-world clinical patient data from EHRs used by cancer care providers, including both community practices and academic medical centers. Most patients in the database originate from community oncology settings; relative community/academic proportions may vary depending on study cohort. By aggregating data across approximately 280 cancer clinics throughout the USA, this de-identified database offers a more comprehensive and representative view of cancer care and outcomes than EHRs from a single healthcare center.

The database includes both structured and unstructured data. Structured data, such as laboratory results and diagnosis codes, undergo harmonization and mapping to common units and terminology48. Unstructured data, such as clinical notes and pathology reports, are processed by oncology nurses and tumor registrars using a technology-enabled abstraction platform to extract variables such as biomarkers or date of disease progression49.

The accuracy of Flatiron Health’s data curation process, particularly regarding mortality data, has been rigorously validated. Patient mortality information is captured through a composite variable that integrates multiple data sources, including both structured and unstructured content from EHRs, commercial sources and the Social Security Death Index. When benchmarked against data from the National Death Index, which is the gold standard for mortality data, the sensitivity of mortality capture in the Flatiron Health database exceeded 90% across various cancer types50.

Prognostic model development

The primary objective of the cancer-specific prognostic models was to accurately estimate patient survival from the time of metastatic cancer diagnosis, so that well-defined prognostic phenotypes could be generated for the emulated trial section. For each cancer type, four distinct survival-based ML models were constructed in Python version 3.7 using the scikit-survival package51. The first model was a gradient-boosted ensemble of regression trees using the negative log-partial likelihood of a Cox proportional hazard model as the loss function. The second model fit an ensemble of survival trees on different subsamples of the dataset, applying a log-rank split criterion. The third model was a linear survival support vector machine. Lastly, three variations of a pCox proportional hazards model were constructed: ridge (\(\ell 2\) penalty), lasso (\(\ell _1\) penalty) and elastic net (linear combination of \(\ell _2\) and \(\ell _1\) penalty).

Additionally, a non-penalized Cox proportional hazards model, inspired from recent publications26,27,28,29, was constructed for each cancer type as a benchmark for comparison with the ML models. The selected published models were chosen based on their demonstrated discriminatory performance on externally validated datasets and their inclusion of variables available in the Flatiron Health database.

Data preparation

Each cancer dataset was split into two subsets: an 80% training set and a 20% test set. Given the anticipated prognostic significance of the year of metastatic diagnosis and its influence on treatment decisions, a stratified split was applied based on this variable to ensure an equitable distribution of diagnosis years across both subsets.

The index date for the models was set as the time of metastatic diagnosis. A data collection window from 90 days before to 30 days after the index date was defined, and the feature values closest to the index date were selected. The upper bound of 30 days was chosen as it represented the median start time for first-line treatment initiation. First-line treatment initiation itself was not used as an upper bound owing to the absence of documented treatment in 10–20% of patients, varying by cancer type. The end date was defined as the time of death, and patients without a recorded death date were right-censored at their last recorded activity in the EHR.

Feature categories used for prognostic model construction included demographic data, cancer characteristics, biomarkers, weight, laboratory values, medical history and medications. Medical history was identified through International Classification of Diseases (ICD) coding and mapped to the Elixhauser comorbidity list. Several features were engineered by leveraging the longitudinal nature of the dataset, such as percentage change in weight and summary laboratory measures, such as maximum, minimum and slope. Treatment data were intentionally excluded from the final feature list because the objective was to stratify patients based on mortality risk before treatment initiation. Additionally, race, ethnicity and insurance were excluded to mitigate the risk of perpetuating historical or societal biases in the predictive model52. For the comprehensive list of features used to construct each disease-specific model, see Supplementary Tables 15–18.

ECOG performance status and laboratory values had relatively high missing rates at the time of metastatic diagnosis, ranging from 30% to 50% depending on the cancer cohort. Special attention was given to their imputation considering their high prognostic value. The default imputation strategy involved imputing the median for laboratory values and creating a binary flag for missing values, and missing ECOG values were left as unknown. As an alternative imputation strategy, MICE was used. For each cancer type, five complete versions of the training and test sets were generated using the ‘miceforest’ Python package53, which uses LightGBM as the back-end model for predicting missing values. A cancer-specific survival GBM was then trained on each training set, with model performance on the test set pooled using a modified version of Rubin’s rules54.

Model evaluation

Model performance was assessed by calculating a time-dependent AUC on the test set, specifically a cumulative/dynamic AUC55. The AUC at time t is defined by equation (1), where i and j are the observed times of death for patients i and j. The function I(·) is an indicator function that returns 1 if a specified condition is true and 0 if the condition is false. The term ωi represents the inverse probability of censoring weights for patient i, which adjusts for the potential bias introduced by censored data. The values \(\hatf\left(\bfx_i\right)\) and \(\hatf\left(\bfx_j\right)\) are the predicted risk scores for patients i and j, based on their respective feature vectors xi and xj.

$$\widehat\rmAUC(t)=\frac\sum _i=1^n\sum _j=1^nI(\;y_j > t)I(\;y_i\le t)\omega _iI(\;\widehatf(\bfx_j)\le\widehatf(\bfx_i))(\sum _j=1^nI(\;y_j > t)\left)\right.)(\sum _i=1^nI(\;y_i\le t)\omega _i)$$

(1)

The numerator of the AUC formula sums over all pairs (i, j) where patient i has died by or at time t and patient j has survived beyond t, comparing their predicted risk scores and applying ωi to each comparison. The denominator normalizes this sum by multiplying the number of patients who survived beyond time t by the weighted count of patients who died by or at time t. The AUC can be interpreted as the weighted probability that, given two randomly chosen patients i and j, where patient i dies before or at time t and patient j survives past t, the model assigns a risk score to patient i that is greater than or equal to the risk score assigned to patient j.

The AUC was chosen over other metrics, such as the concordance index, owing to its ability to assess a model’s discriminatory performance at clinically relevant timepoints. The AUC was calculated at 30-day intervals from the time of metastatic diagnosis up to 5 years afterwards on the test set. However, particular attention was given to the AUC at 1 year for aNSCLC and at 2 years for mBC, mCRC and mPC to determine the top-performing prognostic model. These timepoints were chosen because they align with the mOS in the Flatiron Health dataset. The 95% CIs for the AUC were bootstrapped using 10,000 iterations.

Hyperparameter tuning

Model hyperparameters were optimized to maximize AUC at 1 year for aNSCLC and at 2 years for mBC, mCRC and mPC. This optimization was achieved through a grid search combined with five-fold cross-validation on the training set. For ensemble models, an early stopping strategy was implemented to determine the optimal number of learners and reduce overfitting before fine-tuning the remaining hyperparameters through grid search. Detailed information on the hyperparameter search space can be found in Supplementary Table 19.

The following data transformations were implemented within the cross-validation loops: categorical variables were one-hot-encoded, and numerical variables underwent median imputation and standardization to have a mean of 0 and a standard deviation of 1.

Trial emulation

Eligibility matching

In the primary analysis, we selected Flatiron Health patients who received relevant treatments at the appropriate line of therapy and exhibited correct biomarkers near the time of treatment. Biomarkers were selected up to 30 days beyond the start of treatment, accounting for the known lag time in obtaining results and their crucial role in the eligibility criteria of many trials. Treatment arms were occasionally expanded beyond those explicitly investigated in the RCT if equivalent real-world therapies were present. For instance, in PALOMA-2, which evaluated the benefit of adding palbociclib to letrozole in untreated hormone receptor-positive/HER2-negative mBC, we included real-world patients receiving any first-line aromatase inhibitor with or without palbociclib, as differences in estrogen suppression between the aromatase inhibitors are not associated with clinically significant differences in efficacy.

Two alternative eligibility matching scenarios were evaluated as part of a sensitivity analysis. The first was a strict eligibility criteria scenario, aiming to fulfill as many original trial criteria as possible within the constraints of the Flatiron Health database. Patients meeting eligibility criteria in the primary analysis were removed if they: (1) had an ICD code associated with an exclusionary comorbidity in the year before treatment start; (2) had an ECOG score exceeding the RCT cutoff, based on the score closest to and before treatment start; (3) exhibited organ dysfunction in laboratory results, defined as having any of the following within 3 months before treatment: hemoglobin < 9 g dl−1, creatinine > 2 mg dl−1 or total bilirubin > 3 mg dl−1; or (4) had exclusionary sites of metastases (for example, central nervous system) in the 3 months before treatment. However, certain criteria, such as measurable disease by Response Evaluation Criteria in Solid Tumors (RECIST), or specific past medical history, such as recent pregnancy or vaccinations, were not captured in the Flatiron Health database and could not be replicated. For a complete list of eligibility criteria satisfied in the emulated trials, see Supplementary Tables 20–30.

The second eligibility scenario included only patients who received a standard first dose of chemotherapeutic agents, as defined by NCCN guidelines (Supplementary Table 31). Patients with unknown first doses were excluded. Subsequent dose reduction or schedule modifications were allowed, as all RCTs permitted dose modification during the trial. Trials that included only checkpoint inhibitors, targeted therapies or hormonal therapies were not considered given the fixed-dose prescribing patterns of these agents. The CLEOPATRA trial was also not included in this subgroup analysis, despite the presence of chemotherapeutic agents, owing to low patient numbers in the primary analysis.

Prognostic phenotyping

Real-world patients meeting trial eligibility criteria underwent prognostic phenotyping using the top-performing model identified in step I, which was the GBM across all four cancers. The GBM generated individualized mortality risk scores for each patient, using a loss function based on the negative log-partial likelihood of a Cox proportional hazard model. This approach produces mortality risk scores analogous to the linear predictor in a Cox proportional hazards model. Patients in the emulated trials were then categorized into three prognostic phenotypes based on their mortality risk scores: high-risk (top tertile), medium-risk (middle tertile) and low-risk (bottom tertile).

Survival analysis

Survival analysis was conducted for the emulated trials, with time zero defined as the start of the therapy of interest. IPTW was performed to balance features between the treatment and control arms. A propensity score, representing the probability of being assigned to the treatment arm, was calculated using a logistic regression model. This model included features considered influential for selection into the treatment arm, along with prognostically significant features. Prognostic significance was determined based on the frequency of feature appearance in the GBM regression trees (Fig. 2).

The specific set of features varied by trial but typically included demographics, area-level SES, insurance status, year of metastatic diagnosis, ECOG performance status, time from initial cancer diagnosis to metastatic disease, histology, prior treatments, biomarkers, albumin, change in weight and mortality risk score estimated from the GBM. A patient’s quintile SES was determined by neighborhood factors, including income, home values, rental costs, poverty, unemployment and education level; however, this feature was available only in the breast and colorectal cancer datasets. Most feature values were selected closest to the time of metastatic diagnosis and before treatment initiation, with the exception of biomarkers, which were considered up to 30 days after treatment initiation due to potential reporting delays. To assess the balance of these features between treatment and control arms, we calculated standardized mean differences, aiming to achieve an absolute difference of less than 20% for each feature.

The propensity score, denoted as ei, is defined in equation (2), where Xi represents the features in the propensity score model for patient i, and Zi is the indicator variable representing whether patient i was in the treatment arm (Zi = 1) or control arm (Zi = 0). During the survival analysis, patient i was assigned a weight, ωi, defined in equation (3).

$$e_i=\Pr (Z_i=1|X_i)$$

(2)

$$\omega _i=Z_i/e_i+(1-Z_i)/(1-e_i)$$

(3)

For the emulated trials consisting of the full cohort meeting strict eligibility criteria, HRs were calculated using a Cox-IPTW model. This model adjusted for features in the Cox proportional hazards regression model and balanced the distribution of features between treatment and control arms through IPTW. Robust standard errors for the HR were calculated using the Huber sandwich estimator. HRs from the emulated trials were compared with those from the RCTs by evaluating three metrics: (1) statistical significance agreement, defined as HR estimates and 95% CIs being on the same side of the null; (2) 95% CI agreement, determined by whether the emulated trial estimates fell within the 95% CI of the RCT results; and (3) standardized difference agreement, defined as an absolute standardized difference of less than 1.96 between HRs from the emulated trial and the RCT.

The treatment effect for phenotypes was estimated using the mOS and the RMST at 1 year beyond mOS from IPTW-adjusted Kaplan–Meier survival curves. The 1-year time horizon beyond mOS was selected for calculating RMST to allow adequate time for treatment and control arm separation, which is particularly relevant in the checkpoint inhibitor trials30. The 95% CIs for mOS, RMST, RMST difference and the Kaplan–Meier survival curves were calculated using 1,000 bootstraps, with IPTW performed within each bootstrap.

Validation

In the validation analysis, we assessed the agreement in treatment effect across phenotypes for the KEYNOTE-189 and PALOMA-2 trials using an independent holdout set from the Flatiron Health data. These trials were selected owing to their large sample sizes, which allowed for an adequately sized holdout set. The patient population meeting trial eligibility criteria was divided into training and holdout sets, each containing half of the patients. The division was stratified by year of metastatic diagnosis and treatment receipt to ensure equitable distribution.

The GBM was trained on the full cancer cohort, excluding the holdout set. Mortality risk scores were then calculated for both the training and holdout sets. Phenotype cutoffs were determined based on tertiles from the training set patients and applied to both sets. Survival analysis was performed independently on both sets, and the results from the holdout set were compared to those from the training set to validate the consistency of treatment effect across phenotypes.

Semi-synthetic data simulation

A semi-synthetic data simulation was performed for the KEYNOTE-189 trial to assess bias in the trial emulation HRs and evaluate the validity of resulting CIs. This simulation also analyzed how violations of key assumptions—specifically, positivity and no unmeasured confounders—impact the reliability of the HR estimates. The KEYNOTE-189 trial was selected owing to its large sample size and the lower likelihood of proportional hazards violations within phenotypes, as both treatment and control arms included chemotherapy as part of the interventions.

The semi-synthetic dataset was generated using a two-step procedure. First, a logistic regression model simulated treatment assignments based on patient features from the Flatiron Health database. The features selected were the same as those used in the KEYNOTE-189 propensity model in the primary analysis. Mean imputation was performed for missing values. Second, a Cox model was used to simulate survival time and death. Survival times were generated using an exponential distribution, where the rate parameter was the product of the baseline hazard and the individual hazard obtained from the Cox model. Event status was simulated using a binomial distribution, with the probability of death based on the mortality risk estimates from the Cox model, ensuring that the semi-synthetic dataset had similar hazard and survival characteristics to the original data.

Four cases were assessed. In case 1, both the positivity and no-unmeasured-confounder assumptions were maintained. In case 2, the positivity assumption was violated, with 30% of patients randomly chosen not to receive treatment by setting their propensity score to 0 while maintaining the no-unmeasured-confounder assumption. Case 3 involved a more severe positivity violation, with 40% of patients never receiving treatment. In case 4, the positivity assumption was maintained, but the no-unmeasured-confounder assumption was violated by excluding variables such as weight percentage change at the time of advanced diagnosis and PD-L1 status from the propensity score estimation. These variables were specifically selected because they are often missing at treatment initiation in real-world databases. Of note, these variables were included in our trial emulations.

For each case, samples of 10,000, 15,000 or 20,000 patients from the semi-synthetic dataset were analyzed. A Cox-IPTW model was fitted to calculate the HR for the entire dataset and for each prognostic phenotype, with the model specifics varying by case. This process was repeated 2,000 times for each sample size. Bias was calculated as the difference between the estimated HR from the semi-synthetic dataset and the true HR, and average bias was computed across the 2,000 iterations. The coverage rate was calculated as the proportion of true HRs that fell within the 95% CIs of the estimated HRs across the 2,000 iterations.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

link

Leave a Reply

Your email address will not be published. Required fields are marked *