Leveraging population-scale proteomic data with deep learning for head and neck cancer detection in saliva
npj Digital Medicine, 2026
Shakeel A., Merriel S., Smith J., McGough A., Suderman M., Abdallah Z., Yousefi P.
| Disease area | Application area | Sample type | Products |
|---|---|---|---|
Oncology Wider Bioinformatics Studies | Technical Evaluation Data Science | Plasma Saliva | Olink Target 96 Olink Explore 3072/384 |
Abstract
Identifying robust biomarkers for early cancer detection remains challenging, particularly when working with limited or heterogeneous datasets. Here, we present a proof-of-concept deep learning framework for cancer classification using blood-based proteomic profiles. Our approach leverages sample type transfer and synthetic data augmentation to improve performance and generalization across sample types. Models were trained on plasma proteome data from 13,208 pan-cancer cases and 39,806 controls in the UK Biobank. To address class imbalance and enrich the feature space, a convolutional neural network (CNN-Synth) was trained to detect cancer cases using data augmented with synthetic pan-cancer samples generated via a variational autoencoder. Performance was evaluated in an independent saliva-based dataset from a head and neck cancer case-control study (n = 156). CNN-Synth (AUC = 0.88) surpassed models trained without synthetic data (AUC ≤ 0.77). SHapley Additive explanations identified well-known cancer markers as key features. These results highlight the use of sample type transfer and synthetic data augmentation, with further validation needed.