Leakage-Safe Machine Learning and Explainable Artificial Intelligence for Baseline Proteomic Signal Prioritization in Preclinical Rheumatoid Arthritis
Healthcraft Frontiers, 2026
Donmez T., Mansour M.
| Disease area | Application area | Sample type | Products |
|---|---|---|---|
Immunological & Inflammatory Diseases | Data Science | Plasma | Olink Explore 3072/384 |
Abstract
Reliable baseline biomarkers for progression toward rheumatoid arthritis in anti-citrullinated protein antibody-positive at-risk individuals remain insufficiently characterized. An exploratory leakage-safe machine learning framework combined with explainable artificial intelligence was developed to prioritize circulating proteomic signals associated with progression status in a small at-risk cohort. Baseline clinical and Olink proteomic data from 47 individuals (16 progressors and 31 non-progressors) were analyzed, although limited follow-up among non-progressors rendered the endpoint exploratory rather than prognostic. Of 1,472 quantified proteins, 1,449 were retained after application of a ≤20% missingness threshold. Fold-internal feature selection, including Cohen’s d-based ranking, correlation filtering (|r| < 0.85), and top-30 protein selection, was embedded within repeated stratified five-fold cross-validation. Predictive performance remained modest, with the primary support vector classifier achieving a mean Receiver Operating Characteristic Area Under the Curve (ROC-AUC) of 0.675 and Precision-Recall Area Under the Curve (PR-AUC) of 0.447, while calibration remained weak (Brier score = 0.231; calibration slope = 0.455). A 500-iteration permutation audit was not statistically significant (p = 0.164). Regularized logistic regression failed to improve discrimination, whereas incorporation of routine clinical covariates did not yield a reproducible advantage over proteomic features alone. Extreme gradient boosting demonstrated lower discriminative performance and was retained only for secondary interpretability analyses. Across Tree-based SHapley Additive exPlanations (TreeSHAP), Kernel SHapley Additive exPlanations (KernelSHAP) for the support vector classifier, and bootstrap perturbation analyses, trefoil factor 2 (TFF2), KIT proto-oncogene receptor tyrosine kinase (KIT), cadherin 3 (CDH3), angiopoietin-like 2 (ANGPTL2), interleukin-5 (IL5), and glypican-1 (GPC1) emerged as recurrent candidate proteins. Given the limited cohort size, weak calibration, and non-significant permutation testing, all findings should be regarded as exploratory. The primary contribution therefore lies in the establishment of a transparent, leakage-aware workflow for proteomic signal prioritization in severely underpowered p ≫ n settings, thereby supporting future longitudinal validation studies in preclinical rheumatoid arthritis.