Leakage-Safe Machine Learning and Explainable Artificial Intelligence for Baseline Proteomic Signal Prioritization in Preclinical Rheumatoid Arthritis

Healthcraft Frontiers, 2026

Donmez T., Mansour M.

Disease area	Application area	Sample type	Products
Immunological & Inflammatory Diseases	Data Science	Plasma	Olink Explore 3072/384

Abstract

Reliable baseline biomarkers for progression toward rheumatoid arthritis in anti-citrullinated protein antibody-positive at-risk individuals remain insufficiently characterized. An exploratory leakage-safe machine learning framework combined with explainable artificial intelligence was developed to prioritize circulating proteomic signals associated with progression status in a small at-risk cohort. Baseline clinical and Olink proteomic data from 47 individuals (16 progressors and 31 non-progressors) were analyzed, although limited follow-up among non-progressors rendered the endpoint exploratory rather than prognostic. Of 1,472 quantified proteins, 1,449 were retained after application of a ≤20% missingness threshold. Fold-internal feature selection, including Cohen’s d-based ranking, correlation filtering (|r| < 0.85), and top-30 protein selection, was embedded within repeated stratified five-fold cross-validation. Predictive performance remained modest, with the primary support vector classifier achieving a mean Receiver Operating Characteristic Area Under the Curve (ROC-AUC) of 0.675 and Precision-Recall Area Under the Curve (PR-AUC) of 0.447, while calibration remained weak (Brier score = 0.231; calibration slope = 0.455). A 500-iteration permutation audit was not statistically significant (p = 0.164). Regularized logistic regression failed to improve discrimination, whereas incorporation of routine clinical covariates did not yield a reproducible advantage over proteomic features alone. Extreme gradient boosting demonstrated lower discriminative performance and was retained only for secondary interpretability analyses. Across Tree-based SHapley Additive exPlanations (TreeSHAP), Kernel SHapley Additive exPlanations (KernelSHAP) for the support vector classifier, and bootstrap perturbation analyses, trefoil factor 2 (TFF2), KIT proto-oncogene receptor tyrosine kinase (KIT), cadherin 3 (CDH3), angiopoietin-like 2 (ANGPTL2), interleukin-5 (IL5), and glypican-1 (GPC1) emerged as recurrent candidate proteins. Given the limited cohort size, weak calibration, and non-significant permutation testing, all findings should be regarded as exploratory. The primary contribution therefore lies in the establishment of a transparent, leakage-aware workflow for proteomic signal prioritization in severely underpowered p ≫ n settings, thereby supporting future longitudinal validation studies in preclinical rheumatoid arthritis.

Read publication ↗

See all publications →

Products

Instrument

Software

Olink Reveal

Technology

Proximity Extension Assay

Application

Community

Population-scale proteogenomics

Services

service providers

Olink worldwide network of service providers

Knowledge

Publications

About

Our Legal Center

Leakage-Safe Machine Learning and Explainable Artificial Intelligence for Baseline Proteomic Signal Prioritization in Preclinical Rheumatoid Arthritis