Olink

Olink®
Part of Thermo Fisher Scientific

Unsupervised learning reveals novel disease-associated proteins in high-dimensional human proteomic data

Scientific Reports, 2026

Bernard E., Wang Y., Chen M., Xu S.

Disease areaApplication areaSample typeProducts
Wider Bioinformatics Studies
Data Science
Plasma
Olink Explore 3072/384

Olink Explore 3072/384

Abstract

Modern advancements in precision medicine have led to the generation of vast proteomic datasets, capturing the concentrations of thousands of proteins across tens of thousands of participants. These datasets are traditionally processed using supervised learning methods due to their relative simplicity to implement and assess the output. However, this approach can sometimes overlook subtle patterns that might offer deeper insights. In contrast, unsupervised learning, while capable of revealing hidden relationships, struggles with the challenge of high dimensionality, meaning that brute-force analysis could take millennia to complete. In this study, we developed the Dimensionality Reduction with Avoidance of Missing/COmmunity Detection (DIRAM/COD) framework to address this problem by combining dimensionality reduction techniques with unsupervised learning to analyze the massive proteomic dataset of the UK Biobank, which includes the concentrations of 2,923 plasma proteins from 52,691 participants. By applying this novel approach, we not only confirmed well-established biomarkers for diseases such as hypertension (UBE2L6) and leukemia (LRCH4) but also identified novel protein candidates. For instance, we identified IGF2BP3 in connection with celiac disease, a protein previously linked to intestinal barrier function, along with several other proteins not yet associated with these diseases. This approach opens up exciting possibilities for future research and may pave the way for the discovery of new biomarkers and therapeutic targets.

Read publication ↗