Publication:
Efficient algorithms for improved information retention in integration of incomplete omics datasets

cris.customurl 20448
cris.virtual.department High Performance Computing
cris.virtual.departmentbrowse High Performance Computing
cris.virtual.departmentbrowse High Performance Computing
cris.virtual.departmentbrowse High Performance Computing
cris.virtualsource.department 96eeb7e3-b287-4320-836e-6f62697f2214
dc.contributor.advisor Neumann, Philipp
dc.contributor.author Schlumbohm, Simon
dc.contributor.grantor Helmut-Schmidt-Universität/Universität der Bundeswehr Hamburg
dc.contributor.referee Neumann, Julia E.
dc.contributor.referee Carraro, Thomas
dc.date.issued 2025-07-24
dc.description.abstract The acquisition of high-quality data in the biomedical field, particularly in omics studies such as proteomics or transcriptomics, poses a significant challenge due to incomplete measurements during data acquisition or simply small sample sizes. This issue results in datasets with low statistical power that are in addition often compromised by missing values, which impede downstream analysis and the accurate interpretation of biological phenomena. A common approach to mitigate such limitations is data integration, which combines multiple datasets to increase cohort sizes by incorporating data from different studies or laboratories. However, this approach introduces new challenges, notably the so-called batch effect, which introduces internal biases and obscures biological meaning. Moreover, infrequently measured features (e.g., proteins or genes) create additional gaps in the data during integration tasks. As the volume of available biological data continues to expand, there is an increasing need for computational methods capable of efficiently processing and analyzing these growing datasets. Expected future advancements in data acquisition with regards to throughput necessitate the development of computationally efficient and robust algorithms. In addition, to ensure accessibility and broad adoption, it is crucial that bioinformatics tools must be user friendly, allowing researchers with varying levels of technical expertise to effectively utilize them. To this end, an integration and batch effect reduction tool has been developed, called the HarmonizR algorithm. This work features various functionality that has been build to tackle the aforementioned issues. Dataset integration aims for an increase in cohort sizes and sample amounts, which is facilitated by the inclusion of a new unique removal approach. It overcomes prior limitations regarding data retention, greatly increasing HarmonizR's benefits as a pipeline tool when used prior to data analysis by significantly expanding the number of considerable features and data points of any given study. This may be paired with the added functionality of accounting for user-defined experimental information such as treatment-groups (i.e., covariate information) during adjustment, leading to more robust and high-quality results. Regarding computational efficiency, a novel blocking approach exploits the given data structure to brace the algorithm for current and future big data challenges without negatively impacting adjustment quality. Furthermore, the algorithm's batch effect adjustment capabilities are proven effective on various omics types - with a notable extension towards single cell count datasets by employing further adjustment methodology - as well as non-biological data in the form of an attention-deficit/hyperactivity disorder study. To address remaining challenges, the newly developed BERT algorithm introduces a novel architectural approach, offering improvements in information retention and computational efficiency. A comparative analysis of BERT and HarmonizR explores the advantages of BERT in terms of feature/overall data retention and reduced runtimes, providing a valuable complement to the existing framework. Lastly, to enhance accessibility and ease of use, plugins for the popular Perseus software have been created and are described, enabling seamless integration of both algorithms into established bioinformatics workflows, specifically aiding researchers less familiar with the technical aspects of the here shown algorithms and bioinformatics in general. This work advances the field by enabling the adjustment of omics data with missing values for the first time without substantial information loss. As a result, researchers can now confidently merge previously infeasibly large datasets, which unlocks new possibilities for large-scale, multi-cohort studies. These novel capabilities lead towards more comprehensive as well as statistically powerful biomedical analyses.
dc.description.version VoR
dc.identifier.doi 10.24405/20448
dc.identifier.uri https://openhsu.ub.hsu-hh.de/handle/10.24405/20448
dc.language.iso en
dc.publisher Universitätsbibliothek der HSU/UniBw H
dc.relation.orgunit High Performance Computing
dc.rights.accessRights open access
dc.title Efficient algorithms for improved information retention in integration of incomplete omics datasets
dc.type Dissertation
dcterms.bibliographicCitation.originalpublisherplace Hamburg
dcterms.dateAccepted 2025-07-21
dspace.entity.type Publication
hsu.thesis.grantorplace Hamburg
hsu.uniBibliography
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
openHSU_20448.pdf
Size:
12.18 MB
Format:
Adobe Portable Document Format
Description:
Collections