openHSU logo
Log In(current)
  1. Home
  2. Helmut-Schmidt-University / University of the Federal Armed Forces Hamburg
  3. Publications
  4. 3 - Publication references (without full text)
  5. To combine or not to combine? The influence of combining training datasets on the robustness of deep learning models

To combine or not to combine? The influence of combining training datasets on the robustness of deep learning models

An analysis for optical character recognition of handwriting
Publication date
2025-03-31
Document type
Konferenzbeitrag
Author
Fischer-Brandies, Leopold  
Müller, Lucas
Rebholz, Benjamin
Buettner, Ricardo  
Organisational unit
Hybrid Intelligence  
DOI
10.1109/access.2025.3556582
URI
https://openhsu.ub.hsu-hh.de/handle/10.24405/23028
Publisher
IEEE
Series or journal
IEEE Access
ISSN
2169-3536
Periodical volume
13
Part of the university bibliography
✅
Additional Information
Language
English
Abstract
The present manuscript addresses the question of how training data should be sampled for deep learning models by analyzing and evaluating the impact of training data representation and complexity on the performance and robustness of deep learning models. To address this open question, we take a combinatorial approach and train three architecturally identical deep learning models on three combinations of handwritten digit datasets of varying complexity: EMNIST Digits, DIDA, and a newly composed third dataset combining the first two datasets. Each model was evaluated using withheld test data from all three datasets. We find that models trained exclusively on either EMNIST Digits or DIDA performed well on their respective datasets but poorly on unfamiliar datasets. However, the model trained on both datasets showed an overall solid performance, although not quite reaching the accuracy of the specialized models on their respective datasets. We conclude that while specializing in the training dataset can increase accuracy, a more diverse dataset enhances model robustness. In practice, deep learning models should thus be trained on data that represents the actual application environment as closely as possible or, if such data is not available, on diverse data.
Description
This work is licensed under a Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/).
Version
Published version
Access right on openHSU
Metadata only access

  • Privacy policy
  • Send Feedback
  • Imprint