27 May 2025 - New article by István Üveges and Orsolya Ring has been published | MTA TK Centre for Social Sciences Artifical Intelligence National Laboratory

27 May 2025 - New article by István Üveges and Orsolya Ring has been published

A new article has been published by István Üveges and Orsolya Ring in Information. The title of the article is “Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis”.

Abstract:

Emotion classification in natural language processing (NLP) has recently witnessed significant advancements. However, class imbalance in emotion datasets remains a critical challenge, as dominant emotion categories tend to overshadow less frequent ones, leading to biased model predictions. Traditional techniques, such as undersampling and oversampling, offer partial solutions. More recently, synthetic data generation using large language models (LLMs) has emerged as a promising strategy for augmenting minority classes and improving model robustness. In this study, we investigate the impact of synthetic data augmentation on German-language emotion classification. Using an imbalanced dataset, we systematically evaluate multiple balancing strategies, including undersampling overrepresented classes and generating synthetic data for underrepresented emotions using a GPT-4–based model in a few-shot prompting setting. Beyond enhancing model performance, we conduct a detailed linguistic analysis of the synthetic samples, examining their lexical diversity, syntactic structures, and semantic coherence to determine their contribution to overall model generalization. Our results demonstrate that integrating synthetic data significantly improves classification performance, particularly for minority emotion categories, while maintaining overall model stability. However, our linguistic evaluation reveals that synthetic examples exhibit reduced lexical diversity and simplified syntactic structures, which may introduce limitations in certain real-world applications. These findings highlight both the potential and the challenges of synthetic data augmentation in emotion classification. By providing a comprehensive evaluation of balancing techniques and the linguistic properties of generated text, this study contributes to the ongoing discourse on improving NLP models for underrepresented linguistic phenomena.

The article is available here:

Üveges, István, and Orsolya Ring. 2025. "Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis" Information 16, no. 4: 330.

https://doi.org/10.3390/info16040330