Hungarian benchmark databases for machine learning development

Due to the quick growth in digitalization and the rapid development of lan - guage technology tools, empirical and data-driven research directions aiming to address questions raised by social sciences by examining real text corpora have gained a prominent role in applied linguistics research in recent years. In this sub-project, we have improved CAP’s existing databases and created benchmark databases by extending our existing data in two directions. We have included (oral) questions alongside the actual questions to ministers, quick-fire questions and speeches before the order of the day between 2010 and 2022 on the one hand, while adding the texts from the two parliamentary cycles between 2002 and 2010 to our existing data and corpora for those four types of texts.

Substantial manually or machine-annotated corpora may also be a suitable resource for state-of-the-art techniques such as the training of ma - chine learning models. While almost all of the most advanced such models are currently based on the use of artificial neural networks, due to reasons of efficiency (i.e. the use of large models such as BERT is often a very com - pute-intensive process, not only in terms of training but also in terms of its application during prediction), the use of earlier methods (logistic regression, support vector machine, etc.) are still justified.

During the research, machine learning experiments were carried out on the corpora created in order to assess their practical value in addressing sociological issues. A wide spectrum of methods (e.g. SVM, LSTM, BERT, etc.) have been employed in order to assign the most appropriate technology to each research question. In addition to the benchmark database built during the research, the Parlawspeech database that includes the corpus and data of parliamentary speeches, bills and laws between 1994 and 2022, the database of Magyar Nemzet front pages between 2002 and 2014, have been created, and a study entitled If there is nothing else to say: the local content of interpellations by Csaba Molnár, has published in the Journal of Legislative Studies.

Project participants
László Kiss
Csaba Molnár
Tamás Barczikay
Rebeka Kiss
Adrienn Klein
Viktor Kovács
Bálint György Kubik
Zsanett Pokornyi
István Üveges

Publications
Molnár Csaba. If there is nothing else to say: the local content of interpellations. The Journal of Legislative Studies. Published online: 02 Oct 2022 pp. 1–23., Paper: Early Access, 23 p., 2022

Conference paper
Boda Zsolt, Kiss László, Molnár Csa - ba. Access to international com - parative databases at the Centre for Social Sciences – Introducing the Comparative Agendas Project of CSS. Szöveg. Gép. Társadalom, Budapest, ELTE Társadalomtu - dományi Kar, 20 September 2022.

Repository
Hungarian PARLAWSPEECH dataset