The Text Mining of Political and Legal Texts (poltextLAB) project aims to employ Big Data methods for analyses of repositories of Hungarian and foreign political and legal documents. Traditional approaches to analysing qualitative data sources (texts, images and videos) have typically relied on a manual processing of data. While the knowledge of the source material remains essential in any social science study, there are obvious limitations to manual processing, particularly in terms of the reliability and validity of research results. The sheer variety and volume of data sources (e.g. a country’s entire body of legislation) can make manual processing impractical. Quantitative text analysis and text mining approaches thus represent a new methodological standard for textbased Big Data projects in social sciences.
During the project, we build large text corpora, to be used primarily to develop and test the effectiveness of various machine learning algorithms based on artificial intelligence. In addition to developing new methodological solutions for the analysis of Hungarian texts, improving the effectiveness of existing algorithms and developing new hybrid methods for solving classification tasks, the project also aims to extend the methods to the analysis of non-Hungarian corpora. Various state-of-the-art major language models (e.g. BERT) have been successfully used in both Hungarian and multilingual classification, in up to 20 classes. At the end of 2022, we successfully launched our latest innovation, the CAP BABEL MACHINE, which relies on a multilingual BERT model for an automated identification of policy areas in texts, using the main topics of the policy code book of the Comparative Agendas Project (CAP). Through a form available at poltextlab.com/cap-machine or http://www.capbabel.com/, users can upload the files they wish to encode. After the encoding is finished, processed data is returned to users in a short time.
One of the priorities of the project is to set up a national and international network of like-minded researchers using text mining techniques. For years, we have been the main organiser of the international COMPTEXT conference, which aims to provide an opportunity for researchers relying on text mining techniques to hold regular meetings and learn about each other’s results. Our conferences are attended each year by around 150 participants from prestigious international universities and research institutions. As part of the project, 2 to 3 text mining training sessions are held for social scientists each year. Our Text Mining and Artificial Intelligence training program provides an introduction to supervised and unsupervised machine learning algorithms at both beginner and advanced levels. The program is built on our textbook Text Mining and Artificial Intelligence in R (authors: Sebők Miklós, Ring Orsolya, Máté Ákos), published in 2021, which is based on analyses of our corpora. Our Data Visualisation in R course aims to provide a practical and interactive overview of data visualisation using R’s ggplot2 package.
Project participants
Ágnes Dinnyés
István Járay
Péter Gelányi
Márk György Kis
Rebeka Kiss
Ádám Kovács
Viktor Kovács
Bálint Kubik
Richárd Lehoczki
Ákos Máté
Csaba Molnár
Orsolya Ring
Miklós Sebők
Anna Székely
István Üveges
Cooperating partners
Jagiellonian University Kraków
National University of Ireland, Galway
University of Public Service Graduate School for Public Administration
University of Pécs, Microsoft AI Knowledge Center
Reichman University
University of Szeged
University of Cologne
Publications
Gelányi Péter, Sebők Miklós, Ring Orsolya. A topikmodellezés lehetőségei és korlátai egy törvénykorpusz példáján. (The possibilities and limitations of topic modelling illustrated by the example of a legal corpus). Statisztikai Szemle 100: 8., pp 783–814., 2022
Kiss Rebeka, Sebők Miklós. Creating an Enhanced Infrastructure of Parliamentary Archives for Better Democratic Transparency and Legislative Research – Report on the OPTED forum in the European Parliament (Brussels, Belgium, 15 June 2022). International Journal of Parliamentary Studies, 2 (2), pp. 278–284, 2022
Sebők Miklós, Kacsuk Zoltán, Máté Ákos. The (real) need for a human touch Testing a human-machine hybrid topic classification workflow on a New York Times corpus. Quality and Quantity: International Journal of Methodology 56, pp. 3621–3643., 23 p., 2022
Sebők Miklós, Kubik Bálint György, Molnár Csaba, Járay István, Székely Anna. Measuring legislative stability – A new approach with data from Hungary. European Political Science 21, pp. 491–521., 2022
Sebők Miklós, M. Balázs Ágnes, Molnár Csaba. Punctuated Equilibrium and Progressive Friction in Socialist Autocracy, Democracy and Hybrid Regimes. Journal of Public Policy 42(2), pp. 247–269., 2022
Sebők Miklós, Boda Zsolt (eds.). Policy Agendas in Autocracy, and Hybrid Regimes. London: Palgrave MacMillan, 2021
Sebők Miklós, Kacsuk Zoltán. The Multiclass Classification of Newspaper Articles with Machine Learning: The Hybrid Binary Snowball Approach. Political Analysis 29: 2, pp 236–249., 14 p., 2021
Sebők Miklós, Kozák Sándor. From State Capture to „Pariah” Status? The Preference Attainment of the Hungarian Banking Association (2006–2014). Business and Politics 23: 2, pp 179–201., 2021
Sebők Miklós, Ring Orsolya, Máté Ákos. Szövegbányászat és Mesterséges Intelligencia R-ben (Text Mining and Artificial Intelligence in R). Budapest: Typotex Kiadó, 184 p., 2021
Bolonyai Flóra, Sebők Miklós. Kvantitatív szövegelemzés és szövegbányászat (Quantitative text analysis and text mining). In: Jakab András, Sebők Miklós (eds.) Empirikus jogi tanulmányok. Budapest: Osiris Kiadó, MTA Társadalomtudományi Kutatóközpont, 660 p., pp. 361–380., 20 p., 2020
Sebők Miklós, Gajduschek György, Molnár Csaba (eds.). A magyar jogalkotás minősége: Elmélet, mérés, eredmények (The quality of Hungarian legislation: theory, measurement and results). Budapest: Gondolat Kiadó, 400 p., 2020
Major conferences
American Political Science Association (APSA) Political Methodology Specialist Group, Politics and Computational Social Science, Annual COMPTEXT Conference, OPTED Data4Parliaments
International grants awarded
OPTED – Observatory for Political Texts in European Democracies (2020–2023) Horizon 2020 Grant Agreement no. 951832.
Repositories
GitHub – poltextlab/textreuse_ch_hun
GitHub – poltextlab/text_mining_workshop
GitHub – poltextlab/CLARIN_ParlaMint_HU
GitHub – poltextlab/HunMineR: Companion package for the Hungarian text mining textbook
GitHub – poltextlab/tankonyv: Szövegbányászat és mesterséges intelligencia R-ben
GitHub – poltextlab/nyt_hybrid_classification_workflow: Replication material for Sebők, M., Kacsuk, Z., & Máté, Á. (2021). The (real) need for a human touch: testing a human–machine hybrid topic classification workflow on a New York Times corpus. Quality & Quantity, 1–23.
Multilingual comparable corpora of parliamentary debated ParlaMint 2.1
Keywords: machine learning, elaboration of machine learning algorithms, corpus construction, quantitative text analysis