A thematic exploration of textual research resources in CSS data repositories

Our pilot project aimed to improve archive searchability by testing various machine text analysis techniques on a sample of interview collections held in the digital social science archives of the Voices of the 20th Century and the Research Documentation Centre (KDK) of CSS. The project was carried out in cooperation between the KDK and SZTAKI’s Department of Distributed Systems. After selecting, applying and validating the most appropriate technique, results were integrated into the beta version of a repository search engine. In effect, metadata were automatically assigned to each interview, which provide the researchers with information about the content of the texts (interviews or extracts of interviews) and the location of the texts and text excerpts that are related to each other (even across several collections) and are relevant to the research issues.

Subject headings and subject indexes were generated for the interview texts, manually first and then aided by machines. Their adequacy was verified by our researchers. Even in those cases where documents were subjected to machine analysis only, without subsequent manual verification, the validation of results has led to efficient subject heading and subject index generation. Rather than being simply keywords or synonyms of keywords appearing in the texts, the subject headings or labels associated with the interviews are elements of a conceptual network created using the ELSST, an international social science thesaurus, which helps to reveal the sociological phenomena inherent in the texts. Steps have also been taken in the direction of NER (Named Entity Recognition). We identified and then wikified name elements and time tags appearing in the texts, linking them to the Wikidata knowledge graph, Geonames, VIAF, PIM and other name spaces.

To improve searchability, abstract subject headings and name elements obtained through machine processing are associated with the documents in several new metadata fields. Existing documents are thus opened up for new research. The subject headings are translated into English, making our archives searchable for researchers abroad. As a result, domestic resources that have previously been inaccessible due to the language barrier are now becoming visible and accessible to the international research community. The results of machine processing are visualised by highlighting the name elements and linking them to the dictionary entries as well as by exploring and displaying the frequency of certain topics and subject headings and their relationship.

On this project, we worked in cooperation with CESSDA (Consortium of European Social Science Data Archives). As part of that cooperation, a Hungarian translation of the ELSST English-language social science thesaurus, containing more than 3,300 terms, was completed in collaboration with the Research Centre of Linguistics (NYTK). It has been available online since September 2022. The project also involved a cooperation with the Budapest University of Technology and Economics (BME) for improving the efficiency of BEAST, a Hungarian database speech transcriber. BEAST is an open-source, research-ready system based on the SpeechBrain code, developed by the NYTK and BME, with financing from the Hungarian Scientific Research Fund (OTKA) and MILAB, which uses state-of-the-art transformer neural structures. Researchers interested in the sociological sources will be able to access the results of our exploration of interview documents on a common online search platform created for the repositories of the CSS Research Documentation Centre.

Project participants
Szabolcs Annus
Emese Antal
Júlia Egyed-Gergely Júlia
Georgina Filep
Judit Gárdos
Gergő Havadi
Anna Horváth
Miklós Jakab
Veronika Lipp
Márton Matyasovszky Németh
Enikő Meiszterics
Mária Neményi
Tamás P. Tóth
Bálint Sass
Melinda Szöllősi
Róza Vajda

Publication
Egyed-Gergely Júlia, Vajda Róza, Gárdos Judit, Horváth Anna, Meiszterics Enikő, Micsik András, Martin Dániel, Marx Attila, Pataki Balázs, Siket Melinda. Szociológia, kutatási adatok, mesterséges intelligencia: lehetőségek és tapasztalatok (Sociology, research data and artificial intelligence: opportunities and experiences). In: Tick József, Kokas Károly, Holl András (szerk.) Valós térben - az online térért: Networkshop 31: national conference. 20–22 April 2022. University of Debrecen. Budapest, Hungary, HUNGARNET Society, MTA Library and Information Centre, 364 p. pp. 161–169., 2022

Conference papers
Egyed-Gergely Júlia, Micsik András, Vajda Róza: Sociology, research data and artificial intelligence: opportunities and experiences – presentation, Networkshop 31: national conference, 20-22 April 2022
FAIRsFAIR (EOSC alprojekt) Final Event. The National Perspective, online roundtable talk, Gárdos Judit, 26 January 2022