Automated text analysis in the research data repositories of the Centre for Social Sciences

The aim of our pilot project is to test various methods of automated text analysis using selected samples taken from the interview repositories of two digital archives of social sciences: Voices of the 20th Century and Research Documentation Center at the Center for Social Sciences. The test will serve to identify the best method(s) and integrate the results in the two digital repositories. Our purpose is to mechanically attach metadata to each individual interview providing researchers with information regarding the content of the texts and identifying the location of texts and text parts, connected within or across collections, which are relevant for the given research question. Using computed technologies, indexes and subject headings will be generated for every text; our researchers will first make sure the generated categories are adequate and relevant, while the selected method should later ensure appropriate indexes and subject headings of archived documents without subsequent controlling. After the automated processing of the interview collections, the pilot project also ensures the integration of the outcome of the mechanical text analysis among the metadata of individual entries. The results of the process, disclosing the connections among topics and key words, will be presented in a visualized manner.

The subject headings (or labels) attached to the interviews are not simply keywords (or their synonyms) to be found in the texts but detect tacit sociological phenomena and characteristics, requiring an approach similar to sentiment analysis. We also intend to experiment with ways to simulteneously provide solutions to NER (Named Entity Recognition) and anonymity. The conceptual subject headings will be associated with the documents in several new metadata fields, serving as a means to identify resources for further research among the reposited documents. The subject headings will be translated into English, thereby allowing foreign researchers find and reuse Hungarian language resources corresponding to their proposed research questions - an otherwise unresolved problem in Central Europe, due to the language barriers.