Causal machine learning: opportunities, limits and social science applications

Over the last decade, two areas of social statistics have seen research that has had a significant impact, bringing about a paradigm shift in statistics. Structural causal modelling, on the one hand, is breaking new ground in social research practice, while the application of machine learning techniques in engineering is revolutionising econometrics. The two areas mark seemingly opposite paths of development for social data analysts. The criteria imposed by structural causality models on the statistical formalisation of social theoretical assumptions concerning relationships between variables are considerably stricter than current practice, while modelling aims to put forward robust claims about the causal relationships between various factors. By contrast, machine learning techniques, which mostly aim to offer predictions based on a set of variables, are primarily data-driven techniques, rendering unnecessary the very assumptions that are typically used in widely applied statistical models.

Over the last five years, the two research directions have in fact become closely intertwined in social statistics. First, there is strong demand among data analysts relying on machine learning techniques to support business decision-making to estimate the impact of decisions (treatments). Second, structural causal modelling emerged out of the criticism of data-driven methods in artificial intelligence in the first place; its leading researchers are currently among the most influential theorists in AI research. One of the key directions of research is the development of machine reasoning, which draws on structural causal models. Moreover, a significant part of the most influential results of causal machine learning in the last 3 to 4 years have been related to econometric research and social science (or epidemiological) applications. The development of new procedures is accompanied by considerable theoretical and practical debate. These tend to take place between engineer researchers in data-driven or theory-driven artificial intelligence, and researchers in causal machine learning with a background in engineering and social sciences.

The purpose of the research is twofold. We aim to understand the new research directions and processes referred to above and the root causes of debates and to disseminate our conclusions among researchers of CSS and MILAB as well as within the national social research community. Another goal is to enable members of the CSS research community specialising in quantitative data analysis and MILAB researchers working on predictive analysis of tabular data to understand and master the basics of these machine learning techniques. A causal data analysis reading group was established at CSS in autumn 2020. We picked up that thread during the initial phase of the research, in collaboration with relevant partners and by the dissemination of the results. We organised training sessions on machine learning and lectures on causal statistics, and carried out statistical analyses to investigate whether the new methods are indeed more reliable in estimating the effect of different treatments than traditional regression techniques.

We also developed a data generation program as part of our sub-project on causal machine learning. The next phase of our project focuses on the development of a demo version of a Datasheet tool optimised for both social science and business tasks, which can later be operated as a service to assist researchers and business data analysts prior to carrying out causal analyses. Generating synthetic data based on the properties of a specific existing database and researchers’ assumptions, the Datasheet communicates the properties of various analysis alternatives. The software can also aid the design of future data collection.

Our results to date have included a free-to-use, open-source flexible data generation algorithm for stand-alone method-testing analyses and datasheet building; an introductory courseware (annotated codes) on machine learning for social scientists; mutual collaboration on innovations with an external corporate partner; and consulting on the development of a causal data analysis module for the Retail4 commercial business information software developed by a medium-sized Hungarian company.

Project participants
Jakab Buda
Gábor Hajdu
Béla Janky
Blanka Szeitl
Emese Thamó

Keywords: causal machine learning, structural causal models, machine reasoning, business and policy decision making