Automated multimedia knowledge extraction from publications using the example of structural equation models

The aim of the project was to develop a platform that enables the effective extraction, exploration and aggregation of knowledge from scientific publications. The first step of the funded activity focused on knowledge extraction. Following a multimodal approach, the information contained in the publications (both in text and image form) was to be extracted. Specific classifiers were trained using supervised machine learning methods in order to automate the time-consuming task of knowledge extraction as far as possible.

A key milestone was the provision of a comprehensive annotated data set for structural equation models. This data set is an important cornerstone for the application of supervised multimodal extraction methods. Furthermore, a generator software for so-called synthetic data sets was created in this context.

In the later course of the project, the focus was primarily on expanding the application domains. The project began with the creation of a further data set of structural formulas in chemistry. Building on existing textual representations of these molecules (SMILES, InChI), a very large dataset of image data and the associated labels can be compiled semi-automatically. This enabled us to reduce the degree of automation in the annotation of chemical structural formulas to virtually zero.

The main impact of the activity was to initiate collaboration between the two research groups Mädche and Stiefelhagen. As part of the collaboration, it was successfully demonstrated that it is possible to automate the extraction of high-quality metadata from images in publications.