Integreat will inject linguistic knowledge in neural modelling of natural language by devising novel and linguistically informed self-supervised training procedures for language models that capture structure and semantics during pre-training. We will introduce new modelling approaches based on multi-rooted directed graphs for a variety of central NLP tasks. In addition to being hierarchical in nature, the graph-parsing perspective unlocks new possibilities for the integration of prior knowledge, both deterministic and stochastic, about the domain and the problem. Further, the overarching goal of automated language understanding has currently been broken down into numerous levels of analysis, each focusing on separate sub-tasks. Clearly, understanding human language involves all these tasks and more together.
Recent studies show that linguistic properties that are essential for full understanding of language, such as the effects of negation, are completely missing from models trained only on surface level, sequential prediction tasks. The further integration of knowledge in the form of domain-specific logical ontologies, lexical resources or even image data, will be based on modelling techniques such as graph neural networks (GNNs), various types of transfer learning, multi-task learning and multimodal pre-training. A further research area for Integreat will be probabilistic context understanding, where the latent structure needs to be inferred with uncertainty. Several other research themes will use NLP as a testing arena for their new methods.
Key researchers in this research theme: