Natural Language

Natural language allows us to communicate about matters of unlimited complexity and as such constitutes a very powerful representation of knowledge. Access to the knowledge contained in vast amounts of text and audio requires high quality natural language processing (NLP) tools. The use of deep neural networks, based on sequential modelling of text, has led to a paradigm shift in NLP. To date, NLP research is dominated by classification and sequence labelling techniques in combination with language models that are pre-trained via prediction of surface properties over vast amounts of raw text. Linguistically, however, this development is paradoxical: language structure is hierarchical in nature, so that sequential models can at best be approximate.

Integreat will inject linguistic knowledge in neural modelling of natural language by devising novel and linguistically informed self-supervised training procedures for language models that capture structure and semantics during pre-training. We will introduce new modelling approaches based on multi-rooted directed graphs for a variety of central NLP tasks. In addition to being hierarchical in nature, the graph-parsing perspective unlocks new possibilities for the integration of prior knowledge, both deterministic and stochastic, about the domain and the problem. Further, the overarching goal of automated language understanding has currently been broken down into numerous levels of analysis, each focusing on separate sub-tasks. Clearly, understanding human language involves all these tasks and more together.

Recent studies show that linguistic properties that are essential for full understanding of language, such as the effects of negation, are completely missing from models trained only on surface level, sequential prediction tasks. The further integration of knowledge in the form of domain-specific logical ontologies, lexical resources or even image data, will be based on modelling techniques such as graph neural networks (GNNs), various types of transfer learning, multi-task learning and multimodal pre-training. A further research area for Integreat will be probabilistic context understanding, where the latent structure needs to be inferred with uncertainty. Several other research themes will use NLP as a testing arena for their new methods.

Key researchers in this research theme:

Publisert 3. juli 2023 10:55 - Sist endret 6. sep. 2023 18:23