Incorporating Data Structure and Logical Knowledge in Learning

Most real-life data are structured; for example, they are often relational and stored in databases, hierarchical as in XML, or in the form of knowledge graphs in formats such as RDF and Property Graphs. Moreover, logic is used to systematise and formalise complex domain knowledge, e.g., in the form of ontologies, and logical inference can be used to enrich explicit structured data (e.g., relational databases or knowledge graphs) with implicit information that logically follows from the explicit data and ontology. Machine learning (ML) can benefit tremendously from taking both the data structure and the logical knowledge into account. However, having structured data as input is challenging for classical ML approaches, since they are not able to work with such data directly, and the challenge increases in the presence of logical knowledge.

Similar difficulties arise when the output of an ML model is expected to have structure (strings, sequences, graphs). Currently, ML usually deals with structured data by employing ad-hoc embedding techniques and essentially re-learning the structure from training examples. Moreover, ML can learn the impact of logical ontologies only indirectly and inefficiently from training data. These indirect approaches lead to difficulties in learning good quality models, unsustainable training, and poor interpretability. Moreover, even though knowledge about geometric invariances forms the basis of most physical theories, it only now starts to be exploited in ML. Among the most promising approaches that allow ML to cope with structure and logical knowledge are graph neural networks (GNNs) of various forms and representation learning in general.

To accommodate structured data, Integreat will first consolidate the theoretical foundations of GNN methods and investigate their practical applicability. We will study GNN variants, including their expressive and approximating power, ways to incorporate various types of knowledge, and their applicability to practical problems, such as knowledge graph completion, query answering over incomplete graphs, and anomaly detection. There are many GNN architectures for which seemingly small variations lead to dramatic changes in accuracy. Incorporating knowledge about the structure in a transparent way can help overcome opaqueness of ML models. We will study the impact of different inductive biases driven by knowledge, implementing them from first principles of symmetry and invariance to fully exploit geometric learning. We will explore generalisations to data with structure of methods such as random forest, association rules and penalised high dimensional regression. Introducing a topology in the input space can also generate structural knowledge. For probabilistic modelling, we will also identify ways to infuse structured knowledge into statistical models, by means of, for example, implicit loss embeddings for structured prediction in Hilbert Spaces and appropriate hierarchical graphical models, such as Bayesian and Markov networks, as well as their combinations with GNNs. Knowledge-based reduction of the input dimensionality will be a further theme.

Combining these approaches with other existing methods (Inductive Logic Programming, Markov Logic Networks, ontology-based embeddings, etc), Integreat will develop new methods to intertwine ML with logical inference on deterministic and probabilistic knowledge in logical form. We will formulate new frameworks that utilize domain knowledge and obey known constraints, such as must-link and cannot-link relationships, which are currently not sufficiently taken into account. We will design novel translation algorithms from ML architectures, such as GNNs, to logical axioms and back. These algorithms may be used to verify whether an ML model satisfies the axioms imposed by a logical ontology, as well as to natively incorporate the ontology by amending the model to satisfy the axioms. We will also develop ontology learning approaches that learn logical axioms from data and models. We will design ML models that learn to perceive logical knowledge from data via abductive learning. This requires that the consistency between hypotheses and training data is maximised given background knowledge. For this purpose, new forms of dependency measures and optimization methods are needed, e.g., building upon our matrix-based information theoretic dependency measures. To facilitate transparency and generalizability, a new framework will be developed that enables the use of differentiable inductive logic programming for the proposed knowledge-informed methods, condensing the learned policies into logic statements. Incorporating knowledge will accelerate learning and ensure explainability of the resulting models.

Key researchers in this research theme:

Publisert 3. juli 2023 10:55 - Sist endret 6. sep. 2023 18:17