The knowledge graph as the default data model for learning on heterogeneous knowledge
Xander Wilcke, Peter Bloem, Victor de Boer, Michel Dumontier
In modern machine learning, raw data is the preferred input for our models. Where a decade ago data scientists were still engineering features, manually picking out the details we thought salient, they now prefer the data in their raw form. As long as we can assume that all relevant and irrelevant information is present in the input data, we can design deep models that build up intermediate representations to sift out relevant features. However, these models are often domain specific and
... d to the task at hand, and therefore unsuited for learning on heterogeneous knowledge: information of different types and from different domains. If we can develop methods that operate on this form of knowledge, we can dispense with a great deal more ad-hoc feature engineering and train deep models end-to-end in many more domains. To accomplish this, we first need a data model capable of expressing heterogeneous knowledge naturally in various domains, in as usable a form as possible, and satisfying as many use cases as possible. In this position paper, we argue that the knowledge graph is a suitable candidate for this data model. We further describe current research and discuss some of the promises and challenges of this approach. This article is published online with Open Access and distributed under the terms of the Creative Commons Attribution License (CC BY 4.0). 2451-8484 © 2017 -IOS Press and the authors. 40 X. Wilcke et al. / The knowledge graph as the default data model for learning on heterogeneous knowledge from their data, often creating a derivative of the original data in the process, they now prefer to feed their models the data in their raw form. Specifically, data which still contains all relevant and irrelevant information rather than having been reduced to features selected or engineered by data scientists. This shift can largely be attributed to the emergence of deep learning, which showed that we can build layered models of intermediate representations to sift out relevant features, and which allows us to dispense with manual feature engineering. For example, in the domain of image analysis, popular feature extractors like SIFT  have given way to Convolutional Neural Networks [4, 18] , which naturally consume raw images. These are used, for instance, in facial recognition models which build up layers of intermediate representations: from low level features built on the raw pixels like local edge detectors, to higher level features like specialized detectors for the eyes, the nose, up to the face of a specific person  . Similarly, in audio analysis, it is common to use models that consume audio data directly  and in Natural Language Processing it is possible to achieve state-of-the-art performance without explicit preprocessing steps such as POStagging and parsing  . This is one of the strongest benefits of deep learning: we can directly feed the model the dataset as a whole, containing all relevant and irrelevant information, and trust the model to unpack it, to sift through it, and to construct whatever low-level and high-level features are relevant for the task at hand. Not only do we not need to choose what features might be relevant to the learning task -making ad-hoc decisions and adding, removing, and reshaping information in the process -we can let the model surprise us: it may find features in our data that we would never have thought of ourselves. With feature engineering now being part of the model itself, it becomes possible to learn directly from the data. This is called end-to-end learning (further explained in the text box below). However, most present end-to-end learning methods are domain-specific: they are tailored to images, to sound, or to language. When faced with heterogeneous knowledge -information of different types and from different domains -we often find ourselves resorting back to manual feature engineering. To avoid this, we require a machine learning model capable of directly consuming heterogeneous knowledge, and a data model suitable of expressing such knowledge naturally and with minimal loss of information. In this paper, we argue that the knowledge graph is a suitable data model for this purpose and that, in order to achieve end-to-end learning on heterogeneous knowledge, we should a) adopt the knowledge graph as the default data model for this kind of knowledge and b) develop end-to-end models that can directly consume these knowledge graphs. Concretely, we will use the term heterogeneous knowledge to refer to: entities (things), their relations, and their attributes. For instance, in a company database, we may find entities such as employees, departments, resources and clients. Relations express which employees work together, which department each employee works for and so on. Attributes can be simple strings, such as names and social security numbers, but also richer media like short biographies, photographs, promotional videos or recorded interviews. Of course, no data model fits all use cases, and knowledge graphs are no exception. Consider, for instance, a simple image classification task: it would be extremely inefficient to encode the individual pixels of all images as separate entities in a knowledge graph. We can, however, consider encoding the images themselves as entities, with the raw image data as their single attribute (e.g., as hex-encoded binary data). In this case, we would pay little overhead, but we would also gain nothing over the original simple list of images. However, as soon as more information becomes available (like geotags, author names, or camera specifications) it can be easily integrated into this knowledge graph.