A Review of Relational Machine Learning for Knowledge Graphs

0
380
A Review of Relational Machine Learning for Knowledge Graphs

A Review of Relational Machine Learning for Knowledge Graphs

Abstract

A Review of Relational Machine Learning for Knowledge Graphs management report in data mining studies methods for the statistical analysis of relational, or graph-structured, data. In this paper, we provide a review of how such statistical models can be “trained” on large knowledge graphs, and then used to predict new facts about the world (which is equivalent to predicting new edges in the graph). In particular, we discuss two different kinds of statistical relational models, both of which can scale to massive datasets.

The first is based on tensor factorization methods and related latent variable models. The second is based on mining observable patterns in the graph. We also show how to combine these latent and observable models to get improved modeling power at decreased computational cost. Finally, we discuss how such statistical models of graphs can be combined with text-based information extraction methods for automatically constructing knowledge graphs from the Web. In particular, we discuss Google’s Knowledge Vault project.

Introduction

TRADITIONAL machine learning algorithms take as input a feature vector, which represents an object in terms of numeric or categorical attributes. The main learning task is to learn a mapping from this feature vector to an output prediction of some form. This could be class labels, a regression score, or an unsupervised cluster id or latent vector (embedding). In Statistical Relational Learning (SRL), the representation of an object can contain its relationships to other objects. Thus the data is in the form of a graph, consisting of nodes (entities) and labelled edges (relationships between entities). The main goals of SRL include prediction of missing edges, prediction of properties of nodes, and clustering nodes based on their connectivity patterns. These tasks arise in many settings such as analysis of social networks and biological pathways. For further information on SRL.

In this article, we review a variety of techniques from the SRL community and explain how they can be applied to large-scale knowledge graphs (KGs), i.e., graph structured knowledge bases (KBs) that store factual information in form of relationships between entities. Recently, a large number of knowledge graphs have been created, including YAGO , DBpedia, NELL, Freebase, and the Google Knowledge Graph. these graphs contain millions of nodes and billions of edges. This causes us to focus on scalable SRL techniques, which take time that is (at most) linear in the size of the graph. We can apply SRL methods to existing KGs to learn a model that can predict new facts (edges) given existing facts. We can then combine this approach with information extraction methods that extract “noisy” facts from the Web. For example, suppose an information extraction method returns a fact claiming that Barack Obama was born in Kenya, and suppose (for illustration purposes) that the true place of birth of Obama was not already stored in the knowledge graph. An SRL model can use related facts about Obama (such as his profession being US President) to infer that this new fact is unlikely to be true and should be discarded.

Objectives

This project aims to explore the potential of knowledge graphs in supporting adaptive learning in the area of databases. The specific objectives are described as follows:

  • Develop a modeling technique to represent knowledge graphs based on domain knowledge.
  • Develop a data mining technique to identify relevant knowledge and discover relationships between different pieces of knowledge.
  • Investigate how knowledge graphs modeled by human experts relate to knowledge graphs discovered from data mining/machine learning techniques.

Contributions

My work is to bring the two lines of constructing the knowledge graph together which includes the following contributions:

  • Construct the knowledge graph from the human-selected tags.
  • Crawl the tree-structure webpages to seize the relation between the facts represented by each page.
  • Store and represent the knowledge graph with the modeling and the visualizing tool.

KNOWLEDGE GRAPHS

In this section, we introduce knowledge graphs, and discuss how they are represented, constructed, and used.

Knowledge representation

Knowledge graphs model information in the form of entities and relationships between them. This kind of relational knowledge representation has a long history in logic and artificial intelligence , for example, in semantic networks and frames. More recently, it has been used in the Semantic Web community with the purpose of creating a “web of data” that is readable by machines. While this vision of the Semantic Web remains to be fully realized, parts of it have been achieved. In particular, the concept of linked data has gained traction, as it facilitates publishing and interlinking data on the Web in relational form using the W3C Resource Description Framework (RDF). In this article, we will loosely follow the RDF standard and represent facts in the form of binary relationships, in particular (subject, predicate, object) (SPO) triples, where subject and object are entities and predicate is the relation between them. The existence of a particular SPO triple indicates an existing fact, i.e., that the respective entities are in a relationship of the given type.

We can combine all the SPO triples together to form a multigraph, where nodes represent entities (all subjects and objects), and directed edges represent relationships. The direction of an edge indicates whether entities occur as subjects or objects, i.e., an edge points from the subject to the object. Different relations are represented via different types of edges (also called edge labels).  For example, the fact that there is no starredIn edge from Leonard Nimoy to Star Wars is interpreted to mean that Nimoy definitely did not star in this movie. ‚ Under the open world assumption (OWA), a non-existing triple is interpreted as unknown, i.e., the corresponding relationship can be either true or false. Continuing with the above example, the missing edge is not interpreted to mean that Nimoy did not star in Star Wars. This more cautious approach is justified, since KGs are known to be very incomplete. For example, sometimes just the main actors in a movie are listed, not the complete cast. As another example, note that even the place of birth attribute, which you might think would be typically known, is missing for 71% of all people included in Freebase . RDF and the Semantic Web make the open-world assumption. In Section VII-B we also discuss the local closed world assumption (LCWA), which is often used for training relational models.

Knowledge base construction

Completeness, accuracy, and data quality are important parameters that determine the usefulness of knowledge bases and are influenced by the way knowledge bases are constructed. We can classify KB construction methods into four main groups: ‚ In curated approaches, triples are created manually by a closed group of experts. ‚ In collaborative approaches, triples are created manually by an open group of volunteers. ‚ In automated semi-structured approaches, triples are extracted automatically from semi-structured text via hand-crafted rules, learned rules, or regular expressions. ‚ In automated unstructured approaches, triples are extracted automatically from unstructured text via machine learning and natural language processing techniques.

Construction of curated knowledge bases typically leads to highly accurate results, but this technique does not scale well due to its dependence on human experts. Collaborative knowledge base construction, which was used to build Wikipedia and Freebase, scales better but still has some limitations. For instance, as mentioned previously, the place of birth attribute is missing for 71% of all people included in Freebase, even though this is a mandatory property of the schema. Also, a recent study found that the growth of Wikipedia has been slowing down. Consequently, automatic knowledge base construction methods have been gaining more attention. Such methods can be grouped into two main approaches. The first approach exploits semi-structured data, such as Wikipedia infoboxes, which has led to large, highly accurate knowledge graphs such as YAGO and DBpedia. The accuracy (trustworthiness) of facts in such automatically created KGs is often still very high. For instance, the accuracy of YAGO2 has been estimated1 to be over 95% through manual inspection of sample facts, and the accuracy of Freebase was estimated to be 99%2 . However, semi-structured text still covers only a small fraction of the information stored on the Web, and completeness (or coverage) is another important aspect of KGs. Hence the second approach tries to “read the Web”, extracting facts from the natural language text of Web pages. Example projects in this category include NELL and the Knowledge Vault.

System Configuration:

H/W System Configuration:-

Processor          : Pentium IV

Speed               : 1 Ghz

RAM                  : 512 MB (min)

Hard Disk          : 20GB

Keyboard           : Standard Keyboard

Mouse               : Two or Three Button Mouse

Monitor             : LCD/LED Monitor

S/W System Configuration:-

Operating System               : Windows XP/7

Programming Language       : Java/J2EE

Software Version                 : JDK 1.7 or above

Database                            : MYSQL

Uses of knowledge graphs

Knowledge graphs provide semantically structured information that is interpretable by computers — a property that is regarded as an important ingredient to build more intelligent machines. Consequently, knowledge graphs are already powering multiple “Big Data” applications in a variety of commercial and scientific domains. A prime example is the integration of Google’s Knowledge Graph, which currently stores 18 billion facts about 570 million entities, into the results of Google’s search engine. The Google Knowledge Graph is used to identify and disambiguate entities in text, to enrich search results with semantically structured summaries, and to provide links to related entities in exploratory search.

Enhancing search results with semantic information from knowledge graphs can be seen as an important step to transform text-based search engines into semantically aware question answering services. Another prominent example demonstrating the value of knowledge graphs is IBM’s question answering system Watson, which was able to beat human experts in the game of Jeopardy!. Among others, this system used YAGO, DBpedia, and Freebase as its sources of information. In the context of knowledge graphs, link prediction is also referred to as knowledge graph completion. For example, in Figure 1, suppose the characterIn edge from Obi-Wan Kenobi to Star Wars were missing; we might be able to predict this missing edge, based on the structural similarity between this part of the graph and the part involving Spock and Star Trek.

It has been shown that relational models that take the relationships of entities into account can significantly outperform non-relational machine learning methods for this task. In schema-based automated knowledge base construction, entity resolution can be used to match the extracted surface names to entities stored in the knowledge graph. Link-based clustering extends feature-based clustering to a relational learning setting and groups entities in relational data based on their similarity. However, in link-based clustering, entities are not only grouped by the similarity of their features but also by the similarity of their links. As in entity resolution, the similarity of entities can propagate through the knowledge graph, such that relational modeling can add important information for this task. In social network analysis, link-based clustering is also known as community detection.

Conclusion and Future Directions

In A Review of Relational Machine Learning for Knowledge Graphs management report in data mining, we have looked at a possible symbiosis of machine learning and semantic web knowledge graphs from two angles:

  1. using machine learning for knowledge graphs, and
  2. using knowledge graphs in machine learning.

In both areas, a larger body of works exist, and both are active and vital areas of research. Our recent approaches to creating knowledge graphs which are complementary to those based on Wikipedia, i.e., WebIsALOD and DBkWik, have shown that there is an interesting potential of generating such knowledge graphs. Whenever creating and extending a knowledge graph or mapping a knowledge graph to existing ones, machine learning techniques can be used, either by having a training set manually curated, or by using knowledge already present in the knowledge graph for training models that add new knowledge or validate the existing information. With respect to knowledge graph creation, there are still valuable sources on the Web that can be used.

The magnitude of Wikis, as utilized by DBkWik, is just one direction, while there are other sources, like structured annotations on Web pages  or web tables, which can be utilized. Also for Wiki-based extractions, not all information is used to date, e.g., with the potential of tables, lists, and enumerations still being underexplored. While knowledge graphs are often developed in isolation, an interesting approach would be to use them as training data to improve each other, allowing cross fertilization of knowledge graphs. For example, WebIsALOD requires training data for telling instances and categories from each other, which could be gathered from DBpedia or YAGO. On the other hand, the latter are often incomplete with type information, which could be mined using features from WebIsALOD. As discussed above, embedding methods are currently not usable for descriptive machine learning. Closing the gap between embeddings (which produce highly valuable features) and simple, but semantics preserving feature generation methods would help developing a new breed of descriptive machine learning methods. To that end, embeddings either need to be semantically enriched with a posteriori developments, or trained in a fashion that allow for creating embedding spaces which are semantically meaningful.