DegreEmbed: Incorporating entity embedding into logic rule learning for knowledge graph reasoning

Li, Haotian; Liu, Hongri; Wang, Yao; Xin, Guodong; Wei, Yuliang

doi:10.3233/SW-233413

DegreEmbed: Incorporating entity embedding into logic rule learning for knowledge graph reasoning

Article type: Research Article

Authors: Li, Haotian | Liu, Hongri | Wang, Yao | Xin, Guodong | Wei, Yuliang^{; *}

Affiliations: School of Computer Science and Technology, Harbin Institute of Technology at Weihai, China

Correspondence: [*] Corresponding author. E-mail: wei.yl@hit.edu.cn.

Keywords: Knowledge graph reasoning, link prediction, logic rule mining, degree embedding, interpretability of model

DOI: 10.3233/SW-233413

Journal: Semantic Web, vol. 14, no. 6, pp. 1099-1119, 2023

Published: 13 December 2023

Get PDF

Abstract

Knowledge graphs (KGs), as structured representations of real world facts, are intelligent databases incorporating human knowledge that can help machine imitate the way of human problem solving. However, KGs are usually huge and there are inevitably missing facts in KGs, thus undermining applications such as question answering and recommender systems that are based on knowledge graph reasoning. Link prediction for knowledge graphs is the task aiming to complete missing facts by reasoning based on the existing knowledge. Two main streams of research are widely studied: one learns low-dimensional embeddings for entities and relations that can explore latent patterns, and the other gains good interpretability by mining logical rules. Unfortunately, the heterogeneity of modern KGs that involve entities and relations of various types is not well considered in the previous studies. In this paper, we propose DegreEmbed, a model that combines embedding-based learning and logic rule mining for inferring on KGs. Specifically, we study the problem of predicting missing links in heterogeneous KGs from the perspective of the degree of nodes. Experimentally, we demonstrate that our DegreEmbed model outperforms the state-of-the-art methods on real world datasets and the rules mined by our model are of high quality and interpretability.

1.Introduction

Recent years have witnessed the growing attraction of knowledge graphs in a variety of applications, such as dialogue systems [19,23], search engines [40] and domain-specific softwares [22,30]. Capable of incorporating large-scale human knowledge, KGs provide graph-structured representation of data that can be comprehended and examined by humans. Knowledge in KGs is stored in triple form (es,r,eo), with es and eo denoting subject and object entities and r a binary relation (a.k.a. predicate). For example, the fact that Mike is the nephew of Pete can be formed as (Mike,nephewOf,Pete). However, information incompleteness can be seen in most modern KGs, that is, missing links in the graph, e.g., the work of [10,17] shows that there are more than 66% of the person entities missing a birthplace in two open KGs Freebase [3] and DBpedia [1].

Predicting missing triples based on the existing facts is usually called link prediction as a subtask of knowledge graph completion (KGC) [14], and numerous models have been developed for solving such problems. One prominent direction in this line of research is representation learning methods that learn distributed embedding vectors for entities and relations, such as TransE [4] and ComplEx [34]. In this work, they are referred to as embedding-based methods. This kind of models are capable of capturing latent knowledge through low-dimensional vectors, e.g. we can classify male and female entities in a family KG by clustering their points at the semantic space. In spite of achieving high performance, these models suffer from non-transparency and can poorly be understood by humans, which is a common issue for most deep learning models. In addition, most embedding-based methods work in a transductive setting, where they require the entities in training and test data to overlap, hindering the way to generalize in some real-world situations.

Fig. 1.

An example of KG containing heterogeneous entities and relations: paper, person and institution. Entities in different colors mark their type. The existing links in the KG are presented as solid black lines, the missing one as dashed lines in red and the proper rule for inferring the link as blue lines.

Another popular approach is rule mining that discovers logical rules through mining co-occurrences of frequent patterns in KGs [5,11]. This paper studies the problem of learning first-order logical Horn clauses for knowledge graph reasoning (KGR). As illustrated in Fig. 1, there is a missing link (i.e. nephewOf) between the subject Mike and the object Steve, but we can complete the fact through a logic rule nephewOf(Mike, Pete) ∧ brotherOf(Pete, Steve) ⇒ nephewOf(Mike, Steve), meaning that if Mike is the nephew of Pete and Steve has a brother Pete, then we can infer that Mike is the nephew of Steve. Reasoning on KGs through Horn clauses has been previously studied in the area of Inductive Logic Programming [24]. One representative method, Neural LP [42], is the first fully differentiable neural system that successfully combines learning discrete rule structures as well as confidence scores in continuous space. Although learning logical rules equips a model with strong interpretability and the ability to generalize to similar tasks [28,45], these methods often focus only on the relations of which the rules are made up, while the intrinsic properties of the involved entities are not considered. For example, in the KG shown in Fig. 1, it is definitely wrong to infer by a rule containing a female-type relation path like sisterOf starting from Mike, because Mike is the nephew of Pete, which indirectly tells us he is a male. This sort of deficiency is more severe in heterogeneous KGs where there are entities and relations of different types mixing up. In these KGs, there might be multiple rules of no use, becoming inevitable noises for reasoning tasks. For instance, the rule read(Mike, paper2) ∧ readBy(paper2, Steve) ⇒ nephewOf(Mike, Steve) is obviously wrong in logic, which might decrease the performance and interpretability of ILP models.

In this paper, in order to bridge the gap between the two lines of research mentioned above, we propose DegreEmbed, a model of logic rule learning that integrates the inner attributes of entities by embedding nodes in the graph from the perspective of their degrees. DegreEmbed is not only interpretable to humans, but also able to mine relational properties of entities. We also evaluate our model on several knowledge graph datasets, and show that we are able to learn more accurately, and meanwhile, gain strong interpretability of mined rules.

Our main contributions are summarized below:

– We propose an original model based on logic rule learning to predict missing links in heterogeneous KGs. Specifically, a new technique for encoding entities, called degree embedding, is designed to capture hidden features through the relation type of edges incident to a node.
– Comparative experiments on knowledge graph completion task with five benchmark datasets prove that our DegreEmbed model outperforms baseline models. Besides, under the evaluation of a metric called Saturation, we show that our method is capable of mining meaningful logic rules from knowledge graphs.
– Visualizing learned entity embeddings, we demonstrate that clear features of entities can be obtained by our model, thus benefitting the prediction in heterogeneous settings. Moreover, we prove the necessity of each component of our model using ablation study.

This paper is structured as follows. We briefly introduce our related work and review preliminary definitions of knowledge graphs respectively in Section 2 and Section 3. Section 4 introduces our proposed DegreEmbed model based on logic rule learning for link prediction in heterogeneous KGs. We present the experimental results in Section 5 and conclude our work by pointing out the future direction.

2.Related work

Our work is first related to previous efforts on relational data mining, based on which, a large body of deep rule induction models have been developed for link prediction. Since our approach achieves a combination of logic rule learning and knowledge graph embedding, we conclude related work in this topic as well.

Relational data mining. The problem of learning relational rules has been traditionally addressed in the field of inductive logic programming (ILP) [24]. These methods often view the task of completing a missing triple as a query q(h,t) where they learn a probability as confidence score for each rule between the query entity and answer entity. Among these studies, Path-Ranking Algorithm (PRA) [18] investigated the framework of random walk inference, where they select a relational path under a set of constraints and perform maximum-likelihood classification. An RNN model was developed by [25] to compose the semantics of relations for arbitrary-length reasoning. Chain-Reasoning proposed by [8], enabling multi-hop reasoning through a neural attention mechanism, reveals logical rules across all relations and entities. Although ILP models are capable of mining interpretable rules for humans, these models typically take both positive and negative examples for training and suffer from a potentially large version space, which is a critical shortage since most modern KGs are huge and contain only positive instances.

Neural logic programming. In recent years, models borrowing the idea of logic rule learning in a deep manner have emerged as successful approaches for link prediction task. Extending the idea of TensorLog that tackles the problem of rule-based logic reasoning through sparse matrix multiplication, Neural LP [42] is the first end-to-end differentiable approach to simultaneously learn continuous parameters and discrete structure of rules. Some recent methods [29,38,43] have improved the framework done by Neural LP [42] in different manners. DRUM [29] introduces tensor approximation for optimization and reformulate Neural LP to support rules of varied lengths. Neural-Num-LP [38] extends Neural LP to learn numerical relations like age and weight with dynamic programming and cumulative sum operation. NLIL [43] proposes a multi-hop reasoning framework for general ILP problem through a divide-and-conquer strategy as well as decomposing the search space into three subspaces. However, the existing methods ignore the effects caused by entities of different types while reasoning over a specific relational path, thus witness a more obvious failure where heterogeneous entities and relations are involved in the KGs.

Representation learning. Capturing their semantic information by learning low-dimensional embeddings of entities and relations, also known as knowledge graph embedding, is a vital research issue in KGC, and we term those models as embedding-based models. Newly proposed methods, including RotatE [32], ConvE [9] and TuckER [2], predict missing links by learning embedding vectors from various perspectives of the problem. Specifically, the work of RotatE [32] focuses on inferring patterns such as symmetry and inversion, where they proposed a rotational model that rotates the relation from the subject to the object in the complex space as eo=es∘r where the ∘ denotes the element-wise Hadamard product. ConvE introduces a highly parameter efficient model, which uses 2D convolution over embeddings and multiple layers of nonlinear features to express semantic information. TuckER, inspired by the Tucker decomposition [35] that factorizes a tensor into a core tensor along with a set of matrices, is a linear model for link prediction that has good expressive power. Unfortunately, the biggest problem is that these sort of methods can hardly be comprehended by humans, but we relate to these methods for their ability to capture latent information of entities and relations through embedding.

We also notice that there are methods trying to establish a connection between learning logical rules and learning embedding vectors [37,44], where they augment the dataset by exploring new triplets from the existing ones in the KG using pre-defined logical rules to deal with the sparsity problem, which differs from our goal to consider entities of various types for learning in heterogeneous KGs.

3.Preliminaries

In this section, we introduce background concepts of logic rule learning for knowledge graph reasoning as well as the definition of topological structure of KGs.

3.1.Knowledge graph reasoning

Knowledge graph can be modeled as a collection of factual triples G={(es,r,eo)|es,eo∈E,r∈R}, with E, R representing the set of entities and binary relations (a.k.a. predicates) respectively in the knowledge graph, and (es,r,eo) the triple (subject,relation,object) in form of es→reo. During reasoning over KGs, each triple is usually presented in the form r(es,eo). The subgraph regarding a specific relation ri is described as a subset of G containing all triples with ri being the connection between the subject and object: G(ri)={(es,r,eo)|es,eo∈E,ri∈R,r=ri}.

Logic rule reasoning. We perform reasoning on KGs by learning a confidence score α∈[0,1] for a first-order logic rule in the form

(1)r1(x,z1)∧⋯∧rl(zl−1,y)⇒q(x,y):α,

r(x,y)⇒q(x,y) for short, with r1,…,rl,q∈R, z1,…,zl−1∈E, where r=⋀iri, is called a rule pattern. For example, the rule brotherOf(x,z)∧fatherOf(z,y)⇒uncleOf(x,y) intuitively states that if x is the brother of z and z is the father of y, then we can conclude that x is the uncle of y. All rule patterns of length l (l⩾2) can be formally defined as a set of relational tuples Hl={(r1,r2,…,rl)|ri∈R,1⩽i⩽l}=Rl, and the set of patterns no longer than L is denoted as HL=⋃l=2LHl. A rule path p is an instance of a certain pattern r via different sequences of entities, which can be denoted as p⊳r, e.g., (ra(x,z1),rb(z1,y)) and (ra(x,z2),rb(z2,y)) are different paths of the same pattern.

The link prediction task here is considered to contain a variety of queries, each of which is composed of a query body q∈R, an entity head h which the query is about, and an entity tail t that is the answer to the query such that (h,q,t)∈G. Finally we want to find the most possible logic rules h→r1⋯→r1t to predict the link q. Thus, given maximum length L, we assign a single confidence score (i.e. probability) to a set of rule paths p’s adhering to the same pattern r that connects h and t:

(2){pi(h,t)⇒q(h,t)|pi⊳r,r∈HL}:α

During inference, given an entity h, the unified score of the answer t can be computed by adding up the confidence scores of all rule paths that infer q(h,t), and the model will produce a ranked list of entities where higher the score implies higher the ranking.

3.2.Graph structure

Definition 1

Definition 1(Directed Labeled Multigraph).

A directed labeled multigraph G is a tuple G=(V,E), where V denotes the set of vertices, and E⊆V×V is a multiset of directed, labeled vertex pairs (i.e. edges) in the graph G.

Because of its graph structure, a knowledge graph can be regarded as a directed labeled multigraph [31]. In this paper, “graph” is used to refer to “directed labeled multigraph” for the sake of simplicity. G(r)=(V(r),E(r)) is the corresponding topological structure of G(r). m=|V| and n=|E| stand for the number of vertices and number of edges respectively for a graph G. Particularly in a KG, |E|=m and the total number of triplets (es,r,eo) equals the number of edges, i.e. |G|=n.

Formally, in a graph G=(V,E), the degree of a vertex v∈V is the number of edges incident to it. When it comes to directed graphs, in-degree and out-degree of a vertex v is usually distinguished, which are defined as

(3)deg+(v)=|{(u,v)|u∈V,(u,v)∈E}|(4)deg−(v)=|{(v,u)|u∈V,(v,u)∈E}|

But in this paper, we use “degree” to represent the edges incident to a specific node v for conciseness.

Fig. 2.

A KG example of family members and the relations between them.

Definition 2

Definition 2(Heterogeneous Graph [39]).

A graph G=(V,E) is heterogeneous when it consists a mapping function of node type ϕ:V→A and a mapping function of edge type ψ:E→R. A and R denote the sets of entity types and edge types (relations) in the corresponding KG.

As shown in Fig. 2, nodes of different types are marked in different colors, and edges are categorized by their relational labels.

4.Methodology

Capable of simultaneously learning representations and logical rules, Neural LP [42] is the first differentiable neural system for knowledge graph reasoning that combines structure learning and parameter learning. Our work follows the work of Neural LP and extensive studies based on it to consider the problem of reasoning in heterogeneous KGs, from the view of mining intrinsic properties of the entities in KGs.

4.1.Neural LP for logic reasoning

4.1.1.TensorLog

The work of TensorLog [6,7] successfully simulates the reasoning process using first-order logic rules by performing sparse matrix multiplication, based on which, Neural LP [42] proposed a fully differentiable reasoning system. In the following, we will first introduce the TensorLog operator. In a KG involving a set of entities E and a set of relations R, factual triplets w.r.t. the relation rk are restored in a binary matrix Mrk∈{0,1}|E|×|E|. Mrk, an adjacency matrix, is called a TensorLog operator meaning that (ei,rk,ej) is in the KG if and only if the (i,j)-th entry of Mrk is 1. Let vei∈{0,1}|E| be the one-hot encoded vector of entity ei, then s⊤=vei⊤Mr1Mr2Mr3 is the path features vector [43], where the j-th entry counts the number of unique rule paths following the pattern r1∧r2∧r3 from ei to ej [12].

For example, every KG entity e∈E in Fig. 2 is encoded into a 0–1 vector of length |E|=7. For every relation r∈R and every pair of entities ei,ej∈E, the TensorLog operator relevant to r is define as a sparse matrix Mr with its (i,j)-th element being 1 if (ei,r,ej)∈G. Considering the KG in Fig. 2, for the relation r=auntOf we have

And the rule sisterOf(X,Z)∧fatherOf(Z,Y)⇒auntOf(X,Y) can be simulated by performing the following sparse matrix multiplication:

By setting vx1=[1,0,0,0,0,0,0]⊤ as the one-hot vector of x1 and multiplying by vx1⊤ on the left, we obtain s⊤=vx1⊤·Mr′=[0,0,1,0,0,1,0]. The resultant s⊤ selects the row in Mr′ actually identified by x1. By operating right-hand side multiplication with vz1, we get the number of unique paths following the pattern sisterOf∧fatherOf from x1 to z1: s⊤·vz1=1.

4.1.2.Neural LP

Neural LP [42] inherits the idea of TensorLog. Given a query q(h,t), after L steps of reasoning, the score of the query induced through rule pattern rs of length L is computed as

(5)score(t|q,h,rs)=vh⊤∏l=1LMl·vt,

where Ml is the adjacency matrix of the relation used at the l-th hop.

The operators above are used to learn for query q by calculating the weighted sum of all possible patterns:

(6)∑sαs∏k∈βsMrk,

where s indexes over all potential patterns with maximum length of L, αs is the confidence score associated with the rule rs and βs is the ordered list of relations appearing in rs.

To summarize, we update the score function in Eq. (5) by finding an appropriate α in

(7)φ(t|q,h)=vh⊤∑sαs·(∏k∈βsMrk·vt),

and the optimization objective is

(8)maxαs∑(h,q,t)∈Gφ(t|q,h),

where αs is to be learned.

Whereas the searching space of learnable parameters is exponentially large, i.e. O(|R|L), direct optimization of Eq. (8) may fall in the dilemma of over-parameterization. Besides, it is difficult to apply gradient-based optimization. This is because each variable αs is bound with a specific rule pattern, and it is obviously a discrete work to enumerate rules. To overcome these defects, the parameter of rule rs can be reformulated by distributing the confidence to its inclusive relations at each hop, resulting in a differentiable score function:

(9)ϕL(t|q,h)=(vh⊤∏l=1L∑k=0|R|aklMrk)·vt,

where L is a hyperparameter denoting the maximum length of patterns and |R| is the number of relations in KG. Mr0 is an identity matrix I that enables the model to include all possible rule patterns of length L or smaller [29].

To perform training and prediction over the Neural LP framework, we should first construct a KG from a large subset of all triplets. Then we remove the edge (h,t) from the graph when facing the query (h,q,t), so that the score of t can get rid of the influence imposed from the head entity h directly through the edge (h,t) for the correctness of reasoning.

4.2.Our DegreEmbed model

In this section, we propose our DegreEmbed model based on Neural LP [42] as a combination of models relying on knowledge graph embedding and ILP models where the potential properties of individual entities are considered through a technique we call degree embedding. We discover that the attributes of nodes in a KG can make a difference via observation on their degrees. In Fig. 1, we notice that Mike is a male because he is a nephew of someone, hence it is incorrect indeed to reason by a rule containing a female-type relation starting from Mike. Also, the in-degree (i.e. studyIn) of entity THU proves its identity as a university. Besides, as illustrated in Section 4.1.1, all knowledge of a KG is stored in the relational matrices, which is our aim to reconstruct for harboring type information of entities. For a query q(h,t), the final score is a scalar obtained by Eq. (9), where the path feature vector is s⊤=vh⊤∏l=1L∑k=0|R|aklMrk, and vt selects the t-th element of s⊤ through matrix multiplication. In fact, the vector s⊤∈R|E| is a row of matrix ∏l=1L∑k=0|R|aklMrk, each value of which is the “influence” passed from head entity h to the regarding entity. As a result, we can consider the attributes of the entity ei by changing the i-th row of adjacency matrices from the perspective of the type of degrees of ei.

Fig. 3.

An illustration of computing the DegreEmbed operators for the KG shown in Fig. 2. Superscripts (+) and (−) of the labels of degree embedding vectors denote their in and out direction. All DegreEmbed operators are initialized to zero matrices.

For any entity e∈E, we collect the ones of unique types among all of its in and out degrees separately to form a d-dimensional degree feature vector, where d is the number of unique degrees. Then we project the vector onto a semantic space by looking up in a row-vector embedding matrix E|R|×m, and the result is number of d vectors arranged in a matrix M∈Rd×m, where m is the embedding dimension. The embedding vectors are input into a bidirectional LSTM [13] at different time steps. Finally, we perform attention operation on the hidden state of BiLSTM at the last time step to get the |R|-dimensional attention vector of e for 1⩽i⩽d:

(10)hi,hd−i+1′=BiLSTM(hi−1,hd−i′,M),

where h and h′ are the hidden states of the forward and backward path LSTMs, with the subscripts denoting their time step, and H, the actual embedding vector of entity e, is obtained by concatenating hd and h1′.

To compute the attention value on each relation imposed by entity e, we have

(11)ρe=softmax(WH+b)

Elements in ρe∈R|R| can be viewed as the weights for relations. At last, we replace the elements that are in the row identified by e and equal 1 in each adjacency matrix Mrk by the k-th value of ρe. By following the same procedure for the other entities in the KG, we construct a new set of relational matrices Br1,…,Br|R|, which are called DegreEmbed operators. The score function shown in Eq. (9) is updated accordingly as follows:

(12)ϕL′(t|q,h)=(vh⊤∏l=1L∑k=0|R|aklBrk)·vt,

where the Br1,…,BrR is our new DegreEmbed operators, and Br0 is still the identity matrix. The whole process to compute the operators makes it possible to incorporate the information of entities for rule learning models, where the degree feature vector ρe can be viewed as the identification of the entity e. Remarkably, the DegreEmbed operators can be pre-trained due to its belonging to the inner attribute of a KG, thus resulting in a model that can be easily deployed in similar tasks. An overview of computing the DegreEmbed operators is illustrated in Fig. 3

Finally, the confidence scores are learned over the bidirectional LSTM followed by the attention using Eqs (13) and (14) for the temporal dependency among several consecutive steps. The input in Eq. (13) is query embedding from another lookup table. For 1⩽i⩽L we have

(13)hi,hL−i+1′=BiLSTM(hi−1,hL−i′,input),(14)[ai,1,…,ai,|R|]=fθ([hi‖hL−i′]),

where [ai,1,…,ai,|R|] is the attention vector obtained by performing a linear transformation over concatenated forward and backward hidden states, followed by a softmax operator: fθ(H)=softmax(WH+b).

4.3.Optimization of the model

Loss construction. In general, this task of link prediction is treated as a classification of entities to build the loss. For each query q(h,t) in a KG, we first split the objective function Eq. (12) into two parts: target vector vt and prediction vector

(15)s⊤=vh⊤∏l=1L∑k=0|R|aklBrk,

and then our goal is to minimize the cross-entropy loss between vt and s⊤:

ℓq(h,t)=−∑i=1|E|{vt[i]·log(s[i])+(1−vt[i])·log(1−s[i])},

where i indexes elements in vector vt and s.

Low-rank approximation. It can be shown that the final confidences obtained by expanding ϕL′ are a rank one estimation of the confidence value tensor [29], and low-rank approximation is a popular method for tensor approximation. Hence we follow the work of [29] and rewrite Eq. (12) using rank of T approximation, as shown in Eq. (16).

(16)ΦL(t|q,h)=(vh⊤∑j=1T∏l=1L∑k=0|R|aj,klBrk)·vt.

More concretely, we update Eqs (13) and (14), as is shown in Eqs (17) and (18), by deploying number of T BiLSTMs of the same network structure, each of which can extract features from various dimensions.

(17)hi(j),hL−i+1′(j)=BiLSTMj(hi−1(j),hL−i′(j),input)(18)[ai,1(j),…,ai,|R|(j)]=fθ([hi′(j)‖hL−i′(j)]),

where the superscripts of the hidden states identify their bidirectional LSTM.

5.Experiment

In this section, we report the evaluation results of our model on a knowledge graph completion task, where we compare the effectiveness of our model against the state-of-the-art learning systems for link prediction. Meanwhile, as DegreEmbed takes advantage in the interpretability in contrast to embedding-based methods, we also examine the rules mined by DegreEmbed with the help of the indicator saturation, which assesses the quality of rules from the corresponding topological structure of a KG. We show that the top-scored rules mined by our method coincide with those of high saturation scores, which in turn reflect the interpretability of our model. To this end, we use ablation study to show how different components of our model contribute to its performance.

The knowledge graph completion task we use is a canonical one as described in [42]. When training the model, the query and head are part of incomplete triplets for training, and the goal is to find the most possible entity as the answer tail. For example, if nephewOf(Mike, Steve) is missing from the knowledge graph, our goal is to learn rules for reasoning over the existing KG and retrieve Steve when presented with the query nephewOf(Mike, ?).

5.1.Experiment setting

5.1.1.Datasets

To evaluate our method for learning logic rules in heterogeneous KGs, we select the following datasets for knowledge graph completion task:

– FB15K-237 [33], a more challenging version of FB15K [4] based on Freebase [3], a growing knowledge graph of general facts.
– WN18 [9], a subset of knowledge graph WordNet [20,21] constructed for a widely used dictionary.
– Medical Language System (UMLS) [16], from biomedicine, where the entities are biomedical concepts (e.g. organism, virus) and the relations consist of affects and analyzes, etc.
– Kinship [16], containing kinship relationships among members of a Central Australian native tribe.
– Family [16], containing individuals from multiple families that are biologically related.

Statistics about each dataset used in our experiments are presented in Table 1. All datasets are randomly split into 4 files: facts, train, valid and test, and the ratio is 6:2:1:1. The facts file contains a relatively large proportion of the triplets for constructing the KG. The train file is composed of query examples q(h,t). The valid and test files both contain queries q(h,t), in which the former is used for early stopping and the latter is for testing. Unlike the case of learning embeddings, our method does not necessarily require the entities in train, valid and test to overlap.

Table 1

Statistics of datasets

Dataset	# Relation	# Entity	# Triplets	# Facts	# Train	# Validation	# Test
FB15K-237	237	14541	310116	204087	68028	17535	20466
WN18	18	40943	151442	106088	35353	5000	5000
Family	12	3007	28356	17615	5868	2038	2835
Kinship	25	104	10686	6375	2112	1099	1100
UMLS	46	135	6529	4006	3009	569	633

5.1.2.Evaluation metrics

During training on the task of knowledge graph completion, for each triplet (h,q,t), two queries are designed as (h,q,?) and (?,q,t) with answers t and h for data augmentation. During evaluation, for each query, we manually remove the edge (h,t) from KG for the correctness of reasoning results and the score is computed for each entity, as well as the rank of the correct answer. For the computed ranks from all queries, we report the Mean Reciprocal Rank (MRR) and Hit@k under the filtered protocol [4]. MRR averages the reciprocal ranks of the answer entities and Hit@k computes the percentage of how many desired entities are ranked among top k.

When evaluating the interpretability of rules, we choose a set of indicators called macro, micro and comprehensive saturations that measure the probability of a rule pattern occurring in a certain relational subgraph G(r) from different angles. More specifically, these computational methods analyze the reasoning complexity from the inherent attributes of the graph structure G w.r.t. a KG G.

Definition 3

Definition 3(Macro Reasoning Saturation).

Given a query q∈R and the maximum length L of a rule pattern rl∈HL, the macro reasoning saturation of rl in relation to relation q, i.e. γqrl, is the percentage of triples (hi,q,tj) in subgraph G(q) such that rl(hi,tj)⇒q(hi,tj).

We can compute the macro reasoning saturation γqrl using the following equation:

(19)γqrl=|Url|nq,

with Url being the set Url={(h,q,t)|(h,q,t)∈G(q),rl(h,t)⇒q(h,t)} that collects the factual triplets in G(q) as the reasoning candidates of rule rl, and nq=|G(q)| being the number of edges (i.e. the number of triples) in G(q). We can reasonably say that the larger γqrl grows, the more likely rl can be as a proper inference of the query q. When γqrl equals 1, it means we can reason out every factual triple in G(q) through at least one rule path following the pattern rl.

Definition 4

Definition 4(Micro Reasoning Saturation).

Given the maximum length L of a rule pattern, we define the micro reasoning saturation of pattern rl∈HL as following. Firstly, for a specific triple tri=(h,q,t)∈G, i.e. δtrirl, is the percentage of the paths pli⊳rl such that rl(h,t)⇒q(h,t) as to all paths from h to t.

The equation to compute δtrirl is

(20)δtrirl=|Vrl||VL|

where Vrl={pli|pli⊳rl,rl(h,t)⇒q(h,t)}, VL={pkj|pkj⊳rk,∀rk∈HL,rk(h,t)⇒q(h,t)}. Vrl denotes the set of rule paths derived from the pattern rl that is able to infer the fact (h,q,t), and VL involves all the rules with their lengths no longer that L.

Then, we average δtrirl over all triples (h,q,t)∈G(q) and get the micro reasoning saturation of the pattern rl∈HL for query q:

(21)δqrl=1nq∑tri∈G(q)δtrirl

In Eqs (19) and (21), γqrl and δqrl assess how the probability to infer q following the pattern rl respectively from a macro and a micro perspective. The higher the two indicators are, the easier for models to gain the inference rl(h,t)⇒q(h,t). In order to obtain an overall result, we define the comprehensive reasoning saturation ηqrl by combining the two indicators through multiplication, as revealed in Eq. (22).

(22)ηqrl=γqrl×δqrl

We can imagine that the computation of comprehensive saturation on a certain logical rule rl to infer the relation q involves two procedures: (1) select the triplets (h,q,t) in subgraph G(q) that imply rl(h,t)⇒q(h,t) and (2) for each selected triplets, calculate the percentage of rule paths following the pattern rl within all possible paths that imply q(h,t).

We can take the relation q= auntOf and the rule rl= sisterOf ∧ fatherOf in Fig. 2 as an example to show the computation of saturations. In subgraph G(q), there are totally three triples (presented in red), thus nq=3. For the triple (x2,auntOf,z2), two rule paths can contribute to its inference: wifeOf(x2, z1) ∧ uncleOf(z2, z1) and sisterOf(x2, z3) ∧ fatherOf(z3, z1). In the same way, we can see there are one and two rule paths for (x1,auntOf,z1) and (x1,auntOf,z4) respectively. The rule rl= sisterOf ∧ fatherOf appears as an inference among all these three triples, therefore the macro saturation is γqrl=3/nq=100%. More detailed information can be extracted through computing the micro saturation. The rule rl takes a percentage of 50% among all paths for the triple (x2,auntOf,z1), while 100% and 50% for the other two triples. Thus, the micro saturation of rl for q is δqrl=(0.5+1+0.5)/nq=67%. Finally, we can compute the comprehensive saturation ηqrl=γqrl×δqrl=67%.

Table 2

Saturations of the Family dataset. γqpl, δqpl, ηqpl are macro, micro and comprehensive saturations. The results relating to a specific relation are sorted by the comprehensive saturation in descending order

We show a small subset of saturations computed from Family dataset in Table 2 for joint evaluation with logical rules mined by our model. More results can be obtained in App. C.

5.1.3.Comparison of algorithms

In experiment, the performance of our model is compared with that of the following algorithms:

– Rule learning algorithms. Since our model is based on neural logic programming, we choose Neural LP and a Neural LP-based method DRUM [29]. We also consider a probabilistic model called RNNLogic [27].
– Embedding-based algorithms. We choose several embedding-based algorithms for comparison of the expressive power of our model, including TransE [4], DistMult [41], ComplEx [34], TuckER [2] and RotatE [32].

The implementations of the above models we use are available at the links listed in App. A.

5.1.4.Model configuration

Our model is implemented using PyTorch [26] and the code will be publicly available. We use the same hyperparameter setting during evaluation on all datasets. The query and entity embedding have the dimension 128 and are both randomly initialized. The hidden state dimension of BiLSTM(s) for entity and degree embedding are also 128. As for optimization algorithm, we use mini-batch ADAM [15] with the batch size 128 and the learning rate initially set to 0.001. We also observe that the whole model tends to be more trainable if we use the normalization skill.

Note that Neural LP [42], DRUM [29] and our method all conform to a similar reasoning framework. Hence, to reach a fair comparison, we ensure the same hyperparameter configuration during experiments on these models, where the maximum rule length L is 2 and the rank T is 3 for DRUM and DegreEmbed, because Neural LP is not developed using the low-rank approximation method.

5.2.Results on KGC task

We compare our DegreEmbed to several baseline models on the KGC benchmark datasets as stated in the Section 5.1.1 and Section 5.1.3. Our results on the selected benchmark datasets are summarized in Table 3 and App. B.

Table 3

Knowledge graph completion performance comparison. Hit@k (H@k) is in %

	Family				Kinship				UMLS

	MRR	H@1	H@3	H@10	MRR	H@1	H@3	H@10	MRR	H@1	H@3	H@10
TransE	.34	7	53	86	.26	1	42	76	.57	28	84	96
DistMult	.58	39	71	91	.51	36	57	87	.73	63	81	90
ComplEx	.83	72	94	98	.61	44	71	92	.79	69	87	95
TuckER	.43	28	52	72	.60	46	70	86	.73	63	81	91
RotatE	.90	85	95	99	.65	50	76	93	.73	64	82	94
RNNLogic	.93	91	95	99	.64	50	73	93	.75	63	83	92
Neural LP	.91	86	95	99	.62	48	69	91	.75	62	86	92
DRUM	.94	90	98	99	.58	43	67	90	.80	66	94	97
DegreEmbed	.95	91	99	100	.70	57	79	94	.80	65	94	98

We notice that except that ComplEx [34] produces the best result among all methods on UMLS under the evaluation of Hit@1, all models are outperformed by DegreEmbed with a clear margin in Table 3, especially on the dataset Kinship where we can see about 10% improvement on some metrics. As expected, incorporating entity embedding enhances the expressive power of DegreEmbed and thus benefits to reasoning on heterogeneous KGs.

In Table 7, our model achieves state-of-the-art performance on WN18. It is intriguing that embedding-based methods provide better predictions on FB15K-237 dataset, with rule based methods, including RNNLogic, Neural LP, DRUM and ours, left behind. As pointed in [9], there are inverse relations from the training data present in the test data in FB15K, which is called the problem of test set leakage, resulting in the variant FB15K-237 where inverse relations are removed. No wonder that methods depending on logic rule learning fails on this dataset. Future work on more effective embedding representation of a node and its neighbor edges is likely to significantly advance the performance of link prediction models based on logic rules.

Notably, DegreEmbed not only is capable of producing state-of-the-art results on KGC task thanks to the degree embedding of entities, but also maintains the advantage of logic rule learning that enables our model to be interpretable to humans, which is of vital significance in current research of intelligent systems. We will show the experiment results on the interpretability of our DegreEmbed model later.

5.3.Interpretability of our model

To demonstrate the interpretability of our method, we first report the logical rules mined by our model and compare them with those by DRUM [29]. Then we visualize the embedding vectors learned through the proposed technique degree embedding to prove its expressive power.

5.3.1.Quality of mined rules

Apart from reaching state-of-the-art performance on KGC task which is largely thanks to the mechanism of entity embedding, our DegreEmbed, as a knowledge graph reasoning model based on logic rule mining, is of excellent interpretability as well. Our work follows the Neural LP [42] framework, which successfully combines structure learning and parameter learning to generate rules along with confidence scores.

In this section, we report evaluation results on explanations of our model where some of the rules learned by DegreEmbed and DRUM are shown. As for evaluation metrics, we use the indicator saturations to objectively assess the quality of mined rules in a computable manner. We conduct two separate KGC experiments for generating the logical rules where the only difference is whether the inverted queries are learned. For better visualization purposes, experiments are done on the Family dataset, while other datasets such as UMLS produce similar results.

We sort the rules by their normalized confidence scores, which are computed by dividing by the maximum confidence of rules for each relation, and show top rules mined by our DegreEmbed and DRUM without augmented queries respectively in Table 4 and Table 5. Saturations of rules according to specific relations are shown in Table 2. For more results of saturations, learned rules w.r.t. both pure and augmented queries, please refer to the Appendix.

Table 4

Top rules without reverse queries mined by DegreEmbed on the Family dataset

Table 5

Top rules without reverse queries mined by DRUM on the Family dataset

By referring to the results given by computing saturations, we can see the rules mined by our model solidly agree with the ones with high saturation level. Meanwhile, our model obviously gets rid of the noises rendered by the heterogeneousness of the dataset through blending entity attributes (e.g. gender of entities) into rule learning. The rules mined for predicting the relation brotherOf, such as brotherOf ∧ sisterOf and brotherOf ∧ brotherOf, all show up with a male-type relation at the first hop. However, there are logically incorrect rules mined by DRUM which are highlighted by red in Table 5. We think this is mainly because DRUM does not take entity attributes in to account. In this case, our DegreEmbed model is capable of learning meaningful rules, which indeed proves the interpretability of our model.

Fig. 4.

A t-SNE plot of the entity embedding of our trained model on Family dataset. Node colors denote their classes (i.e. degree feature vectors).

5.3.2.Learned entity embeddings

To explain the learned degree embedding, we visualize the embeddings vectors of some entities from the Family dataset. We use t-SNE [36] to project the embeddings to two-dimensional space and plot them in Fig. 4. In order to obtain the entity embeddings, we first train our DegreEmbed model on Family with the same hyperparameter settings mentioned in Section 5.1.4, and store the entire entity embedding matrix given by Eq. (10). Then, we classify the entities according to their degree feature vector proposed in Section 4.2 and choose top ten most populated clusters marked with various colors to plot in Fig. 4.

Note that, we use a logarithmic scale for the embedding plot to get better visualization results. In fact, the representation of entities exhibits localized clustering in the projected 2D space, which verifies the capability of our model to encode latent features of entities in heterogeneous KGs through their degrees.

5.4.Ablation study

Fig. 5.

Model performance on Family with the original entity embeddings replaced by pre-trained ones from embedding-based methods. Hit@k is in %. The number inside each bar indicates its Hit@k value.

Fig. 6.

Comparison among DegreEmbed variants with different ranks on three benchmark datasets.

To study the necessity of each component of our method, we gradually change the configuration of each component and observe how the model performance varies.

Degree embedding. In this work, degree embedding is proposed as a new technique of entity embedding for incorporating heterogeneous information in KGs. We successively replace this component with learned entity embeddings from five pre-trained embedding-based models listed in Section 5.1.3 on three datasets. We measure the Hit@1, Hit@3 and Hit@10 metrics and show the results on Family in Fig. 5. Results on another two datasets are placed in the Appendix. In summary, the original model using degree embedding to encode entities produces the best results among all variants. We hypothesis that this is due to the fact that many inner attributes of entities are lost in the embeddings of those variants while DegreEmbed can learn to utilize these features implicitly.

Low-rank approximation. Tensor approximation of rank T enables our model to learn latent features from various dimensions, as show Eqs (17) and (18). We conduct experiments on three datasets and show how model behavior differs with rank ranging from 1 to 4 in Fig. 6. Training curves in Fig. 6(a) imply that model may converge faster with lower training loss as rank goes up. However, Fig. 6(b) demonstrates that higher rank does not necessarily bring better test results. We conjecture that this is because the amount of learnable features of distinct dimensions varies from dataset to dataset, where the choice of rank matters a lot. An intriguing insight can be obtained by combining Fig. 6(a) and Fig. 6(b): training loss degrades as model rank increases while it barely contributes to results on test sets, which provides a view of over-fitting.

6.Conclusions

In this paper, a logic rule learning model called DegreEmbed has been proposed for reasoning more effectively in heterogeneous knowledge graphs, where there exist entities and relations of different types. Based on mining logic rules, DegreEmbed simultaneously leverages latent knowledge of entities by learning embedding vectors for them, where the degrees of the entities are closely observed. Experiment results show that our model benefits from the advantages of both embedding-based methods and rule learning systems, as one can see DegreEmbed outperforms the state-of-the-art models with a clear margin, and it produces high-quality rules with great interpretability. In the future, we would like to optimize the way of entity embedding to increase the expressive power of logic rule learning models for knowledge graph reasoning.

Acknowledgements

The work of this paper is supported by the “National Key R&D Program of China” (2020YFB2009502), “the Fundamental Research Funds for the Central Universities” (Grant No. HIT.NSRIF.2020098).

Appendices

Appendix A.

Appendix A.Algorithm URLs

Table 6

Available links to the models used in this work

Algorithm	Link
TransE, DistMult and ComplEx	https://github.com/Accenture/AmpliGraph
TuckER	https://github.com/ibalazevic/TuckER
RotatE	https://github.com/liyirui-git/KnowledgeGraphEmbedding_RotatE
RNNLogic	https://github.com/DeepGraphLearning/RNNLogic
Neural LP	https://github.com/fanyangxyz/Neural-LP
DRUM	https://github.com/alisadeghian/DRUM
DegreEmbed (ours)	https://github.com/lirt1231/DegreEmbed

Appendix B.

Appendix B.Results on FB15K-237 and WN18

Table 7

Knowledge graph completion results on FB15K-237 and WN18. Hit@k is in %

	FB15K-237				WN18

	MRR	Hit@1	Hit@3	Hit@10	MRR	Hit@1	Hit@3	Hit@10
TransE	.15	5	19	25	.36	4	63	81
DistMult	.25	17	28	42	.71	56	83	93
ComplEx	.26	17	29	44	.90	88	92	94
TuckER	.36	27	39	46	.94	93	94	95
RotatE	.34	24	38	53	.95	94	95	96
RNNLogic	.29	21	31	43	.94	93	94	96
Neural LP	.25	19	27	37	.94	93	94	95
DRUM	.25	19	28	38	.54	49	54	66
DegreEmbed	.25	19	27	38	.95	94	95	97

Appendix C.

Appendix C.Extension to Table 2: More saturations of Family

Table 8

Appendix D.

Appendix D.Extension to Table 4: Top rules obtained by our model

Table 9