Deep learning based searching approach for RDF graphs

The Internet is a remarkably complex technical system. Its rapid growth has also brought technical issues such as problems to information retrieval. Search engines retrieve requested information based on the provided keywords. Consequently, it is difficult to accurately find the required information without understanding the syntax and semantics of the content. Multiple approaches are proposed to resolve this problem by employing the semantic web and linked data techniques. Such approaches serialize the content using the Resource Description Framework (RDF) and execute the queries using SPARQL to resolve the problem. However, an exact match between RDF content and query structure is required. Although, it improves the keyword-based search; however, it does not provide probabilistic reasoning to find the semantic relationship between the queries and their results. From this perspective, in this paper, we propose a deep learning-based approach for searching RDF graphs. The proposed approach treats document requests as a classification problem. First, we preprocess the RDF graphs to convert them into N-Triples format. Second, bag-of-words (BOW) and word2vec feature modeling techniques are combined for a novel deep representation of RDF graphs. The attention mechanism enables the proposed approach to understand the semantic between RDF graphs. Third, we train a convolutional neural network for the accurate retrieval of RDF graphs using the deep representation. We employ 10-fold cross-validation to evaluate the proposed approach. The results show that the proposed approach is accurate and surpasses the state-of-the-art. The average accuracy, precision, recall, and f-measure are up to 97.12%, 98.17%, 95.56%, and 96.85%, respectively.


Introduction
The digital age arrives with a set of challenges for the Web due to the abundance of information. In today's modern society, people capture, upload, store, and digitalize almost every activity of daily life routine over the Web. Nowadays, communication devices have the capacity to connect to the internet independently and contain sensors that are spreading useful information without users' intervention. Consequently, the data is increasing daily and resulting in information overload. Searching such data had driven to the development of the linked data and Semantic Web. It considers the machine processable metadata [1]  information flow and can relate data from distributed data sources to make data meaningful. This mash-up of data introduced the phrase Web 3.0. Building links between distributed data sources is essential to Web 3.0 which is achieved using RDF. RDF resources consist of RDF triples where each triple contains a subject s that has property p with value o [2,3]. Consequently, a RDF triple may be viewed as a representation of an atomic fact or a claim [4]. The analogous datasets may be defined as Linked Data [5] that can be summarized as being "simply about using the Web to create typed links between data from different sources". Linked data combine entities from different sources/locations to crawl them as a data-space due to its connected links [5,6]. This idea motivates this study to access the required information from distributed sources and build links that help in searching. RDF triples allow entities to be queried and linked together. The existing studies use RDF and SPARQL to serialize the content and to execute the queries for searching, respectively.
RDFs are massive in size and crucial; therefore it is not easy to extract information for an ordinary user. Although, linked data and SPARQL provides a significant improvement in search methods. However, the complexity criteria (similar triples by RSFS and OWL rules) and the usability criteria (the human effort) are required to read and learn RDF data [7]. For example, SPARQL queries require structure accuracy to extract RDF elements. Such queries do not allow the statistic analysis to evaluate the query against the RDF content; e.g., features of a basket may not be enough as input to identify the online shopping basket. Many approaches have been proposed for achieving this kind of searching from RDF data using linked data and SPARQL [8][9][10][11][12][13][14][15][16][17][18][19]. Notably, such approaches respond to queries with an exact match rather than estimating the similarity within the RDF content that leads to the original motivation for the work in this paper. On the other hand, Hadi et al. [20] exploited a machine learning approach to search RDF graphs. Although their approach works on statistical estimation, it does not consider semantic relationships while searching RDF graphs and requires significant improvement.
To this end, a deep learning-based searching approach is proposed for RDF graphs. In this regard, we first reuse the history-data of DBpedia documents. Second, we preprocess the extracted RDF graphs using the W3C validation service. Third, we concatenate BOW and word2vec feature modeling techniques for attention-based recurrent neural network novel deep representation of RDF graphs. The attention mechanism enables the proposed approach to understand the semantic between RDF graphs. Fourth, we train a convolutional neural network for the accurate retrieval of RDF graphs using the deep representation. Finally, a convolutional neural network is trained to predict the retrieval of RDF graphs. And the proposed approach is evaluated using a 10-fold cross-validation technique on the given dataset. The evaluation results show the accuracy of the proposed approach. The average accuracy, precision, recall, and f-measure are up to 97.12%, 98.17%, 95.56%, and 96.85%, respectively.
The main contributions of this study are as follows: • An approach based on a convolutional neural network is proposed for searching RDF graphs. To the best of our knowledge, we are the first to exploit a deep learning algorithm in retrieval prediction of RDF graphs.
• Evaluation results of the proposed approach on the given dataset show that the proposed convolutional neural network-based approach is accurate and surpasses the state-of-the-art.
The rest of the paper is organized as follows: Section II presents the proposed approach. The evaluation process and results of the proposed approach are described in Section III. Section IV explains the threats. Section V and Section VI present the related work and conclusion, respectively.

Overview
Fig 1 illustrates an overview of the convolutional neural network based searching for RDF graphs. The proposed approach performs RDF graph retrieval prediction as follows: 1. We reuse the history-data of RDF graphs as training data.
2. We apply the W3C validation service to RDF graphs for preprocessing.
3. We concatenate BOW and word2vec feature modeling techniques for a novel deep representation of RDF graphs.

4.
A convolutional neural network is trained to anticipate the retrieval of RDF graphs. We pass the deep representation to the classifier as input that predicts the retrieval of RDF graphs.
The following sections introduce the key steps of the proposed approach.

Illustrating example
We consider an example to demonstrate how the proposed approach anticipates the retrieval of RDF graphs. An excerpt of RDF graph taken from DBpedia is presented in Fig 2. The details

PLOS ONE
Deep learning based searching approach for RDF graphs on how the proposed approach deals the illustrating example are given in the following sections.

Problem definition
A RDF d is a graph from a set of RDF graphs (D) which can be formalized as, where, t 1 , t 2 , . . ., t n represents the n number of triples in a RDF graph, and each triple consists of subject(s), predicate(p), and object(o).
where, d e = a complete example of a RDF graph represented in Fig 2, t e 1 = first triple of the example, t e 2 = second triple of the example, .
. t e n = last triple of the example. The proposed approach takes the problem of searching of RDF graphs as classification problem and predicts whether a RDF graph will be retrieved or not. The retrieval anticipation of a new RDF graph d can be defined as a function f.
where, c, f, d, and D are predefined classification (hit or miss), classification function of retrieval anticipation, a RDF graph, and a set of RDF graphs, respectively.

Preprocessing
We preprocess each of the RDF graph using the W3C Validation Service. We load each collected RDF graph using Apache Jena API (http://jena.apache.org/) to validate its syntax. Then, the validated RDF graph is loaded into the model that transforms the RDFs from serialization format to N-Triples format. The preprocessing of a RDF graph can be formalized as, where, t 1 0 ; t 2 0 ; . . . ; t n 0 are preprocessed n triples of the RDF graph d.
For the motivating example presented in Section 2.2, a RDF graph d e after preprocessing can be formalized as, where, from the excerpt of preprocessed RDF graph is presented in Fig 3, t e 1 0 ; t e 2 0 ; . . . ; t e n 0 represent the n triples (separated with a dot (.)) of the preprocessed RDF graph, respectively.

Deep feature representation
A BOW representation of each RDF graph provides a boolean (0 or 1) array or a term frequency array using all repository terms [21] (shown in Fig 4) and does not incorporate the semantic similarity among terms. Moreover, problems like high dimensionality and sparse data are observed in the bag-of-n-words feature representation [22]. To this end, a neural network-based features representation model (word2vec) is proposed to learn and understand the semantic relationship between terms (predicates in our case) [23]. However, word2vec only considers semantics of individual terms rather than a sequence of words. Notably, a significant improvement is required to combine the sequence of terms, the syntax of terms, the semantic relationship among terms. In this perspective, a deep representation of RDF graphs is proposed. Fig 5 illustrates an overview of the deep representation of RDF graphs. The long short term memory (LSTM) cells are exploited [24] as a memory unit in the hidden layer that resolve the vanishing gradient problem [25]. LSTM cells can memorize the sequence of terms in both forward direction and backward direction. The construction of deep representation involves the extraction of |U|-dimensional representation (BOW) using predicate repository, the learning of |V|-dimensional representation word2vec using |U|-dimensional representation, and the learning of LSTM cells (deep representation) using |S|-dimensional representation. This process returns the |D|-dimensional representation of the given RDF graph. The |D|-dimensional representation has a sequence network (recurrent neural network) that contains a hidden layer with n hidden units (h = h 1 , h 2 , . . .‥, h n ). The recurrent neural network takes |V|-dimensional representation (y = y 1 , y 2 , . . .‥, y n ) as an input and returns a |D|-dimensional representation (z = z 1 , z 2 , . . .‥, z n ). Every h transforms the previous state s i−1 and a term y i into the next state s i and output word z i . Every hidden unit repeatedly performs the function f in recurrent neural network: where each s i has the information of i th term in h, whereas the output z n of h n represents the complete RDF graph. Additionally, an attention mechanism is employed to learn from the predicates of the RDF graph. An attention vector with the weighted summation of all outputs z i can be formalized as, where α i is the weight of each word y i that defines the importance of y i for classification. A  bidirectional recurrent neural network learns representation with input word sequence (forward and backward). A complete deep representation of d can be formalized as, where, + represents the concatenation of vectors.
The hyper-settings of the proposed deep representation model as follows: 300 LSTM units, 0.2 dropout probability, 0.001 learning rate, and binary cross-entropy based loss function with Adam optimizer. We set 100 epochs for the training. Notably, the proposed model is implemented in Python Keras Library [26]. To the best of our knowledge, we are the first to apply deep representation to learn the RDF graph representation. We use deep representation to train a convolutional neural network for the retrieval anticipation of RDF graphs. Fig 6 illustrates an overview of deep learning classifier. We leverage the convolutional neural network for retrieval prediction of RDF graphs. We select the convolutional neural network because of the following reasons: 1) its deep semantic relationship learning capabilities among words [27]; 2) it avoids the gradient problem of recurrent neural network [28] by applying different filter sizes.

Deep learning classifier
To train the convolutional neural network, the deep representation is forwarded to convolutional neural network that contains 3 layers of CNN, filter 128, kernel size 1, loss function binary-crossentropy, and activation tanh. Then, the output of the convolutional neural network is passed to a flatten layer [29] that returns a 1-dimensional vector. Finally, the dense layer connects the neurons between layers and the output layer returns the retrieval prediction of RDF graphs.

Evaluation
This section defines the research questions to evaluate the proposed approach, explains how RDF graphs are collected, presents the metrics and evaluation process of the proposed approach, and discusses the results while answering the research questions.

Research questions
The proposed approach is evaluated by investigating the following research questions: • RQ1: How accurate the proposed approach in retrieval prediction of RDF graphs?
• RQ2: Does the proposed classifier outperform other machine/deep learning classifiers in retrieval prediction of RDF graphs?
• RQ3: Does features' preprocessing influence in predicting the retrieval of RDF graphs?
The RQ1 examines the accuracy of the proposed approach. In this perspective, the proposed approach is compared with the state-of-the-art approaches: a graph-based retrieval of RDFs (GRSearch) [30] and machine learning-based retrieval of RDFs (MLSearch) [20]. We also compare the proposed approach with the two baseline algorithms: random prediction algorithm and zero-rule algorithm to double-check the performance of the proposed approach.
The RQ2 compares the performances of different machine/deep learning classifiers to reveal whether the proposed approach outperforms other machine/deep learning classifiers in retrieval prediction of RDF graphs.
The RQ3 examines the influence of the features' preprocessing. In this perspective, we compute and compare the performance of the proposed approach with and without preprocessing.

Dataset
We collect the DBpedia dataset (https://wiki.dbpedia.org/data-set-30). DBpedia 2016-10 release contains 13 billion pieces of information out of which 1.7 billion were extracted from the English edition of Wikipedia. We use only 1.7 billion RDF triples (English version) to evaluate the proposed approach; however, we ignore all syntactically invalid triples, as mentioned in Section 2.4. Note that we divide the data into four different search categories: Triple-pattern requests with multiple responses; e.g., British actors and their birth regions, Extended triple-pattern requests with multiple responses; e.g., Movies having award-winning feminist actors, Triplepattern requests with zero responses; e.g., MIT graduates born in Steve Jobs's death place, and Extended triple pattern requests with zero responses; e.g., People who influenced by Egyptian writers to evaluate the proposed approach.

Process and metrics
Algorithm 1 shows the process to compute the best classifier (CNN) as mentioned in Section 2. Algorithm 1 consists of three parts. In the first part (Line 1), we set cross-validation (sometimes called rotation estimation) [31] M on D. We divide D into ten segments notated as M i (i = 1, 2, . . .., 10). We subtract the RDF graphs that belong to M i and mark them as testing RDF graphs Test, and the remaining RDF graphs are marked as training RDF graph Train. In the second phase (Lines 2-11), we apply the M-fold cross-validation and train/test the classifiers (MNB, LR, RF, SVM, LSTM, and CNN). For each iteration of cross-validation, we first separate the training dataset Train and testing dataset Test (Line 3). Then, we train the classifiers with Train and evaluate each classifier with Test (Lines 4-10). In the last phase (Lines 12-13), we compute and compare the metrics (accuracy, precision, recall, and F1) of each classifier, and return the best classifier. Train i = S i2 [1,10] The selected metrics are commonly adopted metrics for the performance evaluation of classification algorithms [32][33][34][35][36][37]. Therefore, we calculate the retrieval related accuracy, precision, recall, and f-measure for the performance evaluation of the proposed approach on the given RDF graphs that can be defined as,

Algorithm 1 Identification of Best Machine/Deep Learning Algorithm for the Proposed Approach
Rec where, Acc, Pre, Rec, and F1 represent the accuracy, precision, recall, and f-measure of the proposed approach in retrieval prediction of RDF graphs, respectively. TP represents the number of RDF graphs that the proposed approach predicts correctly as hit, TN represents the number of RDF graphs that the proposed approach predicts correctly as miss, FP represents the number of RDF graphs that the proposed approach predicts incorrectly as hit, and FN represents the number of RDF graphs that the proposed approach predicts incorrectly as miss.

Results
RQ1: Accuracy of the proposed approach. We answer the RQ1 by performing a comparison between the proposed approach and the state-of-the-art approaches: MLSearch and GRSearch. We also compare the proposed approach with a random prediction algorithm and a zero-rule algorithm. We consider both algorithms because the proposed approach is the first approach to leverage deep learning algorithms for retrieval prediction of RDF graphs.
The evaluation results of the proposed approach and the baseline approaches are presented in Table 1. Approaches are presented in the first column of the  85%, 83.15%, and 67.72), respectively. Table 2 shows the evaluation results of random prediction, zero-rule, and the proposed approach. Approaches are presented in the first column of the We present the accuracy distribution of 10-fold cross-validation for the proposed approach and baseline approaches in Fig 7. We compare the F1 distributions of each approach and plot one bean against each approach. Each short horizontal line within a bean illustrates the F1 on a i th fold, whereas the long horizontal line illustrates the average F1. We observe that the proposed approach outperforms the baseline approach in each fold. Notably, the average F1 of the proposed approach is significantly large as compared to the best performances of the baseline approach. We also employ ANOVA (one-way) to confirm the significance of the proposed approach and the basline approach. It may examines whether the single factor (i.e., different approaches) is the only difference that drives to the difference in performance. Note that ANOVA is conducted independently on the Acc, Pre, Rec, and F1. Table 3 presents the results of ANOVA that shows F > F cric and P Value < (alpha = 0.05) are true for each Acc, Pre, Rec, and F1. It indicates that using different approach (the single factor) has a significant difference in the performances of both approaches. The preceding analysis concludes that the proposed approach is accurate in retrieval anticipation of RDF graphs.
• The CNN classifier surpasses all the other machine learning classifiers. It converts non-linearly classifiable and inter-dependent feature data into a higher-dimensional hyperplane if the classification of the data is not possible linearly.
• Although, the existing research [41] reports MNB classifier is effective in classification; however, it does not work well with the proposed approach on the given dataset. One possible reason is that the input predicates (features) to the classifier for training are inter-related, and MNB classifier performs well if the features are independent [27,42]. The evaluation results of MNB on the given dataset are not effective as compared to SVM, LR, and RF with the proposed approach.
• The performance results of LR and RF are very close to the SVM. A larger dataset may reveal that one of them is better than SVM. The preceding analysis concludes that CNN works better than the other classifiers with the proposed approach.
RQ3: Influence of features' preprocessing. The different RDF graphs may have similar predicates (features) or may have superlative/comparative words in predicates. Passing such data as features to a machine learning algorithm is an overhead. It reduces performance and increases the computational cost of machine learning algorithms.
We answer the RQ3 by performing the comparison between the evaluation results of the proposed approach with and without features' preprocessing. The evaluation results are presented in Table 5. The preprocessing input settings are presented in the first column of the table. Columns 2-5 of the table present the performance results of Acc, Pre, Rec, and F1. The rows of the table present the performance of the proposed approach to the different settings of preprocessing, respectively. The improvement in the performance of the proposed approach with different preprocessing settings is presented in the last row of the table.
From the Table 5, we make the following observations: • The preprocessing enabled proposed approach achieves significant improvement in performance. The evaluation results suggest that the performance improvement in Acc, Pre, Rec, and F1 are up to 16.71%, 14.38%, 11.22%, and 12.82%, respectively.
• The preprocessing disabled approach significantly decreases the Rec from 95.56% to 85.92%. The decrease in Rec returns incorrect results against the requested query. One possible reason of the decrease in performance is the similar or superlative/comparitive words in the predicates of the given triples.
The preceding analysis concludes that preprocessing of the features is essential to the proposed approach.

Threats to validity
There could be some elements that may affect the performance of the proposed approach. The followings are the threats to the validity of the proposed approach.
• The selection of evaluation metrics is the first threat to construct validity. We select Acc, Pre, Rec, and F1 metrics for the evaluation of the proposed approach. Because, they are the most adopted metrics [32][33][34][35][36][37] for the evaluation of classification problems.
• The leverage of NLTK for the preprocessing of the extracted features (as mentioned in Section 2.5) is a threat to construct validity. We select NLTK due to its performance and popularity [37]. The use of any other natural language processing repository may affect the said results of the proposed approach.
• The generalizability of the proposed approach is a threat to external validity. We focus the RDF graphs from an open-source dataset (DBpedia) for the evaluation of the proposed approach. We cannot guarantee the results of the proposed approach with other datasets.

Related work
The WWW is an information space where RDF graphs and other web resources are identified by URLs that may be interlinked and are accessible over the Internet. It is difficult to get the right URLs against asked queries due to the information overload caused by the current digital era. To address this problem, Tim Burner Lee introduced the semantic web that provides a common framework and allows data to be shared and reused across applications. It considers semantics for searching rather than keyword matching and query responses. Linking data together from different resources is the key to the semantic web. Moreover, linking data is essential to connect and search data over the semantic web. Linked data rely on RDF graphs that contain data in RDF format. Many approaches have been proposed on the efficient search of RDF graphs. Such approaches mainly focus on classical RDF searching e.g., keyword-based searching or graph-based searching. Tran et al. [43] introduced the idea of generating summary-graphs for the original RDF graph to generate and rank candidate SPARQL queries. Then, Zhang et al. [44] proposed a solution to this idea. Moreover, Yang et al. [45] proposed tree patterns to connect keywords specified by the users where the tree patterns are ordered by their size relevance, and Zheng et al. [46] proposed a method to search semantically equivalent structure patterns. Finally, De Virgilio [47] proposed an RDF keyword-based query vis Tensor calculus and later extended it to a distributed environment via MapReduce [48].
Nagarajan et al. [49] presented ontology-based multi-model semantic information retrieval system. It is based on the idea of integrating domain knowledge and images and retrieves the required multi-modal information using a fuzzy rule set. It also provides the image semantic by constructing visual words using the probabilistic latent semantic. Other researches [50][51][52] also proposed formalize and semantic visualization models based on the fuzzy rule set.
Nhuan et al. [53] proposed an approach that determines the degrees of equality between relations (properties) defined by different vocabularies. They consider the occurrences of matching pairs of RDF triples to find the intervals representing lower and upper levels of property equality. Consequently, they obtained a graph of similar properties where the intervalbased strength of edges represents degrees of similarity between properties.
Jaafar et al. [54] proposed a fuzzy knowledge-based framework to realize a nature and visualized F-RDF retrieval operation, to help an end-user to enhance the querying and accessing Web data.
Gupta et al. [18] introduced a ranking function based on fuzzy logic to enhance Information Retrieval. The function based on the computation of term-weighting schemas such as term frequency, inverse document frequency, and normalization. The state-of-the-art [15][16][17][18] has described the difficulties in the understanding of a semantic search engine. The motive behind is to propose an approach based on RDF, and the automatic identification of content over the WWW.
As a conclusion, researchers have proposed different approaches [8-19, 55, 56] for retrieving information using RDF; however, it requires significant improvement. Moreover, none of them employs machine learning classification algorithms to address this problem. Notably, the proposed approach differs in that the existing approaches as we are first to apply the support vector machine for the retrieval of RDF graphs.

Conclusion
In this digital era, Web users share almost every moment of daily life on the Internet that causes information overload. Consequently, it is difficult to accurately retrieve the required information without understanding the syntax and semantics of the content. To this end, in this paper, a deep learning-based approach for searching RDF graphs is proposed that treats RDF graph requests as a classification problem. The proposed approach applies a deep learning classifier on the given dataset for the retrieval anticipation of RDF graphs. The proposed approach introduces a new way to search the RDF graphs and helps the Web users in answering their queries. We perform the 10-fold cross-validation for the evaluation of the proposed approach using the open-source RDF graphs of DBpedia. The evaluation results show that the proposed approach is accurate.
The broader impact of this study is to indicate that the triples in the RDF graphs are a rich source of information for accurate retrieval prediction of RDF graphs. Our results motivate future research on the retrieval anticipation of RDF graphs. We want to investigate a retrieval prediction of RDF graphs with a deep learning approach with deep hyperparameter settings. This will also confirm the generalizability of the proposed approach.