Reducing vertices in property graphs

Graph databases are constantly growing, and, at the same time, some of their data is the same or similar. Our experience with the management of the existing databases, especially the bigger ones, shows that certain vertices are particularly replicated there numerous times. Eliminating repetitive or even very similar data speeds up the access to database resources. We present a modification of this approach, where similarly we group together vertices of identical properties, but then additionally we join together groups of data that are located in distant parts of a graph. The second part of our approach is non-trivial. We show that the search for a partition of a given graph where each member of the partition has only pairwise distant vertices is NP-hard. We indicate a group of heuristics that try to solve our difficult computational problems and then we apply them to check the the effectiveness of our approach.


Introduction and preliminaries
Graphs are a useful and understandable form of presenting various types of data in areas such as administration, social networks, biological sciences, media, and geography. Property graphs [1] are types of graphs that enable the construction of links, relations, and attributes of particular objects. Some elements of property graphs may be similar and share the same features. These elements can be merged together, and, thanks to that, the property graph becomes smaller, simpler, and easier to process and select the data from.
In this article, we present a way of reducing nodes of a property graph. We rely on the observation that, in the property graph, there may exist elements that have completely different properties, which are non-connectable with each other, and even have disjoint neighbors. The advantage of this approach is that merging different vertex contexts significantly reduces the chance that the query will involve different vertices that have been merged. This approach is subjected to some errors resulting from merging distant vertices and quite random data because of possible relationships. These errors can be eliminated by re-querying only within the merged entities, among which there is very little dependency because vertices are distant in the original graph. Such division speeds up the execution of queries, especially those of high complexity. In our approach to finding this kind of vertex, we use graph vertex coloring methods [2,3].
Some initial work [4][5][6] has been done in Resource Description Framework (RDF) [7] and Semantic Web [8], and we are trying to move these ideas to the property graph world. In the article, we also present how to find distant vertices as well as how to find and merge similar ones. We propose an experimental method of dealing with data similarity problems in property graphs by searching for solutions known from addressing NP-complete graph problems. We also present results obtained with METIS [9] and ColPack [10] support. Our proposals contribute to enable a user who is familiar with graph databases to use and access RDF data and property graph data as well. Property graph databases often have better performance than native RDF graph stores [11][12][13], so it is important to enable interoperability between these two approaches.
The PG data model rests on the concept of creating directed and key/value-based graphs. It means that there is a tail and head to each edge and both vertices and edges can have properties associated with them.
Following [14,15], we provide a formal definition below. Definition 1 (Property Graph). A Property Graph is a tuple PG ¼ hV; E; S; P; h e ; t e ; l v ; l e ; p v ; p e i, where:

V is a non-empty set of vertices,
2. E is a multiset of edges, which are elements of V × V, 3. S is a non-empty set of character strings, 4. P is the Cartesian product S × S, where each member has a form p = hk, vi (property), 5. h e : E ! V is a function that yields the source of each edge (head), 6. t e : E ! V is a function that yields the target of each edge (tail), 7. l v : V ! S is a function mapping each vertex to a label, 8. l e : E ! S is a function mapping each edge to a label, 9. p v : V ! 2 P is a function that assigns vertices to their multiple properties, and 10. p e : E ! 2 P is a function that assigns edges to their multiple properties. Note that hV; E; h e ; t e ; l e i is an edge-labeled directed multigraph. Example 1. The example in Fig 1 presents a property  A collection of RDF triples intrinsically represents a labeled directed multigraph. The nodes are the subjects and objects of their triples. RDF is often referred to as being graph where each hs, p, oi triple can be interpreted as an edge s ! p o. Several of RDF syntax (called serializactions) formats exist for writing down graphs. We propose Yet Another RDF Serialization (YARS), which allows prepare RDF data to exchange on the property graph data stores. One example of such a serialization is Yet Another RDF Serialization (YARS) [15], which allow prepare RDF data to exchange on the property graph data stores.
The article is constructed according to sections. In Section 2, we motivate the need for vertex redution in property graphs. Section 3 shows that reducing vertices in property graphs is NP-hard. In Section 4, we introduce tested data sets and our experiments. Section 5 is devoted to related work. The paper ends with conclusions.

Motivating scenario
Let us suppose we have launched a Web crawler indexing Resource Description Framework in Attributes (RDFa) [18] data, which is a syntax that embeds RDF triples [7] in HTML and XML documents. Following [7], we provide definitions of RDF triples below. Definition 5 (RDF triple). Assume that I is the set of all Internationalized Resource Identifiers (IRIs), B (an infinite) set of blank nodes, and L a set of literals. An RDF triple is defined as a triple t = hs, p, oi where s 2 I [ B is called the subject, p 2 I is called the predicate, and Property graph databases are vertex-centric, whereas RDF graph stores are edge-centric. As a result, RDF graph stores use edges, many of which are not critical to our quiries, so we choose property graphs to store our data. The indexed subjects, predicates, and objects of RDF are saved in the graph database in a form of a property graph. Subjects and objects are represented as vertices, whereas predicates are edge labels. These elements can be written in YARS [15], which is a serialization for PG databases that is compatible with RDF. An exemplary property graph can be seen in Fig 2 and   The first RDF triple consists of http://example.org/p#j (an IRI), http://www.w3.org/1999/02/ 22-rdf-syntax-ns#type (an IRI), and http://xmlns.com/foaf/0.1/Person (an IRI). The second RDF triple consists of http://example.org/p#j (an IRI), http://xmlns.com/foaf/0.1/name (an IRI), and John Smith (a literal). Note that subjects are deduplicated.
Unfortunately, both subjects and objects may be repeated in different sources that are searched by a Web crawler. For the sake of efficiency, we cannot check with every RDF triple whether nodes of the same name in the database already exist. This is why all subjects and objects of RDF triples encountered in different time intervals are entered into the database. Fig  2 shows repeated vertices and their properties. Such a state of data is not desired, because it causes difficulties with the efficiency of data processing and substantially impedes selecting the data. This is why we would like vertices that represent the same or similar thing to be represented by one node. A way of solving such problems is removing unnecessary nodes with their properties that are the same or similar and merging them into one node. Since a Web crawler indexes RDF triples, we assume that if the vertices have the same properties, they are not connected with the edges (see Fig 3). It is important to note that without this assumption we may encounter a loop in a modified graph, where nodes of the same property have been merged. Indeed, let us consider an edge that connects two vertices of the same property p and the vertex v p that represents the set of all vertices with p in the modified graph. Then the edge would be  transformed into a loop that links v p to itself. Such loops are superfluous because of the RDF transformation algorithm to property graphs. Under assumption that there is no edge connecting two vertices with the same property, each set of vertices with a common property is independent. As a result, particular collections of merged nodes of the same property determine the partition of the graph node collection into independent sets. Because, through merging, we wanted to obtain a minimum number of nodes, we assumed that we would also allow the merging of even a few node families, where each would be defined by a common value if no two within the merged family were incidental. Elimination of vertices may take place because the data placed in the properties is similar, hence RDF subjects or objects represent the same family of RDF triple elements. Such minimization leads to a known NP-complete Graph Colorability problem (see GT4 [19]). Note that this problem is originally formulated for undirected graphs, however it is as difficult as the problem of coloring of a directed graph or even a multigraph. Moreover, subfamilies of nodes expressed by the partition of graph into independent sets do not have to be property closed. To say that a partition is property closed means simply that every two vertices that have the same property have to belong to the same independent sets. This property does not have to be kept even in the case of a graph of several vertices. Indeed, the graph presented in Fig 2 is 2 Fig 3).
Therefore, we can not directly use any approximation algorithm for graph coloring to reduce the number of vertices in a property graph. To solve this problem in our approach, first we transform a given property graph, sticking together vertices that have the same property as at Fig 3. Then we color a graph determined by classes of properties, and finally we assign to each vertex the color of the class to which it belongs. We describe the transformation more formally in Section 3 and use it to show in Theorem 1 that our reduction problem is NP-hard. Additionally, in Theorem 2 we show that based on this approach, we can obtain a reduced property graph that has the minimal number of vertices.

Reducing vertices
In this section, we will show that even in the case of narrowing the scope of searching for minimal colorings to divisions into independents sets that are closed because of a common property, the problem of minimal coloring is also an NP-hard. To do this, we will first enter the necessary definitions and notations so as to prove that our problem is NP-hard. Subsequently, we will suggest a method enabling the use of known approximating algorithms in search for minimal graph coloring. For this purpose, we will present a method of converting a property graph into an undirected graph whose minimal coloring will unequivocally designate minimal property graph coloring.
To formulate our coloring problem, we need to set the appropriate vocabulary of notations. Let G ¼ hV; Ei be a simple undirected graph with the vertex set V and the edge set E, where each edge is a two-element subset of V. We assume that considered undirected graphs are loopless since a vertex that is incident to a loop could never be properly colored.
For simplicity, the phrase π is a partition of G means that π is a partition of V into mutually disjoint sets. Now we can formulate the Graph Colorability problem as the following decision problem: Graph Colorability (GT4): INSTANCE: An undirected graph G ¼ hV; Ei, a positive integer k jVj. QUESTION: Does a partition of G into k independent sets exist?
Let p be a property. We call V 1 the class of p if each vertex of V 1 has property p and each vertex that has property p belongs to V 1 , i.e., u 2 V 1 , p 2 p v ðuÞ for every u 2 V. Finally, we call V 1 property closed if for every p 2 P whose the class has a non-empty intersection with V 1 and holds the class of p is a subset of V 1 .
Combined Property Graph (CPG): QUESTION: Does a partition of V into k independent property closed sets in PG exist? It is important to note that there exists a partition of V into mutually disjoint independent property closed sets since the property graph PG is single. We show that a non-single property graph may not have such a partition. Let us consider a property graph presented at Fig 2 that contains vertices a, b, c. Denote by V b an arbitrary property closed subset of vertices that contains b. Suppose that b has an additional property: value = ex:a (value is a key, and ex:a is a value of property). As a and b have the property value = ex:a we infer that a 2 V b . Similarly, c 2 V b since value = foaf:Person is a property of b and c. But vertices a and c are connected by an edge. This contradicts our assumption that V b is independent.
We can also consider weakened assumption that PG is single, preserving the existence of such a partition. Indeed, we can consider a family of property graphs where, if each pair of property closed subsets of vertices has a common member in their class, the class determined by one of them is a subset of the class for the second one. However, we can easily transform such property graphs to single ones, removing every property where the determined class is a subset of the class of another property. Theorem 1. CPG is NP-hard. Proof. We transform Graph Colorability to CPG. Let G ¼ hV; Ei be an undirected graph and a positive integer k jVj. Without loss of generality we can assume that V ¼ f1; 2; . . . ; ng where n ¼ jVj (note that we can consider an injection function 0 : V7 !f1; 2; . . . ; jVjg that assigns a number to each vertex of a graph).
We describe a property graph as a tuple PG ¼ hV; E; S; P; h e ; t e ; l v ; l e ; p v ; p e i defined by where v, u 2 V and e 2 E (see Fig 4). Note that this translation can clearly be done in LOGSPACE.
The main idea of proofs uses the fact that for every property there exists at most one vertex or one edge that has the property or more precisely Indeed, each property p that is assigned to a vertex or an edge has the following form hx, xi, Additionally, V and E do not have common members, since each edge is a two-element subset of natural numbers and each vertex is a natural number. Based on this observation we show that there exists a partition of G into k independent sets if and only if there exists a partition of PG into k independent property closed sets.
Let us consider a partition π of G into k independent sets. Consider π 0 = π as a partition of PG since (V ¼ V). We show that π 0 is a partition of PG into k independent property closed sets. To do this, first, we show that each member of π 0 is an independent set of PG. Let us consider an arrow a 2 E and suppose contrary that exists V 1 2 π 0 such that h e ðaÞ 2 V 1^te ðaÞ 2 V 1 . The arrow a can be represented in the form hv, ui for some v, u 2 V where fv; ug 2 E and v < u. Then h e ðaÞ ¼ u; t e ðaÞ ¼ v, so imply that {v, u} is an edge that connects two members of V 1 , but V 1 is an independent set of G as a member of π, contradiction. We are now in a position to show that each member of π 0 is a property closed set. But the justification of the property is obvious, since there is no two vertices that have the same property by (1), and the proof of the first implication is complete.
Let us consider a partition π PG of PG into k independent property closed sets. Similarly, let us consider p 0 PG ¼ p PG as a partition of G. Suppose, contrary to our claim that p 0 PG is not a partition of G into independent sets. Then there exists V 2 2 p 0 PG and two vertices v, u 2 V 2 such that fv; ug 2 E. Without loss of generality we can assume that v < u. Then hv, ui 2 E connects two members of V 2 being an independent set of PG, since h e ðhv; uiÞ ¼ u; t e ðhv; uiÞ ¼ v, and the proof is complete.
Let PG ¼ hV; E; S; P; h e ; t e ; l v ; l e ; p v ; p e i be a property graph. Let π PG be a set of non-empty subsets of V such that V 0 is a member of π PG if and only if there exists a property p 2 P for which V 0 is the set of all vertices of V that have property p. Obviously, if PG is single, then π PG is a partition of PG. Moreover, if PG is unique adjacency then, by definition, there are no two vertices that are connected by an arrow and belong to the same member of π PG . Hence, each member of π PG is an independent set for every unique adjacency property graph PG. For subsets V 1 , V 2 of V we use the following notation V 1 * PG V 2 if there exists an edge e 2 E that connects a member of V 1 with a member of V 2 or more precisely Now, we define a partition graph GðPG; pÞ for an arbitrary partition π of V as follows. The set of vertices in GðPG; pÞ equals π and two vertices V 1 , V 2 of GðPG; pÞ are connected if V 1 6 ¼ V 2 and V 1 * PG V 2 . Let us consider a partition σ of the set of vertices in GðPG; pÞ and a member V 0 2 σ. Note that S x 2 V 0 x is a subset of V, since each member of σ is a set of nonempty subsets of V. In consequence, is a set of nonempty subsets of V. Note that P V ðsÞ is also a partition of G. Indeed, P V ðsÞ cov- Additionally, if two members of P V ðsÞ, e.g., , and finally these two members of P V ðsÞ are equals.
Similarly, if π is property closed then each member V 0 of π that has a common element with a member V@ of π PG , has to contain whole V@, i.e., V@ V 0 . In consequence, we can assign a partition of GðPG; p PG Þ to the partition π. This partition is defined as follows: We show that based on GðPG; p PG Þ we can adapt each known approximation algorithm for graph coloring to CPG problem. Theorem 2. Let PG be single unique adjacency property graph. Then (i) for every π be a partition of PG into k independent property closed sets holds P p PG ðpÞ is a partition of GðPG; p PG Þ into k independent sets, and (ii) for every π be a partition of GðPG; p PG Þ into k independent sets holds P V ðpÞ is a partition of PG into k-independent property closed sets. Proof.
(i) Let π be a partition of PG into k independent property closed sets. Obviously, P p PG ðpÞ is a partition of GðPG; p PG Þ and has k members, which is clear from (4). We showed that each member of P p PG ðpÞ is an independent set. Let us consider a 2 P p PG ðpÞ and suppose contrary to our claim, that there exist x, y 2 a for which {x, y} is an edge of GðPG; p PG Þ. From (2), there exists e 2 E such that h e ðeÞ 2 x^t e ðeÞ 2 y or h e ðeÞ 2 y^t e ðeÞ 2 x. Without loss of generality we can assume that h e ðeÞ 2 x^t e ðeÞ 2 y. Additionally, from (4), there exists V 0 2 π satisfying a = {V 0 2 π PG : V@ V 0 }. Hence x, y V 0 , and finally h e ðeÞ; t e ðeÞ 2 V 0 . This contradicts the fact that V 0 as a member of π is independent.
(ii) Let π be a partition of GðPG; p PG Þ into k independent sets. It is evident that P V ðpÞ is a partition of PG and has k members. Let us consider a member a of P V ðpÞ. Then there exist V 0 2 π satisfying a = S x 2 V 0 x. We show first that a is an independent property closed. Suppose first that a is not independent. Then there exists an edge e 2 E such that h e ðeÞ 2 a^t e ðeÞ 2 a, and in consequence there exist x 1 , x 2 2 V 0 with the following properties: h e ðeÞ 2 x 1 and t e ðeÞ 2 x 2 . But then from (2), we obtain that {x 1 , x 2 } is an edge of GðPG; pÞ, and in consequence V 0 contains two connected element x 1 , x 2 . This contradicts the fact that V 0 as a member of π is an independent set. Now we prove that a is property closed. Let us consider q 2 P whose class has a nonempty intersection with a. To be specific there exists u 2 a with qinp v ðuÞ. Then there exists x 1 2 V 0 such that u 2 x 1 . Moreover, x 1 ( S x2V 0 x) = a, and x 1 as a member of V 0 is an element of π PG , hence x 1 is a class of a property q 1 2 P and q 1 2 p v ðuÞ. But since PG is single, u has exactly one property, therefore q = q 1 and in consequence the class of q, which is equal to x 1 , is a subset of a. This concludes the proof.

Experiments and evaluation
In this section, we will present a description of data sets and experiments from three areas. The first group of experiments (Subsection 4.2) shows property graph coloring characteristics. In Subsection 4.3, we show how our proposal works on graph databases. The last group of experiments (Subsection 4.4) presents how property graph serializations deal with our solutions.

Data sets description and set-up
All experiments were executed on an Intel Core i7-4770K CPU @ 3.50GHz (4 cores, 8 threads), 8GB of RAM (clock speed: 1600 MHz), and an HDD with reading speed rated at *160 MB/ sec (we test it in hdparm -t). We used Linux Mint 17. We gathered data sets from the Web in five ways: crawled data via modified LDSpider [20], subset of DBpedia [21], subset of Wikidata [22], W3C Public Mailing List Archives (https:// lists.w3.org/) and automatically generated using Berlin SPARQL Benchmark (BSBM) [23].
The first data set (DS 1 ) was generated in LDSpider, which was extended with YARS support. The data set mainly concerns Friend of a Friend (FOAF) [24] information because we used FOAF URIs in the seed file. FOAF is a vocabulary that describes persons, their activities and relations with other people and objects. The next data set (DS 2 ) was created on the basis of DBpedia 3.0 [21], which contains data from different infoboxes in Polish. The third data set (DS 3 ) is a dump (access date: 2016-06-21) at the class hierarchy of Wikidata Properties [22], which is structured data of its Wikimedia sister projects including Wikipedia, Wikisource, and others. The fourth data set (DS 4 ) was found on the W3C Public Mailing List (https://goo.gl/ 9x2WAu). This data set was generated in cwm.py (that is a data processor and reasoner for the Semantic Web and has data about class/property equivalences and other Web Ontology Language [25] metadata. The last data set (DS 5 ) is generated using Berlin SPARQL Benchmark (BSBM) [23], which was extended with YARS support. Description of data sets is presented in Table 1. DS 2 , DS 3 and DS 4 are serialized RDF, so we had to transform thom into YARS, which is indirect PG serializaction for RDF data. Our transformation tool was built in Python, and it is available at https://github.com/domel/yars. All of the considered data sets use YARS so that they can be compared under the same conditions. Description of data sets is presented in Table 1. The first part of the table shows input files characteristics such as N-Triples size (S b yars ), YARS size before removing the repeated nodes (S b yars ), YARS size after removing the repeated nodes (S a yars ). Generation times are presented in the second part of the

Property graph coloring
Our tool for property graph coloring uses ColPack [10] and METIS [9] file format. Our tool for METIS and other graph formats was built in C++ and it is available at https://github.com/ domel/graph_syntax. Our main tool for PG coloring is available at https://github.com/domel/ pg_color.
To reduce vertices in property graphs we have to take five steps: In Table 2, we present transformation from YARS into METIS and its characteristics. The first part of the table shows input files characteristics such as YARS size before removing the repeated nodes (S b yars ), YARS size after removing the repeated nodes (S a yars ), and METIS size (S metis ). In the second part of the table, we present the arithmetic mean time of transformation from 10 runs. The results show that the regularity of the graph has a strong influence on the transformation time, i.e., DS 2 is irregular in its structure and its transformation time is worse than DS 5 , which is benchmark generated. The third part of the table shows output file characteristics such as YARS with color metadata size ( o S cm yars ), regular YARS file size ( o S yars ), and Lzip/DEFLATE [26], tANS (ZStandard implementation: http://facebook.github.io/zstd/) [27] and Brotli [28] stream compressed YARS ( o S lzip yars , o S tans yars and o S br yars ). The last part shows reducing ratios referring to YARS with color metadata (R c ) and YARS after removing the repeated nodes (R a ). In addition, we show the ratio of YARS before removing the repeated nodes (R b ) and YARS after removing the repeated nodes with compression (R a lzip , R a tans and R a br ). Database servers [29][30][31] often can send messages to clients via HTTP [32] or RESTful web services [33]. Therefore, we decided to test our solution for compression to improve transfer speed and bandwidth utilization.
To test our approach we have to choose the minimum distance that has to be maintained between each pair of vertices merged together. At the same time, we have to choose the minimum distance between vertices of the same color. Below we present two cases: distance-1 and distance-2 [2].
In Table 3, we present distance-1 coloring characteristics, such as unique vertex reducing ratio (R u ) of unique vertices to output vertices, total vertex reducing ratio (R t ) of input vertices to output vertices. Figs 5 and 6 present DS 1 and DS 4 before and after our reducing vertices. To visualize these graphs we use the Yifan Hu algorithm (before) [34] and the Fruchterman-Reingold algorithm (after) [35].
In Table 4, we present distance-2 coloring characteristics, such as unique vertex reducing ratio (R u ), total vertex reducing ratio (R t ), which show, how many nodes of the same color are removed. Figs 7 and 8 present DS 2 and DS 3 before and after us reducing vertices. To visualize these graphs we use the Yifan Hu algorithm.

Querying graph databases
In this subsection, we show our experiments considering querying DS 1 , DS 2 , DS 3 , DS 4 and DS 5 before and after us reducing vertices in common graph databases, such as Neo4j [30], Titan 1.0.0 [29], and OrientDB 2.2.14 [31]. All these property graph databases were running in Docker [36], which is a completely sandboxed virtual environment. The instructions to build images (i.e. dockerfiles are avaliable on:  We focus mainly on the speed-up of querying, i.e., the ratio of time before to after our reducing for every query: Q 1 , Q 2 , . . ., Q 8 . In Figs 9, 10, 11 and 12, we present queries in three different databases. OrientDB has been tested using Gremlin [37] and SQL via a console. In Neo4j, we used Cypher Query Language [38], and, in Titan, we executed Gremlin queries. As we expected to reduce the number of vertices, we generally sped-up the process of querying. Indeed, the average amount of speed-up is 24.99 times. However, the speed-up is significantly dependent on the choice of a graph database. This parameter is 2.73 in Neo4j, 38.78 in Titan, 48.52 in OrientDB via Gremlin, and 9.92 in OrientDB via a console. Based on OrientDB via Gremlin we obtain the biggest average number of speed-up, however the ratio of time is greater that 1 only in 35% of querying. Note for comparison that Neo4j in 75%, OrientDB via a console in 78.7%, Titan in 82.5%, obtains speed-up greater that 1. If we compare the ratio of time, we can also compare the ratio of time in different graph databases for a particular query. Titan in 52.5%, OrientDB via a console in 25%, Neo4j in 25%, OrientDB via Gremlin in 7.5% of questions obtains the maximum speed-up based on our modification of property graphs. Additionally, OrientDB via Gremlin in 45%, OrientDB via console in 22.5%, Neo4j in 22.5%, Titan in 10% obtain the minimum speed-up.

Reducing vertices in property graphs
Our queries can be divided into two types. The first one checks basic operations (loading data-Q 1 , select all nodes with properties-Q 2 , select all the edges with all properties-Q 3 , removing data-Q 4 ). The second one checks finding the shortest path (Q 5 -max depth 2, . . .Q 8 -max depth 5). The analysis showed that our reduction of the number of vertices has different effects in each individual graph database that we considered. Indeed, the average speed-up of querying obtain 96.51 times in the case of the first type of queries in OrientDB via Gremlin, even if the average speed-up in the case of the second type of queries in the graph database is 0.53, i.e., the shortest paths are found about two-times slower after the modification than prior to it. However, such situations occur only in the case of OrientDB via Gremlin. The best results are obtained in the case of Titan, where the average speed-up is 30.90 times in the case of the first type of queries and 46.67 in the second one.

Serializations
In this subsection, we compare efficiency of our reducing vertices in common graph serialization formats, such as GML [39], GraphML [40], and GEXF (https://gephi.org/gexf/format/). This is important because format affects the effective property graph data storage and  transmission. This feature is useful for saving disk space, better use of database memory buffers or faster communication between a database client and a server. Table 5 shows distance-1 characteristics of serializations before and after our vertices reduction. All results are better after our transformation. The best ratio belongs to GML (DS 1 , DS 2 , DS 4 and DS 5 ) and GEXF (DS 3 ).

Reducing vertices in property graphs
In Table 6, we introduce distance-2 characteristics of serializations before and after our vertices reduction. Only two cases have slightly worse results (GML in DS 1 and DS 3 ). It is evident that additional information about the original graph (especially information about the edges) has a great impact in case of a small reduction of nodes. The rest of the serializations give slightly better results than before our transformations.

Property graphs
In this section, we discuss property graph approaches and solutions. They can be divided into four groups: abstraction layer and formalization of property graphs, property graphs databases, multi-modal databases that support property graphs, and distributed processing frameworks.
The first group relates to proposals that formalize property graphs [29,[41][42][43]. In [41], Hartig proposes a formalization of the PG model and introduces transformations between PGs and RDF Ã [44]. Unfortunately, this model is not widely supported by graph stores. In [29], Jouili et al. suggested another definition of PG based on Blueprints (https://github.com/ tinkerpop/blueprints/wiki). The PG definition is restrictive, because it assumes that labels must be unique. In this paper, authors present a distributed graph database comparison framework. In [42], Schätzle et al. present a formalization of PG in the RDF context. Moreover, the paper introduces a SPARQL query processor for Hadoop called S2X. Unfortunately, this paper focuses on distributed storage and do not formalize property graphs in the graph database context. In [43], Batarfi et al. propose a formalization of attributedgraphs, which is similar to PGs. There are a few data stores in the Property Graph world [30,45]. Neo4j [30] is a native graph database purpose-built to leverage not only data but also its relationships. Titan [29] is another graph database that is distributed and transactional. Dex/Sparksee [45] is yet another graph database, which supports data constraints to guarantee the integrity of data and relationships among them.
The third group is a multi-modal database that supports property graphs, and it can be divided into two subgroups: graph databases that support RDF and PG models [46,47], and databases that support documents and graphs [31,48]. Oracle database with Oracle Spatial and Graph option [46] is an add-on database feature with advanced spatial capabilities enabling the development of complex geographic information systems. Bigdata/Blazegraph [47] is another graph database that supports RDF and PG models. It is an ultra-scalable and high-performance database that supports up to 50 Á 10 9 edges on a single machine. The next subgroup is OrientDB [31] and AranoDB [48]. The first solution supports different indexes and ACID transactions guaranteeing that all database transactions are processed reliably. In AranoDB documents are grouped into collections, that can be related to vertices or edges.
The last group is distributed processing frameworks that use property graphs [49,50]. GraphX [49], which is an API of Apache Spark for graphs and graph-parallel computation, is an example of property graph usage. It extends the Spark RDD abstraction by introducing the resilient distributed Property Graph. SGraph, a part of GraphLab [50], is another scalable graph data structure, that derives from the property graph idea.

Graph coloring
Graph coloring is known to be an NP-complete problem from the 70s [19,51,52]. Since then, the difficulty of this problem has contributed to proposed heuristic algorithms, hoping that the number of colors they use is near optimal [53][54][55]. Recently, the problem of minimum graph coloring is being better acquainted. Bellare et al. [56] show that minimum graph coloring cannot be approximated better than |V| 1/7− for every > 0, unless P = NP. Moreover, Feige et al. [57] prove that if NP-problems cannot be solved by a randomized algorithm in polynomial time, the minimum graph coloring cannot be approximated better than O(|V 1− |).
On the other hand, there are numerous heuristic algorithms for specialized graph coloring problems [58][59][60][61]. Qu et al. [58] propose a hybrid heuristic approach based on estimation distribution algorithms. This paper provides solution of acceptable quality for a number of optimisation problems and demonstrates the generality through experimental results for different variants of exam timetabling problems. FOO-PARTIALCOL is another approach presented in [59]. This method is based on tabu search [62]. A solution consists of k disjoint stable sets and a set of uncolored vertices. Yet another approaches are introduced in [60], which are based on Greedy algorithm [63]. In the first paper, Iterated Greedy algorithm is effective in graphs with n vertices partitioned into k as nearly equal sized sets as possible. In the second paper, the authors show heuristic methods to color vertices of a graph, which relies upon the comparison of the degrees and structures of graphs.
Several coloring algorithms occur in this context, depending on whether the matrix is a Jacobian [64,65] or a Hessian [66,67]. In [64], authors apply a column intersection graphbased formulation. In [65], authors propose formulation which is based on the concept of a Reducing vertices in property graphs  consistent row-column partition, in which the entire set of rows and columns is partitioned into two respective sets of groups. Vertices that remain uncolored at the end of the algorithm form an independent set in the graph and can be assigned a neutral color zero. Coleman et al. [66] propose a model that exploits symmetry. This model is called path coloring and requires that every pair of adjacent vertices get distinct colors, and every path on four vertices uses at least three colors. McCormick [67] introduces a graph coloring model for the computation. The model uses the adjacency graph representation of the underlying symmetric matrix and requires that in every path u, v, w in the graph, vertices u, v, and w receive distinct colors.
We distinguish probabilistic approaches [68,69], supervised machine learning approaches [71,72,75] and unsupervised machine learning approaches [70,73,74,76]. There are also solutions for duplication and similarity detection in graph data [77][78][79][80]. Dong et al. [77] present duplicate detection in a scenario where relationships between publications, persons, etc., form a graph. At each iteration, the first pair in the priority queue is retrieved, compared, and classified as nonduplicate or duplicate. The presented algorithm gradually enriches references by merging attribute values. Kalashnikov et al. [78] propose a domain-independent data cleaning approach for a graph of entities. Presented algorithm uses clustering techiques. Yin et al. [79] present linkage-based clustering, in which the similarity between two objects is measured based on the similarities between the objects linked to them. Bhattachary et al. [80] propose another algorithm that evaluates similarities of candidate pairs at each iterative step, and selects the most similar pair at each iteration. An algorithm augments a general class of attribute similarity measures with relational similarity among the entities. Duplicates are merged Reducing vertices in property graphs together before the next iteration, so that in effect clusters of candidates are compared. This merge updates the reference graph and the priority queue.
On the other hand, we distinguish among approaches for deduplication: active learning methods [81,82], clustering methods [83], and graph algorithms [84]. Sarawagi et al. [81] show how machine learning techiques could be applied in the elimination of redundant data where training data were available. Georgescu et al. [82] propose approach for deduplication with the main advantage of using crowdsourcing as a training and feedback mechanism. Culotta et al. [83] propose a conditional random field model of duplicate removal that captures these relational dependencies, and then employ a relational partitioning algorithm to jointly deduplicate data. Zhou et al. [84] show how to reduce the number of slow synchronization operations needed in parallel graph search.

Conclusions and future work
Reducing vertices is an important issue in graph databases. In this article, we outlined a way of reducing nodes of a property graph with the use of a graph vertex coloring method. We also presented works from the property graph coloring research area. Finally, we presented experiments, that showed great potential for the presented approaches.
The proposed approach is destined to evolve and include a wider set of coloring methods in its future versions. We showed that when reducing the number of vertices with one property, we can significantly increase the efficiency of working with graph databases that store RDF data, where the average number of speed-up is 24.99 times. However, the speed-up is significantly dependent on the choice of a graph database and a coloring method. Our initial research showed that we obtain large differences in the speed-up, even if we only focus on a single request in different graph databases.
Therefore, as part of future work, we will consider various algorithms for coloring graphs to find the ones that guarantee the best acceleration in the known graph databases. Furthermore, we will try to indicate which ones are best suited to each database. These studies will require not only coloring algorithms with the best approximation, but also heuristics that do not have constant approximations. Moreover, future work will focus on property graph partitions.