Figures
Abstract
The issue of data quality has emerged as a critical concern, as low-quality data can impede data sharing, diminish intrinsic value, and result in economic losses. Current research on data quality assessment primarily focuses on four dimensions: intrinsic, contextual, presentational, and accessibility quality, with intrinsic and presentational quality mainly centered on data content, and contextual quality reflecting data usage scenarios. However, existing approaches lack consideration for the behavior of data within specific application scenarios, which encompasses the degree of participation and support of data within a given scenario, offering valuable insights for optimizing resource deployment and business processes. In response, this paper proposes a data contribution assessment method based on maximal sequential patterns of behavior paradigms (DecentralDC). DecentralDC is composed of three steps: (1) mining the maximal sequential patterns of sharing and exchange behavior paradigms; (2) determining the weights of these paradigms; (3) calculating the contribution of sharing and exchange databases combined with data volume. To validate our approach, two sharing and exchange scenarios of different scales are established. The experimental results in two scenarios validate the effectiveness of our method and demonstrate a significant reduction in cumulative regret and regret rate in data pricing due to the introduction of data contribution. Specifically, compared to the most competitive baseline, the improvements of mean average precision in two scenarios are 6% and 8%. The code and simulation scenarios have been open-sourced and are available at https://github.com/seukgcode/DecentralDC.
Citation: Ke W, Liu Y, Wang J, Fang Z, Chi Z, Guo Y, et al. (2024) DecentralDC: Assessing data contribution under decentralized sharing and exchange blockchain. PLoS ONE 19(10): e0310747. https://doi.org/10.1371/journal.pone.0310747
Editor: Emanuele Crisostomi, Università di Pisa, ITALY
Received: February 18, 2024; Accepted: September 2, 2024; Published: October 24, 2024
Copyright: © 2024 Ke et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data for this study are publicly available from the GitHub repository (https://github.com/seukgcode/DecentralDC).
Funding: This work was supported by National Science Foundation of China (Grant Nos.62376057) and the Start-up Research Fund of Southeast University (RF1028623234). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: NO authors have competing interests.
Introduction
With the rapid development of artificial intelligence, the Internet of Things, and communication technology, enterprises, organizations, and individuals in various industries have accumulated substantial data resources and formed their own data assets [1–3]. DeMedeiros et al. [4] highlights the extensive use of AI and IoT in various sectors, impacting data accumulation and usage. Abiodun et al. [5] discusses the transformative role of IoT across industries, affecting how data resources are accumulated and managed. Tariq et al. [6] provides insights into the rapid development and implications of IoT technology on data generation and security.
These data assets are vital for social production and operational activities, driving technological advancements and industrial upgrades [7, 8]. Kato et al. [9] discusses the critical role of IoT in enhancing industrial and technological processes, aligning with the notion of data assets being crucial for development. Affia et al. [10] highlights the importance of IoT in healthcare, underlining the vital role of data in improving healthcare services. Mazhar et al. [11] emphasizes how IoT, coupled with AI, is crucial for addressing operational challenges in various sectors. Currently, the issue of data quality [12, 13] has gained prominence. Low-quality data can significantly diminish the intrinsic value of data, obstruct data sharing and exchange, and even lead to substantial economic losses. Elouataoui et al. [14] details the adverse impacts of poor data quality, including economic repercussions, and underscores the critical need for high-quality data for effective organizational decision-making. Cai et al. [15] highlight the difficulties in managing data quality due to the variety and volume of big data, which complicates integration and impacts economic outcomes. Merino et al. [16] emphasize how low-quality data in real-time systems can lead to inefficiencies and financial losses, stressing the importance of quality control. Wamba et al. [17] illustrate that poor data quality undermines firm performance by affecting decision-making and operational efficiency. In this light, the integration of blockchain technology, as extensively detailed in references [18–20] offers a formidable solution by enhancing the security and transparency of data transactions. For instance, Koshiry et al. [18] explores blockchain’s role in credential verification within the educational sector, ensuring the accuracy and reliability of academic records. Moreover, the knowledge graph and natural language processing technologies discussed in [19] can significantly enhance the management and retrieval of educational resources, proving essential for modern e-learning platforms. Furthermore, the concerns about data security in public networks, highlighted in [20], underscore the necessity of robust security measures, such as those provided by blockchain, to safeguard sensitive data from unauthorized access. Therefore, data quality assessment is critical to data utilization and the full development of data value.
Research on data quality commenced in the 1970s, yielding significant findings to date. Ehrlinger et al. [21] discusses the evolution and ongoing significance of data quality research, highlighting how the field has developed since the 1970s with a focus on both theoretical and practical applications.The initial step in data quality evaluation is defining the dimensions of assessment. In accordance with scholarly investigations [22–28] (Fig 1), the prevailing focus of attention rests upon four principal categories of quality dimensions: intrinsic quality, contextual quality, presentational quality, and accessibility quality. Taleb et al. [29] develop a framework that validates the critical importance of intrinsic, contextual, presentational, and accessibility dimensions for robust data quality management across various sectors. Intrinsic quality [30] focuses on the essential characteristics of data, which is the basic requirement of data quality, such as integrity, accuracy, consistency. Contextual quality [31] is closely related to data usage scenarios, encompasses considerations like timeliness, task relevance, and applicability. The contextual quality [32] of the same data in different application scenarios may be different. Presentational quality [27] reflects the degree of understandability and conciseness of data. Accessibility quality focuses on the extent to which data can be obtained, and sometimes it also considers factors such as confidentiality and security.
To sum up, intrinsic quality and presentational quality only evaluate the data quality from the aspect of data content. The accessibility quality reflects the difficulty of data acquisition. Only contextual quality remains intricately connected to the specific data application scenario. Nevertheless, the primary emphasis continues to be placed on the fact of data content, particularly with regard to its alignment with the predetermined task, adherence to temporal requirements, and overall accessibility. In essence, the current data quality dimension mainly focuses on data content. However, in the specific application scenario, in addition to the content, the behavior between data and the application scenario (referred to as data behavior) should also be considered. Data behavior can reflect the degree of participation and support of data to the current scenario or task, as well as its role and value of data.Wang et al. [33] explores how data quality varies according to its use and context, affecting operational efficiency and decision-making. We define this metric as a new contextual quality dimension, data contribution. Data contribution can often provide reference information and a basis for application system resource deployment optimization, business process optimization and reasonable data pricing [34, 35]. Therefore, exploring how to scientifically and reasonably evaluate data contribution holds significant practical importance.
At present, there are few similar related evaluation studies. Our previous work [36] proposes a business scenario-driven data contribution evaluation method based on data access behavior for the single microservice system running independently. The method considers the importance of business scenarios and the corresponding data access behavior information when evaluating the contribution. The importance weight of business scenarios is determined by the analytic hierarchy process, and the information of data access behavior is obtained by the distributed tracing tool. The way to combine the two is to formalize the data access behavior by defining behavior operators so as to express the importance of business scenarios with data contribution, and then use the maximum vector similarity to transform the problem into a nonlinear programming problem. By solving this optimization problem, the contribution of the business-related data tables (databases) in the system to the upper business scene is evaluated.
Data contribution is a contextual quality dimension based on data behavior [37, 38]. At present, data, as a critical production factor and material, has been deeply integrated with all aspects of social life. On the one hand, data behavior provides support for upper business scenarios within the application system [39]. On the other hand, in order to give full play to the value of data and promote industrial collaboration and innovation, data behavior widely exists between different organizations and application systems to support data sharing and exchange [40]. In our first work, we studied the data contribution evaluation method based on data access behavior within a single application system, represented by the microservice system. Therefore, this paper studies the data contribution evaluation method based on data sharing and exchange behavior between different application systems.
The fundamental prerequisite for realizing the full potential of data through sharing and exchange is the assurance of data privacy and security. Departing from the traditional centralized sharing and exchange model, which requires storing shared data in third-party data centers or cloud platforms, the decentralized sharing and exchange model, based on blockchain technology, has gradually become the dominant paradigm. This shift is attributable to its inherent features of openness, tamper-proof design, auditability, and traceability [36]. Consequently, this decentralized approach has found extensive applications in diverse sectors, including government affairs, enterprises, and healthcare [41–44]. According to the blockchain industry survey in 2023 [45], 35% of the domestic organizations have applied blockchain technology in the digital identity field, and 74% of the units are actively exploring the direction of Web 3.0 and digital assets. By the end of 2022, there have been over 1,500 blockchain application cases. The openchain network released by ANT GROUP includes 26 nodes, with more than 1 billion transactions on the chain [46]. Blockchain technology is significantly shaping the future of digital identity and asset management, with a growing number of applications and widespread interest in the potential of Web 3.0. Therefore, this paper chooses the decentralized sharing and exchange mode based on blockchain as the research scenario. The characteristics of this scenario can be summarized as “authorization on the chain, sharing and exchange off the chain” [47, 48]. Fig 2 shows the general process, which mainly includes the processes of sharing data release on the chain, request initiation, authorization verification, data encryption and transmission, acquisition calculation results, and sharing and exchange metadata information to be recorded on the chain. Specifically, the metadata information of the data sharing and exchange behavior between the participants is recorded on the blockchain in the order of timestamp, usually including timestamp, requester, provider and corresponding database, data digest hash, data volume, etc. It is worth noting that in contrast to completely open and untrusted scenarios of digital currencies such as Bitcoin and Ethereum, the identities of participants engaged in data sharing and exchange are mostly known. This implies that it is not a completely untrusted scenario, but rather involves data sharing and exchange within specific contexts, such as various government departments or subsidiaries within a corporate group. Consequently, these scenarios often utilize consortium blockchains [49] to facilitate data sharing and exchange among participants, and uses traditional consensus algorithms such as Raft [50] and PBFT [51] to achieve consensus. Elouataoui et al. [52] discusses the integration of advanced data sharing technologies like blockchain, which enhance data integrity and trust, indirectly supporting the use of these technologies in ensuring data quality and security.
In this paper, we consider two kinds of sharing and exchange behaviors. The first type involves the temporary sharing of data usage rights. For instance, in the open sharing scenario of government data, a provident fund management entity might request access to current tax and credit information of customers. The second type pertains to multi-party computing scenarios. For instance, in a corporate group setting, four subsidiaries may each contribute a portion of their own data to participate in a computational task. Upon completion of the task, each party receives its respective computation results. Furthermore, since sharing and exchange behaviors involving the same participating database are linked to the same specific databases in the assessment framework, our study defines all sharing and exchange behaviors participated by the same databases as the same type of behavior paradigm. The shared data directory on the blockchain records information such as databases available for sharing and the associated access permissions provided by various organizations. It encompasses all data sharing and exchange behavior paradigms that may occur among participants.
From the perspective of real-world scenarios, the data sharing and exchange behaviors among different application systems mentioned above are often driven by the collaborative business between them. Therefore, there typically exists a certain logical order and dependency relationship between the sharing and exchange behavior paradigms belonging to the same collaborative business. The logical order and dependency relationships reveal the varying degrees of influence and importance among different behavior paradigms. Consequently, they can impact the magnitude of contributions to the associated databases. We aim to obtain this relationship from the metadata of sharing and exchange behavior on the blockchain as a basis for evaluating the contribution of each database. However, in the consortium blockchain system, the data sharing and exchange behavior paradigm is determined and limited, and the same behavior paradigm usually belongs to multiple different collaborative businesses; At the same time, the metadata information of sharing and exchange behavior is recorded chronologically, and the concurrent execution of multiple collaborative services will lead to the continuous sharing and exchange behavior in a short period belonging to different collaborative services; The sharing and exchange behaviors of complex collaborative services usually involves different data-requesting parties. These characteristics make it challenging to mine the dependence between sharing and exchange behavior paradigms. In addition, quantifying the contribution of each participating database according to the dependence of behavior paradigm and data sharing and exchange behavior information is also a problem to be solved in this paper.
To solve the above problems, we propose a data contribution evaluation method based on maximal sequential patterns of behavior paradigms (DecentralDC). The method is divided into three stages. Firstly, the data behavior metadata sequence recorded on the blockchain is preprocessed to obtain the behavior paradigm sequence set, followed by mining maximal sequential patterns of behavior paradigms. Secondly, a behavior paradigm dependency graph is constructed based on the extracted maximal sequential patterns. This dependency graph is a weighted directed graph, and the importance weight of each behavior paradigm can be obtained through the importance evaluation algorithm of graph nodes. Lastly, leveraging the importance weights of behavior paradigms and the data volume of sharing and exchange behaviors, the contribution of databases participating in sharing and exchange over a specific period can be quantified. Our main contributions can be summarized as follows:
- For the decentralized sharing and exchange scenario of multiple application systems based on blockchain, a data contribution evaluation method based on behavior paradigm maximal sequential patterns is proposed.
- The importance weight of the behavior paradigm is determined by the behavior paradigm dependency graph, and the data volume of sharing and exchange behavior is taken into account when quantifying the data contribution.
- Through experiments conducted in two different scale scenarios, this paper verifies the effectiveness of the proposed method and the positive impact of introducing data contribution on data pricing. Specifically, in terms of mining behavior paradigm dependencies, compared to the baseline methods, the F1 scores of our method in the two scenarios are on average higher by 0.15 and 0.11, respectively. Regarding the influence on data pricing, compared to the cumulative regret and regret rate of the baseline method, our method shows a decrease of 43.38% and 61.8% in cumulative regret and regret rate, respectively, in Scenario #1 with the introduction of contribution. In Scenario #2, the corresponding decreases are 35.46% and 59.8%, respectively.
Related work
While the related works on data contribution is relatively limited, we can draw insights from mature methods in the field of node importance assessment. Node importance evaluation is a critical topic in network analysis, aimed at identifying key nodes within a network. These methods are typically categorized into three types: iterative-based algorithms, path-based algorithms, and adjacency-based algorithms. Each method considers different perspectives and techniques in assessing node importance, thus they each have their own advantages and disadvantages. Table 1 is a comparison table outlining the strengths and weaknesses of these methods.
Relative to traditional methods (iterative algorithms, path-based algorithms, and adjacency-based algorithms), our DecentralDC method offers unique advantages as follows:
- Regarding Iterative Algorithms: Our DecentralDC method resolves issues of high computational complexity and poor convergence by employing maximal sequential pattern mining, eliminating the need for repeated network traversals and significantly boosting efficiency.
- Regarding Path-Based Algorithms: Unlike traditional methods that rely on static paths and struggle with large-scale networks, DecentralDC uses behavioral pattern analysis to uncover complex and dynamic inter-node dependencies, accurately reflecting the real significance of nodes in dynamic network environments.
- Regarding Adjacency-Based Algorithms: By constructing behavior paradigm dependency graphs and utilizing maximal sequential pattern mining, our DecentralDC transcends the limitations of focusing solely on local neighbor information, enabling a comprehensive global analysis of node dependencies.
Through these enhancements, DecentralDC markedly improves both the accuracy of node importance assessments and the adaptability of networks, proving particularly effective in scenarios involving data sharing and exchange. In essence, DecentralDC significantly enhances node importance assessments’ accuracy and network adaptability. It proves particularly effective in data sharing and exchange scenarios, providing new avenues for network analysis research.
Preliminaries
This section introduces the prior knowledge needed in this paper, mainly related to the theory of sequential pattern mining and data pricing.
Sequential pattern mining
Sequential pattern mining [62] is an important research direction in the field of data mining, which aims to mine frequent subsequence patterns in sequence database. Based on the original association rules [63], it considers the time or space dimension information. It has been widely used in user behavior pattern analysis [64], DNA sequence analysis [65], traffic warning [66], etc. The basic concepts involved in sequential pattern mining are as follows.
Itemset. A set consisting of one or more items is called an itemset, which can be expressed as I = {i1, i2, …, in}. The items in the itemset cannot be repeated, have no order, and can usually be arranged in lexicographic order.
Sequence & sequence database. An ordered arrangement consisting of one or more itemsets is called a sequence, which can be expressed as S =< s1, s2, …, sn >, si(1 ≤ i ≤ n) is an itemset; itemsets in a sequence are also called elements or transactions. The same elements can be repeated in a sequence. The sequence length is defined as the number of elements contained in the sequence; the sequence size is defined as the number of items contained in the sequence (cumulative counting of repeated elements and items). For example, the length of the sequence< {1, 4}, {3}, {1, 4} >is three and the size is five. A set consisting of one or more sequences is called a sequence database and is denoted as SDB = [S1, S2, …, Sn].
Subsequence. The sequence subseq =< sub1, sub2, …, subn >is a subsequence of another sequence seq =< s1, s2, …, sm > if there exist integers 1 ≤ j1 < j2 < … < jn ≤ m such that (denoted as subseq ⊆ seq).
Continuous subsequence. Given a sequence seq =< s1, s2, …, sn >and a subsequence c, c is a continuous subsequence of seq if any of the following holds: c is obtained by deleting items in s1 or sn; c is obtained by deleting one item of element si(1 ≤ i ≤ n)containing at least two items in seq; c is a continuous subsequence of c′, and c′ is a continuous subsequence of seq.
Support. The support of a sequence seq in a sequence database SDB is defined as the proportion of sequences containing seq in the whole sequence database and is denoted by supSDB(seq).
Frequent sequential pattern. For a given sequence set SDB and a specified minimum support threshold min_sup, all sequences in the set {seq|sup(seq) ≥ min_sup} are called frequent sequence patterns. If the size of a frequent sequential pattern is k, this sequence is called k-frequent sequential pattern. Any subsequence of a frequent sequential pattern is also a frequent sequential pattern, and any supersequence of an infrequent sequence is not a frequent sequential pattern.
Maximal sequential pattern. A frequent sequential pattern is a maximal sequential pattern iff any supersequential pattern of the sequential pattern is not a frequent sequential pattern. Maximal sequential patterns are compact representations of frequent sequential pattern sets whose size is typically several orders of magnitude smaller than the set of frequent sequential patterns [67].
Data pricing
A data market is a centralized environment and platform where data transactions occur [68, 69]. A well-functioning data market necessitates the rational pricing of data. The stakeholders involved in data transactions typically include providers, consumers, and brokers. Data providers are individuals or organizations that offer data, and the quality and value of the data they provide determine the trading price and popularity of the data. Data consumers are individuals or entities needing data, selecting suitable data based on their requirements, and paying the corresponding price. Data agents are the third-party intermediaries that provide data services. They provide technical and service support and reasonable data pricing to promote transactions between data providers and consumers. In decentralized scenarios, the intermediary role of data brokers is always eliminated. Data providers and consumers engage in peer-to-peer (P2P) data transactions, and the data pricing is directly completed by the data providers.
Market value. Consider a general online data market where various consumers sequentially submit purchase requests over time. These requests can be regarded as a multi-round online transaction sequence [70]. In round t(t = 1, 2, …), the characteristics of the transaction data can be described by a feature vector ht ∈ Rk, where the actual market value of data exists objectively but is unknown. However, the estimated market value vt can be defined as a function of the feature vector, i.e., vt = f(ht) + δt, where f: Rk → R represents the deterministic part of mapping the feature vector ht to its value, and δt is a non-negative random variable, assumed to be i.i.d. drawn from a zero-mean distribution over the set of real numbers R [71–73]. For computational simplicity, we can hypothesize f to be a linear function and remove the influence of randomness δt. Consequently, the market value can be simplified as (θ is the weight of the linear model) [74].
Cost of data. The cost of data, denoted as costt, consists of the expenses incurred in processes such as collection, processing, production, and tracking [68]. To simplify the calculation, the data costs can generally be regarded as the privacy compensation generated by the data provider when collecting data, and it can be computed using the feature vector ht of the data. Moreover, costt is typically not higher than the data’s posted price pt to ensure non-negative data utility. Only when the posted price pt exceeds costt can the data provider attain profits [75].
The target of data pricing. During online transactions, the regret at round t Rt [76] is defined as the gap between the data provider’s maximum expected profits and their actual profits: (1) where 1{pt ≤ vt} is an indicator function. When pt ≤ vt, the result is 1; otherwise, it is 0. is the data provider’s expected profits when the posted price is , and is the optimal posted price that maximizes the profit at round t. If the best posted price is vt, (i.e., ), then regret Rt becomes [77]: (2)
Generally, the target of data pricing is to minimize cumulative regret in data transactions: (3) where Rt is the regret at round t.
Principles of data transactions. Based on the objectives of data pricing, the following principles can be derived [78–80]:
If costt > vt, the transaction will not occur, resulting in zero regret.
If costt < vt and pt > vt, the regret is vt (in this scenario, the consumer rejects the posted price, indicating a high likelihood of non-occurrence of the transaction and causing substantial regret).
If costt < vt and pt < vt, the transaction will occur, resulting in a regret of vt − pt.
Posted price. For the posted price in round t, the upper bound and the lower bound could be set as the upper and lower bound of estimated market value vt, i.e., and , where κt is the knowledge set of θ in round t. There are two methods to choose the appropriate posted price based on and [77, 81]:
Exploitation price. The exploitation price refers to considering the known optimal price more during each decision-making instance. It selects previously validated high-yield choices. Opting for exploitation pricing as the posted price does not update the knowledge set of θ (i.e., κt+1 = κt). Exploitation pricing primarily emphasizes immediate returns and is typically determined as .
An effective pricing mechanism strives to strike a balance between selecting exploration prices and exploitation prices to achieve maximum revenue. When the difference value between and exceeds a predefined threshold ϵ, the exploration price is chosen to update the knowledge set of parameters θ. When , the exploitation price is preferred to ensure successful transactions and attain higher economic gains.
Problem definition
As shown in Fig 3, given a decentralized data sharing and exchange scenario FS = (P, Bx) based on consortium blockchain, where P = (P1, P2, …, Pr) represents r-number participants in the scenario, while Bx = (bx1, …, bxm) represents the sequence of m-number data sharing and exchange behavioral metadata entries recorded on the blockchain over a given timeframe. Each behavioral metadata entry captures details such as the requester, provider, source database, and data volume associated with the respective data sharing and exchange events. We denote the number of databases participating in sharing and exchange of the i-th party as share_numi, i.e., , where n is the sum of databases participating in sharing and exchange in the scenario. The task of this paper is to evaluate the contribution vector y = (y1, …, yn) of these n-number databases to the data sharing and exchange between different application systems based on the behavior information recorded in Bx.
Methdology
Based on the above task, we propose a data contribution evaluation method based on maximal sequential patterns of behavior paradigms, and the framework is illustrated in Fig 4. The method is divided into three stages. In the first stage, we preprocess the original sharing and exchange behavior metadata sequence to obtain the behavior paradigm sequence set, and then identify maximal sequential patterns utilizing the maximal sequential pattern mining algorithm. In the second stage, the behavior paradigm dependency graph is constructed based on maximal sequential patterns. Each node represents a behavior paradigm. Two behavior paradigms in a two-continuous subsequence with length two in a maximal sequential pattern are directly connected by an edge in the dependency graph, and the weight of the edge is the sum of the support of all maximal sequential patterns containing this two-continuous subsequence. Then the degree centrality of each node is used as the importance weight of each behavior paradigm. In the third stage, for the metadata sequence of sharing and exchange behavior over a period of time, the contribution of each database participating in sharing and exchange is calculated and normalized by combining the weight of behavior paradigm and the data volume of the sharing and exchange behavior.
As show in Algorithm 1, the main function orchestrates the complete data contribution evaluation process from start to finish, ensuring adherence to the theoretical constructs outlined in the research. The function initializes by obtaining a sequence of behavior metadata FS, which is pre-processed and analyzed through several stages to derive the final contribution scores of databases involved in data sharing and exchange scenarios.
Algorithm 1 Main Function for Calculating Contribution Scores
1: Input: FS—Sequence of sharing and exchange behavior metadata
2: Output: x—Contribution scores for databases
3: // Step 1: Extract maximal sequential patterns from metadata
4: MSP ← MineMaximalPatterns(FS)
5: // Step 2: Construct dependency graph and determine weights
6: W ← ConstructDependencyGraphAndCalculateWeights(MSP)
7: // Step 3: Compute contribution scores using weights and metadata
8: x ← ComputeContributionScores(FS, W)
9: // Output the calculated contribution scores
10: Output x
Mining behavior paradigm maximal sequential patterns
The core of our method involves mining the dependencies between behavior paradigms and determining the weights of these paradigms. Since these dependencies are embedded in the maximal sequential patterns of behavior paradigms, the first step is to mine these maximal sequential patterns.
Constructing the set of behavior paradigm sequences.
To carry out the subsequent sequential pattern mining task, it is necessary to preprocess the original data sharing and exchange behavior metadata sequences to construct the behavior paradigm sequence set. The overall process is shown in Fig 5. Initially, we utilize the data provider and database provided by it as the identifier to differentiate sharing and exchange behavior paradigms, and assign the same behavior paradigm ID to label each category of behaviors. As illustrated in the example of Fig 5, the behavior paradigm with the ID of 6 represents the category of sharing and exchange behaviors with the provider of “P2” and participating database of “P2-db-3”. Then, two interval thresholds (PART_THR_Lower and PART_THR_Upper) are set to segment the behavior pattern ID sequence. The segmentation process involves scanning each behavior chronologically. If the time interval between two adjacent behaviors is less than PART_THR_Lower, their behavior paradigm IDs form an itemset. If the interval exceeds PART_THR_Upper, the behavior paradigm sequence is split between the two behaviors, resulting in two separate sequences. Otherwise, the current paradigm ID is added as an individual itemset to the preceding adjacent sequence. After segmentation, several extended sequences of behavior paradigms are obtained from the original dataset. Finally, behavior paradigm sequences are processed employing a sliding window with a step size equal to the window length, forming the final behavior paradigm sequences set.
Mining maximal sequential patterns of behavior paradigms.
As mentioned above, frequent sequential pattern mining aims to uncover frequent subsequence patterns within extensive sequences, revealing associative relationships among different itemsets. In this paper, the frequent sequences of behavior paradigms contain the logical order and dependence between behavior paradigms. Note that the dependencies in our study only refer to direct dependencies, excluding transitive and indirect dependencies. However, due to the property that all subsequence patterns of frequent sequential patterns are also frequent, numerous dependencies obtained by typical mining algorithms are redundant or inaccurate. Specifically, continuous subsequences within frequent sequences contain redundant dependencies, while non-continuous subsequences may include erroneous dependencies. Consider the frequent sequence seq in Fig 6, which is assumed to depict a collaborative business process. The highlighted red 2-subsequences become inaccurate if not belonging to the direct dependency in any other collaborative business.
As a variant of frequent sequential pattern mining, maximal sequential pattern mining [82–84] can effectively solve the above problems. This approach preserves the longest sequential pattern meeting support criteria, substantially mitigating the redundance of frequent sequential patterns and the erroneous dependencies. Moreover, the longer behavior paradigm sequence can contain more complete and rich business logic meaning. In this study, we use the VMSP (Vertical mining of Maximum Sequential Patterns) algorithm [67] of Professor Philippe Fournier-Viger’s team to mine maximal sequential patterns of behavior paradigms. This algorithm, identified as the fastest among the their proposed maximal sequential pattern mining algorithms, operates on a vertical mining approach.
This process is integral to identifying and extracting maximal sequential patterns, which are pivotal in understanding the underlying collaborative business in data sharing. These patterns provide the foundation for further analysis in weight calculation. The pseudocode is shown in Algorithm 2.
Algorithm 2 MineMaximalPatterns Function
H
1: function MineMaximalPatterns(FS)
2: Input: FS—Sequence of sharing and exchange metadata
3: Output: MSP—Maximal Sequential Patterns
4: P ← ∅ ▹ Initialize an empty set for behavior paradigms
5: for each bx in FS do
6: Bx_paradigm_id ← identifyParadigm(bx.provider, bx.database)
7: P ← P ∪ {Bx_paradigm_id} ▹ Collect unique paradigms
8: end for
9: MSP ← MAXIMALPATTERNMINING(FS, P) ▹ Apply pattern mining
10: return MSP
11: end function
Determining the weight of the sharing and exchange behavior paradigms
After obtaining the maximal sequential patterns of behavior paradigm, we aim to quantify the weight of each behavior paradigm through the dependency relationship contained in it, which is divided into two steps: Constructing behavior paradigm dependency graph and evaluating the importance weight of dependency graph nodes.
Constructing behavior paradigm dependency graph.
For the mined maximal sequential patterns of behavior paradigms, we construct a dependency graph to quantify the dependence strength between behavior paradigms and visualize these dependence relationships. In the dependency graph, each behavior paradigm serves as a node and every two-continuous subsequence of length two seq=<a, b> contained in any maximal sequence pattern forms a directed edge e=<a, b>. The weight we of edge e is calculated as follows: (4) where n represents the number of mined maximal sequential patterns, sup(i) denotes the support of the i-th maximal sequential pattern MSPi, and the function isContained(e,i) takes values of 1 or 0, indicating whether the sequence seq corresponding to the directed edge e is a twp-continuous subsequence of length two of MSPi.
Fig 7 shows an example of constructing behavior paradigm dependency graph. For the edge <4,2>, since <{4},{2}> is the two-continuous subsequence of length two in both seq1 and seq3, w42 is the sum of support of seq1 and seq3, resulting a total of 0.29. Similarly, the weight w27 of edge <2,7> is the sum of sup2 and suq3, equating to 0.24.
Determining the weight of behavior paradigm based on degree centrality.
The behavior paradigm dependency graph represents the dependence relationship and degree of dependence between behavior paradigms. Based on this graph, the weight of each behavior paradigm can be calculated by complex network node importance evaluation algorithms.
In contrast to the limitations of general degree centrality [85], which predominantly considers local neighbor information, degree centrality (DC) in our method reflects both the number of sequential patterns in which the current behavior paradigm is located and the information about the number of neighboring nodes. Hence, it serves as a suitable evaluation metric for determining the weight of the behavior paradigm. Specifically, for a node v in the dependency graph, its degree centrality DC(v) can be defined as: (5) where DC_in(v) and DC_out(v) are in-degree centrality and out-degree centrality of node v, respectively. (6) (7) where neighbori and neighborj refer to all nodes directly arriving at v and nodes directly arriving from v respectively.
Constructing the dependency graph involves mapping the relationships between different behavior paradigms as extracted from the maximal patterns. The weights calculated from this graph quantify the influence or importance of each paradigm in the context of data sharing. The pseudocode is shown in Algorithm 3.
Algorithm 3 ConstructDependencyGraphAndCalculateWeights Function
1: function ConstructDependencyGraphAndCalculateWeights(MSP)
2: Input: MSP—Maximal Sequential Patterns
3: Output: W—Weights of behavior paradigms
4: G ← (V, E) where V ← ∅, E ← ∅ ▹ Initialize graph with no vertices or edges
5: for each pattern p in MSP do
6: for each consecutive subsequence (a, b) in p do
7: if ¬∃ edge (a, b) in E then
8: E ← E ∪ {(a, b)} ▹ Add edge if it does not exist
9: end if
10: update weight (a, b) based on sum of supports containing (a, b)
11: end for
12: end for
13: W ← calculateNodeCentrality(G) ▹ Calculate weights based on node centrality
14: return W
15: end function
Quantifying each database contribution
For the specific sharing and exchange behavior between different participants, the behavior paradigm it belongs to identifies the specific databases providing data and the metadata also records the amount of data supplied by these databases. Our method quantifies the contribution of participating databases by considering both the data volume associated with the behavior and the importance weight of the corresponding behavior paradigm. Specifically, for the given sharing and exchange behavior metadata sequence Bx = (bx1, …, bxm) over a period of time, the contribution yk of the k-th participating database in the scenario is calculated: (8) where DC(i) represents the weight of the behavior paradigm corresponding to bxi, and dv(i, k) denotes the volume of data provided by the k-th participating database in behavior bxi. The function PF(⋅) discretizes data volume proportionally into integers within the interval [1, rhigh]: (9) where ⌊⋅⌋ means round down operator, rhigh is a predefined integer greater than 1, and dvmax represents the maximum data volume in the training dataset. The function isInvolved(k, i) indicates whether the current behavior bxi involves the k-th participating database, with its value being either 0 or 1. Thus, the contribution vector y = (y1, …, yn) of the n databases participating sharing and exchange can be obtained, and we take as the final contribution of these databases.
The final step involves calculating the contribution scores for each database based on the weighted paradigms and the volume of data they handled. This score quantifies the impact or value of each database’s participation in the Sharing and exchange. The pseudocode is shown in Algorithm 4.
Algorithm 4 ComputeContributionScores Function
1: function ComputeContributionScores(FS, W)
2: Input: FS—Sequence of sharing and exchange metadata, W—Weights of behavior paradigms
3: Output: x—Contribution scores for databases
4: x ← Vector initialized to 0 ▹ Start with zero contribution
5: for each bx in FS do
6: db ← bx.database
7: volume ← bx.data_volume
8: paradigm ← bx.paradigm_id
9: x[db] ← x[db] + W[paradigm] × volume ▹ Compute weighted contribution
10: end for
11: Normalize x ▹ Ensure scores are proportionate and comparable
12: return x
13: end function
Theoretical analysis
The evaluation of data contribution in decentralized sharing and exchange scenarios necessitates a robust theoretical foundation to accurately assess the influence and importance of different data behaviors. This section delves into the theoretical underpinnings that guide our proposed DecentralDC method.
Maximal sequential pattern mining and data contribution
Maximal Sequential Pattern Mining (MSPM) is used in data mining to identify subsequences in a dataset that occur frequently and are not contained in any larger pattern. This is critical for evaluating data contribution because it reveals key behavioral patterns in the data that play a significant role in data sharing and exchange.
To achieve maximal sequential pattern mining, we first define the concept of support, which is the frequency of an itemset in the dataset. Given a minimum support threshold minsup, a sequence X is considered frequent if its frequency in the dataset is not less than minsup. (10) where D is the set of all transactions in the dataset, and T is a single transaction. Next, to identify maximal sequential patterns, we further check if these frequent sequences are contained in larger frequent sequences. Only those frequent sequences that are not contained in any larger frequent sequences are considered maximal sequential patterns. (11)
This process ensures that only the most representative and informative patterns are retained, avoiding redundancy. In data contribution evaluation, this method effectively identifies the behavior patterns that contribute most to data sharing and exchange. Through maximal sequential pattern mining, we can extract the most informative patterns from the dataset. These patterns represent the most common and important behaviors in the data. Therefore, identifying these patterns helps us understand the overall dynamics of the data and evaluate the contribution of individual data behaviors. This method ensures the accuracy and efficiency of the analysis, focusing only on the most critical behavior patterns and avoiding the processing of redundant data.
Based on the above theory, we can apply maximal sequential pattern mining to data contribution evaluation. First, preprocess the sharing and exchange behavior data on the blockchain to obtain behavior paradigms and behavior paradigm sets. Then, use the maximal sequential pattern mining algorithm to extract maximal sequential patterns from the behavior paradigm set. The specific steps include:
- (1) Identifying sharing and exchange behavior paradigms based on data providers and databases.
- (2) Assigning behavior paradigm IDs based on timestamps, set time interval thresholds, and divide behavior paradigm ID sequences.
- (3) Generating the final behavior paradigm subsequence set using a sliding window.
Construction of probabilistic models and dependency graphs
Probabilistic graphical models (PGMs) use a combination of nodes and directed edges to clearly represent the conditional dependencies between data items. Quantifying these dependencies is crucial for evaluating data contribution. Bayesian networks, as a typical type of probabilistic graphical model, can be represented by the joint probability distribution as follows: (12)
In our method, a Bayesian network is constructed to represent the dependencies between data behaviors and shared behaviors. By calculating conditional probabilities, we can quantify the strength of these dependencies. (13)
By constructing a Bayesian network, we can clarify the dependencies between data behaviors and quantify the strength of these dependencies. This enables us to identify and measure the influence of specific behaviors on other behaviors, thereby accurately evaluating the contribution of individual data behaviors. The use of Bayesian networks ensures a precise representation of dependencies and provides a systematic method to handle uncertainties and complex dependencies.
In the contribution evaluation process, frequent behavior patterns are converted into dependency graphs where nodes represent these behavior patterns and edges represent the dependencies. The weights of the edges are determined by the conditional probabilities associated with these dependencies. To construct the dependency graph, each behavior pattern is represented as a node within the graph. Dependency relationships are illustrated as edges between nodes, where any consecutive subsequence of length two within a frequent behavior pattern is depicted as a directed edge. The weight of an edge is calculated based on the support of the maximal sequential patterns that include this consecutive subsequence. (14) where n is the number of maximal sequential patterns containing the behavior pattern corresponding to edge e, sup(i) is the support of the i-th maximal sequential pattern MSPi, and isContained(e, i) indicates whether the subsequence corresponding to edge e is contained in MSPi.
Evaluating node influence in the graph
Degree centrality (DC) is a metric used to measure the importance of a node. The degree centrality of a node v can be expressed as the sum of its in-degree and out-degree. (15)
In the evaluation of data contribution, by calculating the degree centrality of each node, we can quantify its importance in data sharing and exchange.
Additionally, path-based centrality analysis calculates the total weight of all paths from one node to all other nodes, assessing each node’s importance and influence in the network. (16)
By using degree centrality and path-based centrality analysis, we can identify the most influential nodes in the data network. These nodes are key in data sharing and exchange, and understanding their importance helps in optimizing data management resources and improving overall network efficiency. Degree centrality and path-based centrality analysis provide a systematic approach to quantifying node importance, ensuring accurate and consistent results.
The first step in contribution evaluation involves calculating the degree centrality of each node in the dependency graph of behavior patterns. This calculation helps to quantify the importance weight of each behavior pattern based on its connections within the graph. The second step is to calculate the path-based centrality by determining the weight sum of all paths from one node to all other nodes. This evaluation assesses the node’s propagation capability and influence within the network, providing a measure of its overall impact. Specific implementation steps of degree centrality calculation are as follows: (17) (18) where in_DC(v) and out_DC(v) denotes the in-degree centrality and out-degree centrality, respectively. Path-based centrality calculation involves determining the sum of the weights of the most probable paths from each node to all other nodes within the graph. This method quantifies the importance of nodes by evaluating the path weights. By calculating these path weights, the importance of each node can be assessed based on the cumulative influence it exerts through its connections to other nodes. This approach provides a comprehensive measure of node centrality within the network.
Experiments
In this section, we introduce the experimental datasets and parameter settings, the results and analysis of the main experiment and analysis experiments.
Experimental datasets and parameter settings
This section provides a detailed description of the experimental dataset construction method and the parameter settings.
Scenario settings.
To visually illustrate the specific application of the proposed protocol, we have detailed two core scenarios: data usage right sharing and multi-party computing. These scenarios are widely applicable across various industries such as healthcare [86], real estate [87], finance [88], transportation [89], education [90], and municipal services [91]. We have provide detailed case studies to further validate the authenticity and broad applicability of these scenarios. In the study of these two scenarios, we aim to demonstrate the most critical and common application patterns in modern data exchange and processing. The first scenario, data usage right sharing, represents the need for data access and permission management between government and enterprises, highlighting the importance of data privacy and security, which is crucial in a regulation-heavy modern society. The second scenario, multi-party computation, showcases how complex data analysis tasks can be performed while maintaining data privacy, which is particularly prevalent in finance, healthcare, and multinational corporations. By testing our methods in these two typical scenarios, we ensure the broad applicability and practical value of our research results in the real world.
- Scenario #1: Data Usage Right Sharing
This scenario involves the open sharing of government data. For example, in a government data open-sharing scenario, a public fund management unit may request other government data such as tax and credit information of the current customer. Such sharing is temporary and mainly revolves around the temporary granting of data access and usage rights between government departments. This sharing model often involves handling sensitive data, making privacy and data security the primary concerns. This mode allows data owners to maintain ownership while authorizing other users to use the data for a specified period. It is particularly applicable where time-limited access to sensitive data is necessary, under stringent compliance and security measures, such as temporary access to medical records or personal financial information.
1. Medical Data Sharing [86]: A healthcare provider in the United States employs blockchain technology to temporarily share patient health data with authorized research institutions for cardiovascular research. This system processes approximately one million authorized health record accesses annually, ensuring data security and compliance while supporting medical research needs.
2. Academic Record Management [91]: A university in the United States uses a blockchain platform to manage and certify academic achievements and qualifications, allowing educational institutions and employers to verify students’ academic records while ensuring data integrity and security. Since 2018, this system has verified over 50,000 academic records, demonstrating the effectiveness of this sharing mode in enhancing the security and efficiency of academic data management.
- Scenario #2: Multi-Party Computation
This scenario describes collaborative data analysis tasks among multiple subsidiaries within a group, such as jointly developing new drugs or conducting risk assessments. Multi-party computation technology is used to ensure that data is processed without decryption to protect the data privacy of all parties. Each subsidiary provides its data portion, and after completing the computational task, each party retrieves its respective results. This scenario requires efficient data collaboration mechanisms and data integration capabilities. This mode supports multiple participants in performing computational tasks while ensuring data privacy. It is suited for cross-institutional collaboration, such as supply chain management or financial market analysis, allowing stakeholders to make informed decisions without disclosing sensitive data, thereby protecting their data privacy.
1. Financial Risk Assessment [88]: An international banking consortium uses multi-party computing technology for transaction anomaly detection to prevent and identify fraudulent activities. Members of the consortium analyze transaction patterns collectively without sharing specific customer data. Approximately 500,000 transactions are analyzed daily, ensuring customer data privacy while enhancing the ability to monitor fraudulent activities.
2. Smart Traffic Management [89]: In Singapore, blockchain technology is used to manage and optimize the city’s traffic system, ensuring transparent data sharing and secure management. Through real-time data analysis, participants efficiently carry out urban planning and traffic management, processing over two million data transactions annually. This application showcases multi-party computing’s role in improving efficiency and response times in smart city and traffic management initiatives.
- Applications of Data Contribution Assessment and Systems
Data contribution assessment plays a crucial role in data sharing, directly affecting data pricing and revenue distribution, and motivating higher-quality data sharing.
1. Educational Data Analysis [90]: A university employs a blockchain platform to evaluate the contributions of teachers and students to educational resources, which then informs their incentives. This system allows the educational institution to distribute resources and rewards more equitably, encouraging more active participation in educational activities. Each semester, the platform processes over 10,000 records of educational activities. This case reflects how assessing individual data contributions can optimize the allocation and utilization of educational resources.
2. Waste Management [92]: In the UK, cities like London are using blockchain technology to enhance waste management processes. Blockchain helps in tracking the lifecycle of waste, ensuring proper disposal, and promoting recycling. Citizens who correctly segregate waste are rewarded with tokens that can be used for public services. This system not only improves waste management efficiency but also encourages public participation in environmental conservation efforts.
Scenario comparison.
As shown in Table 2, we compare the core differences between data usage right sharing and multi-party computation in terms of data providers and participants, data scale, interaction frequency, and behavior paradigm quantity.
Experimental parameter settings.
To validate our method in real sharing and exchange scenarios, we construct two simulation scenarios with different scales involving multiple participants on a consortium blockchain. The sharing and exchange metadata sequences recorded in the blockchain, denoted as Bx = (bx1, …, bxm), serve as our experimental datasets. The specific construction steps are as follows.
- Step 1. Set the number and name of participants, the total number of databases owned by each participant, and the directory of shared resources (including the number, name, metadata and access permissions of databases for each participant).
- Step 2. Define data sharing and exchange behavior paradigms that can occur between the parties in the consortium blockchain based on the shared resource directory. Two types of behavior paradigms are defined: data usage right sharing and multi-party computing. The data format and examples of behavior paradigms are shown in Table 3. The example of data usage right sharing indicates that P1, P2 and P4 can apply to use the data in the second database P3-db-2 of participant P3, and the example of multi-party computing illustrates that P1-db-1 and P3-db-2 can perform corresponding multi-party computing tasks.
- Step 3. Define a certain number of multi-party collaborative businesses, and each collaborative business is composed of the behavior paradigms defined in Step 2 arranged in a particular order. Simultaneously ensuring that there are no containment relationships between behavior paradigm sequences of different business scenarios, and that each behavior paradigm sequence within a business scenario contains no duplicate behavior paradigms.
- Step 4. Simulate the concurrent execution process of the collaborative businesses defined in Step 3 through multithreading. Each thread executes a specific collaborative business and records the corresponding sharing and exchange behavior metadata onto the blockchain. To emulate varying levels of concurrency in real-world scenarios, the execution flow of threads is designed as follows: Starting from time ts, decide whether to execute the collaborative business with a certain probability ps. If it fails to execute, the thread decide again after a short interval. Once the collaborative business initiated, all sharing and exchange behaviors belonging to this business are sequentially executed with a certain probability pe. After completing this collaborative business, the thread will proceed to repeat the above process after a relative long interval. Simultaneously, the metadata information of each completed sharing and exchange behavior is recorded on the blockchain. Finally, when all threads finish, the recorded metadata information of all behaviors is sorted chronologically, resulting in the dataset Bx = (bx1, …, bxm). The format and examples of bxi are illustrated in Table 4. The behavior sequence number serves as the unique identifier for the behavior, while the data volume field represents discretized results ranging from integers 1 to 10. The data volume field contains multiple elements for behavior with sequence number 151 in Table 4, indicating a multi-party computing behavior.
To verify the efficacy of our method in the above two scenarios, our experiments include both a small-scale and a large-scale scenarios. Specific parameter settings are detailed in Table 5. The function randint (a, b) in Table 5 represents the random generation of integers within a closed interval [a, b]. For Scenario #1 (data usage right sharing), considering that the data volume involved is usually small but involves sensitive information, a high level of privacy protection and a low interaction frequency are set. This setting helps simulate the characteristics of temporary data access and permission management among government departments. For Scenario #2 (multi-party computation), due to the involvement of complex data analysis and large data volumes, a larger dataset size and more frequent data exchanges are set. This setting can better simulate multi-party computation scenarios common in industries such as finance and healthcare, which require efficient data processing and strong data privacy protection. For different experimental scenarios, we set corresponding hyper-parameters to ensure that the experiments can accurately simulate real-world data sharing and exchange behaviors.
To further illustrate the dataset features, the specific descriptions and explanations of the above data items are shown in Table 6.
Experimental data description.
Our experimental dataset covers two main scenarios: data usage right sharing and multi-party computation. The dataset scale and complexity vary based on the characteristics of these scenarios. In Scenario #1, there are fewer participants and simpler data exchanges, resulting in a smaller dataset size, reflecting the characteristics of actual government data processing. In Scenario #2, the complexity of collaborative tasks among multiple subsidiaries results in a significantly larger and more complex dataset to meet the highly dynamic and intricate data processing needs. Details of data description are illustrated in Table 7.
In the experiment, to ensure comprehensive learning of the dependency and importance weight of the sharing and exchange behavior paradigms, and to accurately evaluate data contribution, we divided the dataset Bx into a training set and a test set at a ratio of 4:1. Specifically, 80% of the data is used in the training phase. Through this training data, we can establish and optimize the dependency relationship model of the sharing and exchange behavior paradigms and calculate the importance weight of each paradigm. The remaining 20% of the data is reserved for the testing phase. Using this test data, we can validate the effectiveness of the model and ultimately output the data contribution evaluation results. This approach ensures consistency and reliability in model training and evaluation across different datasets. Details are illustrated in Table 8.
Main results
To further verify our method in this experiment, we will assess the contribution of each database in different scenarios, add data contribution assessment experiments, and compare the differences between our DecentralDC method and other baseline methods. By setting predetermined contribution rankings for different databases and controlling their occurrence probabilities in the experimental data generation, we update the two datasets that embodies certain data sharing and exchange behavior paradigms. Initially, we establish two different data sharing and exchange scenarios in a simulation environment and assign predetermined contribution rankings to the databases in each scenario. Databases with higher contributions will have a higher probability of appearing during experimental data generation, while those with lower contributions will have a lower probability. Then, we will generate experimental data based on these preset rankings and occurrence probabilities, which will include the roles and frequencies of various databases in specific sharing and exchange behaviors. By analyzing the experimental results, we can validate the consistency between the preset database contribution rankings and the actual observed data usage. This method not only helps us understand the behavioral impact of different databases in a decentralized sharing and exchange model but also tests the effectiveness and accuracy of our data contribution assessment method in practical applications.
Experimental purpose and scenario definition.
This experiment aims to build a dataset with a certain data sharing behavior pattern by presetting different database contribution rankings and controlling their occurrence probabilities during data generation. The core objective is to validate the rationality of the database contribution rankings and the accuracy of the assessment method.
We define two different data sharing and exchange scenarios. Each scenario simulates a specific type of data exchange pattern, scenario #1 simulate data sharing between government departments, while scenario #2 simulate data exchanges between medical institutions. Based on prior data usage and contribution analysis, assign a predetermined contribution ranking to the participating databases. Databases with higher contributions will have a higher occurrence probability during data generation, indicating more frequent participation in data exchange activities. A probabilistic model is designed to adjust the occurrence frequencies of databases within the generated dataset based on their predefined contribution levels. This method ensures that databases with higher contributions are more active in the simulation, thus mimicking real-world behavioral patterns. Following the settings described above, a simulation program is employed to generate data encompassing the exchange activities of all participating databases. This data includes timestamps, the databases involved, data items, and their access frequencies. Each scenario is independently generated to maintain consistency and isolation of the scenario settings.
Baselines and metrics.
In this experiment, we compare several algorithms to validate the performance differences between our DecentralDC method and existing baseline methods. The following algorithms are compared:
- Iterative Algorithms: including PageRank and Katz Centrality, these algorithms iteratively compute the importance of nodes, suitable for assessing node centrality in networks.
- Path-based Algorithms: including Betweenness Centrality and Closeness Centrality, these methods evaluate the influence of nodes based on their positions within paths.
- Adjacency-based Algorithms: including Degree Centrality and Eigenvector Centrality, which primarily assess importance based on the direct connections of nodes. This evaluation does not involve pre-processing through maximal behavior pattern mining.
For evaluation metrics, we utilize MAP, Precision@10, Recall@10, F1@10, and ACC@10 to comprehensively assess the effectiveness of the different algorithms. (19)
Here, |Q| = 1, so the formula calculates the average precision for a single query. mq represents the number of relevant items for that query, and Precision(Rq,k) is the precision at each cut-off k, up to mq. (20)
This measures the proportion of relevant items among the top 10 items retrieved, reflecting the accuracy of the retrieval process for the top results. (21)
This evaluates the proportion of the total relevant items that have been retrieved within the top 10 results, providing insight into the coverage of the retrieval system. (22)
The F1 score at 10 combines precision and recall into a single metric by calculating their harmonic mean, providing a balance between precision and recall for the top 10 results. (23)
This metric measures the ratio of correctly predicted relevant items in the top 10 results, providing a straightforward indicator of accuracy at this cut-off.
Main experimental results and analysis.
The main results of Scenario # 1 is shown in Table 9. In the scenario of data usage rights sharing (Scenario # 1), DecentralDC demonstrates significant advantages. It achieves a MAP of 0.85, indicating its effectiveness in accurately identifying key nodes through an integrated analysis of the nodes’ network positions and data interaction behaviors. The Precision@10 is 0.84, suggesting that DecentralDC can provide the most relevant entries among the top 10 results, which is crucial for quickly accessing key information in a data-sharing environment. The Recall@10 is 0.83, confirming its effectiveness in covering all relevant data nodes and ensuring that no key data is missed. The F1@10 score is also 0.83, showing a good balance between precision and recall, allowing for accurate node identification while minimizing false positives. The ACC@10 is 0.85, indicating that almost all of the top 10 retrieved results are correct, reflecting the high efficiency and reliability of DecentralDC in practical applications. Comparative analysis with baseline methods in Scenario # 1 are shown as follows.
- PageRank: In the data usage rights sharing scenario, PageRank scored a MAP of 0.72 and Precision@10 of 0.70. PageRank primarily assesses node importance based on the number of incoming links, which may not adequately reflect the true influence of nodes in dynamically changing data usage scenarios. In contrast, DecentralDC not only considers the number of connections but also analyzes their actual roles and efficiency in data usage, thus providing a more accurate assessment of node influence.
- Katz Centrality: For Katz Centrality, the MAP was 0.75 and Precision@10 was 0.73. This method emphasizes the influence of indirect connections between all nodes but may not fully capture the rapidly changing patterns of data flow. DecentralDC surpasses Katz Centrality’s limitations by analyzing the directions and frequency of data flows, offering a more dynamic and precise analysis of node importance.
- Betweenness Centrality: Betweenness Centrality achieved a MAP of 0.68 and Precision@10 of 0.66. It identifies nodes at central positions within the network, which control information flows. However, in data sharing, nodes serving merely as conduits in information flow may not engage directly in data usage. DecentralDC identifies not only key nodes in the information flow but also analyzes their roles in data processing and decision-making, thus providing deeper insights into data usage rights sharing.
- Closeness Centrality: Closeness Centrality recorded a MAP of 0.70 and Precision@10 of 0.68. While this metric emphasizes the proximity of nodes to all others within the network, suggesting their efficiency in information transmission, it may not reflect the nodes’ actual control over data usage. DecentralDC provides a more comprehensive evaluation by analyzing not only the structural position of nodes within the network but also their roles in data generation, processing, and usage, thereby offering a more precise assessment of influence.
- Degree Centrality: In this scenario, Degree Centrality postes a MAP of 0.77 and Precision@10 of 0.75. This method measures the importance of nodes based on their direct connections, reflecting their activity within the network. Although Degree Centrality can effectively identify active interacting nodes, it overlooks the specifics and quality of these interactions. DecentralDC, by delving deeper into nodes’ behavior patterns and analyzing how they control and use data, provides a more comprehensive understanding of node influence.
- Eigenvector Centrality: Eigenvector Centrality performes well in the data usage rights sharing scenario with a MAP of 0.79 and Precision@10 of 0.78. This method suggests that a node’s importance is influenced not only by its direct connections but also by the importance of its connected nodes. While Eigenvector Centrality effectively captures the cumulative effect of influence within the network, it primarily relies on static network structures. DecentralDC, by analyzing specific behaviors of nodes in data usage—such as requests, processing, and distribution—reveals the real-time and dynamic importance of nodes.
The main results of Scenario # 2 is shown in Table 10. In multi-party computation environments (Scenario # 2), DecentralDC exhibits significant performance advantages. It achieves a MAP value of 0.80, demonstrating excellent identification of key participating nodes. The Precision@10 is 0.79, indicating its ability to accurately recognize nodes that contribute most significantly to computations. The Recall@10 is 0.78, proving its effectiveness in covering all essential data nodes, crucial for ensuring the completeness and accuracy of computation results. The F1 score at 10 is also 0.78, showing a good balance between precision and recall, which ensures that while identifying the correct nodes, the results’ relevance and completeness are maximized. The ACC@10 is 0.80, confirming DecentralDC’s efficiency and reliability in practical applications, especially in complex scenarios involving multi-party data integration and processing. Comparative analysis with baseline methods in Scenario # 2 are shown as follows.
- PageRank: Although PageRank performs well in data usage rights sharing, its effectiveness significantly drops in multi-party computation scenarios (MAP 0.65). This is because PageRank mainly assesses static importance based on links, while multi-party computation requires identifying nodes dynamically contributing to the computation process. DecentralDC provides a more comprehensive importance assessment by evaluating not just static links but also the dynamic interactions and data contributions between nodes.
- Katz Centrality: While Katz Centrality performs relatively well in data usage rights sharing with a MAP of 0.68, it faces challenges in multi-party computations. It does not adequately handle the dynamic data flow and contributions between nodes. DecentralDC is more effective in capturing real-time data interactions and processing capabilities, allowing for more adaptable evaluations of node importance in multi-party computations.
- Betweenness Centrality: Despite its advantages in identifying key nodes in data flows, Betweenness Centrality’s importance evaluation capability decreases in multi-party computations (MAP 0.62), as merely being a conduit in data flow does not equate to contributing to computation outcomes. DecentralDC surpasses Betweenness Centrality by analyzing how nodes influence final computation outputs, which is particularly crucial in multi-party computations.
- Closeness Centrality: Exhibiting a MAP of 0.64, Closeness Centrality shows limitations in multi-party computations. This method, which evaluates the average distance of a node to all others in the network, assumes that closer nodes have a greater impact on data. However, multi-party computations require more than quick data access; they need nodes that significantly contribute to data processing. DecentralDC provides a more accurate evaluation of node importance by analyzing nodes’ actual data processing capabilities and contributions, making it particularly suitable for multi-party computations.
- Degree Centrality: With a MAP of 0.70 in multi-party computations, Degree Centrality’s primary limitation is its focus solely on direct connections of nodes. In multi-party computations, a node’s influence is determined not just by its connections but by how it processes and contributes data. DecentralDC offers deeper insights by analyzing the behavioral patterns of nodes in data processing and exchange, enabling the identification of nodes that are crucial in data processing but may not be prominent in traditional measures.
- Eigenvector Centrality: Eigenvector Centrality achieves a MAP of 0.72 in multi-party computations, closely matching DecentralDC. This method considers the importance of a node based on the importance of its connected nodes, attempting to capture the cumulative effect of influence within the network. While theoretically effective in reflecting the distribution of influence in a network, Eigenvector Centrality lacks detailed analysis of how nodes actively participate in data processing and multi-party computations. DecentralDC provides a more comprehensive evaluation of influence by considering nodes’ data interactions and processing capabilities, which are essential for optimizing the entire computation process in multi-party environments.
Here are the results of data contribution obtained through our DecentralDC, as shown in Table 11. For the 189 databases in Scenario #2, Table 11 shows the top ten databases in contribution. Given the theory of our method, in the process of sharing and exchange, the database with high importance weight of the behavior paradigm and a large amount of data will have a higher contribution.
Maximum sequential patterns mining experiments
This section verifies the effectiveness of the proposed method for maximum sequential patterns mining. For the behavior paradigm dependence mining step, the effect comparison between the maximal sequential pattern mining method and the other two methods is shown in Table 12, and the support of the three methods in the two scenarios is set to 11% and 1.7%, respectively. In Scenario #2, the lower support set compared to Scenario #1 is mainly due to the increased parallel execution of collaborative businesses. Consequently, more behaviors are generated within the same duration, allowing for the acquisition of a greater number of behavior paradigm sequences through interval thresholds and sliding windows, ultimately leading to lower support.
Baseline methods.
To verify the effect of our method in mining dependencies between sharing and exchange behavior paradigms, two baseline methods are selected for comparison.
- CM-SPADE [93]: All frequent sequential patterns of behavior paradigm sequence set are mined. CM-SPADE is a highly efficient sequential pattern mining algorithm proposed by Professor Philippe Fournier-Viger’s team in 2014.
- Two-Frequent Continuous Subsequence Mining (t-FCSM): This method is designed to mine Two-frequent continuous subsequence patterns with length 2 from the behavior paradigm sequence set.
We refer to the evaluation indicators of machine learning classification tasks, as detailed in Table 13. Note that in the evaluation process, only the presence or absence of dependencies is considered, while the strength of dependencies, namely the weight of edges in the dependency graph is not taken into account.
For each scenario, the F1 value with the best mining effect is marked in bold, and the next best is marked by underlining. According to the experimental results, we can draw the following conclusions.
First, in both scenarios, the recall of CM-SPADE method and VMSP method is 1, indicating that both methods have mined all the correct dependencies. However, the CM-SPADE mining algorithm exhibits lower precision. This can be attributed to the retention of a significant number of subsequences, wherein the considerable dependencies, i.e., Two-continuous frequent subsequences with length 2, are erroneous. Second, in both scenarios, the t-FCSM method achieves a precision of 1, with recalls of 90.69% and 88.75%. This suggests all accurate dependencies mined by t-FCSM, but with some omissions. These omissions stems from exclusive focus on mining sequential patterns within adjacent elements of the behavior paradigm sequence. In our experimental setting, where the probability of the next behavior belonging to the same collaborative business is 0.5, dependencies may be overlooked if two adjacent behavior paradigms in the collaborative business process do not correspond to adjacent elements in the sequence. Moreover, t-FCSM can only mine short binary dependencies, constraining its expression of the business logic meaning. Therefore, it is unsuitable for application in scenarios where collaborative business encompasses numerous behavior paradigms.
Time spent experiments
Regarding time performance measurements in our experiments, we conduct experiments and compare the performance of our method against the CM-SPADE and t-FCSM baseline methods, since the main time cost arises from mining frequent sequences. The experiments are conducted on a high-performance computer equipped with an Intel Core i9-10900K CPU (10 cores, 20 threads), 64GB RAM, and an NVIDIA GeForce RTX 3080 graphics card. This hardware configuration ensures that we could fairly evaluate the performance of various algorithms when processing large datasets.
The result of Time Spent Experiments is shown in Table 14. CM-SPADE and t-FCSM are general sequence pattern mining methods, not maximal sequence pattern mining methods. Therefore, after mining the general sequence patterns, a post-processing step is required to obtain the maximal sequence patterns. In our detailed analysis of the VMSP, CM-SPADE, and t-FCSM algorithms across different data scale scenarios, we identified key quantitative results that highlight the strengths and limitations of each algorithm. Initially, the VMSP algorithm demonstrated a training time 29% faster than CM-SPADE and 16% faster than t-FCSM in small data scale scenarios. In large data scale scenarios, VMSP’s performance was even more impressive, with training times 17% faster than CM-SPADE and 12% faster than t-FCSM. These results clearly indicate VMSP’s efficiency advantages.
Furthermore, we conducted an in-depth analysis of the algorithms’ principles. VMSP’s main advantages stem from its efficient data structures and optimized parallel computing strategies, which significantly reduce unnecessary data scans and computations when handling large datasets, thereby enhancing processing speed. In contrast, CM-SPADE, although faster in small-scale data processing, lacks memory usage efficiency and scalability when dealing with large datasets. The t-FCSM algorithm, while having specific advantages in time-series data analysis, exhibits weaker capabilities in processing complex datasets, particularly at larger scales, where efficiency and performance significantly decline.
These analytical findings reflect various factors to consider when selecting suitable algorithms, including data scale, algorithm efficiency, and specific requirements of application scenarios. They also illustrate how to choose the appropriate algorithm to optimize performance based on different business needs in practical applications. This thorough performance analysis and principled explanation help us better understand the applicability and potential application value of each algorithm.
Impact of support
In this section, we analyze the impact of the support on mining behavior paradigm dependencies in our method.
The range of support values is shown in the Table 15, where support is an important parameter for measuring the frequency of behavior patterns. Here, the choice of support gradually increases from 0.07 to 0.17 for Scenario #1, and 0.008 to 0.023 for Scenario #2. The purpose is to observe whether the model can more accurately identify frequent behavior patterns as the support increases. Lower support allows less common behavior patterns to be considered important, while higher support emphasizes more common and consistent behavior patterns. Such settings help find the optimal support threshold to balance the model’s generalization ability and the risk of overfitting.
As shown in the Tables 16 and 17, in Scenario #1, with the increase in support, precision and F1-Score initially rise and then fall, while recall remains at 100% at lower support levels. When the support is 0.11, Precision reaches its peak value of 94.37%, and F1-Score also achieves its highest value of 97.10%. As the support continues to increase, both Precision and F1-Score decline, possibly because higher support thresholds filter out fewer frequent patterns, which may include noise data, affecting the accuracy. At a support level of 0.13, Precision and F1-Score begin to drop significantly, although Recall remains high, indicating that at higher support levels, the model may miss some important patterns, leading to decreased Precision.
In Scenario #2, the impact of support on each metric is similar to that in Scenario #1. Precision and F1-Score reach their highest values at a support level of 0.017, at 95.57% and 97.73%, respectively. Compared to Scenario #1, Scenario #2 achieves higher Precision and F1-Score even at lower support levels (e.g., 0.008), indicating that in a more complex data environment, our method can effectively extract meaningful patterns at lower support levels. This phenomenon suggests that the dataset in Scenario #2 may contain more frequent patterns, allowing effective mining even at lower support thresholds.
To further intuitively observe the trend changes in the results, we visualize the results as shown in Figs 8 and 9. Similar experimental results in both scenarios show that as the support increases, the precision and F1 score exhibit an upward trend followed by a subsequent decline. This indicates that low support reduce the threshold of frequent sequence patterns, causing more noise sequences to be classified as frequent sequence patterns and introducing incorrect dependencies. Conversely, high support thresholds make it difficult to mine longer sequence patterns, resulting in the failure to filter out subsequences that belong to these longer patterns, which may contain erroneous dependencies. Furthermore, appropriate supports ensure that all correct dependencies among behavior paradigms are mined. However, after the support exceeds a certain threshold, the recall starts to decline. The main reason may be that many frequent sequences cannot meet the new support and become infrequent sequences. These discarded sequences may contain some unique correct behavior paradigm dependencies, and the loss of these relationships leads to the decline of recall.
The horizontal axis represents the magnitude of support, while the vertical axis represents the values of the indicators.
The horizontal axis represents the magnitude of support, while the vertical axis represents the values of the indicators.
Overall, in both scenarios, Precision and F1-Score perform best at moderate support levels (0.11 for Scenario #1 and 0.017 for Scenario #2), while Recall maintains a high level across most support levels. This demonstrates that our mining method can effectively extract significant patterns in different scale scenarios, ensuring high accuracy and comprehensiveness. Particularly, in lower support situations, the performance in Scenario #2 significantly outperforms Scenario #1, further proving the advantage of our method in handling complex data environments.
Impact of sliding window size
In this section, we analyze the impact of sliding window size on mining behavior paradigm dependencies in our method.
The size of the sliding window directly affects the range within which the model observes behavior patterns, which is shown in the Table 18. Smaller windows may not capture enough contextual information, leading to incomplete pattern recognition; larger windows may include too much irrelevant information, reducing the accuracy of recognition. By varying the window size from 5 to 10 for Scenario #1 and from 6 to 11 for Scenario #2, the experiment aims to assess the impact of different window sizes on model performance and find a balance point where the model can most effectively recognize and utilize the temporal dependencies in behavioral data. Besides, the support of the two scenarios is 11% and 1.7% respectively.
Experimental results are shown in the Tables 19 and 20. In Scenario #1, we observe that as the window size gradually increased, the model’s precision significantly improved from 76% to 94.37%. This change indicates that the model can more accurately identify target entities or features in larger windows, reducing misidentification. Particularly when the window size reaches 8, the model achieves a precision of 94.37%, demonstrating that a larger contextual window provides the model with more information, enabling it to better understand and process complex patterns in the data. Concurrently, the recall rate reached 100% at a window size of 7 and maintains this level in subsequent tests. This result shows that when the window size reaches 7 or above, the model captures all relevant instances, missing no positive data. This is extremely important in practical applications where ensuring no loss of information is critical, especially in scenarios requiring highly accurate information retrieval. Additionally, the F1 score increases from 80.85% to 97.10%, this leap in performance further verifies that the model’s overall effectiveness significantly enhances in larger windows. This improvement is not just an enhancement in a single metric but a result of combined improvements in precision and recall, clearly demonstrating the significant impact of window size on model performance.
In Scenario #2, the model’s performance also shows a similar improvement trend. The precision gradually increases from 88.17% to 95.57%. Similar to Scenario #1, this growth reflects the model’s enhanced capability to grasp data features in larger data windows, effectively distinguishing between positive and negative samples, thereby reducing misjudgments. The performance of the recall rate is also particularly noteworthy, reaching 100% at a window size of 8, indicating that the model can retrieve all relevant instances completely. Such a high recall rate is particularly crucial in many applications, such as in medical or legal document searches where missing critical information could lead to severe consequences. The increase in the F1 score from 86.85% to 97.73% further certifies that with larger window settings, the overall performance of the model is significantly boosted. This enhancement is due to the synchronous improvements in precision and recall, ensuring the model’s robustness and reliability when handling complex and variable data.
Visualization results of the two scenarios are shown in Figs 10 and 11. In both scenarios, changing the sliding window size while keeping the support constant yields a similar effect. Precision and F1 initially increase then decrease while recall decreases gradually as the sliding window size decreases below a certain threshold. The precision decreases when the window size reduces, possibly because a smaller window can only mine shorter sequence patterns, i.e., subsequences contained in longer sequences cannot be filtered out, and these short sequences may contain incorrect behavior paradigm dependencies. The decrease in precision as the window size increases may be because larger windows introduce more noise sequence fragments, thereby mining incorrect behavior paradigm dependencies. Additionally, the main reason for the reduction of recall caused by the reduction of window size may be that the smaller window causes the behavior paradigms belonging to the same long collaborative business to be divided into different sequences, resulting in the loss of the dependencies between these behavior paradigms.
The horizontal axis represents the size of the sliding window, while the vertical axis represents the values of the indicators.
The horizontal axis represents the size of the sliding window, while the vertical axis represents the values of the indicators.
Through detailed analysis of Scenarios #1 and #2, it is clear that window size plays a key role in enhancing model performance, and the proposed method demonstrates good adaptability and effectiveness across different types of data scenarios.
Impact of data contribution in data pricing
Data contribution, as a data quality dimension based on data behavior, can impact the calculation of data costs from the perspective of data behavior, thus influencing data pricing. In order to analyze the impact of introducing data contribution on data pricing, we choose the following two cost calculation methods for comparison.
Comparison methods.
- Baseline
Following the noisy linear query settings in [71, 77], data costs are solely evaluated in terms of privacy compensation. Specifically, in round t, a privacy leakage mechanism based on differential privacy and a tanh-based privacy compensation function are employed to generate the privacy compensation of the data provider. Subsequently, the privacy compensation is used to generate a k-dimensional feature vector ht, which is then L2-normalized, i.e., ||L2(ht)||2 = 1. The costt is the sum of the components of L2(ht), i.e., costt = ∑iL2(ht)i, i = 1, 2 …, k. The pricing algorithm employs the ellipsoid-based data pricing (EBDP) mechanism [77]. - Privacy-based Pricing Incorporating Data Contribution (PPIC)
To ensure that the proportions of data contribution and privacy compensation in the cost are roughly equal, we consider adding the contribution of the current trading data’s database to each component of the L2-normalized privacy compensation. Furthermore, for convergence, the data cost needs to be L2-normalized overall [77], i.e., costt = ∑iL2(L2(ht) + α * xprovider * 1)i, i = 1, 2, …, k, where α represents the weight to adjust the proportion between data contribution and privacy compensation (chosen as 0.1, 0.5, 1.0, 1.5, and 2.0 in the experiments, denoted as PPIC_alpha0.1, PPIC_alpha0.5, PPIC_alpha1.0, PPIC_alpha1.5, and PPIC_alpha2.0 respectively), xprovider stands for the contribution of the database storing trading data, and 1 is a k-dimensional vector with all components equal to 1.
To evaluate the pricing performance, we simulate 100,000 trading rounds and utilize cumulative regret and regret ratio as evaluation metrics, as detailed in Table 21.
Parameter settings.
As shown in Table 22, in the impact of data contribution in data pricing, the hyperparameter α is used to adjust the proportion between data contribution and privacy compensation in the data pricing model. Specifically, the value of α determines the relative weight of data contribution in the total cost of each transaction. Experimental results show that increasing the α value leads to a greater impact of data contribution on the pricing strategy, thereby reducing cumulative regret and regret ratio. This indicates that incorporating data contribution into data pricing can effectively optimize the pricing strategy.
The simulation rounds (Rounds) are divided into two categories based on the type of experiment: cumulative regret experiments and regret ratio experiments. For cumulative regret experiments, the number of rounds ranges from 100 to 100000, aimed at evaluating the pricing mechanism’s performance over varying transaction frequencies. For regret ratio experiments, the rounds range from 1 to 100000, focusing on assessing the efficiency of the pricing strategy at different stages of the simulation. These settings ensure a comprehensive analysis of the data pricing mechanism under various conditions, validating the effectiveness and stability of the proposed model.
Experimental results.
In this section, we analyze the impact of introducing data contribution in data pricing under the two scale scenarios. The experimental results are shown in Figs 12–15.
These four tables provides crucial data comparing the effectiveness of different methods (EBDP and PPIC with varying alpha values) over multiple rounds in two different scenarios. The cumulative regret tables (Tables 23 and 24) give a total measure of regret over time, while the regret ratio tables (Tables 25 and 26) provide a comparative measure of performance efficiency at different points during the experiment. This data is vital for understanding the performance and efficiency of each method in the given scenarios.
We can observe from the results of the two scenarios that as the number of trading round t increases, the cumulative regret gradually rises and then stabilizes. This is because data providers need to publish exploration prices and update the range of θ in more transactions, making the posted price converge towards the true market value of the data. Meanwhile, the regret ratio exhibits an overall decreasing trend, which can be attributed to the fact that as the trading round t increases, the gap between pt and the vt gradually diminishes, resulting in a slower increase in cumulative regret compared to cumulative market value. It is worth noting that around 100 trading rounds, regret ratio shows a short-term increase trend, which is due to the intense changes in the pricing mechanism when selecting exploration and exploitation prices.
In Scenario #1, PPIC significantly reduces cumulative regret compared to the baseline. Furthermore, as the weight of data contribution increases, cumulative regret experiences a decrease. This is primarily attributed to the additional cost information introduced by data contribution, facilitating the update of the range of θ during the pricing process. It narrows the gap between the posted price and the market value, thereby reducing regret. As α increases, the proportion of data contribution in the cost gradually rises, and with the increase of t, data providers are more inclined to determine the price based on the data contribution, aiming to achieve maximum returns. When the contribution weight α is set to 1.0, 1.5, and 2.0 respectively, the cumulative regret of PPIC is reduced by 10.93%, 23.34%, and 43.38% compared to the baseline. Additionally, with the increase in the contribution weight, the final regret ratio continues to decrease. Notably, in the baseline, the final regret ratio is 0.1336, while PPIC with a contribution weight α of 2.0 achieves a final regret ratio of 0.0513, representing a 61.8% reduction. Despite data providers having a good estimation of the data value after a sufficient number of trading rounds, the impact of contribution on the posted price remains significant, demonstrating the effectiveness of incorporating data contribution.
The conclusions from Scenario #2 are similar to those in Scenario #1. Based on the experiment results related to cumulative regret, when the contribution weight α is set to 1.0, 1.5, and 2.0 respectively, the cumulative regret of PPIC is reduced by 11.03%, 23.50%, and 35.46% compared to the baseline. Regarding the experiment results related to the regret ratio, the final regret ratio of baseline is 0.1273, whereas, with the α of 2.0, the final regret ratio is 0.0516, representing a reduction of 59.8%.
Limitation and discussion
This study acknowledges several limitations that warrant further investigation. While our method for assessing data contribution in decentralized sharing scenarios has shown promising results, the reliance on specific blockchain architectures may affect the generalizability of our findings. Additionally, the computational intensity of maximal sequential pattern mining could pose challenges in larger, more dynamic datasets.
Limitation
- Sensitivity to Hyperparameters
Our method relies on multiple critical hyperparameter settings, such as the support threshold for frequent sequence mining and the window size for behavioral pattern extraction. These hyperparameters may require different configurations across various datasets or application contexts to ensure effectiveness and accuracy. A primary issue with hyperparameter settings is their high sensitivity; different parameter values can significantly impact the final assessment outcomes. For instance, varying the support threshold from 5% to 15% led to a 20% decrease in recall, as valuable patterns below the new threshold were excluded. Similarly, the setting of the window size needs to balance capturing sufficient behavioral information and avoiding redundant data. Optimizing these hyperparameters typically necessitates extensive experimentation and tuning, which is not only time-consuming but may also not guarantee a globally optimal solution. Additionally, varying data characteristics and application demands might lead to significant discrepancies in optimal hyperparameter settings, increasing complexity and uncertainty in practical applications. - Dependency on Sequence Length
Our approach shows a high dependency on the length of sequences used in mining behavioral patterns. Short sequences might not adequately capture contextual information about user behaviors or data exchanges, leading to inaccuracies in the evaluation results. Conversely, long sequences may include irrelevant information and noise, increasing the complexity of data processing and computational resource consumption. The choice of sequence length substantially affects the algorithm’s performance and the accuracy of the results. For shorter sequences, our method might require the integration of more contextual information or additional data features to compensate for insufficient data. In experiments, reducing window size from 10 to 5 time units caused a drop in F1 score from 0.75 to 0.55, reflecting the loss of contextual continuity. For longer sequences, it becomes crucial to develop more efficient data processing and compression techniques to mitigate computational loads and enhance processing speeds. Moreover, different application scenarios and data types might have varying requirements for sequence length, demanding that our method adapt flexibly in real-world applications. - Computational Complexity and Resource Demands
The computational complexity and resource requirements of our method escalate when dealing with large-scale datasets. The processes of mining maximal sequence patterns and constructing behavioral paradigm graphs require substantial computational resources and memory, which could become a bottleneck when handling vast amounts of data. As data scales increase, the algorithm’s time and space complexities grow exponentially, potentially leading to prolonged processing times and memory exhaustion. During the evaluation of a dataset comprising over 10 million transactions, computational time increased tenfold as the volume of data doubled, underscoring the scalability challenges. To address these issues, we need to develop more efficient algorithms and data structures to optimize computation, reducing unnecessary computational steps. Furthermore, leveraging distributed computing and parallel processing techniques can significantly enhance the scalability and processing capabilities of our method. However, this also implies that sufficient computational resources and technical support are essential in practical applications. Thus, reducing computational complexity and resource demands while ensuring algorithmic accuracy remains a critical research direction for us.
Discussion
Despite our method’s demonstrated efficacy and accuracy in experimental settings, the aforementioned limitations indicate several areas for improvement and optimization. To overcome these limitations, we plan to conduct in-depth research in the following areas.
- Optimizing Hyperparameter Tuning Methods
We will explore automated hyperparameter tuning techniques, such as those based on Bayesian optimization or evolutionary algorithms, to reduce manual tuning efforts and enhance tuning efficiency. - Improving Sequence Processing Techniques
For sequences of varying lengths, we aim to develop more adaptive sequence processing and compression technologies to ensure that critical behavioral information is captured while minimizing noise and redundancy. - Enhancing Computational Efficiency and Scalability
By introducing more efficient algorithms and data structures, combined with distributed computing and parallel processing, we plan to boost the processing power and scalability of our method, ensuring it remains effective even on large-scale datasets. Through these improvements, we believe we can further enhance the practical utility and application value of our method, providing more effective and reliable solutions for assessing data contribution in decentralized sharing and exchange scenarios.
Conclusion
This work focuses on the assessment of data contribution in decentralized sharing and exchange consortium blockchains. Utilizing metadata sequences of data sharing and exchange behaviors recorded on the consortium blockchain, we propose a data contribution assessment method based on the maximal sequential patterns of sharing and exchange behavior paradigms. This method aims to explore dependencies among sharing and exchange behavior paradigms and determine the weights of these paradigms. The maximal sequential pattern mining algorithm helps filter out redundant or potentially erroneous behavior paradigm dependencies, enhancing the accuracy of the discovered dependencies and their weight. Finally, combining the data volume of data behaviors allows for the calculation of the contribution of each participating database in the consortium blockchain. The effectiveness of the proposed method and the positive role of data contribution in data pricing are demonstrated through experiments conducted in two different scale scenarios. pecifically, compared to the most competitive baseline, the improvements of mean average precision in two scenarios are 6% and 8%.
Future work
This article introduces and defines a new context quality metric, data contribution, and explores methods to assess this metric in scenarios of data sharing and exchange. We have proposed a method for assessing data contribution based on sharing and exchange behavior paradigms and maximal sequence patterns. This method has been empirically validated on open-source systems and simulated scenarios, demonstrating its capability to address data contribution assessment issues in respective contexts to some extent. However, there remain several areas for improvement and refinement in our study, and future work could continue exploring the following aspects.
- Improvements in Data Contribution Assessment Methods.
Our current method only considers direct dependencies in mining relationships. Future work could incorporate indirect and transitive dependencies to more accurately mine the dependencies and their strengths between different behavioral paradigms. The research primarily focuses on data sharing scenarios driven by collaborative business between different systems, considering only temporary sharing of data usage rights and multiparty data behavior types. Further improvements should integrate characteristics of non-collaborative business-driven sharing scenarios, such as data trading and data services, to extend the applicability of the current method. - Introduction of Adaptive Hyperparameter Optimization Techniques.
To address the deficiencies in hyperparameter settings of the current method, future work could involve adaptive hyperparameter optimization techniques, such as Bayesian optimization or genetic algorithms, to automatically adjust key parameters like support thresholds, enhancing the method’s applicability and robustness. - Reducing Dependency on Sequence Length.
In order to reduce the dependency on the length of data sequences, future approaches could explore combining sequence compression techniques or graph-based pattern mining methods to enhance the efficiency and accuracy when handling long sequences, adapting to a more diverse range of data scenarios. Through these improvements and extensions, the method proposed in this article is expected to play a greater role in a broader range of practical applications, providing a more comprehensive and precise solution for assessing data contribution.
References
- 1.
Foroni D, Lissandrini M, Velegrakis Y. Estimating the extent of the effects of Data Quality through Observations. In: ICDE; 2021.
- 2. Cao P, Duan G, Tu J, Jiang Q, Yang X, Li C. Blockchain-Based Process Quality Data Sharing Platform for Aviation Suppliers. IEEE Access. 2023;11:19007–19023.
- 3.
Chongzhao L, Huang H. A Study on Influencing Factors of Local Government Data Sharing in China. Chinese Public Administration. 2019;.
- 4. DeMedeiros K, Hendawi A, Alvarez M. A Survey of AI-Based Anomaly Detection in IoT and Sensor Networks. Sensors. 2023;23:1352. pmid:36772393
- 5. Abiodun O, Omolara O, Alawida M, Alkhawaldeh R, Arshad H. A Review on the Security of the Internet of Things: Challenges and Solutions. Wireless Personal Communications. 2021;119:1–35.
- 6. Tariq U, Ahmed I, Bashir A, Shaukat Dar K. A Critical Cybersecurity Analysis and Future Research Directions for the Internet of Things: A Comprehensive Review. Sensors. 2023;23:4117. pmid:37112457
- 7. Elouataoui W, Mendili SE, Gahi Y. Quality Anomaly Detection Using Predictive Techniques: An Extensive Big Data Quality Framework for Reliable Data Analysis. IEEE Access. 2023;11:103306–103318.
- 8. Mou S, Zhang H. Mining algorithm of accumulation sequence of unbalanced data based on probability matrix decomposition. PLOS ONE. 2023;18(7). pmid:37418443
- 9.
Kato G, Yamongan J, Manao J, Arcega R, Espino R, Capili R, et al. In: Emerging Technologies in the Philippines: Internet of Things (IoT); 2022. p. 300–308.
- 10. Affia AA, Finch H, Jung W, Samori I, Potter L, Palmer XL. IoT Health Devices: Exploring Security Risks in the Connected Landscape. IoT. 2023;4:150–182.
- 11. Tehseen M, Bux TD, Al ST, Yasin GY, Inayatul H, Ullah I, et al. Analysis of IoT Security Challenges and Its Solutions Using Artificial Intelligence. Brain Sciences. 2023;13:683.
- 12. Gao F, Song S, Wang J. Time Series Data Cleaning under Multi-Speed Constraints. Int J Softw Informatics. 2021;11:29–54.
- 13. Fuentes Cabrera JG, Pérez Vicente HA, Maldonado S, Velasco J. Combination of unsupervised discretization methods for credit risk. PLOS ONE. 2023;. pmid:38011207
- 14.
Elouataoui W, Alaoui IE, Gahi Y. In: Data Quality in the Era of Big Data: A Global Review. Springer International Publishing; 2022. p. 1–25.
- 15.
Kothapalli M. The Challenges of Data Quality and Data Quality Assessment in the Big Data; 2023.
- 16.
Merino J, Xie X, Parlikad A, Lewis I, McFarlane D. Impact of data quality in real-time big data systems. In: CEUR Workshop Proceedings. vol. 2716. CEUR-WS.org; 2020. Available from: https://doi.org/10.17863/CAM.59426.
- 17. Fosso Wamba S, Gunasekaran A, Akter S, Ren S, Dubey R, Childe S. Big data analytics and firm performance: Effect of dynamic capabilities. Journal of Business Research. 2016;70.
- 18. El Koshiry A, Eliwa E, Abd El-Hafeez T, Shams MY. Unlocking the power of blockchain in education: An overview of innovations and outcomes. Blockchain: Research and Applications. 2023;4(4):100165.
- 19. Badawy A, Fisteus JA, Mahmoud TM, El-Hafeez TA. Topic Extraction and Interactive Knowledge Graphs for Learning Resources. Sustainability. 2021;14(1):1–21.
- 20.
Lotfy A, Zaki A, Abd El-Hafeez T, Mahmoud T. In: Privacy issues of public Wi-Fi networks; 2021. p. 656–665.
- 21. Barateiro J, Galhardas H. A Survey of Data Quality Tools. Datenbank-Spektrum. 2005;14:15–21.
- 22. Ballou DP, Pazer HL. Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems. Management Science. 1985;31:150–162.
- 23. Wang RY, Strong DM. Beyond Accuracy: What Data Quality Means to Data Consumers. J Manag Inf Syst. 1996;12:5–33.
- 24. Lahti L, Huovari J, Kainu M, Biecek P. Retrieval and Analysis of Eurostat Open Data with the eurostat Package. R J. 2017;9:385.
- 25. Álvarez-Martínez MT, Lopez-Cobo M. WIOD SAMs adjusted with Eurostat data for the EU-27. Economic Systems Research. 2018;30:521–544.
- 26. Cichy C, Rass S. An Overview of Data Quality Frameworks. IEEE Access. 2019;7:24634–24648.
- 27.
Jain A, Patel H, Nagalapatti L, Gupta N, Mehta S, Guttula SC, et al. Overview and Importance of Data Quality for Machine Learning Tasks. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020;.
- 28.
Croft R, Babar MA, Kholoosi MM. Data Quality for Software Vulnerability Datasets. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2023; p. 121–133.
- 29. Taleb I, Serhani M, Bouhaddioui C, Dssouli R. Big data quality framework: a holistic approach to continuous quality management. Journal of Big Data. 2021;8.
- 30. Figuerêdo JSL, Calumby RT. Unsupervised query-adaptive implicit subtopic discovery for diverse image retrieval based on intrinsic cluster quality. Multim Tools Appl. 2022;81(30):42991–43011.
- 31.
Wang Y, Chen X, He B, Sun L. Contextual Interaction for Argument Post Quality Assessment. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023.
- 32.
Deshpande D, Sourati Z, Ilievski F, Morstatter F. Contextualizing Argument Quality Assessment with Relevant Knowledge. CoRR. 2023;abs/2305.12280.
- 33. Wang J, Liu Y, Li P, Lin Z, Sindakis S, Aggarwal S. Overview of Data Quality: Examining the Dimensions, Antecedents, and Impacts of Data Quality. Journal of the Knowledge Economy. 2023;15:1–20.
- 34. Miao X, Gao Y, Chen L, Peng H, Yin J, Li Q. Towards Query Pricing on Incomplete Data. IEEE Transactions on Knowledge and Data Engineering. 2022;34:4024–4036.
- 35. Cai H, Yang Y, Fan W, Xiao F, Zhu Y. Towards Correlated Data Trading for High-Dimensional Private Data. IEEE Transactions on Parallel and Distributed Systems. 2023;34:1047–1059.
- 36. Liu Y, Yu Z, Yuan X, Ke W, Fang Z, Du T, et al. Assessing Database Contribution via Distributed Tracing for Microservice Systems. Applied Sciences. 2022;.
- 37.
Huang C, Zhang H, Liu X. Incentivizing Data Contribution in Cross-Silo Federated Learning. CoRR. 2022;abs/2203.03885.
- 38.
Adams K, Spadea F, Flynn C, Seneviratne O. Assessing Scientific Contributions in Data Sharing Spaces. CoRR. 2023;abs/2303.10476.
- 39. Lv H. E-commerce consumer behavior analysis based on big data. J Comput Methods Sci Eng. 2023;23(2):651–661.
- 40. Wang C, Chi C, Yao L, Liew AW, Shen H. Interdependence analysis on heterogeneous data via behavior interior dimensions. Knowl Based Syst. 2023;279:110893.
- 41. Shahnaz A, Qamar U, Khalid A. Using Blockchain for Electronic Health Records. IEEE Access. 2019;7:147782–147795.
- 42. Chen L, Lee WK, Chang C, Choo KKR, Zhang N. Blockchain based searchable encryption for electronic health record sharing. Future Gener Comput Syst. 2019;95:420–429.
- 43. Yu Y, Ding Y, Zhao Y, Li Y, Zhao Y, Du X, et al. LRCoin: Leakage-Resilient Cryptocurrency Based on Bitcoin for Data Trading in IoT. IEEE Internet of Things Journal. 2018;6:4702–4710.
- 44. Yang Z, Yang K, Lei L, Zheng K, Leung VCM. Blockchain-Based Decentralized Trust Management in Vehicular Networks. IEEE Internet of Things Journal. 2019;6:1495–1505.
- 45.
BBAA. ANNUAL REPORT ON BLOCKCHAIN DEVELOPMENT IN CHINA 2023; 2023. Available from: https://13115299.s21i.faiusr.com/61/1/ABUIABA9GAAg3JLppAYoi9_a8AY.pdf.
- 46.
CAICT. blockchain white book 2023; 2023. Available from: http://www.caict.ac.cn/kxyj/qwfb/bps/202312/P020231207518702725959.pdf.
- 47. Huang C, Xue L, Liu D, Shen XS, Zhuang W, Sun R, et al. Blockchain-Assisted Transparent Cross-Domain Authorization and Authentication for Smart City. IEEE Internet of Things Journal. 2022;9:17194–17209.
- 48. Cai T, Chen W, Psannis KE, Goudos SK, Yu Y, Zheng Z, et al. Scalable On-Chain and Off-Chain Blockchain for Sharing Economy in Large-Scale Wireless Networks. IEEE Wireless Communications. 2022;29:32–38.
- 49.
Hao Y, Piao C, Zhao Y, Jiang X. Privacy Preserving Government Data Sharing Based on Hyperledger Blockchain. In: IEEE International Conference on e-Business Engineering; 2019.
- 50.
Ongaro D, Ousterhout J. In Search of an Understandable Consensus Algorithm. In: 2014 USENIX Annual Technical Conference (USENIX ATC 14). Philadelphia, PA; 2014. p. 305–319.
- 51.
Castro M. Practical Byzantine fault tolerance. In: USENIX Symposium on Operating Systems Design and Implementation; 1999.
- 52. Elouataoui W, El Alaoui I, el Mendili S, Youssef G. An Advanced Big Data Quality Framework Based on Weighted Metrics. Big Data and Cognitive Computing. 2022;13.
- 53. Liu Y, Wang B, Liu X. Comprehensive assessment of cable-stayed bridge based on Pagerank algorithm. Advances in Bridge Engineering. 2023;4.
- 54. Yao Y, Cheng T, Li X, He Y, Yang F, Li T, et al. Link prediction based on the mutual information with high-order clustering structure of nodes in complex networks. Physica A: Statistical Mechanics and its Applications. 2022;610:128428.
- 55.
Weng T, Zhou X, Fang Y, Tan L, Li K. Finding Top-k Important Edges on Bipartite Graphs: Ego-betweenness Centrality-based Approaches; 2023. p. 2415–2428.
- 56.
Ma C, Fang Y, Cheng R, Lakshmanan L, Zhang W, Lin X. Efficient Algorithms for Densest Subgraph Discovery on Large Directed Graphs; 2020. p. 1051–1066.
- 57. Sun Z, Wu B, Chen Y, Ye Y. Learning From the Future: Light Cone Modeling for Sequential Recommendation. IEEE Transactions on Cybernetics. 2023;53(8):5358–5371. pmid:36417718
- 58. Elmezain M, Othman E, Ibrahim H. Temporal Degree-Degree and Closeness-Closeness: A New Centrality Metrics for Social Network Analysis. Mathematics. 2021;9:2850.
- 59. Jarumaneeroj P, Ramudhin A, Lawton J. A connectivity-based approach to evaluating port importance in the global container shipping network. Maritime Economics Logistics. 2022;25.
- 60.
Kim Yk, Go Mh, Lee K. Influence Through Cyber Capacity Building: Network Analysis of Assistance, Cooperation, and Agreements Among ASEAN Plus Three Countries. Berlin, Heidelberg: Springer-Verlag; 2023. Available from: https://doi.org/10.1007/978-3-031-25659-2_24.
- 61.
Che-Castaldo J, Cousin R, Daryanto S, Deng G, Feng ML, Gupta R, et al. Critical Risk Indicators (CRIs) for the electric power grid: A survey and discussion of interconnected effects; 2021.
- 62.
Agrawal R, Srikant R. Mining sequential patterns. Proceedings of the Eleventh International Conference on Data Engineering. 1995; p. 3–14.
- 63.
Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules; 1998.
- 64.
Qu S, Li K, Fan Z, Wu S, Liu X, Huang Z. Behavior Pattern based Performance Evaluation in MOOCs; 2021.
- 65. Wu Y, yong Zhu C, Li Y, Guo L, Wu X. NetNCSP: Nonoverlapping closed sequential pattern mining. Knowledge-Based Systems. 2020;196:105812–105812. pmid:32292248
- 66.
Gao J, Sun Y, Liu W, Yang S. Predicting Traffic Congestions with Global Signatures Discovered by Frequent Pattern Mining. 2016 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData). 2016; p. 554–560.
- 67.
Fournier-Viger P, Wu CW, Gomariz A, Tseng VS. VMSP: Efficient Vertical Mining of Maximal Sequential Patterns. In: Canadian Conference on AI; 2014.
- 68. Pei J. A Survey on Data Pricing: From Economics to Data Science. IEEE Transactions on Knowledge and Data Engineering. 2022;34(10):4586–4608.
- 69. Azcoitia SA, Laoutaris N. A Survey of Data Marketplaces and Their Business Models. SIGMOD Rec. 2022;51:18–29.
- 70. Xiao M, Li M, Zhang JJ. Locally Differentially Private Personal Data Markets Using Contextual Dynamic Pricing Mechanism. IEEE Transactions on Dependable and Secure Computing. 2023;20:5043–5055.
- 71. Li C, Li DY, Miklau G, Suciu D. A theory of pricing private data. ACM Transactions on Database Systems (TODS). 2014;39(4):1–28.
- 72. Mao J, Leme R, Schneider J. Contextual pricing for lipschitz buyers. Advances in Neural Information Processing Systems. 2018;31.
- 73.
Ye P, Qian J, Chen J, Wu Ch, Zhou Y, De Mars S, et al. Customized regression model for airbnb dynamic pricing. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining; 2018. p. 932–940.
- 74. Sun J, Wu D. Ellipsoid Pricing Based Context-feature Mechanisms for Noisy Sensing Tasks. IEEE Internet of Things Journal. 2023;.
- 75.
Niu C, Zheng Z, Wu F, Tang S, Gao X, Chen G. Unlocking the value of privacy: Trading aggregate statistics over private correlated data. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2018. p. 2031–2040.
- 76. Xu J, Wang YX. Logarithmic regret in feature-based dynamic pricing. Advances in Neural Information Processing Systems. 2021;34:13898–13910.
- 77. Niu C, Zheng Z, Wu F, Tang S, Chen G. Online pricing with reserve price constraint for personal data markets. IEEE Transactions on Knowledge and Data Engineering. 2020;34(4):1928–1943.
- 78. Amin K, Rostamizadeh A, Syed U. Repeated contextual auctions with strategic buyers. Advances in Neural Information Processing Systems. 2014;27.
- 79. Shah V, Johari R, Blanchet J. Semi-parametric dynamic contextual pricing. Advances in Neural Information Processing Systems. 2019;32.
- 80. Luo Y, Sun WW, Liu Y. Contextual Dynamic Pricing with Unknown Noise: Explore-then-UCB Strategy and Improved Regrets. Advances in Neural Information Processing Systems. 2022;35:37445–37457.
- 81. Cai H, Ye F, Yang Y, Zhu Y, Li J, Xiao F. Online pricing and trading of private data in correlated queries. IEEE Transactions on Parallel and Distributed Systems. 2021;33(3):569–585.
- 82.
Fournier-Viger P, Gomariz A, ebek M, Hlosta M. VGEN: Fast Vertical Mining of Sequential Generator Patterns. In: International Conference on Data Warehousing and Knowledge Discovery; 2014.
- 83.
Fournier-Viger P, Wu CW, Tseng VS. Mining Maximal Sequential Patterns without Candidate Maintenance. In: International Conference on Advanced Data Mining and Applications; 2013.
- 84.
Lin NP, Hao WH, jen Chen H, Chueh HE, Chang CI. Fast mining maximal sequential patterns; 2007.
- 85. Bonacich P. Factoring and weighting approaches to status scores and clique identification. Journal of Mathematical Sociology. 1972;2:113–120.
- 86.
Mettler MM. Blockchain technology in healthcare: The revolution starts here. 2016 IEEE 18th International Conference on e-Health Networking, Applications and Services (Healthcom). 2016; p. 1–3.
- 87. Saari A, Junnila S, Vimpari J. Blockchain’s Grand Promise for the Real Estate Sector: A Systematic Review. Applied Sciences. 2022;12:11940.
- 88. Zhang D, Han X, Deng C. Review on the research and practice of deep learning and reinforcement learning in smart grids. CSEE Journal of Power and Energy Systems. 2018;4(3):362–370.
- 89.
White G. Future Applications of Blockchain in Business and Management: a Delphi study. Strategic Change. 2017;.
- 90. Chen G, Xu B, Lu M, Chen NS. Exploring blockchain technology and its potential applications for education. Smart Learning Environments. 2018;5.
- 91.
Sharples M, Domingue J. The Blockchain and Kudos: A Distributed System for Educational Record, Reputation and Reward. vol. 9891; 2016. p. 490–496.
- 92. Ahmad RW, Salah K, Jayaraman R, Yaqoob I, Omar M. Blockchain for Waste Management in Smart Cities: A Survey. IEEE Access. 2021;9:131520–131541.
- 93.
Fournier-Viger P, Gomariz A, Campos M, Thomas R. Fast Vertical Mining of Sequential Patterns Using Co-occurrence Information. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining; 2014.