Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A knowledge graph construction method for compliance review of water conservancy project reports

  • Zelin Ding ,

    Contributed equally to this work with: Zelin Ding, Zhefei Fan, Yuanfeng Hao

    Roles Conceptualization, Funding acquisition, Methodology, Resources

    Affiliation North China University of Water Resources and Electric Power, Zhengzhou City, Henan Province, China

  • Zhefei Fan ,

    Contributed equally to this work with: Zelin Ding, Zhefei Fan, Yuanfeng Hao

    Roles Investigation, Methodology, Project administration, Resources, Software

    2450365632@qq.com

    Affiliation North China University of Water Resources and Electric Power, Zhengzhou City, Henan Province, China

  • Yuanfeng Hao ,

    Contributed equally to this work with: Zelin Ding, Zhefei Fan, Yuanfeng Hao

    Roles Methodology

    Affiliation Henan Water & Power Engineering Consulting CO., Ltd, Zhengzhou City, China

  • Tao Wang ,

    Roles Conceptualization, Project administration

    ‡ TW, XD and XZ also contributed equally to this work.

    Affiliation North China University of Water Resources and Electric Power, Zhengzhou City, Henan Province, China

  • Xin Du ,

    Roles Investigation

    ‡ TW, XD and XZ also contributed equally to this work.

    Affiliation North China University of Water Resources and Electric Power, Zhengzhou City, Henan Province, China

  • Xinhang Zhang

    Roles Conceptualization

    ‡ TW, XD and XZ also contributed equally to this work.

    Affiliation North China University of Water Resources and Electric Power, Zhengzhou City, Henan Province, China

Abstract

To break through the efficiency and accuracy bottlenecks of manual mode in the compliance review of water conservancy project reports and promote the digital transformation of “Smart Water Conservancy”, this paper proposes a knowledge graph construction method for the compliance review of water conservancy project reports. Firstly, based on natural language processing technology, the BERT-BiLSTM-CRF model is used for named entity recognition to accurately locate key entities such as engineering parameters and normative clauses. Secondly, the context-free grammar (CFG) is used to parse the logical relationships between entities, and the normative clauses are transformed into “head entity + relationship + tail entity” triples through a semantic label system to achieve structured expression of knowledge in the water conservancy field. Finally, the Neo4j graph database is used to store the knowledge graph, and the Py2neo toolkit is used to complete the efficient import and dynamic update of triple data. The research takes the actual review of water conservancy project reports as a case to verify the feasibility of the method. Practice has proved that this method effectively improves the efficiency and accuracy of the compliance review of water conservancy project reports, providing technical support and practical reference for the digital transformation of water conservancy projects, and is of great significance for promoting the intelligent development of the water conservancy industry.

Introduction

Against the backdrop of the accelerated promotion of the “smart water conservancy” strategy and the national digital transformation, the compliance review of water conservancy engineering reports, as the core link of engineering construction quality control, still relies on the traditional mode of manually comparing normative provisions word by word, facing prominent problems such as low efficiency, strong subjectivity, and difficulty in accurately capturing complex logical relationships [1]. With the expansion of water conservancy projects and the complexity of regulatory systems, traditional review models are no longer able to meet the needs of intelligent and precise management, and there is an urgent need to introduce artificial intelligence technology to break through bottlenecks [2].

In recent years, with the rapid development of artificial intelligence in the field of natural language, technologies such as knowledge graphs for processing unstructured correlated data have been widely applied [3]. In 2012, Google released a large-scale knowledge graph, which sparked a craze for knowledge graph research and application. Various types of knowledge graphs such as DBpedia, YAGO, NELL, ArnetMiner, etc. emerged one after another [4]. With the advancement of research, knowledge graphs in the era of deep learning attempt to introduce neural networks, such as Bordes et al.‘s knowledge base structure embedding and BORDES A et al.’s Neural Tensor Network (NTN), which embed entities and relationship words for node representation and triplet judgment, achieving significant results in knowledge graph graph graph completion tasks [5]. In recent years, thanks to advances in natural language processing, pre trained models such as BERT have improved their text understanding and retrieval capabilities, making it possible to understand and reason on raw text. For example, CHEN D et al.’s DrQA directly extracts question answers from text [6]. In the field of hydraulic engineering, research on knowledge graph construction is constantly deepening. Although existing problems have been identified and research frameworks have been proposed, there is relatively little direct application of knowledge graph technology in the compliance review of hydraulic engineering reports. Knowledge graph, as a semantic modeling technology based on graph structure, integrates fragmented normative knowledge into a structured network through the “entity relationship attribute” triple system, providing a new path for solving the above problems.

Therefore, this article first preprocesses the report data, and then uses natural language processing techniques to identify entities through the BERT BiLSTM CRF model. Combined with context free parsing, the article proposes seven semantic role labels to organize the logical structure of the text. Simultaneously convert the rules into a knowledge graph triplet, and finally construct a review knowledge graph using Neo4j and Py2neo. This research innovates the knowledge extraction and application mode, realizes intelligent retrieval of standardized clauses and automatic comparison of drawings, effectively improves review efficiency, and promotes the intelligent development of water conservancy engineering [7].

Research method

Semantic syntactic joint parsing method

In the field of deep learning, the BERT BiLSTM CRF model is a widely used architecture in natural language processing, performing well in tasks such as text classification and named entity recognition. Its network structure is divided into three modules: BERT, Bidirectional Long Short Term Memory Network (BiLSTM), and Conditional Random Field (CRF). The working principle is to extract word vectors as basic features through a BERT pre trained model, input them into BiLSTM for deep feature extraction, and then decode and predict the optimal annotation sequence through the CRF module [8]. Example of Sample Division for Hydraulic Standards, as shown in Fig 1.

thumbnail
Fig 1. Example of Sample Division for Water Conservancy Standards.

https://doi.org/10.1371/journal.pone.0339575.g001

  1. (1) BERT adopts a Transformer architecture based on “Self Attention”, and its encoder achieves deep representation learning of input sequences through self attention layers. In the calculation process, the embedding vectors of each word element are mapped to the query matrix Q, key matrix K, and value matrix V, and the correlation weights between word elements are calculated using the attention function shown in formula (1). In order to prevent the phenomenon of numerical instability in attention scores as the embedding dimension dk increases, a dimension scaling factor is introduced in the calculation process. Subsequently, the attention weights are normalized using the Softmax function and weighted with the value matrix to obtain a lexical representation that integrates global contextual information.
(1)

BERT further proposed the Multi attention mechanism by introducing self attention mechanism. The calculation formula is as follows.

(2)(3)
  1. (2) LSTM (Long Short Term Memory Network) is an improvement on traditional RNN, which solves the problem of gradient explosion during RNN training, uses gating to achieve long-term memory, and avoids forgetting important features. It introduces three gating mechanisms for each hidden layer neuron based on RNN, selectively remembering and forgetting feature information. However, LSTM can only capture unidirectional sequence dependencies. Therefore, researchers propose BiLSTM, which consists of two independent LSTM layers. Forward and backward LSTMs process sequences in positive and reverse order, respectively, to capture contextual information from different directions.
(4)(5)(6)(7)(8)(9)
  1. (3) In sequence annotation tasks, although BiLSTM can capture long-distance text dependencies, it independently predicts word element labels and cannot explicitly model label transfer constraints. The Conditional Random Field (CRF) introduces a label transition probability matrix and utilizes the dependency relationship between adjacent labels to output the globally optimal annotation sequence, which compensates for the shortcomings of BiLSTM. After receiving the output score of BiLSTM, CRF generates the maximum possible prediction sequence that meets the annotation transfer constraint. For the sequence X=(x1, x2..., xn), if the output score matrix P of BiLSTM is n × k (n is the number of words, k is the number of labels), Pij represents the score of the i-th word and j-th label, corresponding to the predicted sequence Y=(y1, y2,..., yn), the score function is:
(10)

A is the transition score matrix, which is the score for the transition from label i to label j, and the value of A is k + 2. The probability of generating the predicted sequence Y is:

(11)

Take the logarithm of both sides of equation 1.11 to obtain the likelihood function of the predicted sequence:

(12)

In the formula, represents the real annotation sequence, and YX represents all possible annotation sequences. The output sequence with the highest score obtained after decoding:

(13)

After completing dynamic semantic annotation and entity extraction, the text has been transformed into a sequence of entities with clear semantic labels, but the logical relationships between entities are still hidden in the sequence structure and have not yet formed a structured logical expression. To further reveal the hierarchical relationships and semantic constraints between entities, it is necessary to construct a syntax tree to transform linear text into a tree like logical structure, so that the multi-level relationship of “object attribute constraint” can be explicitly presented. Context free grammar (CFG) is a formal language description method that uses formal rules to define valid strings. Its production rule consists of a single non terminal character on the left, ensuring the context independence of syntax inference. CFG is widely used in fields such as natural language processing [9]. CFG is defined as a quadruple G=(V, T, P, S), where V and T are the sets of non terminal symbols and terminal symbols respectively, P is the set of production equations, and S is the initial symbol. Non terminal symbols represent abstract components, while terminal symbols are concrete instances. Production equations implement the derivation of non terminal symbols from strings, and syntax trees can visually display the derivation process. They consist of roots (starting symbols), interiors (non terminating symbols), and leaf nodes (terminating symbols). When designing syntax, priority should be clearly defined, and hierarchical grouping can reflect priority relationships.The example of the CFG (Context-Free Grammar) syntax tree is as shown in Fig 2.

thumbnail
Fig 2. Example of Sample Division for Water Conservancy Standards.

https://doi.org/10.1371/journal.pone.0339575.g002

Rule extraction integration

The language used to describe standard articles mainly includes conditional sentences and compound sentences. Conditional sentences contain both simple and compound forms, with compound sentences consisting of multiple simple sentences, and master-slave compound sentences being more common. These sentence structures all contain four elements: the inspected object, attributes, comparison operation constraints, and requirement limits. They can be marked with semantic tags obj, prop, cmp, and rprop, and parsed into a structure of “the attribute (prop) of the object (obj) should satisfy the requirement (rprop) through the relationship (cmp)” and represented in a tree.

There are two ways to extend complex specification statements:

  1. (1) Compound Structure.For example, “The bottom elevation of the beam-slab of the service bridge should be 0.5m higher than the highest flood level”. At this time, a tag sobj can be added to indicate that “beam-slab” is the upper-level object of “obj”. Meanwhile, sobj can be reused, where the sobj on the right is the sub-object of the sobj on the left, as shown in Fig 3 (a). In addition, rprop may also be a lower-level attribute of an object. For instance, “The top elevation of the sluice should not be lower than the top elevation of the flood dike”. Here, “top elevation” indicating the required limit value is an attribute of “flood dike”. Therefore, a tag robj can be added to mark “flood dike” as the upper-level object of rprop, as shown in Fig 3 (b).
  2. (2) Conditional Structure.As shown in Fig 3 (c), in the sentence “If the loess foundation has no filter layer, the seepage path coefficient shall not be less than 4”, “has no filter layer” does not indicate a required limit value. Instead, it means the subsequent review will be implemented only when this condition is met, similar to the “If” part in an If-Then structure. Therefore, a tag arprop can be added to mark such elements representing “If”-type preconditions. Meanwhile, like rprop, robj can also be used to mark the upper-level object of arprop.

In summary, this article proposes seven semantic tags to represent the different semantic roles of words (or phrases) in normative provisions. Among them, the obj, sobj, and prop elements are used to represent the elements to be checked in the drawing. obj can be the child node of sobj and the parent node of prop. CMP is the comparison or existence relationship between the element prop to be checked and the required conditions Rprop/Rprop. Rprop and ARprop respectively represent two types of requirement conditions. Rprop is a constraint requirement applied to prop, while ARprop is a prerequisite applied to prop. The explanations of each semantic tag are shown in Table 1.

Each semantic tag can represent an entity, and the relationship between entities can be defined to form a triple structure of “head entity+relationship+tail entity”. It should be noted that cmp itself has the meaning of “relationship”, so it can be used as the relationship between prop and rprop/arprop to form the “prop+cmp+rprop” structure. Furthermore, based on the meanings of other semantic tags, three other basic relationships can be defined, namely inclusion, existence, and condition. The basic relationship types and examples are shown in Table 2.

In summary, the following methods have been developed to extract review rules from normative provisions: firstly, semantic annotation of normative provisions is performed using a trained BERT-BILSTM-CRF named entity recognition model; Next, the CFG analysis method is used to parse the annotated normative clauses, and the parsing results are represented using a tree structure. The purpose of this step is to achieve formal expression of normative clause sentences; Finally, the tree structured statements are mapped to the corresponding nodes and relationships in the knowledge graph through matching, as shown in Fig 4.

thumbnail
Fig 4. Schematic diagram of knowledge extraction method for review.

https://doi.org/10.1371/journal.pone.0339575.g004

Recognition model training

Data preprocessing

Select the specifications described in Table 3 as the training data for the semantic annotation model. Before training, it is necessary to preprocess the specification file into text data [10,11], which mainly includes the following processes:

thumbnail
Table 3. High and Low Limits of Safety Prices for Water Gates (Unit: m).

https://doi.org/10.1371/journal.pone.0339575.t003

  1. (1) Remove content unrelated to the article and only retain the section containing the review rules. For example, deleting contents such as table of contents, preface, and numbering from the specifications.
  2. (2) For some review articles presented in table form, the content of the table should be organized and converted into corresponding simple sentences. For example, in Table 3, it can be converted into rule texts with simple sentence structures such as “The safe increase value of the normal water level when the third level water gate blocks water is not less than 0.4m” and “The safe increase value of the highest water level when the fourth and fifth level water gates block water is not less than 0.2m”.
  1. (3) Break down the standard clauses of compound sentences into simple clause clauses. For example, in the article “The minimum thickness of concrete or reinforced concrete pavement should be greater than 0.4m, and the permanent seam spacing along the water direction can be used between 8-20m”, it can be divided into the article “The minimum thickness of concrete or reinforced concrete pavement should be greater than 0.4m” and the article “The permanent seam spacing along the water direction of concrete or reinforced concrete pavement can be used between 8-20m”, which can then be further divided into the article “The minimum thickness of concrete pavement should be greater than 0.4m”, the article “The minimum thickness of reinforced concrete pavement should be greater than 0.4m”, and the article “The permanent seam spacing along the water direction of pavement can be used between 8-20m”. When preprocessing normative texts, the simpler the rule provisions are split, the easier it is to extract knowledge from normative provisions in the future

Semantic annotation

Semantic annotation refers to the process of assigning semantic labels manually or automatically to preprocessed water conservancy engineering specification data, clarifying semantic information such as entities, attributes, relationships, and logical conditions in the text. Named entity recognition methods based on deep learning can essentially be classified as a sequence labeling problem. Sequence labeling assigns labels to each element in a given text sequence based on its semantic and grammatical features, aiming to construct the corpus required for model training. Through the analysis of the seven types of semantic labels defined in the previous sections of the sluice engineering specifications, this study adopts the BIO annotation system (assuming X is an entity, B-X represents the beginning of the entity, I-X represents the middle or end of the entity, and O represents a non-entity) to perform semantic annotation on the specification provisions. Fig 5 shows an example of BIO annotation, and the text annotation example and corresponding JSON text are presented in Fig 6.

thumbnail
Fig 6. Text annotation example and corresponding JSON text.

https://doi.org/10.1371/journal.pone.0339575.g006

Randomly divide the 472 annotated statements into layers in an 8:2 ratio. Among them, 80% is the training set (Train) with a total of 656 statements, and 20% is the test set (Valid) with a total of 164 statements. The training set is used to train and update deep learning models, while the validation set is used to test model performance. It should be noted that when randomly dividing statements, the training/testing ratio of each label should be as close as possible to 8:2. Table 4 shows the number and proportion of each BIO semantic label in the randomly partitioned training and validation sets. It can be seen that the training/testing ratio of each label is close to 8:2, which basically meets the requirements of model training.

Training results

Semantic annotation based on deep learning requires setting the learning rate and batch size. Learning rate is a key hyperparameter in machine learning that controls the step size of model parameter updates. It determines the magnitude of parameter adjustment of the model based on the gradient of the loss function in each iteration. A suitable learning rate can balance convergence speed and stability, which is an important foundation for efficient model training. Batch size refers to the number of training samples used each time the model parameters are updated. In practice, batch size and learning rate often need to be adjusted collaboratively to optimize training effectiveness and efficiency. For the task of semantic annotation of standardized knowledge, the BERT-BILSTM-CRF named entity recognition model is adopted, and the parameter settings of the model are shown in Table 5. Based on previous experimental studies [1214], two hyperparameters, learning rate and batch size, were selected for ablation experiments. The evaluation criteria used in the experiment are the same as above, and it runs on a 64 bit Windows system and NVIDIA GeForce RTX 4090 memory. The hyperparameter settings of the experiment and the running results of the model on the validation set are shown in Table 6.

thumbnail
Table 5. Super parameter setting and operation results of Ablation Experiment.

https://doi.org/10.1371/journal.pone.0339575.t005

thumbnail
Table 6. Named entity recognition model parameter settings.

https://doi.org/10.1371/journal.pone.0339575.t006

After testing, the model showed the best performance after 15 rounds of training (Epoch), with the optimal hyperparameter combination being “learning rate=3e-5” and “batch size=4”. In the setting of this parameter combination, the semantic annotation results of the model on the validation set are shown in Table 7, and the corresponding confusion matrix is shown in Fig 7.

thumbnail
Table 7. Semantic annotation results of the model on the validation set.

https://doi.org/10.1371/journal.pone.0339575.t007

thumbnail
Fig 7. Confusion matrix of semantic annotation results on verification set.

https://doi.org/10.1371/journal.pone.0339575.g007

Based on the above results, the comprehensive accuracy, recall rate, and F1 of the model are 86.6%, 87.2%, and 86.8%, respectively. Among these tags, only cmp, Rprop, and robj achieved F1 scores of over 90%, while the F1 scores of other semantic tags were relatively low, ranging from 77.6% to 89.4%. This result indicates that deep learning based semantic annotation methods can be applied to large-scale semantic annotation of long and complex sentences, and can obtain relatively accurate results. In addition, the small dataset used in this study may not fully utilize the performance of deep learning. Overall, semantic annotation based on deep learning can currently achieve satisfactory results and has great potential for further improving performance.

Model verification experiments and innovation demonstration

To clarify the innovation and superiority of the proposed BERT-BiLSTM-CRF model in the semantic annotation task of water conservancy specifications, this section designs baseline comparison experiments and ablation experiments based on the annotated dataset (656 samples in the training set and 164 samples in the test set) constructed in the previous sections and the determined optimal hyperparameters (learning rate = 3e-5, batch size = 4), verifying the innovation of the model from two dimensions: “horizontal model comparison” and “vertical module contribution”.

Basic experimental settings

(1) Dataset and evaluation metrics.

This study adopts the annotated dataset and BIO annotation system from the previous sections, focusing on the recognition performance of 7 core semantic labels (sobj, obj, prop, cmp, rprop, robj, arprop). The evaluation metrics include Precision, Recall, and F1-score, and additionally supplements the “Accuracy of Hydraulic Professional Entity Recognition” (for the recognition accuracy of domain-specific entities such as “sluice crest elevation” and “seepage path coefficient”) to ensure that the evaluation dimensions meet the needs of water conservancy engineering scenarios.

(2) Experimental environment and parameter benchmark.

The experimental environment is consistent with that in the previous sections (64-bit Windows system, NVIDIA GeForce RTX 4090 graphics card). The basic model parameters adopt the optimal configuration in Table 5 (LSTM-size = 128, epoch = 15, max_seq_length = 128, dropout_rate = 0.5), only adjusting the model structure to construct baseline models and ablation groups to ensure the fairness and comparability of the experiments.

Baseline model comparison experiments

Three types of representative models are selected as baselines to verify the innovation of the proposed model’s “BERT pre-training + BiLSTM bidirectional modeling + CRF label constraint” combination. Performance tests of the above baseline models are conducted on the test set, and the results are shown in the following Table 8:

thumbnail
Table 8. Performance Comparison between the Proposed Model and Baseline Models.

https://doi.org/10.1371/journal.pone.0339575.t008

The F1-score of the proposed model is 9.2 percentage points higher than that of BERT (Baseline 1). This indicates that the bidirectional gating mechanism of BiLSTM can effectively capture the long-distance dependencies between “entity-attribute-constraint” in water conservancy specifications (e.g., the association between “normal water storage level safety margin” and “0.4m”), while the label transition constraint of the CRF layer reduces the mislabeling rate of “arprop (prerequisite condition) and rprop (core constraint)” from 17.8% to 6.9%.

The F1-score of the proposed model is 5.7 percentage points higher than that of BERT-LSTM (Baseline 2), confirming that BiLSTM is more suitable for the semantic logic of complex sentences in water conservancy specifications (e.g., “If there is no filter layer in the loam foundation, the seepage path coefficient should not be less than 4”). It can simultaneously cover the association between “preceding condition and subsequent constraint”, and the recognition accuracy of the conditional label “arprop” is increased by 7.2 percentage points.

The F1-score of the proposed model is 7.4 percentage points higher than that of BiLSTM-CRF (Baseline 3), reflecting the advantage of BERT pre-trained semantic representation. The contextual information learned from massive texts can accurately recognize professional terms such as “seepage path coefficient” and “filter layer”, avoiding the ambiguity of entity boundaries caused by randomly initialized word vectors (e.g., the mis-splitting rate of “traffic bridge beam bottom elevation” is reduced from 11.5% to 2.8%).

Core module ablation experiments

To verify the necessity and collaborative value of the three core modules (BERT pre-training layer, BiLSTM bidirectional layer, and CRF layer) in the proposed model, three ablation groups are designed. The contribution of each module is quantified by “removing core modules and observing performance changes”.

The results of the ablation experiments are shown in the following Table 9:

thumbnail
Table 9. Results of Core Module Ablation Experiments for the Proposed Model.

https://doi.org/10.1371/journal.pone.0339575.t009

The F1-score of Ablation 1 (Without BERT) decreases by 7.7 percentage points, and the average accuracy of key labels decreases by 8.3 percentage points. This indicates that the BERT pre-training layer provides an accurate semantic foundation for the recognition of hydraulic professional terms, which is the core to solving the problem of “domain entity ambiguity”.

The F1-score of Ablation 2 (Without BiLSTM) decreases by 5.7 percentage points, and the average accuracy of key labels decreases by 5.9 percentage points. This confirms that the bidirectional modeling of BiLSTM can effectively capture cross-phrase semantic dependencies in water conservancy specifications (e.g., the conditional association between “Level 3 sluice water retaining” and “safety margin”).

The F1-score of Ablation 3 (Without CRF) decreases by 4.1 percentage points, and the average accuracy of key labels decreases by 4.7 percentage points. This reflects that the label transition constraint of the CRF layer ensures the logical consistency of semantic labels, reducing mislabeling between “rprop and robj” and “arprop and obj”.

The above results show that the three core modules of the proposed model are not simply superimposed, but are innovatively adapted to the characteristics of water conservancy specifications (many professional terms, many long sentences, and strong logical constraints). The three modules collaboratively achieve “accurate semantic representation, comprehensive sequence modeling, and logical label constraint”, collectively supporting the model’s performance to be superior to baseline models with single modules or simplified structures.

Experimental conclusions and innovation summary

Through baseline comparison and ablation experiments, the innovations of the proposed BERT-BiLSTM-CRF model are clarified as follows:

  1. Structural Innovation: A collaborative architecture of “pre-trained semantic representation + bidirectional sequence modeling + global label constraint” is proposed, which specifically solves three pain points in the semantic annotation of water conservancy specifications: ambiguity in professional entity recognition, insufficient capture of long-sentence dependencies, and confusion in label logic.
  2. Performance Advantage: The model achieves a comprehensive F1-score of 0.868 on the test set, which is 5.7%–9.2% higher than that of traditional baseline models. Moreover, the accuracy of hydraulic professional entity recognition reaches 84.9%, which is suitable for the practical needs of water conservancy engineering scenarios.
  3. Application Value: The high-precision annotation results of the model (e.g., accurate distinction of 7 types of semantic labels) provide high-quality data input for syntax parsing and triple conversion in “Review Rule Analysis”, directly supporting the effectiveness of the construction of the compliance review knowledge graph.

Review rule storage

Analysis of review rules

Based on the semantic annotation model trained in the previous section, this study automates the semantic annotation of standard clauses for water gates, thereby achieving the extraction of various entities. To further analyze the semantic relationships between entities, context free grammar is used to parse normative clauses, and based on this, a domain knowledge representation model is constructed. After analysis, the articles can be divided into five types: relational constraints, attribute numerical constraints, attribute proportion constraints, attribute nesting constraints, and conditional constraints.

  1. (1) The existence relationship constraint clauses reflect the existence relationship between the examination objects, and use affirmation or negation to constrain the relationship between the subject examination object and the object examination object. For example, “a cushion layer should be set up under the bottom protection”, “under the bottom protection” and “cushion layer” are the subject and object respectively, “setting” indicates the relationship, and “should” determines the connection between the two. From the perspective of auditing, computers can understand that there should be a “cushion layer” under the “bottom protection”, otherwise it is not compliant. The corresponding knowledge graph triplet transformation is shown in Fig 8.
  2. (2) The attribute value constraint clauses mainly reflect the comparative relationship between the attribute values of the object under review, and constrain the relationship between the object under review and its attributes with affirmation or negation. For example, “the inner diameter of the drainage pipe is greater than 0.2m”, “inner diameter” and “0.2m” are the objects for review, “drainage pipe” limits the range of “well pipe”, “inner diameter” is the attribute of “well pipe”, and “greater than” indicates comparison. When the value of “inner diameter” is greater than “0.2m”, it meets the requirements; otherwise, it does not comply. The knowledge graph triplet transformation is shown in Fig 9.
  1. (3) Attribute proportion constraint clauses reflect the proportional relationship of review attributes to constrain pending attributes. For example, the length of the straight section of the river should not be less than 5 times the width of the water surface at the entrance of the sluice. ““ Length “and” 5 times “are the constraint objects,” straight section of the river “can be regarded as” river “,” length “is the attribute of the review object,” not less than “is the comparison relationship, and the semantics are the comparison of the values of” length “and” water surface width “. Only when the conditions of” not less than “and” 5 times “are met can the specification be met. This article is expressed in a computer understandable way as a comparison between the attributes of” straight section of the river “,” length “and” water surface width “. When the conditions of” not less than “and” 5 times “are met, it meets the specification. The corresponding knowledge graph triplet transformation is shown in Fig 10.
  2. (4) The nested constraint class reflects the nested structure between attributes, and examines attributes through comparative relationship constraints. Taking the example of “the bottom elevation of the traffic bridge should be 0.5m higher than the highest flood level”, the “bottom elevation of the traffic bridge” includes three levels of nested “traffic bridge beam bottom elevation”, where “beam” is a component of the “traffic bridge” and “bottom elevation” is an attribute of the “beam”; ‘Above’ is the comparative relationship, ‘0.5m’ is the judgment condition, and both must be met simultaneously. This article can be represented by a computer as the composition of a “traffic bridge”. The attribute “bottom elevation” of the “beam” satisfies the conditions of “0.5m higher” and “0.5m higher”, which meets the specifications. The corresponding knowledge graph triplet transformation is shown in Fig 11.
  1. (5) Conditional clauses reflect the constraint premise of subordinate clauses on the main clause through logical structure and affirmative/negative forms. Taking the example of “when the third level water gate blocks water, the safe increase value of the normal water level is not less than 0.4m”, in the clause, “the third level water gate” consists of components and attributes, and “blocking water” is the state condition; In the main sentence, the phrase ‘safe increase value of normal water level’ is composed of a noun and an attribute, while ‘not lower than’ is a comparative relationship. This type of provision requires first determining the conditions of the subordinate clause. When the “water gate” attribute satisfies both “level 3” and “water blocking”, it is determined whether the “safety premium value” of the “normal water level” in the main clause is “not less than” or “0.4m”. If it meets the requirements, it complies with the specifications. Otherwise, the review will not pass. The knowledge graph triplet transformation is shown in Fig 12.

Knowledge graph generation

There are three main storage solutions for knowledge graphs: relational databases, RDF triplet databases, and native graph databases. Although relational databases can store triple data ranging from tens of millions to billions, they have low efficiency in deep correlation queries and limited ability to express relationships [15]; Although RDF databases have similar semantics to graph models and good visualization effects, they have problems such as high storage overhead, insufficient design flexibility, and high query complexity [16]; Native graph databases have advantages such as high storage efficiency, convenient modeling, and strong relational expression ability, but their data capacity is relatively limited [17,18]. In response to the moderate scale of knowledge graph data in the field of water conservancy review, the need for efficient relationship expression ability, and frequent rule update requirements, this study chooses a native graph database as the storage solution [19].

The existing native graph databases include Neo4j, JanusGraph, etc. Due to Neo4j’s independent storage engine, it does not rely on external systems and has good performance in graph data traversal and relational queries. Therefore, this study uses it to store knowledge of water gate specification review [2022]. To achieve structured knowledge visualization storage and retrieval, using the Py2neo toolkit, batch process the parsed triplet statements in Cypher language and import them into Neo4j. The following is the process of importing triplet data.

  1. (1) Entity import. Read standardized knowledge triplet data, add various semantic tags, and import the names of entities such as water gates, water barriers, and normal storage levels. Taking the entity nodes that import some normative knowledge as an example, as shown in Fig 13.
  1. (2) Relationship import. Read the canonical knowledge triplet data, where each row includes a header entity, relationship, and tail entity. Add relationship labels and types, and use the CREATE statement to import all relationships. Taking the relationship between entities in the imported review articles as an example, as shown in Fig 14.

After the above steps, a preliminary knowledge graph for compliance review of water gate drawings has been constructed. Due to the large number of entities and relationships, this article presents a partial knowledge graph as shown in Fig 15.

Conclusion

This study focuses on the in-depth research of the knowledge graph construction method for compliance review of water conservancy engineering drawings [2325]. Through technological breakthroughs in key links such as data preprocessing and semantic annotation, the following core achievements have been formed:

  1. (1) A systematic process for data cleaning, semantic label extraction, and structured transformation has been designed to address the characteristics of multi-source heterogeneous data in water conservancy engineering specifications. By using regular expressions to remove text noise, extracting core semantic labels based on rule matching and natural language processing techniques, and converting data into JSON structured format and syntax tree structure, a standardized data foundation is laid for knowledge graph construction.
  2. (2) A annotation system has been constructed that includes both basic semantic tags and extended semantic tags, comprehensively covering the core review elements and complex semantic relationships of water conservancy engineering specifications. By adopting a semi-automatic annotation method, combined with automated pre annotation and manual verification, and ensuring annotation quality through cross validation, the customized annotation tool developed significantly improves annotation efficiency and accuracy.
  3. (3) The collaborative optimization of data preprocessing and semantic annotation effectively solves the problem of structuring compliance review data for water conservancy engineering reports, providing high-quality data support for entity extraction, relationship construction, and rule inference in subsequent knowledge graphs, and laying a theoretical and technical foundation for achieving automated and intelligent compliance review of water conservancy engineering reports. The knowledge graph constructed based on the optimization results can be applied to the compliance review practice of water conservancy engineering reports such as sluice drawings. Through the efficient storage and query capabilities of the Neo4j database, the review rules can be quickly retrieved and intelligently inferred, greatly improving the efficiency and accuracy of the review, and effectively promoting the development of water conservancy engineering construction towards digitalization and intelligence. It provides important technical support and practical reference for the implementation of the “smart water conservancy” strategy.

Supporting information

References

  1. 1. Luo Dongxia Research on the Impact of Artificial Intelligence on Labor Employment [D]. Southwest University of Finance and Economics, 2023.
  2. 2. Angeruma W, Siriguleng W, Sichentu S. A review of research on knowledge graph completion. Computer Science and Exploration. 2025;1–23.
  3. 3. Sun J, Wang J, Ding G. Spectrum knowledge graph: An intelligent engine for future spectrum management. Journal of Communications. 2021;42(05):1–12.
  4. 4. Cao R, Liu L, Yu Y. A review of research on large language models integrating knowledge graphs. Computer Application Research. 2024;1–14.
  5. 5. Bordes A, Usunier N, García-Durá NA. Translating embeddings for modeling multi-relational data. Advances in Neural Information Processing Systems. Lake Tahoe, NV, USA: Curran Associates. 2013;2787–95.
  6. 6. Chen D, Fisch A, Weston J. Read Wikipedia and answer open domain questions. Proceedings of the 55th Annual Conference of the Association for Computational Linguistics (Volume 1: Long Papers) Vancouver, Canada: Association for Computational Linguistics, 2017: 1870–9.
  7. 7. Zhejiang Water Conservancy Standardization Association. The first national water conservancy standard large-scale model has been released [EB/OL]. (January 10, 2025).
  8. 8. Jinfa L, Can X, Zhang Ke, etc Research on Emerging Technology Recognition Integrating Deep Learning and Patent Maps in Dynamic Perspective. Intelligence Theory and Practice, 1-14 [205-05-08].
  9. 9. Xiaoyang Research on Compliance Review of 3D Bridge Engineering Construction Design Based on Knowledge Model. Xi’an University of Technology, 2023. https://doi.org/10.27398/d.cnki.gxalu.2023.000386
  10. 10. Qian F, Cheng J, Xia R. Preliminary exploration on the construction ideas, framework and application scenarios of water conservancy large-scale models. China Water Resources. 2024;09:9–19.
  11. 11. Lin J, Chen K, Zheng Z. Key technologies and applications of intelligent interpretation of building engineering standard specifications. Engineering Mechanics. 2025;42(02):1–14.
  12. 12. Wang Jiali Automated Compliance Review and Design Optimization of BIM Models Based on NLP and Knowledge Graph. Southeast University, 2023.
  13. 13. Chen Yaokun Research on Automatic Compliance Review of Structural Models Based on Knowledge Graph Chongqing University, 2023.
  14. 14. Zhou Yucheng Automatic rule interpretation method for design review. Tsinghua University, 2022.
  15. 15. Guo H, Zhou Y, Ye X. Automatic mapping from IFC data model to relational database model. Journal of Tsinghua University (Natural Science Edition). 2021;61(02):152–60.
  16. 16. Hu H, Xuefeng X, Qitao Z. Emotion triplet extraction method using multi feature weighted graph convolutional network. Journal of Hunan University (Natural Science Edition), 2024;51(12):165–75.
  17. 17. Zhang J, Huang X, Gui M. Construction and application of water conservancy knowledge graph for digital twin engineering. People’s Yellow River. 2024;46(04):121–4.
  18. 18. Yang Y, Chen S, Zhu Y, Liu X, Pan S, Wang X. Intelligent question answering for water conservancy project inspection driven by knowledge graph and large language model collaboration. LHB. 2024;110(1).
  19. 19. Chen Y, Lu G, Wang K, Chen S, Duan C. Knowledge graph for safety management standards of water conservancy construction engineering. Automation in Construction. 2024;168:105873.
  20. 20. Zhang L, Hu K, Ma X, Sun X. Combining semantic and structural features for reasoning on patent knowledge graphs. Applied Sciences. 2024;14(15):6807.
  21. 21. Jiang S, Yang J, Xie J, Xu X, Dou Y, Jing L. Product innovation design approach driven by implicit relationship completion via patent knowledge graph. Advanced Engineering Informatics. 2024;61:102530.
  22. 22. Liu Y, Tian J, Liu X, Tao T, Ren Z, Wang X, et al. Research on a knowledge graph embedding method based on improved convolutional neural networks for hydraulic engineering. Electronics. 2023;12(14):3099.
  23. 23. Liu X, Lu H, Li H. Intelligent generation method of emergency plan for hydraulic engineering based on knowledge graph – take the South-to-North Water Diversion Project as an example. LHB. 2022;108(1).
  24. 24. Yang Y, Zhu Y, Jian P. Application of knowledge graph in water conservancy education resource organization under the background of big data. Electronics. 2022;11(23):3913.
  25. 25. Haiyu R, Jianping L, Jian W. A review of research on intelligent question answering systems based on big language models. Computer Engineering and Applications, 1-24 [2020-03-30].