Figures
Abstract
Hydrological prediction in ungauged basins often relies on the parameter transplant method, which incurs high labor costs due to its dependence on expert input. To address these issues, we propose a novel hydrological prediction model named STH-Trans, which leverages multiple spatiotemporal views to enhance its predictive capabilities. Firstly, we utilize existing geographic and topographic indicators to identify and select watersheds that exhibit similarities. Subsequently, we establish an initial regression model using the TrAdaBoost algorithm based on the hydrologic data from the selected watershed stations. Finally, we refine the initial model by incorporating multiple spatiotemporal views, employing semi-supervised learning to create the STH-Trans model. The results of our experiments underscore the efficiency of the STH-Trans model in predicting runoff for ungauged basins. This innovation leads to a substantial increase in model accuracy ranging from 7.9% to 30% compared to various conventional methods. The model not only offers data support for water resource management, flood mitigation, and disaster relief efforts, but also provides decision support for hydrologists.
Citation: Zhao Q, Zhu Y, Shi Y, Li R, Zheng X, Zhou X (2025) Hydrological prediction in ungauged basins based on spatiotemporal characteristics. PLoS ONE 20(1): e0313535. https://doi.org/10.1371/journal.pone.0313535
Editor: Kyungrock Paik, Korea University, KOREA, REPUBLIC OF
Received: February 21, 2024; Accepted: October 27, 2024; Published: January 10, 2025
Copyright: © 2025 Zhao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data presented in this study cannot be made public due to the confidentiality of the data provided by the relevant hydrological department. The data are available on request from the Hydrology bureau of Jiangxi Province, Huangsheng Bund International Building A, No. 1499 Yanjiangnan Avenue, Xihu District, Nanchang City, Jiangxi Province, China. (http://www.jxssw.gov.cn).
Funding: This work has been partially supported by the Talent Startup project of Nanjing Institute of Technology (No.YKJ202117), the Jiangsu Provincial Department of Education’s University Philosophy and Social Science Research Project (No.2023SJYB0434), the Natural Science Foundation of Jiangsu Province (Grants No. BK20210928) and the Talent Startup project of Nanjing Institute of Technology(No.YKJ202317). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
In recent years, the advancement of the big data era and machine learning has significantly improved the accuracy of hydrological prediction. However, while most forecasting models excel in predicting watershed hydrology using rich monitoring station data, accurate hydrological forecasting for ungauged areas remains challenging. Despite the relatively complete network of current hydrological stations, remote watersheds are still lacking monitoring stations or with limited data from newly established stations (observation data less than six months) [1]. These remote and underserved areas pose a significant threat to residents’ property and life safety, particularly during continuous rainstorms leading to potential flood events.
For the ungauged basins with sparse data, the traditional hydrological model cannot be effectively trained and predicted. A hydrological data-driven model relies on a large amount of data information which rules are extracted from and then predicted. Therefore, the lack of a large number of historical data makes data-driven models useless for ungauged areas. In hydrology, the usual method is to find the nearest basin stations for parameter transplantation and build the correlation through modeling [2]. These methods usually require further process understanding and estimation methods and comprehensive knowledge across processes and regions [3]. At present, there are the following problems with hydrological prediction in ungauged areas [4]:
1. Low universality. The existing methods not only require a large number of parameter calculations [5], but also can only perform parameter migration for specific watersheds [6]. It cannot be applied to multiple watersheds and lacks universality.
2. Strong professional dependence. The migration method of hydrological parameters depends on the physical hydrological process [7], which requires professional knowledge and methods, as well as cross process, regional comprehensive experience to establish the model.
3. Low prediction accuracy [8]. The method of transplanting the parameters of the nearest station to the target station depends on the similarity between watersheds. If the hydrological conditions of the two basins are not so similar, they need to be slowly corrected according to the actual situation.
To solve the above problems, this paper proposes a hydrological prediction method for ungauged basins based on transfer learning and uses multiple space-time perspectives for adjustment, as shown in Fig 1. The model proposed in this paper not only improves the accuracy of hydrological transport, but also reduces reliance on expert knowledge, and improves the efficiency and universality of hydrological prediction.
The specific framework process is as follows:
- Multi-Source Data Input: The bottom layer of the framework is the basic input of the model. The input data includes hydrological data, meteorological data, geographic location information, and terrain attributes.
- Calculate Watershed Similarity: This is to determine the domain dataset. First, calculate the geographical distance between each hydrological station and the target point, and then select the stations within a certain distance to calculate the geometric similarity, terrain similarity and geographical location relationship for comprehensive judgment.
- Determine Domain Dataset: Based on domain similarity, determine the target domain dataset, source domain dataset, and auxiliary domain dataset to participate in training.
- Initial Model Establishment: The model improves the Transfer Learning algorithm and applies it to the regression task. At the end of the algorithm, the regression results are obtained by taking the median of the weighted basis learner.
- Multiple Spatiotemporal Views Adjustment: Train three regressors starting from the time, spatial, and attribute views. Continuously adjust the average and difference values of any two regressors to obtain and output the final predicted value.
2 Related works
Predictions in Ungauged Basins (PUB), as the second International Hydrological Decade, was officially launched in 2003. Hrachowitz M et al. [9] reviews the work that has been done under the six science themes of the PUB Decade and outlines the challenges ahead for the hydrological sciences community. Tim et al. [10] took an ungauged area in southern Cambodia as an example, coupled the rainfall runoff model with the irrigation reservoir, used remote sensing data to drive, and professional knowledge to constrain parameters to establish a hydrological prediction model. Kong et al. [11] proposed a physical process driven distributed model, TOPKAPI model, to forecast floods in areas without data through parameter transplantation. Lance et al. [12] used the artificial neural network model to predict the runoff in ungauged areas and took the time lag records of precipitation and temperature as the input of the model. The experimental results show that the hourly flow prediction is better than the daily flow prediction. Tara et al. [13] proposed a multi modeling research method, using four regional models (two data-driven models and two hydrological models) for continuous daily flow estimation, and combining the four models to improve the model performance. Yu et al. [14] proposed an improved hydrological prediction model based on empirical model, radial basis function and autoregressive model for forward prediction of runoff series. Oruche R et al. [15] study the methodology behind Transfer Learning through fine-tuning and parameter transferring for better generalization performance of streamflow prediction in data-sparse regions. Chu KyungSu et al. [16] used the support vector machine method, the random forest method and the XGBoost. The threshold rainfall of the ungauged watersheds was calculated using the XGBoost technique and verified through past rainfall events and damage cases. Rasheed Zimeena et al. [17] presents a prototype ML-based framework for flood warning and flood peak prediction on the recent success of Machine Learning models on streamflow prediction. Choi Jeonghyeon et al. [18] incorporates vegetation modules into hydrological models to effectively improve accuracy. Demirel et al. [19] found that adding spatial pattern evaluation to the traditional temporal evaluation of hydrological models can assist in identifying optimal parameter sets. Swilla Livingstone et al. [20] aims to calibrate and verify the model for runoff prediction of ungauged basin using the other river flow data. The sub-catchment geospatial characteristics were delineated using the Digital Elevation Model of the study area in ArcGIS to help improve model’s performance. Giudicianni C. [21] discusses the impact of different rainfall events on ungauged basins, and also proves that basins are influenced by terrain, area, and rainfall.
In summary, machine learning method provides new ideas for hydrological prediction in ungauged basins. The spatial patterns, including vegetation coverage, watershed area and others, can also improve the accuracy of hydrological prediction in ungauged basins. Although more and more studies consider using machine learning to solve the PUB problem, there are still some problems, including low prediction accuracy and weak universality. Moreover, there are still no stations or historical data available for many small and medium-sized rivers. How to combine machine learning, transfer learning, and spatial features in the era of big data to predict the regions with sparse hydrological data and provide a model with good prediction accuracy while reducing the dependence on expert knowledge has become a complex problem.
3 Domain dataset selection based on spatial similarity
3.1 Basic definition
Transfer learning [22–25] is to mine new knowledge on the basis of existing knowledge and find the relationship between existing knowledge and new knowledge. The basic definitions are as follows:
Domain (D): D = {X, P(X)}, It is composed of data characteristics and data distribution. Generally speaking, it can be understood as a specific field at a certain time.
Task (T): T = {Y, f(⋅)}, It is composed of objective function and learning result. Generally speaking, it can be understood as something to do.
Source Domain (DS): DS = {(xs1, ys1), ⋯, (xsn, ysn)}, xsi ∈ XS, ysi ∈ YS. A domain with existing knowledge. Modeling requires the use of knowledge from the Source Domain.
Target Domain (DT): DT = {(xT1, yT1), ⋯, (xTn, yTn)}, xTi ∈ XT, yTi ∈ YT. A domain without knowledge and requiring knowledge learning tasks.
Transfer learning is based on source domain knowledge [26]. By reducing the data distribution difference between source domain and target domain, knowledge transfer is achieved.
The purpose of hydrological forecast is to predict flood events, reduce the impact of disasters, and conduct reasonable reservoir operation. So, we will forecast runoff. There are many factors that affect the runoff. In addition to hydrological factors, it is also affected by geographical and topographic factors and meteorological factors, such as watershed area, watershed length, elevation, slope and rainfall.
The Transfer learning model needs source domain and target domain for training. Therefore, it is very important to select the appropriate source basin data for the model training. In addition, for ungauged basins, we need to find a basin that can replace the training of the target basin. This paper selects and marks the basin that is most similar to the target basin, and defines it as an auxiliary basin to replace the target basin for training. The specific data set is defined as follows:
Target Basin(t): the area without hydrological data that needs to be relocated.
Source Basin(s): a basin with hydrological data, which is used to establish a model and assist the target basin in prediction.
Auxiliary Basin(h): h ∈ s, the basin most similar to the target basin.
The rainfall data of t is Pt, the rainfall data of h is Ph, the rainfall data of s is Ps, the runoff data of h is Qh, the runoff data of s is Qs, the geographic data of t is Gt, the geographic data of h is Gh, the geographic data of s is Gs, the terrain data of t is Tt, the terrain data of h is Th and the terrain data of s is Ts.
3.2 Selection of source basin and auxiliary basin based on spatial similarity
Similarity is compared by calculating the distance between the features of different things. For the same hydrological process, if the size of the hydrological related elements at each watershed point is proportional and their change trend is similar, then the two hydrological phenomena are similar [27]. In the dynamic process, if the two dynamic phenomena are similar, geometric similarity, kinematic similarity and dynamic similarity shall be generally met. In the physical process of hydrology, the geometric similarity considers the shape information, basin area, basin length and so on. The movement similarity considers the runoff direction and velocity of the river. For dynamic similarity, gravity and pressure of each location shall be considered, including elevation and slope of the basin [28, 29].
Based on the geographical location characteristics, topographic and geomorphic characteristics of each station, this section constructs three hydrological spatial relationship maps: Geometric Similarity Graph, Topographic Index Graph, and Geographic Location Graph. There is no hydrological data record in ungauged areas, such as runoff and water level. So we do not consider hydrological time series factors when constructing hydrological similarity map.
When the two rivers are geographically close, the hydrological process will also be similar to some extent because of the rainfall process and geographical conditions [30]. Therefore, it is necessary to consider the distance of the measuring station in the basin that is similar to the target basin. The distance formula is as follows:
(1)
Where R is the radius of the earth, atan2(x, y) is a function that returns the arctangent of the specified x and y coordinate values.
The calculation formula of a is as follows:
(2)
Where Δθ is the latitude difference between two geographical locations, αi and αj is the longitude of two geographical locations, Δθ is the longitude difference. The angle here must be in radians, not numeric latitude or longitude.
3.2.1 Geometric similarity graph.
The hydrological basin shape information includes basin area, basin length, etc. Based on the shape information, a geometric similarity graph is constructed. The similarity relationship is calculated by the weighted difference distance of the shape attribute values between two stations. The formula of Geometric Similarity Graph is as follows:
(3)
Where GSa,t is the geometric similarity distance between point a and t, where a is the surrounding basin, point t is the target basin, n is the number of attributes in the shape information, δm is the weighting coefficient, am is the mth attribute value in the a watershed shape information. tm is the mth attribute value in the t watershed shape information. The smaller GSa,t value is, the more similar it is to the target watershed.
3.2.2 Topographic index graph.
Affected by the topography of the basin, hydrological phenomena shows different hydrological processes. The topography not only affects the runoff direction, but also relates to the specific geographical distribution of the cumulative water volume. Topographic index is the cumulative trend of runoff at a certain point in the basin. Watershed points with similar topographic indexes have similar hydrological processes in theory. The formula of Topographic Index Graph is as follows [31]:
(4)
Where α represents the accumulated water volume at any point in the basin, tanβ represents the runoff trend caused by gravity.
3.2.3 Geographic location graph.
The basin is affected by gravity and inertial force. The different velocity of water flow leads to different situations of rapid rise and fall of flow. According to the elevation difference and gradient difference between the two watersheds, the Geographic Location Graph is constructed. The formula is as follows:
(5)
Where GLa,t is the combination of elevation difference and gradient difference between a and t, where a is the surrounding basin, point t is the target basin, ϑ1 and ϑ1 are the coefficients, is the normalized altitude difference of basin a and basin t,
is the normalized average gradient difference of basin a and basin t. The smaller GLa,t value is, the more similar it is to the target watershed.
For the selection of the source basin, first filter the stations according to the horizontal distance, and then sort them according to the geometric similarity relationship, terrain index and geographical location. We can obtain three sorted station tables. Select the top n% stations in the three tables. If the selected stations rank in the top n% in all three tables, the data of this station will be placed in the source basin data table.
For the selection of auxiliary basin, they are sorted according to the difference value of topographic index. Three basins with small topographic index difference are selected, and one of the most similar basins is selected as the auxiliary basin according to the geometric relationship and geographical location relationship. The runoff information from the auxiliary basin stations will be used to establish the initial hydrological prediction model.
4 STH-Trans model
4.1 H-Trans model based on transfer learning
There is no available hydrological station in the ungauged area, so the required information cannot be effectively mined. Therefore, based on the idea of transfer learning, this section takes the runoff information of the auxiliary basin as the training data set of the target basin. Tradaboost algorithm [32] is improved based on the idea of adaboost algorithm, which is a widely used sample-based transfer learning method. Tradaboost algorithm filters the source domain data, removes the data inconsistent with the target domain sample distribution, updates the weight vector through the Boosting method, increases the weight of training data useful for the source domain data, and reduces the weight of invalid data with inconsistent distribution.
Aiming at the specific problems in hydrological field, a hydrological prediction model based on transfer learning is proposed. The model performs regression training operation based on the idea of Tradaboost algorithm [33]. The model adjusts the weight vector based on the prediction error, and then obtains the regression results by taking the median or weighted average of the weighted basis learners [34]. When the output Yi of sample Xi is incorrect in the regression process, its actual error value may be arbitrarily large,
is the real value of sample Xi. Therefore, the error calculation in this algorithm needs to be normalized so that its range is [0, 1], and then the cumulative error is used to calculate the redistributed weight.
This paper selects M5 model tree [35] as the base learner. The M5 model tree algorithm replaces constants with linear regression functions at leaf nodes and combines piecewise linear models to handle nonlinear problems. Compared with traditional linear regression algorithms, it can automatically segment the input space and accurately predict nonlinear data. Compared with regression trees, it has faster computational efficiency and higher prediction accuracy. Compared with neural network methods, it has interpretability [36]. The steps are as follows:
- Dataset input: Set the training dataset as D. The first n samples of training set D are the source basin sample data, and the last m samples are the target basin sample data.
- Parameter set: Set the base learner M5 model tree. This paper selects M5 model tree [37] as the base learner. Set the basic parameters and initialize the weight vector W1, to
.
- Training model: Train the base learner ht with training data sets and weight vectors.
Calculate the maximum error on the target basin sample set in the training set:.
Calculate the relative error on the target basin sample set in the training set:.
Where yi is the real data, ht is the basic learner, and xi is the input data. Calculate the error rate of the learner ht on the target basin sample set:, When εt > 0.5, reset it to 0.5.
Update weight vector:
Where, Zt is the normalization factor.
- Model output: Build the final regressor:
, Where h′(x) is the median of βtht(x).
When εt is greater than 0.5, reset εt to 0.5. Each round of iteration adjusts the weight vector according to εt. The sample with a large εt will reduce the influence factor in the next iteration. After multiple iterations, the data with smaller error rate εt of the target domain will be given higher weight, while the weight of others will be reduced.
4.2 STH-Trans model based on multiple spatiotemporal views
The hydrological data distribution of the source basin may be very different from that of the target basin. Since there are no stations in the area, some existing geographic data and calculated rainfall data of the target basin cannot fully estimate its hydrological status. Therefore, this chapter improves H-Trans by using semi-supervised learning [37–39] based on bifurcation and combining the existing rainfall data and geographic terrain data of the target basin to adjust H-Trans model.
4.2.1 Multiple spatiotemporal view.
The formation of hydrological phenomena is not only reflected in the scope of time and space, but also related to the geographical attributes of hydrological entities. When the spatial attributes between hydrological entities are more similar, the change trend of hydrological process will be similar. This paper adjusts the transfer model from three aspects: time view, space view, and attribute view. The three views are described below.
1. Time View
Hydrologic data mostly change with time. In the hydrological process, the hydrological situation will be affected not only by its own hydrological factors, but also by meteorological factors. Meteorological elements promote the whole hydrological process of the basin, which makes the different basins have different confluence characteristics. Rainfall is an important factor of hydrological process change, and it will cause obvious changes in runoff and water level. Therefore, the rainfall in the meteorological data is selected as the time series feature of the time view to be added to the training. The dataset based on Time View is defined as follows:
(6)
Where Q is the runoff of the measuring station, and P is the area rainfall data corresponding to the measuring station.
2. Space View
Whether the hydrological phenomena between the two points are similar is related to the distance between the two points. The river channel distance cannot be calculated in the ungauged area, so we choose the horizontal distance as the measurement. The longitude and latitude are selected as the spatial coordinates to calculate the Euclidean Distance, and the altitude is selected as the geographic data for collaborative assistance. The dataset based on Spatial View is defined as follows:
(7)
Where G(lt) = {d(lt, ls), g(lt, ls)}, lt is the measuring station of the target basin, ls is the measuring station of the source basin, d(lt, ls) is the horizontal distance between the target basin and the source basin, g(lt, ls) is the elevation difference between the target basin and the source basin, G(lt) is the spatial distance feature of the target watershed.
3. Attribute View
The topography, landform, soil vegetation and other conditions of a basin will have a great impact on the generation and change of runoff in the basin. These indicators are decisive parameters for hydrological forecasting. Select drainage area, drainage length, slope, vegetation coverage, terrain and other data in geographic attributes as attribute data [21, 40]. The dataset based on Attribute View is defined as follows:
(8)
Where T(lt) = {cs(t(lt), t(ls)), t′(lt)}, lt is the measuring station of the target basin, ls is the measuring station of the source basin, t(l) =< l.a1, l.a2, l.a3, l.a4 >, l.a1 is the normalized drainage area, l.a2 is the normalized drainage basin length, l.a3 is the normalized slope, l.a4 is the normalized vegetation coverage, cs(t(lt), t(ls)) is the cosine similarity of two vectors, t′(l) is the terrain index, T(lt) is the attribute feature of the target basin.
4.2.2 STH-Trans model establishment.
This paper proposes a transfer learning algorithm based on the adjustment of multiple spatiotemporal views. Drawing on the idea of Tri-Training [41–45], starting from the three characteristic views of time, space and attribute, we use three regression functions based on these three views to determine how to select appropriate unlabeled samples for labeling, as shown in Fig 2.
Where t is the target watershed, s is the source watershed, h is the auxiliary watershed, Q is the runoff dataset, P is the rainfall dataset, G is the spatial dataset, and T is the attribute dataset.
The model first selects appropriate training data sets of source domain and target domain based on the idea of transfer learning and similarity analysis. Then start with the Time View, Space View and Attribute View, train three regressors based on these three views, and constantly adjust them through the difference value to finally get the final estimate.
The specific steps of model construction are as follows:
- Initial model construction: Tradaboost regression algorithm is constructed according to the algorithm in Section 4.1, and the base learner is M5 model tree.
- Model adjustment:
- (a). Use the data based on Time View, Space View and Attribute View and the hydrological data of source domain and auxiliary domain to establish three regressors for the target domain respectively R1, R2, R3.
- (b). Calculate the average and difference of the estimated values of any two regressors and put them into the empty sample set L.
- (c). Sort all samples in sample set L, and divide all sorted samples into b sub sample sets from top to bottom.
- (d). Select the sample of a% with the smallest difference in the sub sample set to participate in the calculation instead of the original sample set in the regression machine that did not participate in the calculation.
- Model output: After the conditions are met or the iteration is completed, the average regression of the three regressors is output as the final result.
In the model adjustment step, if the difference between the estimated values of the other two regression functions is small, the average estimated values provided by them will obtain higher confidence. However, the size of the difference cannot be measured by a fixed value. For example, the estimated values of sample A and sample B are 10 and 100 respectively, obviously sample A is easier to obtain a smaller difference. Therefore, we sort the average of the estimated values in each iteration, and then partition the sorted samples from top to bottom. Find the samples with the smallest difference in the partition and add them to the next round of calculation until the iteration is completed.
5 Experimental analysis
5.1 Experimental dataset selection
The experiment uses data of hydrological stations in Jiangxi Province, China (see Fig 3). To verify the validity of the model, the experiment uses the basin with hydrological data as the ungauged area for prediction. All hydrological data of this station are only used for final comparison, not as training data in the experiment. Taking Shi Shi Station in Jiangxi Province (station code: 62312550, east longitude: 114.78, north latitude: 28.26) as an example, the station is located in Putian Village, Zhajin Town, Xiushui County, Jiangxi Province, and belongs to Xiushui River System. The drainage area of the station is 2807 km2, accounting for 73.3% of the Wanzai River basin area, belonging to the subtropical humid climate. The river section is basically straight and the channel is approximately trapezoidal. The width of the medium to high water channel is about 160 m to 230 m, and there are no tributaries upstream. The riverbed is composed of fine sand on the left bank, pebbles on the right bank, and some rocks. There are three sandbars located 400 m upstream of the section, growing weeds and trees. In some sections of the river below 306 m downstream, a floodplain appears on the left bank above 53.50 m water level. The water surface width has increased from over 200 m to about 300 m, and there is a left bend about 500 m downstream, with medium to high water playing a controlling role [46].
(The figure was generated by Python with basemap toolkit. The map source and license are accessible from https://github.com/matplotlib/basemap.).
5.1.1 Multiple spatiotemporal view.
Select the source basin and auxiliary basin corresponding to the target basin. First, calculate the geographic distance according to the longitude and latitude coordinates, and then calculate the terrain similarity according to the distance. Because hydrological station measurements provide hydrological data, the longitude and latitude coordinates here use the coordinates of hydrological stations. The red triangles in Fig 3 refer to all stations within 100km from the target basin (indicated as the black triangle).
The distance from the target basin (Shi Shi as the example) to all stations within 100 km is shown in Fig 4. In the figure, the orange bars represent stations with a distance of less than 15 km.
Next, we establish a geometric similarity graph, and calculate the geometric distance according to the shape information, watershed area, length, etc. The smaller the distance, the more similar it is. The geometric distances from target watershed to all selected stations are shown in Fig 5.
Then, we calculate the terrain index through ArcGIS. The terrain index of the target basin is 20.96. The closer the topographic index is to that of the target station, the more similar it is to the target basin. The terrain index is shown in Table 1 and the topographic index distance is shown in Fig 6:
In the figure, the orange bars represent stations with terrain index differences less than 1. The station with the smallest difference in terrain index is Shang Gao.
Finally, we establish a geographical location graph. Fig 7 shows the comprehensive geographic distance between the target watershed and all measurement stations. The smaller the value, the closer it is to the target basin.
In the figure, the orange bars represent stations with comprehensive geographic distances less than 1. Shang Gao is the station with the smallest difference.
Through comprehensive comparison and removal of a large number of stations with missing data, 23 stations with small differences in terrain index and small geometric distance from the comprehensive geographic distance are selected. Among them, Shang Gao, Bai Zhushan and Ling Jiang are the basins with the smallest difference in topographic index. In consideration of its geometric distance and comprehensive geographical distance, Shang Gao, which has a relatively small gap, is selected as an auxiliary basin and directly put into the model to replace the target basin to help with migration and prediction. Other stations are put into the model as the source basins for transfer learning (see the red triangles shown in Fig 8).
(The figure was generated by Python with basemap toolkit. The map source and license are accessible from https://github.com/matplotlib/basemap.).
Based on the data analysis for the selected auxiliary basin, Fig 9 below shows the daily flows of Shang Gao and Shi Shi in 2008. It can be seen from the figure that although the runoff is different, the runoff change trend of the two watersheds is similar. Therefore, it is feasible for us to select the auxiliary basin to help the ungauged basin to predict.
5.1.2 Selection of spatiotemporal data.
Time view dataset. the five years runoff data from 2005 to 2009 of the source basin and auxiliary basin that have been normalized [47] are selected for the experiment, and there are 269,334 runoff data in total. In addition, the rainfall data of the target basin can be obtained by interpolating rainfall from the surrounding rainfall stations. The average rainfall of the first six hours and the current period of the source basin and the target basin are also input as time characteristics. However, because it is difficult to show hourly rainfall data for a period of time, we show the general distribution of annual rainfall in Fig 10. It is viewed that the rainfall is higher in the hill regions than that in the flat plains, which highlights the risk of flash floods.
(The figure was generated by Python with basemap toolkit. The map source and license are accessible from https://github.com/matplotlib/basemap.).
Spatial view dataset. spatial features are represented by spatial coordinates. The experiment includes 22 source basins, 1 auxiliary basin, and 1 target basin. Spatial features include longitude and latitude, elevation, and elevation difference of these basins, as shown in Table 2. The digital elevation map is shown in Fig 11.
(The red triangles refer to all stations within 100 km from the target basin (indicated as the black triangle). The figure was generated by Python with basemap toolkit. The map source and license are accessible from https://github.com/matplotlib/basemap.).
Attribute view dataset. attribute feature data are obtained by mining from spatial data and using software such as ArcGIS, including terrain and geomorphic data such as watershed area (square kilometers), watershed length (kilometers), and vegetation coverage.
The vegetation coverage is shown in Fig 12. Vegetation coverage and specific geographic attributes such as watershed area and length are shown in Table 3.
(The red triangles refer to all stations within 100km from the target basin (indicated as the black triangle).) (The figure was generated by Python with basemap toolkit. The map source and license are accessible from https://github.com/matplotlib/basemap.).
5.2 Comparative model and parameter setting
In this experiment, common prediction evaluation indexes and flood prediction evaluation indexes are selected, including the Root Mean Square Error, Certainty Coefficient, Kling-Gupta Efficiency Coefficient, Nash-Sutcliffe Efficiency Coefficient and Peak Flow Error index. The experiment first compares the model results of different selected source watershed and auxiliary watershed training, and then compares the model results of adjustment training under different views. The specific model is described as follows:
Model based on geometric similarity (Trans): Geometric similarity graph is used to select source basin and auxiliary basin data without optimization and adjustment of multiple spatiotemporal views.
Model based on geometric similarity and spatiotemporal views (ST-Trans): Refer to the model of the paper [48] and adjust the model using the Tri-training concept. The geometric similarity relation graph is used to select the source basin and auxiliary basin data, and the model is optimized and adjusted by multiple spatiotemporal views.
Model based on spatial similarity (H-Trans): The source basin and auxiliary basin data are comprehensively selected using multiple relational maps in spatial similarity without optimization and adjustment of numerous spatiotemporal views.
Model based on spatial similarity and time view (TH-Trans): The source basin and auxiliary basin data are comprehensively selected by using multiple relational graphs in spatial similarity, and the time series data are added for adjustment and optimization.
Model based on spatiotemporal view adjustment (CoH-Trans): Refer to the model of the paper [36] and adjust the model using the Co-training concept. the source basin and auxiliary basin data are comprehensively selected by using the multi relational graph in spatial similarity, and the time series feature and spatial feature are added to adjust and optimize based on the collaborative training method.
Model based on multiple spatiotemporal views adjustment (STH-Trans): the source basin and auxiliary basin data are comprehensively selected by using multiple relational maps in spatial similarity, and the optimal adjustment and optimization are carried out based on the multiple spatiotemporal views proposed in this paper.
Experimental environment: operating system Windows 10, processor Inter i7-9700, 2.4 GHz, memory 32 GB, GPU GTX 1050 Ti.
Experimental parameters: the base learner is M5 model tree, the average weight is set for the initial sample weight, and the maximum number of iterations is 50. To ensure the model’s effectiveness, the parameters of a and b must be set through training. The experimental setting is a = 3 and b is [1, 5]. When b is 1, the unlabeled sample is not divided. The experimental results are shown in Fig 13.
It can be seen that RMSE is decreasing no matter what the value of b is, especially the previous iterations. This also indirectly proves the effectiveness of our proposed model. When b = 1, the RMSE of the model is declining but does not reach the optimal performance. When b = 2, the model still needs several iterations to achieve better results. When b reaches 3, the model can quickly achieve the optimal effect, so the experimental setting is b = 3.
We continue to find the optimal parameters of a. The experimental setting is b = 3, a is [1, 5]. The results are shown in Fig 14.
It can be seen from the figure that the maximum convergence cannot be achieved when a = 1. In other cases, the model can reach the maximum convergence after a period of time, and obtain better results. In addition, the performance when a is taken as 2 is slightly worse than when a is taken as 3. After a reaches 3, the model performance is similar, and the optimal effect can be achieved quickly, so the experimental setting is a = 3.
5.3 Experimental results and comparative analysis
The experiment establishes an initial prediction model, and the prediction results are shown in Fig 15. It can be seen that the runoff change trend predicted by the initial model is basically consistent with the observations, rising before the flood peak arrives and falling after the peak. The purpose of hydrological data prediction is to predict the coming flood and to conduct water resources scheduling. The initial model achieves this goal, but the accuracy still needs to be improved. The high flood peak prediction value here may also be due to the higher runoff of the auxiliary basin and the surrounding source basin. It can also be found from the comparison of figure above that the upper runoff is generally higher than that of Shi Shi, which may lead to the higher predicted runoff.
In physical process, geometric similarity is the premise of dynamic similarity and kinematic similarity. We selected the source basin and auxiliary basin based on geometric similarity for comparative experiments.
It can be seen from the figure that different dataset selections have an impact on the model results. The Trans model in the figure uses the source basin and auxiliary basin datasets selected based on geometric similarity. The auxiliary basin is Niu Toushan, and the geometric distance between Niu Toushan and Shi Shi is the smallest, their horizontal distance is also small. The H-Trans model uses the data of the source basin and the auxiliary basin selected by the comprehensive factors such as geometric similarity, topographic index and geographical location relationship. The auxiliary drainage basin is Shang Gao. Because the auxiliary basin selected by the Trans model is not as good as that selected by the H-Trans Model selection, Niu Toushan is not the best auxiliary basin, so the model effect trained by the source basin and auxiliary basin selected by the Trans model is worse than that of H-Trans model. Fig 16 shows that the training data selected by the method proposed in this paper has better training results and can predict higher flood peaks. The hydrological prediction in the ungauged areas is for flood prediction. The prediction of peak flow is important in flood warning and emergency rescue. Therefore, the selection of appropriate datasets for the model is crucial.
Then we train the models based on multi-spatiotemporal views adjustment for the different training data of the source basin and auxiliary basin. The experimental results are as follows (Fig 17):
It can be seen in the figure that the training dataset selected by the method proposed in this paper can get better results. Although the training results of the source basin and auxiliary basin data sets based on geometric similarity are better than those before multi-view adjustment, the prediction of flood peak still has problem. At the same time, the model adjusted by multiple spatiotemporal views is better than the general transfer learning model, and its accuracy is significantly improved. The STH-Trans model not only incorporates spatial features, but also has been improved through multiple spatiotemporal views.
The hydrological prediction models under different views are compared. In the hydrological process, rainfall has a significant influence. The experiment first adds rainfall to train the model, adjusting the initial model based on a single time series feature. The results are shown in Fig 18:
It is obvious from the figure that the TH-Trans model with time series feature adjustment is more accurate than the initial H-Trans model. Although the geographical position in the spatial view and the topography and geomorphology in the attribute view have inextricably linked relations with the formation of hydrological phenomena, there is no direct causal relationship, so the spatial view and attribute view are not separately trained and compared.
Next, compare the models based on different views. The experimental results are shown in Fig 19. The model based on dual spatiotemporal view is adjusted based on the time series view and spatial geographical location view, and the model is trained using the collaborative training method.
It can be seen from the figure that the results of the CoH-Trans model based on the dual spatiotemporal view are better than those of the H-Trans model, while the STH-Trans model based on the multi spatiotemporal view adds an attribute view on the basis of the CoH-Trans model, which fits better and makes the local prediction of the flood peak more accurate.
Table 4 compares the predicted results of the migration model for the above models. It can be seen that the ST-Trans model and the STH-Trans model adjusted by multiple spatiotemporal views are much better than the original model without adjustment such as Trans and H-Trans, with smaller root mean square error and higher accuracy. The auxiliary basin and source basin selected in this method can apparently help hydrological migration. The STH-Trans model based on multiple perspectives has better NSE and KGE, higher R2 and lower RMSE than TH-Trans and CoH-Trans. This may be because the STH-Trans model considers factors such as geographical location and terrain, making predictions more accurate.
One of the purposes of hydrological forecasting is to predict floods, so it is to be able to predict the arrival time and peak of floods. In addition to comparing hydrological prediction results, we also need to compare and predict extreme flood events that occur in hydrological prediction, especially the accuracy of flood peak prediction. The flooding process with the highest peak value in the prediction sequence is selected for comparison, as shown in Table 5.
It can be seen from the table that the flood peak value predicted by the dataset selected by the geometric similarity ST-Trans model does not go up, and the selection method of the source basin and auxiliary basin in this paper can better predict the peak value.
The adjusted model performs better than the initial model, with higher precision, lower root mean square error and smaller peak error. In the adjusted model, the precision of the CoH-Trans model based on a dual spatiotemporal view is improved, and the error is reduced compared with the TH-Trans model based on a single view of time series. Among them, the STH-Trans model has the best effect. It not only improves the accuracy by 30% compared with the H-Trans model, 9.5% compared with the TH-Trans model, and 7.9% compared with the CoH-Trans model, but also dramatically reduces the root mean square error. In terms of flood peak error, the STH-Trans model under the multi temporal and spatial view also performs best, with the minimum flood peak error. It can be seen from the figure that the H-Trans initial model and TH-Trans model lag in peak time, while the STH-Trans model can accurately predict or even advance. Flood forecast is to give early warning of floods, and the prediction of peak value in advance can better leave preparation time for flood control and risk resistance. On the basis of considering spatial features, compared to single view and dual spatiotemporal view models, the STH Trans model not only considers geographical location, but also geographical attribute factors, including some terrain factors, which makes predictions more accurate.
6 Conclusion
This paper proposes the STH-Trans model for small and medium-sized ungauged basins which have geographic terrain data. The model integrates transfer learning and semi-supervised learning which taken into account the spatial characteristics of the target basin and the spatiotemporal characteristics of surrounding basins. The innovation and advantages of this article are as follows.
- (1) The STH-Trans model employs Transfer learning to initiate the results by training the base dataset with the largest similarity. Compared to parameter transplantation in neighboring areas, this method compares the geographic terrain of the watershed and finds the most suitable migration watershed.
- (2) The STH-Trans model refines and improves the experimental results based on geographic characteristics, basin specific characteristics, surrounding watershed characteristics, and rainfall, improving the accuracy of model prediction.
- (3) The STH-Trans model reduces the dependence on expert knowledge and addresses data gaps often encountered in traditional data-driven models. This results in enhanced prediction accuracy and a more versatile application across different scenarios.
Moreover, most of the ungauged basins are remote and backward areas, and our research results can provide flow prediction for these basins. In addition, it can not only provide data support for water resource management, but also provide decision-making support for flood prevention and control.
However, the model in this paper still has some limitations.
- (1) The watershed tested in this model has a relatively flat terrain and is located in a plain area, belonging to a subtropical humid climate. Therefore, the effectiveness of the model in arid areas still needs to be considered.
- (2) Although the model in this article is aimed at ungauged basins, it still requires a large amount of other data, including geographic terrain data, rainfall data, as well as hydrological and terrain data of surrounding watersheds.
- (3) The model didn’t deepen the analysis to the mechanisms behind the hydrological function. It focuses more on data analysis rather than physical processes, and its interpretability is weak. This is also a common problem with machine learning models. However, process understanding is necessary while using machine learning-based models for hydrologic predictions [49, 50].
In the future, in terms of models, we will consider incorporating temporal similarity and constructing hydrological spatiotemporal similarity to further refine domain selection and improve prediction accuracy. In terms of application, we consider conducting research on semi-arid and semi-humid regions as well as arid regions. At the same time, we expect to conduct interpretability analysis on machine learning models by combining geographical terrain and hydrological and physical mechanisms, in order to improve the interpretability of the model.
References
- 1. Valimba P, Mahe G. Estimating Flood Magnitudes of Ungauged Urban Msimbazi River Catchment in Dar es Salaam, Tanzania. College of Engineering and Technology, University of Dar es Salaam. 2020;(1):59–71.
- 2. R A, M BD, A P, G D, R RL. Streamflow prediction in ungauged basins: analysis of regionalization methods in a hydrologically heterogeneous region of Mexico. Hydrolog Sci J. 2019;64:1297–1311.
- 3. Levin SB, Farmer WH. Evaluation of Uncertainty Intervals for Daily, Statistically Derived Streamflow Estimates at Ungaged Basins Across the Continental U.S. Water. 2020;12(5):1390.
- 4. Guo Y, Zhang Y, Zhang L, Wang Z. Regionalization of hydrological modeling for predicting streamflow in ungauged catchments: A comprehensive review. Wiley Interdisciplinary Reviews: Water. 2020;.
- 5. Salinas JL, Laaha G, Rogger M, Parajka J, Viglione A, Sivapalan M, et al. Comparative assessment of predictions in ungauged basins—Part 2: Flood and low flow studies. Hydrology and Earth System Sciences. 2013;17:2637–2652.
- 6. Parajka J, Viglione A, Rogger M, Salinas JL, Sivapalan M, Blöschl G. Comparative assessment of predictions in ungauged basins—Part 1: Runoff-hydrograph studies. Hydrology and Earth System Sciences. 2013;17:1783–1795.
- 7. Zelelew Alfredsen. Transferability of hydrological model parameter spaces in the estimation of runoff in ungauged catchments. Hydrological Sciences Journal. 2014;59:1470–1490.
- 8. Blöschl G, Bierkens MFP, Chambel A. Twenty-three unsolved problems in hydrology (UPH)—a community perspective. Hydrolog Sci J. 2019;64:1141–1158.
- 9. Hrachowitz M, Savenije HHG, Blschl G, Mcdonnell JJ, Cudennec C. A decade of Predictions in Ungauged Basins (PUB)—a review. Hydrological Sciences Journal/Journal des Sciences Hydrologiques. 2013;58(6):1198–1255.
- 10. Emmerik TV, Mulder G, Eilander D, Piet M, Savenije H. Predicting the ungauged basin: model validation and realism assessment. Frontiers in Earth Science. 2015;3(62).
- 11. Kong X, Li Z, Liu Z. Flood Prediction in Ungauged Basins by Physical-Based TOPKAPI Model. Advances in Meteorology. 2019;2019(4):1–16.
- 12. Besaw LE, Rizzo DM, Bierman PR, Hackett WR. Advances in ungauged streamflow prediction using artificial neural networks. Journal of Hydrology. 2010;386(1-4):27–37.
- 13. Razavi T, Coulibaly P. Streamflow Prediction in Ungauged Basins: Review of Regionalization Methods. Journal of Hydrologic Engineering. 2013;18(8):958–975.
- 14. Yu Y, Zhang H, Singh V. Forward Prediction of Runoff Data in Data-Scarce Basins with an Improved Ensemble Empirical Mode Decomposition (EEMD) Model. Multidisciplinary Digital Publishing Institute. 2018;10(4):388.
- 15.
Oruche R, Egede L, Baker T, O’Donncha F. Transfer learning to improve streamflow forecasts in data sparse regions. 2021;.
- 16. Chu K, Oh C, Choi J, Kim B. Estimation of Threshold Rainfall in Ungauged Areas Using Machine Learning. Water. 2022;14(859–859).
- 17. Rasheed Z, Aravamudan A, Sefidmazgi AG, Anagnostopoulos GC, Nikolopoulos EI. Advancing flood warning procedures in ungauged basins with machine learning. Journal of Hydrology. 2022;609:127736.
- 18. Jeonghyeon C, Ungtae K, Sangdan K. Ecohydrologic model with satellite-based data for predicting streamflow in ungauged basins. The Science of the total environment. 2023;903:166617.
- 19. Demirel MC, Koch J, Rakovec O, Kumar R, Mai J, Müller S, et al. Tradeoffs Between Temporal and Spatial Pattern Calibration and Their Impacts on Robustness and Transferability of Hydrologic Model Parameters to Ungauged Basins. Water Resources Research. 2024;60.
- 20. Livingstone S, Zacharia K, Mwajuma L. Calibration and verification of a hydrological SWMM model for the ungauged Kinyerezi River catchment in Dar es Salaam, Tanzania. Modeling Earth Systems and Environment. 2024;10:2803–2818.
- 21. C G, I DC, A DN, R G. Variance-based Global Sensitivity Analysis of Surface Runoff Parameters for Hydrological Modeling of a Real Peri-urban Ungauged Basin. Water Resources Management. 2024;38:3007–3022.
- 22. Bais AS. IMPROVE A LEARNER: A SURVEY OF TRANSFER LEARNING. Journal of Fundamental and Applied Sciences. 2018;10:136–147.
- 23. Pan W. A survey of transfer learning for collaborative recommendation with auxiliary data—ScienceDirect. Neurocomputing. 2016;177(C):447–453.
- 24. Dev K, Ashraf Z, Muhuri PK, Kumar S. Deep autoencoder based domain adaptation for transfer learning. Multimedia tools and applications. 2022;81:21–27. pmid:35310888
- 25. Zhang X, Chen Z, Gao J, Huang W, Li P, Zhang J. A two-stage deep transfer learning model and its application for medical image processing in Traditional Chinese Medicine. Knowledge-Based Systems. 2022;239.
- 26. Man CK, Quddus M, Theofilatos A. Transfer learning for spatio-temporal transferability of real-time crash prediction models. Accident Analysis and Prevention. 2022;165:106511. pmid:34894483
- 27. Zhao W, Liu X, Jing J, Xi R. Re-LSTM: A long short-term memory network text similarity algorithm based on weighted word embedding. Connection Science. 2022;34:2652–2670.
- 28. Agheli B, Adabitabar FM, H G. Similarity measure for Pythagorean fuzzy sets and application on multiple criteria decision making. Journal of Statistics and Management Systems. 2022;25:749–769.
- 29. Jiang X, Huang Y, Zhang F. Study on Spatial Geometric Similarity Based on Conformal Geometric Algebra. International Journal of Environmental Research and Public Health. 2022;19:10807. pmid:36078512
- 30. Zhu B, Kan G, He X. Hydrologic Similarity and Parameter Transplantation of Non data Nested Watershed. Journal of China Institute of Water Resources and Hydropower Research. 2020;18:223–231.
- 31. Beven KJ, J M. A physically based, variable contributing area model of basin hydrology. Hydrological Sciences Journal. 1979;24:43–69.
- 32. Liu X, Wang G, Cai Z, Zhang H. A MultiBoosting Based Transfer Learning Algorithm. Journal of Advanced Computational Intelligence and Intelligent Informatics. 2015;19(381–388).
- 33. Fu Z, Li C, Yao W. Landslide susceptibility assessment through TrAdaBoost transfer learning models using two landslide inventories. Catena. 2023;222.
- 34. Zhai G, Sun Z, Wang G, Li P, Liang Q, Zhang M. Instance-based transfer learning method for locating loose particles inside aerospace equipment. Measurement. 2023;221.
- 35. Bu L. Application of M5 Model Tree in Passive Remote Sensing of Thin Ice Cloud Microphysical Properties in Terahertz Region. Remote Sensing. 2021;13.
- 36. Lv M, Li Y, Chen L, Chen T. Air Quality Estimation by Exploiting Terrain Features and Multi-View Transfer Semi-Supervised Regression. Information Sciences. 2019;483:82–95.
- 37. Li , Wang H, Li Y, Xiao Y, Hu G, Zhao P, et al. Learning adaptive criteria weights for active semi-supervised learning. Information Sciences: An International Journal. 2021;561:286–303.
- 38. Yang W, Xia K, Li T, Xie M, Song F. A Multi-Strategy Marine Predator Algorithm and Its Application in Joint Regularization Semi-Supervised ELM. Mathematics. 2021;9(291).
- 39. Shi S, Nie F, Wang R, Li X. Semi-supervised learning based on intra-view heterogeneity and inter-view compatibility for image classification. Neurocomputing. 2022;488:248–260.
- 40.
Andersen FH. Hydrological modeling in a semi-arid area using remote sensing data. agu fall meeting abstracts. 2008;.
- 41. Bin Y, Yang Y, Shen F, Xu X. Combining multi-representation for multimedia event detection using co-training. Neurocomputing. 2016; p. 11–18.
- 42. Rothpletz-Puglia P, Kaganda J. Developing Nutrition Research Capacity at the Tanzania Food and Nutrition Centre through Collaborative Training and Technical Assistance. Journal of Nutrition Education & Behavior. 2017;49(7):34.
- 43. Huang P, Wang H, Jin Y. Offline data-driven evolutionary optimization based on tri-training. Swarm and Evolutionary Computation. 2020;60.
- 44. Wang L, Yao Y, Wang K, Daniel AC, Zhao G, Lai F. A Novel Surrogate-Assisted Multi-objective Optimization Method for Well Control Parameters Based on Tri-Training. Natural resources research. 2021;(6):30.
- 45. Liu Q, Liu S, Wang G, Xia S. Social Relationship Prediction across Networks Using Tri-training BP Neural Networks. Neurocomputing. 2020;401.
- 46. Xiao-bin G. Analysis of reducing discharge measurement times of Shishi hydrology station. Jiangxi Hydraulic Science & Technology. 2015;41:138–143.
- 47. Zhao Q, Cui S, Zhu Y, Li R, Zhou X. A Novel Online Hydrological Data Quality Control Approach Based on Adaptive Differential Evolution. Mathematics. 2024;12(12):1821–1821.
- 48. Song X, Liu H, Feng Y, Xiong J, Luo Y. A Co-training Algorithm by Predicting Confidence of Unlabeled Neighbors. International Journal of Artificial Intelligence Tools: Architectures, Languages, Algorithms. 2022;.
- 49. Prashant I, Akshay K, Basudev B. Value of process understanding in the era of machine learning: A case for recession flow prediction. Journal of Hydrology. 2023;626(PB).
- 50. S NG, Frederik K, Keefe SA, S PC, Daniel K, M FJ, et al. What Role Does Hydrological Science Play in the Age of Machine Learning? Water Resources Research. 2021;57(3).