Figures
Abstract
The global climate crisis is creating increasingly complex rainfall patterns, leading to a rising demand for data-driven artificial intelligence (AI) in short-term weather forecasting. However, the black-box nature of AI models acts as a critical obstacle against their integration into the existing forecasting operations. This study addresses this issue by developing an explainable AI framework that extracts precipitation mechanisms from the model’s internal activation patterns when it predicts rainfall intensity in the future. The primary objective of this study is to enable the semi-automatic knowledge discovery of the weather mechanisms embedded in the nonlinear AI model by developing the unsupervised concept explanation method. A key challenge is the inherent fuzziness and the complexity of precipitation systems. We propose a probabilistic multi-label self-supervised clustering approach within the explainable framework to address this. Our algorithm refines an insufficient embedding space into perceptually meaningful representations. It improves the clustering performance over existing baselines, achieving an increase of 0.5358 in terms of a Silhouette Coefficient metric, which measures the similarity of intra-clusters and the dissimilarity of inter-clusters. We extract and characterize primary meteorological mechanisms through comprehensive case studies: convectional, frontal, orographic, and cyclonic precipitations. These findings are further validated by a user study involving forecasters at the Korea Meteorological Administration. We assess the distinguishability of the extracted rainfall patterns by conducting a user survey regarding the homogeneity of the extracted rainfall patterns. The results indicate comparable accuracies between existing human-annotated label-based examples (80%) and the unsupervised model-based ones (92%). Furthermore, the proposed method can effectively identify between polar low and typhoon cases, successfully capturing the different mechanisms while their cyclonic shapes are analogous. Our structured methodology can provide a pathway for detecting extreme weather events–such as heavy rainfall and isolated thunderstorms–in near real-time, thereby supporting operational forecasting or posthoc analysis tasks.
Citation: Kim S, Choi J, Lee S, Choi J (2025) Unsupervised concept discovery for deep weather forecast models with high-resolution radar data. PLOS Clim 4(9): e0000633. https://doi.org/10.1371/journal.pclm.0000633
Editor: Juan A. Añel, Universidade de Vigo, SPAIN
Received: April 26, 2025; Accepted: August 1, 2025; Published: September 18, 2025
Copyright: © 2025 Kim et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The radar composite data is developed by the KMA Weather Radar Center (WRC). It has been retrieved from the NIMS and is publicly available at the Korean National Climate Data Center (NCDC, Korean: \url{https://data.kma.go.kr/data/rmt/rmtList.do?code=11pgmNo=62}, English: \url{https://data.kma.go.kr/resources/html/en/aowdp.html}, last access: 1 February 2023). The human annotated label dataset is publicly available in the figshare repository [46, 47] with the identifier [DOI:10.6084/m9.figshare.27993743.v2].
Funding: This work was supported by the following funding sources: The Artificial Intelligence Graduate School Program (KAIST) funded by the Institute of Information & Communications Technology Planning & Evaluation (IITP) and the Ministry of Science and ICT (MSIT) under grant number RS-2019-II190075, received by authors [SK, JhC, SL, and JsC] (https://www.iitp.kr/en/main.it). The project titled “Development of Artificial Intelligence Technology for Personalized Plug-and-Play Explanation and Verification of Explanation” funded by IITP and MSIT under grant number RS-2022-II220984, received by authors [SK, JhC, SL, and JsC] (https://www.iitp.kr/en/main.it). The project “Developing Intelligent Assistant Technology and Its Application for Weather Forecasting Process” funded by the Korea Meteorological Administration (KMA) and the National Institute of Meteorological Sciences (NIMS) under grant number KMA2021-00123, received by authors [SK, JhC, SL, and JsC] (http://www.nims.go.kr/AE/MA/main.jsp). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Due to increasingly chaotic weather conditions caused by the global climate crisis, the demand for faster and more accurate weather predictions is growing. Traditional weather forecasting relies on numerical weather prediction (NWP), which makes predictions by solving theory-based partial differential equations. Despite its theoretical soundness, NWP has several limitations. First, NWP is computationally expensive, often requiring extensive hypercomputing facilities for timely predictions. While relaxing assumptions through methods such as quasi-geostrophic approximations [1] or hydrostatic and anelastic approximations [2–4] can reduce the computation cost, it prevents the full utilization of available data. Data-driven machine learning models can circumvent these problems. For instance, the deep neural network (DNN) model from [5] predicts hundreds of weather variables globally for the next 10 days in one minute, while the model from [6] makes hourly predictions in 1.5 seconds. Furthermore, these models learn nonlinear weather patterns directly from data [7], fully utilizing all available information. The increasing availability of high-quality data [8,9] makes this aspect of DNNs even more attractive (Fig 1).
The conceptual examples are provided: convectional, frontal, cyclonic, and orographic precipitations.
Despite these strengths, DNNs have a critical weakness preventing their integration into operational weather forecasting: black boxes. Since a DNN comprises complex interconnections of numerous neurons, it is challenging to understand the exact decision-making process. This opaqueness limits the trustworthiness of a DNN’s predictions and, therefore, its worth to operational forecasters. This gap in trust is a ubiquitous problem across multiple domains, leading to the development of explainable AI techniques to aid users in interpreting the behavior of AI models [10].
Applying explainable algorithms to weather forecasting models is challenging due to the inherent ambiguity of weather systems, as shown in Fig 2; different precipitation systems can co-occur in a single region. These systems often exhibit fuzzy edges, making establishing clear boundaries between neighboring systems difficult. The systems can also be sparse, consisting of small patches of observed rainfall that are difficult to identify without preprocessing. Furthermore, as shown in Fig 3, real-world precipitation mechanisms exhibit ambiguous and entangled semantics. Consequently, explanations for weather forecasting models inherently require probabilistic approaches rather than deterministic methods to represent the degree of ambiguity and entanglement. This study considers the individual precipitation systems independently by applying a domain-tailored instance segmentation to separate the distinct rainfall mechanisms within a single data, enabling analysis at the rainfall system level.
An example image with two precipitation cases: (1) frontal and (2) convective system.
One popular branch of explanation methods is example-based explanations, which uses user-familiar input space as the explanation results, making them understandable even for layperson users. In particular, example-based concept explanations offer examples illustrating human-comprehensible concepts captured by a model [12]. Such concepts in weather forecast models may include rainfall intensity, shape, or rainfall mechanisms (high-level semantics). Defining the term concept is challenging, even for human experts [13]. Concepts can be extracted by analyzing the internal vector space of a model even if it is not explicitly designed to capture the chosen concepts [12]. Several studies involve posthoc analysis and human annotation of extracted concepts to assign meaningful interpretations [14]. While there are several desiderata for concepts, this paper focuses on (1) possessing inherent meaning and (2) enabling distinction between or within concepts. The first desideratum is typically assessed through human expert evaluation [14–17], while the second can be measured using a Silhouette Coefficient score [18]. In this study, we conduct a user survey with expert forecasters to assess the first desideratum. We inherently address the second desideratum by employing a soft Silhouette coefficient-based deep clustering algorithm.
Example-based explanations can have several drawbacks, such as the insufficient representational power of given feature spaces, the risk of trivial clustering outcomes, and the gap between human perceptions and generated examples. First, the distance measurement in the concept analysis is sensitive to the chosen feature space [19,20]. Although Euclidean distance in the feature space is known to reflect human perceptual distance when tested on benchmark datasets [21–23], the representational spaces are often insufficient in real-world applications due to constraints such as scarcity of data [24]. The efficient training properties of DNNs exacerbate the problem since the models may have an incomplete representation space that minimizes the objective function while concentrating on critical information from the data, which causes, for example, unsupervised clustering in the feature space often leads to trivial clusters that solely focus on color while ignoring object shapes [25,26]. To this end, several studies suggest building refined manifolds to achieve more meaningful representations on top of the feature space of the original model [24]. Based on the ideas, this paper enhances the representational space of the target model under the metric learning scheme.
The second limitation is the gap between human understanding and the generated examples. Given a set of analogous examples an algorithm generates, a user may not necessarily identify their similarities when the samples are analytically similar (e.g., developing or dissipating rainfall) but intuitively different (e.g., having identical colors or intensities) [27,28]. To address this cognitive gap, we designed the questionnaire in a comparative format to measure the relative accuracy with which users can assess the homogeneity of the generated explanations versus human annotations. Based on the discussion above, example-based explanations can provide a useful understanding of a model’s decision-making process if its shortcomings can be addressed. Therefore, this study proposes an example-based concept explanation framework (See Fig 1) with the following components to enhance the interpretability and reliability of data-driven precipitation forecast models.
Recent research explores AI-based vector space analysis of weather and climate patterns. We discuss their limitations and highlight our contributions. [29] and [30] take a similar approach, using clustering algorithms to analyze heavy rainfall during the summer season in South Korea. These studies typically perform clustering on input data for their analysis; however, input-level clustering often captures visual similarities rather than semantic concepts. [31] distinguishes typhoon structures using a specific internal layer of a typhoon detection model. To our knowledge, the algorithms for concept analysis of extracting various precipitation mechanisms from the representation vector space captured by trained weather forecasting DNNs are yet to be explored. One issue with using the vector space is that it may be too entangled for meaningful analysis, which is critical for knowledge discovery. Hence, we address this problem by transforming the original space into a more meaningful, disentangled space through a multi-label deep clustering method. We will discuss this process in the next section.
To identify ambiguous and co-occurring meteorological concepts, we need to perform multi-label deep clustering to capture relationships among the multiple labels. The regular clustering methods [32,33] use an objective function that optimizes the intra-class similarity while reducing the inter-class similarity. Directly measuring this concept, the Silhouette Coefficient score [18] is a popular evaluation metric for clustering tasks, and the Soft-Silhouette clustering [34] uses the metric as an objective function for optimizing clusters as a contrastive learning. However, these studies are often centered around single-label multi-class clustering. We instead adjust the Soft-Silhouette [34] to multi-label clustering based on the theoretical foundation of the Binary Relevance [35], which decomposes a classifier with k classes into k independent binary classifiers. Specifically, we modify the final activation function of the target model from a single multi-class label to several binary labels to enable probabilistic multi-label classification. One of the benefits of using the silhouette coefficients for optimization is that it naturally avoids the trivial solutions in clustering, a problem often associated with using cross-entropy loss [36], by leveraging its innate contrastive learning scheme. Our contributions are threefold:
- Semi-Automatic Extraction of Rainfall Mechanisms from an AI Model: We extract 24 concepts and characterize them as identifiable weather patterns, including cyclonic, convectional, frontal, and orographic precipitations, which are verified to be distinguishable by forecasters.
- Probabilistic Deep Clustering for Ambiguous Rainfall Mechanisms: To explain ambiguous weather patterns with multiple rainfall mechanisms, we extend previous algorithms to support probabilistic multi-label clustering by incorporating binary sigmoid loss.
- Self-Supervised Learning for Enhanced Representation Spaces: We employ self-supervised learning techniques to generate perceptually more meaningful representations on top of the manifold of the target model, addressing the issue of insufficient or biased representations.
Materials and methods
This section introduces our unsupervised concept discovery framework which consists of four parts (Fig 4): (1) data preprocessing using instance segmentation tailored for meteorological data (see Section Instance Segmentation for Rainfall Systems), (2) self-supervised learning-based representation space refinement for meaningful initial cluster centroids (see Section Self-Supervised Learning for Refining Feature Spaces), (3) clustering in the feature space for pseudo-labels (see Section Multi-Label Deep Clustering for Co-Occurring Rainfall Systems), (4) unsupervised concept activation vector (CAV) extraction and evaluation of alignment with domain knowledge (see Section Concept Activation Vector Localization).
(1) the input data is preprocessed by applying the watershed instance segmentation method on the activation vector space of the bottleneck layer of the given trained precipitation forecast model. (2) A masked autoencoder is trained for the self-supervised learning on the previous processed activations to refine a meaningful representation space () from the suboptimal vector space. (3) Pseudo-labels to represent weather patterns are created by performing the proposed multi-label deep clustering on top of the refined representational space, finally obtaining the revised representational space
. (4) Concept vectors (
) are extracted from the final vector space.
Model and datasets
The target model is an unpublished variant of convolution-based DeepRaNE [37] provided by the National Institute of Meteorological Sciences (NIMS) in South Korea. The model consists of a denoising autoencoder and a U-Net. The input and target output are precipitation intensities derived from radar reflectivity observations. Each input instance consists of seven high-resolution radar observations at 10-minute intervals, spanning from 60 minutes prior (T-60) to the reference time (T) and the 1-hour cumulative average, concatenated channel-wise with temporal information (month, day, hour) and spatial coordinates (longitude, latitude) following an early fusion approach in multimodal learning [38] which enhance the spatial and temporal embedding. The training data for concept extraction spans 10-minute intervals from 2018 to 2021, inclusive. The data used for the explanation module is extracted from the activation vectors of the bottleneck layer in the U-Net module. Detailed descriptions of the model and radar preprocessing are provided in Section A.2 and A.3 in S1 Appendix, respectively.
Instance segmentation for rainfall systems
To address the ambiguity and individuality of precipitation systems in radar images, we use the watershed (The algorithm gradually expands regions from the local minima (markers) until they converge at the local maxima(watershed boundaries).) image segmentation algorithm [39], which considers both pixel intensity and spatial distance, as shown in step (1) of Fig 4, along with tailored pre- and post-processing techniques. In the pre-processing step, regions with precipitation rates below 0.1 mm hr-1 are masked as non-precipitation areas [40], and the input is binarized to delineate individual echo cells. Consequently, the watershed algorithm operates on binary input and adjacency information only, which is nothing but the Voronoi segmentation incorporating the boundary masks, thereby avoiding fine-grained over-segmentation. In the post-processing step, directly adjacent segments are merged into a single system. Each segment in the input space is then mapped into a corresponding region in the vector space of the target layer. This region is subsequently cropped and resized into a uniform shape, which improves computational efficiency through dimensionality reduction while preserving distinct rainfall systems. We filter out inactive channels throughout the training dataset to further reduce dimensionality, retaining 280 out of the original 1,024 channels. The resulting feature vectors, zi ∈ R280 × 9×9, are utilized in all downstream tasks. For implementation details and reproducibility, refer to Section B.1 in S1 Appendix.
Self-Supervised learning for refining feature spaces
As discussed previously, two potential issues exist with the direct application of clustering-based concept extraction on the feature manifold of a deep weather forecast model. First, the embedding space may be insufficiently trained due to limited training data compared to the model’s size, leading to suboptimal clusters with incomplete disentanglement and poorly estimated centroids. Second, weather patterns are often ambiguous, requiring multiple concept labels to be assigned probabilistically rather than deterministically. We address the first problem by employing a self-supervised learning scheme consisting of a masked autoencoder (MAE) [41] with mean-squared error (MSE) loss to generate the refined representational activation vector space pre (step (2) of Fig 4). The MAE randomly masks a significant portion of the input, and the model solves the task by reconstructing the missing patches based on the remaining visible elements. This simple yet effective approach enhances the representation space by reconstructing missing content and inferring absent parts solely from visible patches.
Multi-Label deep clustering for co-occurring rainfall systems
To address the second issue, we introduce a multi-label deep clustering method. We first transform the class assignment process of the traditional deep clustering method into a multi-label classification problem, which binarizes each class into a one-hot label using sigmoid functions. We then perform a modified three-stage deep clustering from [42] (step (3) of Fig 4): (i) refined embedding via the self-supervised MAE, (ii) pseudo-class assignment via k-means clustering to generate initial clusters, (iii) multi-label soft class assignment with the cluster adjustment. The last stage is performed using soft-silhouette loss [34] (https://github.com/gvardakas/Soft-Silhouette). This loss function extends the Silhouette Coefficient score [17,18] by minimizing intra-cluster distance and maximizing inter-cluster distance. Specifically, the average soft Silhouette score S(·) is defined as the expected conditional Silhouette value sCk (Eq. 2) weighted by the i-th sample’s k-th cluster assignment probability PCk. The conditional silhouette value sCk(zi) is computed based on two factors: aCk(zi), which is the distance between zi and Ck weighted by the expected probability of zi compared to all other points zj=i belonging to Ck, and bCk, the expected distance of zi from the closest cluster Cl=k to Ck. PCk is computed using a radial basis function (RBF) kernel (Eq. 5) to account for the mean and variance when computing distances between data points and clusters [35]. The RBF distance features are scaled using the temperature factor τ [43] to mitigate the overconfidence problem before computing the class assignment probabilities via the last activation function.
The overall loss consists of the soft Silhouette, entropy, and MSE loss. The entropy loss is a regularizer to prevent dominance by a specific cluster, while MSE loss is added to preserve the representational power of the learned embeddings [34]. These three loss functions refine the trained feature space, using the pseudo-class centroids as anchors to enhance clustering effectiveness:
where fw and gθ are the encoder and decoder functions, respectively. The function hr(·) denotes the clustering function, which outputs the probability of assigning sample zi = fw(xi) to the k-th cluster in the feature space. λ1 and λ2 are the scaling factors for the soft silhouette and entropy losses, respectively. The model parameters w,θ, and r are trained by gradient descent.
We set the temperature factor to 0.25 to shift the kernel output by -0.5 and achieve the sigmoid range of (0, 1) after applying the Gaussian RBF kernel , where the input domain is in [0, inf) and the output range is in [0, 1]. Parameter σ is also trainable. The final refined vectors
soft ∈ R280 and the clusters are obtained through iterative optimization.
Concept activation vector localization
We then extract CAVs from the refined vector space and cluster results as pseudo labels to train individual linear probers, as introduced in the introduction section. Unlike previous approaches, the output clusters are not entirely exclusive due to the multi-label clustering manner. We use cluster labels with a probability threshold 0.5 as pseudo labels for each concept. According to previous literature, clusters with fewer than 50 samples can be omitted as a post-processing step [14]. To obtain CAVs and concept probers, support vector machine linear classifiers are trained using a one-vs-all setting. The L1 regularizer can produce sparse and efficient CAVs as additional techniques. Platt’s sigmoid calibration method [44] can alleviate the overconfidence issue of the output logits. An ensemble mechanism that averages the coefficients from k models through k-fold cross-validation can be employed to mitigate the overfitting of classifiers. We provide the implementation details in Section B.2 in S1 Appendix for reproducibility.
Results and discussion
Extracted concepts based on semi-automatic clustering
Figs 5 and 6 showcases the examples of output clusters generated via deep clustering in the refined representation space soft. The concept labels are derived through posthoc analysis and statistical analysis on a human-annotated label dataset [11]. Cluster 4 captures the east-coast-rainfall concept, which exhibits patterns along the eastern coasts of the Korean Peninsula that are often influenced by orographic lifting. Cluster 8 corresponds to the convectional rainfall concept, which forms localized intense shower patterns due to strong updrafts within cumulonimbus clouds. Cluster 12 is associated with the typhoon concept, characterized by cyclonic patterns accompanied by heavy rainfall. This mechanism involves a tropical cyclone with a central eye and strong spiral rain bands. Cluster 20 corresponds to stationary-front concept, which forms elongated, thick, linear rainfall band patterns. Cluster 23 is associated with the lake-effect snowfall concept, which exhibits a distinctive wave-like pattern over a broad area. This mechanism occurs when cold air moves over a warm sea surface, resulting in prolonged snow showers downwind. Section C.4 in S1 Appendix provides additional output examples and details of the posthoc analysis.
The date time is in UTC. A. Example instances of Cluster 0: upper-low-jet-coupling. B. Example instances of Cluster 1: upper-low-jet-coupling. C. Example instances of Cluster 2: upper-low-jet-coupling. D. Example instances of Cluster 3: changma. E. Example instances of Cluster 4: upper-low-jet-coupling and east-coast-rainfall. F. Example instances of Cluster 6: changma, typhoon, and convectional. G. Example instances of Cluster 7: typhoon and low-level-jet. H. Example instances of Cluster 8: convectional. I. Example instances of Cluster 10: stationary-front. J. Example instances of Cluster 11: upper-low-jet-coupling.
The date time is in UTC. (Cont.). A. Example instances of Cluster 12: typhoon. B. Example instances of Cluster 14: convectional. C. Example instances of Cluster 15: drizzle. D. Example instances of Cluster 16: changma. E. Example instances of Cluster 18: typhoon and north-pacific-high-edge. F. Example instances of Cluster 19: typhoon. G. Example instances of Cluster 20: convectional and stationary-front. H. Example instances of Cluster 21: typhoon. I. Example instances of Cluster 22: changma. J. Example instances of Cluster 23: lake-effect-snowfall.
Evaluation of representation space for concept extraction
To verify that the refined vector space soft provides a high-quality representation space for concept analysis, we compare the performance of Automatic Concept Extraction (ACE) [14], which uses k-means clustering on the specific layer(l)’s embedding vector ϕl(x), and our clustering method applied on three embedding vector spaces: z,
pre, and
soft. Due to computational costs and memory constraints, we cannot use the original ACE implementation on ϕ(x) ∈ R1024 × 45 × 36. Instead, we use z as an alternative. We measure the discrete Silhouette Coefficient score, which ranges from -1 to 1, and a near-zero value indicates poor cluster separation. As shown in Table 1, both ACE and our framework trained on z achieves a near-zero score (-0.0291 and -0.0039, respectively), suggesting a high overlap of clusters. In contrast, our framework achieves a score of 0.3441 with
pre and 0.5067 with
soft, showing a significant improvement in clustering performance. The results indicate that self-supervised learning substantially improves the disentanglement of the representational space, while the multi-label soft silhouette coefficient further enhances this separation. The intuition behind the improved performance is that, as shown in Fig 7, the initial manifold z is too intricate for direct concept extraction. The modified space
pre obtained through self-supervised learning is less entangled. The final optimized space
soft is even more disentangled, making it more amenable to clustering.
(c) initial clustering on the self-supervised refined embedding vectors pre. (d) multi-label clustering on its embedding vectors
soft. The same color represents samples within the same cluster. The black markers represent the centroids of individual clusters. The final result showed redundancy in rainfall concepts. We therefore tested for statistically significant clusters, then removed clusters 5, 14, 21, 23, and 28, and merged clusters 1 and 15. Detailed experimental settings are provided in Section C.2 in S1 Appendix for replicability.
To evaluate the effectiveness of the proposed feature space, we conduct a nearest neighbor analysis using the Euclidean distance in z and soft, comparing the labels of the top three nearest neighbors as shown in Fig 8. The results indicate that
soft reflects more meaningful relationships, with the conceptual distance depending not only on shape or intensity-oriented information but also high-level semantic mechanisms such as stationary-front, lake-effect-snowfall, and east-coast-rainfall. This observation suggests that our probabilistic multi-label approach effectively aligns the feature representation with domain knowledge, allowing the model to discern subtle similarities in weather mechanisms beyond simple visual patterns. Given the meaningful conceptual
Case Study of Polar Low vs. Typhoon
Polar lows and typhoons are precipitation mechanisms with similar cyclonic patterns and intensities. However, they are distinct weather systems in that the former occurs primarily in winter. A good weather concept explanation should be able to distinguish between the two systems despite their visual similarities. We examine whether the CAVs from our framework can capture this difference by performing a case study measuring the predictive probability of the typhoon prober (the linear classifier of Concept 19) for several polar low and typhoon cases. We use instances over the East Sea of Korea from May 20–21, 2021 for polar low cases, and Typhoon Mitag and Typhoon Soulik for typhoon cases. We use Mitag and Soulik as their radar patterns resemble polar lows during the dissipation phase of their life cycle.
As shown in Fig 9, the typhoon prober assigns high probabilities to typhoon cases but low probabilities to polar low cases, demonstrating the model’s ability to distinguish between cyclonic patterns in winter (polar low) and summer (typhoon) rather than merely detecting the rotational shapes. This result illustrates the effectiveness of CAVs in differentiating mechanically distinct but visually similar phenomena. Detailed experimental settings and additional results for the North Pacific High edge versus frontal system comparison are provided in Section C.1 of S1 Appendix.
The score refers to the probabilistic score from probe 19. The middle image is from the advanced, very high-resolution radiometer (AVHRR) CH 01, observed by the meteorological operational satellite (METOP-1) on 2021-05-21 at 01:58 (UTC). disentanglement in soft, we use the clusters constructed in this space, which represents the vector space in step (4) of Fig 4, as pseudo labels in subsequent procedures.
Allocation of meaning of concept vectors
We heuristically select the number of clusters as 30 to slightly exceed the number of annotated labels in the benchmark dataset (16) [11], allowing for redundancy in identifying concepts. We merge statistically insignificant clusters when p < 0.01 in the cluster-pair t-test based on previous research [14] and remove clusters with fewer than 50 samples–the post-processing results in 24 final clusters. Example instances from different clusters are reported in Figs 5 and 6. The annotated concept label descriptions are also provided in S1 Appendix. The results indicate an alignment between the attributes of the concepts and human perception.
Survey with domain experts
To evaluate the degree of alignment of the extracted concepts with meteorological domain knowledge, we conduct a survey with domain experts using structured questionnaires and interviews. The questions involve identifying samples extracted from a homogeneous concept cluster compared to a set of random samples. To address cognitive biases in user surveys as discussed previously, where intuitive differences related to shape and intensity tend to dominate over analytical thinking such as innate mechanisms [27,28], we design the questionnaires to be contrastive, comparing two label types in the same format. The labels consist of (1) human-annotated concept labels from open-source data [11] and (2) pseudo-labels obtained from the concept extraction method. We ask five questions for each category, presented in random order for both questions. We then conduct a comparative analysis of the results. The questionnaire is designed to be performed online, including animations to visualize the temporal progression of rainfall mechanisms. The user survey was conducted on June 24, 2024, with five Korea Meteorological Administration (KMA) forecasters. The left side of Fig 10 illustrates examples of questions in the questionnaire. According to Fig 10, the forecasters achieved an average accuracy of 80% for the annotated labels and 92% for the concept model identification task, indicating that humans recognize the extracted concepts with a relatively comparative level. Additional participant interviews and feedback are provided in S1 Appendix.
Conclusion
This study proposes an unsupervised example-based concept explanation framework for a given precipitation forecast model based on high-resolution radar data, contributing to understanding precipitation patterns and meteorological processes such as convectional, frontal, orographic, and cyclonic precipitations. The framework allows for a probabilistic representation of the simultaneous co-occurrence of meteorological mechanisms and helps address common challenges with manual concept annotation in the weather domain. We perform extensive analyses of the proposed algorithm, measuring clustering performance and alignment between extracted concepts and weather domain knowledge both quantitatively and qualitatively. Our experiments show that the framework can identify the key precipitation mechanisms captured by the given rainfall forecast model and distinguish between visually similar yet distinct mechanisms like polar lows and typhoons. These results suggest that the extracted concepts encapsulate not only visual similarity of precipitation systems but also other high-level semantic information.
There are several future directions for this research. First, given the model-agnostic nature of the proposed framework, it may be extended to analyze other state-of-the-art weather forecast models, which would inherently embed richer rainfall patterns in the feature spaces to identify more complex patterns. We may also apply the method to multivariable models that take additional inputs such as temperature, pressure, wind direction, and wind speed, potentially discovering more complex and diverse weather mechanisms than those that can be extracted from radar data alone.
Supporting information
S1 Appendix. Detailed research method and additional experimental results.
This appendix provides the model and data description; the implementational details of instance segmentation and concept vector extraction; the additional experimental results of concept prober, manifold analysis, forecasters’ interviews, and more examples of clustering outputs. Also, the samples of the concept annotated label dataset we use [11] are showcased.
https://doi.org/10.1371/journal.pclm.0000633.s001
(PDF)
Acknowledgments
We sincerely thank Dr. Hyesook Lee, Sunyoung Kim, and Junsang Park in the AI Meteorological Research Division at the Korean National Institute of Meteorological Sciences (NIMS) for their valuable advice for practical use throughout this study. We would also like to thank the Forecast Research Department of NIMS for providing weather forecast posthoc analysis data that facilitated our research and the Daejeon Regional Office of Meteorology forecasters in the Korea Meteorological Administration (KMA) for conducting the user survey.
References
- 1.
Kalnay E. Atmospheric modeling, data assimilation and predictability. Cambridge University Press; 2003.
- 2. Pu Z, Kalnay E. Numerical weather prediction basics: models, numerical methods, and data assimilation. Handbook of hydrometeorological ensemble forecasting. Springer Berlin Heidelberg; 2018; p. 1–31.
- 3. Laloyaux P, Kurth T, Dueben PD, Hall D. Deep learning to estimate model biases in an operational NWP assimilation system. J Adv Model Earth Syst. 2022;14(6).
- 4. Frnda J, Durica M, Rozhon J, Vojtekova M, Nedoma J, Martinek R. ECMWF short-term prediction accuracy improvement by deep learning. Sci Rep. 2022;12(1):7898. pmid:35551266
- 5. Lam R, Sanchez-Gonzalez A, Willson M, Wirnsberger P, Fortunato M, Alet F, et al. Learning skillful medium-range global weather forecasting. Science. 2023;382(6677):1416–21. pmid:37962497
- 6. Bi K, Xie L, Zhang H, Chen X, Gu X, Tian Q. Accurate medium-range global weather forecasting with 3D neural networks. Nature. 2023;619(7970):533–8. pmid:37407823
- 7. Thuemmel J, Karlbauer M, Otte S, Zarfl C, Martius G, Ludwig N. Inductive biases in deep learning models for weather prediction. arXiv preprint arXiv:230404664. 2023.
- 8. Willett DS, Brannock J, Dissen J, Keown P, Szura K, Brown OB, et al. NOAA open data dissemination: Petabyte-scale Earth system data in the cloud. Sci Adv. 2023;9(38):eadh0032. pmid:37729405
- 9.
Korea Meteorological Agency. Haneulsarang. 2022. Available from: https://www.kma.go.kr/download_01/kma_202002
- 10. Longo L, Brcic M, Cabitza F, Choi J, Confalonieri R, Del Ser J, et al. Explainable Artificial Intelligence (XAI) 2.0: a manifesto of open challenges and interdisciplinary research directions. Information Fusion. 2024;106:102301.
- 11.
Kim S, Choi J, Lee S, Choi J. Example-Based Concept Analysis Framework for Deep Weather Forecast Models. 2024. Available from: https://figshare.com/articles/dataset/Example-Based_Concept_Analysis_Framework_for_Deep_Weather_Forecast_Models/27993743
- 12.
Molnar C. Interpretable machine learning. Lulu.com; 2020.
- 13. Genone J, Lombrozo T. Concept possession, experimental semantics, and hybrid theories of reference. Philosophical Psychol. 2012;25(5):717–42.
- 14. Ghorbani A, Wexler J, Zou JY, Kim B. Towards automatic concept-based explanations. Adv Neural Inf Process Syst. 2019;32.
- 15. Schut L, Tomasev N, McGrath T, Hassabis D, Paquet U, Kim B. Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero. arXiv preprint arXiv:231016410. 2023.
- 16.
Kim B, Wattenberg M, Gilmer J, Cai C, Wexler J, Viegas F, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In: International Conference on Machine Learning. PMLR; 2018. p. 2668–2677.
- 17.
Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley; 2009. Available from: https://books.google.co.kr/books?id=YeFQHiikNo0C
- 18. Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.
- 19.
Poché A, Hervier L, Bakkay MC. Natural Example-Based Explainability: a Survey. In: World Conference on eXplainable Artificial Intelligence. Springer; 2023. p. 24–47.
- 20. Hanawa K, Yokoi S, Hara S, Inui K. Evaluation of similarity-based explanations. arXiv preprint arXiv:200604528. 2020.
- 21. Gatys L, Ecker AS, Bethge M. Texture synthesis using convolutional neural networks. Adv Neural Inf Process Syst. 2015;28.
- 22. Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer; 2016. p. 694–711.
- 23. Zhang R, Isola P, Efros AA, Shechtman E, Wang O. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 586–595.
- 24. Esser P, Rombach R, Ommer B. A disentangling invertible interpretation network for explaining latent representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. p. 9223–9232.
- 25. Caron M, Bojanowski P, Joulin A, Douze M. Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (ECCV); 2018. p. 132–149.
- 26. Ji X, Henriques JF, Vedaldi A. Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 9865–9874.
- 27. Liao QV, Varshney KR. Human-centered explainable ai (xai): From algorithms to user experiences. arXiv preprint arXiv:211010790. 2021.
- 28.
Kahneman D. Thinking, fast and slow. Macmillan; 2011.
- 29. Park C, Son S-W, Kim J, Chang E-C, Kim J-H, Jo E, et al. Diverse synoptic weather patterns of warm-season heavy rainfall events in South Korea. Monthly Weather Rev. 2021;149(11):3875–93.
- 30. Jo E, Park C, Son S-W, Roh J-W, Lee G-W, Lee Y-H. Classification of localized heavy rainfall events in South Korea. Asia-Pacific J Atmos Sci. 2019;56(1):77–88.
- 31.
Sprague C, Wendoloski EB, Guch I. Interpretable AI for Deep Learning-Based Meteorological Applications. 2019.
- 32. Sun Y, Cheng C, Zhang Y, Zhang C, Zheng L, Wang Z, et al. Circle loss: A unified perspective of pair similarity optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. p. 6398–6407.
- 33. Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A. Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst. 2020;33:9912–24.
- 34. Vardakas G, Papakostas I, Likas A. Deep Clustering Using the Soft Silhouette Score: Towards Compact and Well-Separated Clusters. arXiv preprint arXiv:240200608. 2024.
- 35. Zhang M-L, Li Y-K, Liu X-Y, Geng X. Binary relevance for multi-label learning: an overview. Front Comput Sci. 2018;12(2):191–202.
- 36. Wu X, Zhang S, Zhou Q, Yang Z, Zhao C, Latecki LJ. Entropy minimization versus diversity maximization for domain adaptation. IEEE Trans Neural Netw Learn Syst. 2023;34(6):2896–907. pmid:34520373
- 37. Ko J, Lee K, Hwang H, Oh S-G, Son S-W, Shin K. Effective training strategies for deep-learning-based precipitation nowcasting and estimation. Comput Geosci. 2022;161:105072.
- 38. Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS. Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 2010;16(6):345–79.
- 39. Beucher S. Use of watersheds in contour detection. In: Proc. Int. Workshop on Image Processing, Sept. 1979; 1979. p. 17–21.
- 40. Rodwell MJ, Richardson DS, Hewson TD, Haiden T. A new equitable score suitable for verifying precipitation in numerical weather prediction. Quart J Royal Meteoro Soc. 2010;136(650):1344–63.
- 41.
Han S, Park S, Park S, Kim S, Cha M. Mitigating embedding and class assignment mismatch in unsupervised image classification. In: European Conference on Computer Vision. Springer; 2020. p. 768–784.
- 42.
Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: International Conference on Machine Learning. PMLR; 2017. p. 1321–1330.
- 43. He K, Chen X, Xie S, Li Y, Dollár P, Girshick R. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 16000–16009.
- 44. Platt J, et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classifiers. 1999;10(3):61–74.
- 45. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Machine Learning Res. 2008;9(11).
- 46. Kim S, Choi J, Lee S, Choi J. Example-Based concept analysis framework for deep weather forecast models. Artif Intell Earth Syst. 2025;4(3).
- 47.
Kim S, Choi J, Lee S, Choi J. Unsupervised Concept Discovery for Deep Weather Forecast Models with High-Resolution Radar Data. Figshare. 2025.