Figures
Abstract
Accurate and efficient pavement condition assessment is essential for maintaining roadway safety and optimizing maintenance investments. However, conventional assessment methods such as manual visual inspections and specialized sensing equipment are often time-consuming, expensive, and difficult to scale across large networks. Recent advancements in generative artificial intelligence (GAI) have introduced new opportunities for automating visual interpretation tasks using street-level imagery. This study evaluates the performance of seven multimodal large language models (MLLMs) for road surface condition assessment, including three proprietary models (Gemini 2.5 Pro, OpenAI o1, and GPT-4o) and four open-source models (Gemma 3, Llama 3.2, LLaVA v1.6 Mistral, and LLaVA v1.6 Vicuna). The models were tested across four task categories relevant to pavement management: distress and feature identification, spatial pattern recognition, severity evaluation, and maintenance interval estimation. Model performance was assessed across five dimensions: response rate, response correctness, consistency, multimodal errors, and overall computational intensity and cost. Results indicate that MLLMs can interpret street-level imagery and generate task-relevant outputs in a cost-effective manner. Among the evaluated models, we recommend GPT-4o as the preferred option, as it balances responsiveness, accuracy, and computational cost.
Citation: Xu C, Shu L, Dao A, Cui Y (2026) Multimodal generative AI for automated pavement condition assessment: Benchmarking model performance. PLoS One 21(2): e0340380. https://doi.org/10.1371/journal.pone.0340380
Editor: Junghwan Kim, Virginia Polytechnic Institute and State University, UNITED STATES OF AMERICA
Received: August 28, 2025; Accepted: December 18, 2025; Published: February 12, 2026
Copyright: © 2026 Xu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: We will share data on public GitHub as soon as this paper is accepted.
Funding: The project is funded by the Great Lakes–Northern Forest (GLNF) Cooperative Ecosystem Studies Unit (CESU), award number W912HZ-19-SOI-0002. The funder had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The content of this publication does not necessarily reflect the views or policies of the funder.
Competing interests: No authors have competing interests.
Introduction
Road condition assessment is a key component of transportation infrastructure management. It provides information necessary to evaluate the current state of roadways, schedule maintenance activities, and allocate resources efficiently. Accurate assessments can contribute to improved safety, reduced travel disruption, and lower vehicle operating costs [1,2]. Over time, road condition monitoring also supports infrastructure asset management by informing long-term planning, helping prioritize maintenance needs, and reducing overall life-cycle costs [3–5]. Beyond structural and safety benefits, effective pavement management also reduces fuel consumption, tire wear, and mechanical damage caused by surface irregularities, offering both economic and environmental advantages [6].
Road condition assessment methods range from structural evaluations that measure subsurface integrity using techniques such as falling weight deflectometer testing [7,8] and ground-penetrating radar [9,10], to surface-level inspections, which analyze visible pavement conditions such as cracking, rutting, raveling, and potholes [11–13]. This study focuses on road surface assessment, which evaluates road condition based on the physical characteristics of the pavement surface.
Conventional road surface assessments have traditionally relied on manual inspection techniques, which are often time-consuming, labor-intensive, and subject to observe variability. In response, recent developments in computer vision and artificial intelligence have enabled the adoption of automated, image-based methods [14–16]. Deep learning [14,17] and other machine learning techniques [18,19] are widely applied in pavement condition assessment to enhance accuracy, consistency, and operational efficiency [18,20,21]. Developing these models requires large volumes of high-quality, accurately labeled training data [22,23]. Acquiring such data is often challenging and time-consuming due to the need for extensive manual annotation [24]. In addition, most models are designed for a single, task-specific function [25]. Since road surface assessment involves multiple stages, including defect detection, severity evaluation, and maintenance prioritization, separate models are typically required to address each component of the workflow.
More recently, advances in generative artificial intelligence (AI), particularly multimodal models, have gained attention for their potential to move beyond traditional support functions and autonomously generate task-relevant outputs with minimal contextual input. Unlike conventional machine learning models that require extensive training datasets and pre-defined rules, generative artificial intelligence (GAI) models can respond adaptively to a range of inputs and generate task-relevant outputs with limited contextual information [26]. Leveraging these capabilities, recent studies have begun to apply MLLMs in urban research, particularly using street-view imagery to analyze built environments, urban form, and neighborhood conditions [27–29].
However, the application of MLLMs using street-view remain for surface-level road condition assessment remains largely unexplored. Pavement evaluation typically involves detecting surface distress, classifying severity, and recommending maintenance actions, which have traditionally required multiple specialized models. With their multimodal understanding and generative capabilities, MLLMs have the potential to integrate these tasks within a single framework and streamline the assessment process. Before MLLMs can be reliably adopted for pavement evaluation using street-view images, it is necessary to benchmark their current capabilities, assess their performance across key road condition assessment tasks, and determine the degree of fine-tuning and adaptation required for reliable application in pavement condition assessment. Beyond establishing baseline benchmarks, this study expands prior work by comparing the performance, reliability, and computational efficiency of multiple MLLMs. Although earlier studies have explored the use of individual MLLMs with street-view imagery, they primarily focused on demonstrating the potential of a single model. To address this gap, the present study evaluates the capability of multiple MLLMs for surface-level road condition assessment and examines performance variations across key tasks. Both proprietary and open-source models are included to capture a comprehensive view of current technological capacity, as they differ in architecture, training data, and accessibility. Evaluating both types is important because it enables an assessment of whether open-source alternatives can achieve performance comparable to commercial systems, which has implications for cost-effectiveness and the feasibility of large-scale adoption by public agencies and researchers. The analysis also accounts for computational cost, providing insight into the trade-offs between performance and efficiency.
Building on this framework, this study examines the potential of multimodal large language models (MLLMs) for assessing road surface conditions based on the physical state of pavement. The evaluation focuses on four key tasks relevant to pavement management workflows: (1) surface distress and feature identification (e.g., cracks, potholes, and patches), (2) spatial pattern recognition (e.g., whether defects are isolated, uniformly distributed, or concentrated along edges or drainage areas), (3) severity evaluation (estimating the extent or seriousness of observed defects), and (4) maintenance interval estimation (inferring the timing or urgency of potential maintenance needs). These tasks were selected for their direct relevance to operational decision-making within pavement condition assessment frameworks.
Surface distress identification facilitates the initial detection and classification of visible pavement defects [30]. Spatial pattern recognition supports the identification of localized deterioration, helping to inform targeted maintenance interventions and resource allocation [31]. Severity evaluation translates observed pavement defects into standardized condition ratings that guide maintenance prioritization and investment decisions [32,33]. Finally, maintenance interval estimation involves predicting the timing of future repairs based on the current condition of the pavement, supporting long-term maintenance planning [31].
Model performance is assessed across five performance dimensions: response rate, response correctness, consistency, multimodal errors, and overall computational intensity and cost.
Literature review
Traditionally, road surface condition assessments have relied on manual inspections, structured surveys, and standardized scoring systems based on the physical condition of the pavement [34,35]. While these methods have served as the foundation for pavement and road management, they are labor-intensive, time-consuming, and susceptible to human error [36]. In response to these limitations, scholars and industry practitioners have increasingly advocated for the adoption of automated, sensor-based approaches. These methods improve the objectivity of assessments, reduce reliance on periodic manual inspections, and enable continuous, real-time monitoring of road conditions [35,37,38]. Technologies such as accelerometers, GPS, cameras, and LiDAR are frequently used to capture high-resolution images of surface defects [39–41].
Building on these technological advances, machine learning [18,42] and deep learning techniques [19,43–45] have been widely applied to road surface image analysis. These computational approaches automate condition assessment, reducing the time and cost associated with traditional manual inspections [17,20]. Beyond improving efficiency, they enhance diagnostic accuracy, streamline analytical workflows, and contribute to more effective road infrastructure management [46].
However, these models are not fully autonomous. They typically integrate human expertise into the decision-making process [47,48] and are unable to make independent judgments about condition severity or maintenance needs. Their effectiveness also depends heavily on large, high-quality annotated datasets, which are often computationally intensive to process [49,50] and may be limited or inconsistently available across different contexts [51].
Recent advancements in GAI have opened new avenues for automating assessment processes across various domains [52,53]. Although MLLMs have not yet reached the point of full automation, they demonstrate versatility across tasks [54]. Traditional machine learning approaches for surface-level road condition assessment are typically designed and trained for a single, well-defined objective such as defect detection [55–57], defect classification [58–60], or surface segmentation [61,62]. In contrast, MLLM can address multiple tasks within a unified framework by combining visual perception with linguistic reasoning [54]. In addition, when baseline performance is insufficient, these models can be fine-tuned to improve task-specific accuracy [63].
MLLMs appear to have technical capabilities that may support road surface condition assessment using street-level images. These models integrate computer vision functions [64] that allow them to detect, classify, and quantify road defects such as cracks, potholes, and other surface irregularities. In addition to visual processing, they are designed to generate descriptive textual outputs from image content [65], which may allow for context-aware evaluations that combine standardized scoring frameworks with natural language reporting. Furthermore, MLLMs may have the potential to interpret road conditions autonomously and produce preliminary maintenance recommendations, thereby reducing the need for expert intervention. In fact, scholars have utilized street-view images to evaluate built environments [29], streetscape quality [66], walkability [28,67], and neighborhood livability [66], demonstrating that MLLMs can interpret complex visual cues associated with urban design and infrastructure. However, these capabilities have not yet been empirically validated within the domain of pavement assessment.
To address this gap, this study evaluates the performance of MLLMs across four tasks relevant to pavement condition assessment. The selection of tasks reflects common practices in surface road condition assessment: (1) surface distress and feature identification [21,34,68,69] (2) spatial pattern recognition (i.e., crack positioning) [70,71] (3) severity evaluation [32,72,73], and (4) maintenance interval estimation [32,74].
Among the selected models, three are proprietary. Gemini 2.5 Pro, developed by Google DeepMind and released in February 2024, incorporates advanced multimodal reasoning capabilities [75]. OpenAI o1 and GPT-4o, both released by OpenAI in May 2024, represent a lightweight and flagship model, respectively. OpenAI o1 prioritizes computational efficiency for vision-language tasks [76], whereas GPT-4o offers enhanced capabilities in visual perception, language understanding, and rapid multimodal generation [77]. The remaining models are open source. Gemma 3, released by Google DeepMind in April 2024, is a publicly accessible and compact model tailored for ease of fine-tuning and adaptation in academic and research settings [78]. Llama 3.2, developed by Meta and released in April 2024, extends the Llama series with improved contextual reasoning and instruction-following [79]. LLaVA v1.6 Mistral, released jointly by UW-Madison and Microsoft in February 2024, integrates the Mistral backbone for visual-language alignment in multimodal reasoning [80]. LLaVA v1.6 Vicuna, also released by UW-Madison, builds on the Vicuna model and focuses on instruction-following in visual question answering [81].
Methods
Data
This study used a multimodal dataset comprising street-level images and textual prompts as model inputs, along with benchmark references for performance evaluation.
Multimodal input.
The input to MLLMs consisted of street road image and textual prompts designed to reflect real-world road condition assessment tasks. The images were sourced from Google Street View and selected based on PCI records, as detailed in the section on benchmark data. To ensure temporal alignment between pavement condition data and Street View imagery, only images captured within one year of a PCI assessment were included. PCI records prior to 2014 were excluded due to the lack of reliably dated images. The dataset size is relatively small primarily because real-world PCI data are scarce and currently available only from the City of San Francisco. The available dataset contains PCI values and geographic coordinates but does not include street-level imagery, requiring the retrieval of corresponding Google Street View images for each road segment. Furthermore, many segments in San Francisco have perfect PCI scores (i.e., PCI = 100), resulting in limited representation of visible pavement defects. To mitigate potential sampling bias, a random subset of those records was selected to balance the distribution of pavement conditions. This process resulted in a dataset of 261 road segment images.
Each image was paired with a prompt designed to instruct the model to perform a specific infrastructure-related task. A total of 39 unique prompts were developed, as detailed in S1 Table. These prompts request the identification of observable features (e.g., presence, type, and quantity of cracks or potholes), descriptions of spatial patterns (e.g., clustering near intersections or drainage), assessments of condition severity, and judgments on the urgency of potential maintenance actions. Prompts also ask for explanatory reasoning or justifications. The prompts were written in natural language and structured as multi-part instructions to encourage detailed responses.
Benchmark data.
To evaluate model outputs, this study used three types of benchmark data: PCI, estimated road maintenance intervals, and manually annotated labels for observable surface features and spatial patterns.
PCI records were obtained from the City of San Francisco’s open data portal in February 2025 [82]. The dataset includes street-level information such as PCI assessment dates, numerical scores ranging from 0 to 100, severity categories, road functional classifications, and geographic coordinates. These records were also used to guide image selection, as described in Section 3.1.1.
Road maintenance intervals were estimated by reviewing sequential Google Street View images for visible signs of surface repair. Segments showing repairs within one year of the initial image were labeled as short-term maintenance. Repairs that occurred between one and three years later were labeled as long-term. Segments with no visible repairs within three years were classified as having no maintenance. These labels provide a basis for evaluating the models’ ability to estimate repair urgency. While the use of Google Street View imagery may introduce some uncertainty in identifying the exact timing or type of repair, this approach remains a practical solution given the absence of datasets that contain both PCI records and detailed maintenance histories. To minimize potential bias, we aligned image selection with PCI assessment years and verified repairs at the same locations to ensure consistency.
In addition, manual annotations were conducted by members of the research team to support the evaluation of the models’ identification capabilities. Each image was independently reviewed by annotators, who answered a series of structured questions corresponding to the model prompts. A total of 29 questions were answered for each image. These questions focused on the presence and type of pavement defects, their spatial distribution, and relevant environmental or contextual conditions. Specifically, annotators assessed (1) surface features, such as whether cracks or potholes were present, and if so, the general type of cracking (transverse, longitudinal, or alligator); (2) spatial distribution patterns, indicating whether defects appeared isolated, spread across the pavement surface, or concentrated along joints, edges, or drainage areas; and (3) environmental and contextual conditions, including visible intersections, drainage features, patched or sealed areas, signs of resurfacing, and evidence of poor drainage or standing water that could exacerbate pavement deterioration. In cases where annotators disagreed, the discrepancies were discussed collectively until a consensus was reached. These annotations serve as benchmarks for evaluating the accuracy of model responses to prompts related to identification tasks.
Assessment.
This study evaluated the performance of seven MLLMs: Gemini 2.5 Pro, OpenAI o1, GPT-4o, Gemma 3, Llama 3.2, LLaVA v1.6 Mistral, and LLaVA v1.6 Vicuna. Fig 1 shows the general workflow of the MLLM evaluation process. These models were tested across five performance dimensions to assess their ability to complete four categories of road condition assessment tasks of varying complexity. Each image was prompted once, as the 39 prompts were designed primarily as close-ended questions with well-defined terms and clear worded instructions (e.g., yes/no, multiple choice, or numerical rating), which limit open-ended variability in model outputs. The output from each model consisted of text-based responses generated in direct answer to the prompt questions, which were then compared with human-annotated responses to evaluate correctness, consistency, and reasoning quality. All models were configured with a temperature of 0.3 and a top-p of 1.0, while other parameters were kept at their default settings to ensure consistency across platforms. All evaluations were conducted on a high-performance computing cluster equipped with NVIDIA V100 GPUs (32 GB memory) and Intel Xeon Gold 6148 CPUs (2.40 GHz).
Task design.
Four task categories, which represent a progression from low-level visual recognition to higher-level condition diagnosis and intervention planning in surface road assessment, were designed to evaluate model capabilities: (1) Pavement surface distress and feature identification, which addresses basic defect detection critical for maintenance prioritization; (2) Spatial pattern recognition, which evaluates whether deterioration is localized or widespread and thus supports infrastructure-level planning; (3) Pavement condition description and severity evaluation, which provides higher-level diagnostic insights into pavement health and deterioration risks; and (4) Maintenance interval estimation, which directly informs scheduling of interventions and long-term resource planning. Table 1 summarizes these four task categories, detailing their associated Task description and assessment focus.
Assessment dimension.
This study assessed MLLMs’ performance across five dimensions, each corresponds to an aspect of model performance: response rate, response correctness, consistency, multimodal errors, and overall computational intensity and cost.
Response rate was used to assess model responsiveness. It was defined as the proportion of prompts for which the model generated a valid and interpretable output. A higher response rate indicates that the model can engage with a wide range of task instructions, while a lower rate suggests difficulty in interpreting or addressing certain prompt types. In this study, the response rate was calculated programmatically as the percentage of images that yielded a valid response for each question. Response rates were first computed for each of the 39 prompts for each model and then aggregated to represent overall responsiveness at the model level.
Response correctness was used to evaluate the factual accuracy of model outputs by comparing them against established benchmark references, including PCI scores, maintenance interval estimation, and manually annotated surface features. Standard classification metrics were calculated, accuracy, precision, recall, and F1 score. These metrics quantify the degree to which model predictions align with ground-truth annotations and serve as objective measures of task performance.
Consistency serves as a measure of inter-model agreement. It is computed using Cohen’s Kappa, which evaluates the level of agreement between different models when responding to the same prompt, while adjusting for chance-level concordance. All model responses including uncertain or incomplete ones such as “Not sure” are included in the calculation to reflect real-world deployment conditions.
Multimodal errors were evaluated to identify failures in the model’s ability to process, integrate, or reason across visual and textual information, specifically within the context of street-view imagery. This aspect of the analysis focused on the model’s reasoning and interpretive behavior, aiming to explore why incorrect or fabricated content was generated. Errors were examined qualitatively to identify recurring patterns and potential underlying causes. As part of this evaluation, hallucinations were included as a specific subtype of multimodal error. In this study, hallucinations were defined as responses that introduced fabricated, unsupported, or irrelevant content not grounded in the input image or prompt. A response was classified as a hallucination when the described feature or condition could not be verified in the corresponding image.
Overall computational intensity and cost are used to evaluate efficiency. These are quantified by recording each model’s token usage and response generation time. Statistical significance in model differences is tested using the Kruskal-Wallis test, followed by pairwise Wilcoxon rank-sum tests with Bonferroni correction. To complement these tests, effect sizes were calculated to assess the magnitude and practical significance of the observed differences. For pairwise comparisons, rank-biserial correlation (r) was used, and for overall group differences, eta squared (η²) was computed based on the Kruskal–Wallis statistic. For proprietary models, token usage is further translated into estimated monetary cost based on publicly available pricing, enabling comparison of cost-efficiency.
Results
Response rate
The evaluated MLLMs demonstrated the ability to produce outputs in response to domain-specific prompts related to road surface condition assessment. As shown in Fig 2, the overall average response rate, calculated by aggregating model outputs across four task categories, was 63.65%, indicating a moderate capacity to produce interpretable outputs. Both the highest and lowest response rates were observed among the proprietary models. Gemini 2.5 Pro achieved the highest rate at 76.01%, demonstrating strong responsiveness to task prompts. In contrast, OpenAI o1 and GPT-4o recorded the lowest rates, at 51.24% and 51.67%, respectively. Among the open-source models, LLaVA v1.6 Mistral, Gemma 3, and LLaVA v1.6 Vicuna achieved relatively high response rates of 70.85%, 69.44%, and 66.58%, respectively. LlaMA 3.2 had the lowest response rate within this group, at 59.73%, slightly below the overall average.
Table 2 reports model response rates by the four task categories. Overall, MLLMs responded well to prompts focused on pavement surface distress and feature identification (71.82%), pavement condition description and severity evaluation (77.86%), and maintenance interval estimation (94.86%). However, the average response rate dropped significantly for spatial pattern recognition tasks (47.06%). These findings suggest that current MLLMs are generally capable of responding to tasks involving object identification and text-based reasoning, such as describing pavement distress and estimating maintenance needs. However, they struggle with tasks that require interpreting visual information related to spatial distribution patterns.
Pavement surface distress and feature identification exhibited varied response rates across models. Gemini 2.5 Pro (88.30%), LLaVA v1.6 Mistral (81.52%), and Gemma 3 (78.72%) showed higher overall response rates for pavement surface distress and feature identification, whereas OpenAI o1 (52.05%) and GPT-4o (53.49%) performed lower. Reviewing the subtask-level results in Table 3, most models responded well to general questions, such as identifying whether cracks are present, with response rates close to or at 100% (e.g., 99.62% for OpenAI o1 and 100.00% for GPT-4o). However, their responsiveness dropped markedly for fine-grained classification tasks. For example, in the subtask of identifying transverse cracks, the response rates of OpenAI o1 and GPT-4o dropped to 42.91% and 44.06%, respectively. These findings suggest that the lower overall responsiveness of models like OpenAI o1 and GPT-4o is largely due to their limited performance on fine-grained tasks involving specific distress types.
Spatial pattern recognition had the lowest response rates among all tasks, with GPT-4o performing the weakest at just 28.98%, and Gemini 2.5 Pro achieving the highest at 57.14%. These results indicate that most models struggle with solving tasks related to the spatial distribution, patterns, or layout of road damage. Compared to proprietary models, open-source models generally outperformed proprietary models in spatial pattern recognition. LLaVA v1.6 Mistral (59.24%), LLaVA v1.6 Vicuna (58.48%), and Gemma 3 (48.61%) achieved higher response rates than GPT-4o (28.98%) and OpenAI o1 (29.58%), suggesting that open-source architecture may be better suited for spatial reasoning tasks. Nevertheless, all models showed markedly lower response rates in spatial pattern recognition than in other tasks, underscoring the persistent difficulty of capturing spatial context and relational features in road surface imagery. Pavement condition description and severity evaluation presented strong response rates across models, with an overall average of 77.86%. Most models were capable of generating concise textual descriptions of the road environment and estimating damage severity. Gemini 2.5 Pro (89.87%) and Gemma 3 (89.66%) achieved the highest response rates in this task category, while LlaMA 3.2 recorded the lowest at 62.50%.
Maintenance interval estimation had the highest response rates among all task categories, with an overall average of 94.86%. Gemma 3 reached 100%, followed by GPT-4o (98.47%), OpenAI o1 (96.93%), and Gemini 2.5 Pro (93.49%). The remaining models also performed well, each with response rates above 85%. These results indicate that most MLLMs are able to generate outputs when prompted to estimate appropriate maintenance timing based on road condition inputs.
Response correctness
Response correctness varies across models, with proprietary models generally outperforming their open-source counterparts. Fig 3 compares the response correctness of seven MLLMs across four metrics: (a) accuracy, (b) precision, (c) recall, and (d) F1 score. GPT-4o and OpenAI o1 (proprietary models) achieved the highest scores across all metrics, indicating stronger performance in generating accurate, relevant, and complete outputs for road condition assessment tasks. The red dashed line represents the mean score for that metric. In contrast, open-source models consistently underperformed, particularly in precision and recall, suggesting a higher likelihood of producing false positives and omitting relevant features.
As shown in Fig 3(a), GPT-4o achieved the highest accuracy at 67.80%, followed closely by OpenAI o1 at 67.24%. LLaVA v1.6 Vicuna was the only open-source model to exceed the group average of 61.49%, achieving an accuracy of 63.20%. Other models such as Gemini 2.5 Pro (58.85%), Gemma 3 (58.09%), and LLaVA v1.6 Mistral (55.37%) fell below the average, indicating more frequent generation of incorrect outputs. These results suggest that GPT-based models are more effective at producing correct responses across road condition assessment tasks.
In terms of precision, shown in Fig 3(b), OpenAI o1 led with 50.34%, followed by GPT-4o at 45.29%. Although these values were the highest among the evaluated models, they indicate that even the top-performing models occasionally labeled incorrect outputs as correct. Precision values for Gemma 3, LLaVA v1.6 Mistral, and LLaVA v1.6 Vicuna ranged between 32.69% and 43.25%, suggesting a greater tendency to produce false positives. Gemini 2.5 Pro recorded a precision of 40.46%, which was below the group mean of 41.64%.
Recall scores, displayed in Fig 3(c), followed a similar pattern to accuracy and precision. OpenAI o1 (60.19%), GPT-4o (58.55%), and Gemini 2.5 Pro (56.00%) all outperformed the group average of 51.96%, indicating they were better at correctly identifying relevant information for surface road condition assessment tasks. In contrast, all open-source models had lower recall scores, ranging from 44.49% to 49.84%, suggesting they were more likely to miss important features during the assessment.
Fig 3(d) presents F1 scores, which balance precision and recall providing an overall measure of response correctness. OpenAI o1 and GPT-4o again led this metric, with scores of 63.53% and 63.25%, respectively. Gemma 3 achieved a moderately strong score of 59.47%, while Gemini 2.5 Pro lagged behind at 54.21% due to lower precision. Open-source models including LLaVA v1.6 Mistral (50.31%), LLaVA v1.6 Vicuna (55.16%), and Llama 3.2 (53.34%) all scored below the group mean of 57.03%. These results confirm that proprietary models are better equipped to deliver responses for the assessment.
Table 4 presents response correctness metrics across the four task categories. Results show that models performed better on visual recognition tasks, such as pavement surface distress and feature identification as well as spatial pattern recognition, but showed weaker performance on interpretive reasoning tasks, including pavement condition evaluation and maintenance interval estimation.
Among the visual recognition tasks, the highest performance was observed in pavement surface distress and feature identification. GPT-4o (Accuracy: 78.99%, F1: 69.50%) and OpenAI o1 (Accuracy: 77.55%, F1: 67.72%) demonstrated strong capabilities in detecting visible surface conditions such as cracks and potholes. Similarly, in spatial pattern recognition, both models maintained relatively high performance, with OpenAI o1 achieving an F1 score of 68.53% and GPT-4o scoring 66.55%.
In contrast, interpretive reasoning tasks are more challenging, as they require models to infer severity levels or predict future repair actions based on complex and often ambiguous visual and contextual cues. Consequently, model performance in these tasks remains low. All models recorded F1 scores below 25% for pavement condition evaluation and below 35% for maintenance interval estimation, highlighting their limited capability in producing accurate and consistent predictions.
Consistency
Fig 4 shows that proprietary models tend to produce more consistent outputs, and open-source models generate more variable and less predictable responses. Specifically, proprietary models exhibited stronger inter-model agreement than open-source models. The highest pairwise agreement was observed between GPT-4o and OpenAI o1 (0.621), indicating a high level of consistency in their outputs, likely due to similar underlying architectures or training strategies. Gemini 2.5 Pro also showed moderate agreement with GPT-4o (0.181) and OpenAI o1 (0.196), further supporting the coherence within proprietary models. In contrast, open-source models demonstrated low or near-zero agreement both among themselves and with proprietary models. For example, LLaVA v1.6 Vicuna showed almost no agreement with other models, with values ranging from 0.002 to 0.029, and even a slightly negative agreement with LLaVA v1.6 Mistral (−0.028). Similarly, LLaVA v1.6 Mistral and Llama 3.2 showed weak inter-model alignment, suggesting inconsistent output behavior.
Fig 5 presents inter-model agreements across four task categories. Each heatmap panel shows agreement on a road condition assessment task: (a) pavement surface distress and feature identification, (b) spatial pattern Recognition, (c) road condition description and severity Evaluation, and (d) maintenance interval estimation. GPT-4o and OpenAI o1 showed high agreement in all four task categories, with particularly strong alignment in pavement surface distress identification and damage pattern recognition. This suggests that GPT-based models are more consistent in visual recognition tasks. Gemini 2.5 Pro showed moderate agreement with both GPT-4o and OpenAI o1 across all tasks, indicating relatively stable response behavior and internal consistency among proprietary models. In contrast, open-source models showed low agreement both among themselves and with proprietary models across all task categories. For example, LLaVA v1.6 Vicuna and Mistral frequently yielded near-zero or even negative agreement scores, indicating that they often produced divergent outputs when processing the same input. Such inconsistency suggests that open-source MLLMs may be less dependable for road condition assessment tasks, particularly those that require consistent recognition and interpretation of visual and contextual information.
Multimodal errors
Pavement surface distress and feature identification.
MLLMs often struggle to identify specific types of surface distress when performing fine-grained classification. While most models reliably detected the general presence of cracks or potholes, they frequently misidentified the specific type of distress. In several instances, models hallucinate cracks type, such as labeling transverse cracks as longitudinal or falsely detected alligator cracks in unrelated patterns. These hallucinations commonly occurred when distress features were subtle, partially obscured, or visually similar, making it difficult for the model to apply precise visual reasoning. This pattern suggests that although object detection is often accurate at a general level, the step of assigning a detailed label is prone to error. Such misclassifications introduce risks when outputs are used to inform infrastructure repair decisions, particularly in workflows where the type of surface distress directly affects prioritization or treatment strategy.
Spatial pattern recognition.
MLLMs often encounter difficulty when interpreting how surface damage is distributed across roadways, particularly in tasks that require recognizing spatial layouts or structural alignment. Unlike general damage detection, spatial pattern recognition demands a more nuanced understanding of how cracks or potholes relate to features such as joints, edges, or intersections. In these cases, models frequently misclassified or hallucinated spatial arrangements. These errors were especially common when features were irregular, subtle, or embedded in complex visual contexts. Rather than relying solely on observed visual cues, models may infer complexity where none exists, leading to exaggerated or inaccurate spatial descriptions. This poses a challenge for applications that require accurate diagnosis of localized deterioration, such as planning targeted maintenance interventions.
Pavement severity assessment.
Discrepancies were observed between MLLM-generated severity assessments and PCI scores provided by the City of San Francisco. These differences do not necessarily indicate model error, as the PCI scores are based on human evaluations that may introduce subjectivity. As shown in S2 Table, in some cases, MLLM assessments more accurately reflect the visible surface conditions captured in the street view images. This suggests that PCI labels may not always provide the most up-to-date or visually consistent representation of roadway quality, and that MLLMs can offer a valid alternative based on visual evidence. At the same time, it is important to note that in this study, the prompt requested only a PCI value without requiring justification. Without an explicit requirement for reasoning, the model may generate estimated values based on assumptions or unsupported visual cues.
Pavement maintenance interval estimation.
Multimodal errors in responses to maintenance interval estimation occurred when models relied excessively on visible surface distress, leading them to suggest short-term repairs for roads that were not actually scheduled for maintenance. This may have happened because important contextual factors such as long-term transportation plans, budget limitations, and traffic management needs were not considered, as shown in S3 Table. Although the road conditions appeared to be deteriorated, real-world maintenance decisions often prioritize these broader considerations, leading to delayed repairs. This behavior could represent a form of hallucination in which the model inferred unwarranted maintenance needs based solely on surface appearance, generating recommendations that were not grounded in actual maintenance schedules or operational priorities.
Overall computational intensity and cost
Fig 6 compares the processing time (left) and token count (right) across all evaluated MLLMs. Red lines indicate median values, annotated above each box. Proprietary models exhibited longer processing times per street image. OpenAI o1 and GPT-4o had median durations of 67.25 and 52.08 seconds, respectively, while Gemini 2.5 Pro was faster at 40.79 seconds. In contrast, open-source models showed consistently shorter processing times with minimal variability.
Token usage varied by model. Among the proprietary models, Gemini 2.5 Pro had the highest median token usage (3,445), followed by GPT-4o (3,304) and OpenAI o1 (3,196). Interestingly, with input prompts and image embeddings kept consistent across cases, the near-constant token usage of open-source models suggests a fixed-length output behavior. This pattern suggests that open-source models may rely more heavily on templated responses, while proprietary models adjust output length dynamically based on input content.
A Kruskal-Wallis test revealed significant differences in token usage among the three models (p = 0.0066), with a small effect size (η² = 0.010). Post hoc pairwise Wilcoxon rank-sum tests with Bonferroni correction showed that Gemini 2.5 Pro used a significantly different number of tokens compared to both GPT-4o (adjusted p = 0.027, r = 0.11) and OpenAI o1 (adjusted p = 0.018, r = 0.12), while no significant difference was found between GPT-4o and OpenAI o1 (adjusted p = 1.000, r = 0.04). Despite statistical significance, the small effect sizes suggest that these differences in token usage may have limited practical relevance on their own.
Translating token usage into monetary cost, analyzing one street image with 39 prompts related to road surface condition assessment costs approximately $0.02 using Gemini 2.5 Pro, $0.04 using GPT-4o, and $0.12 using OpenAI o1. Although Gemini 2.5 Pro typically consumes more tokens, its lower per-token rate results in a lower overall cost compared to OpenAI o1. The per-token cost of OpenAI o1 is approximately seven times higher than that of Gemini 2.5 Pro and at least 3.75 times higher than that of GPT-4o. Consequently, OpenAI o1 incurs the highest cost per image, while Gemini 2.5 Pro and GPT-4o offer more cost-efficient alternatives.
Discussion, limitations and future work
The potential and current limitations of existing MLLMs in road surface condition assessment
The study assessed the performance of both proprietary and open-source MLLMs on four tasks relevant to surface road condition assessment. Results indicate that these models can analyze street-level imagery and generate relevant, structured responses that align with task objectives. This confirmed the potential of MLLMs to support surface road condition assessment workflows.
Beyond performance, MLLMs are characterized by their user-friendly design and low operational costs, which support their applicability in pavement condition assessment. The user-friendliness of MLLMs is reflected in two aspects. First, they allow users to interact with the model using natural language prompts [83], removing the need for specialized programming expertise. Second, MLLMs can produce structured assessment outputs directly from raw street-level images, without requiring extensive data preprocessing or formatting. This streamlined input process reduces the technical burden typically associated with automated assessment tools and facilitates more efficient deployment in real-world pavement evaluation workflows.
In terms of cost, MLLMs are also relatively affordable. The cost of analysis ranges from $0.02 to $0.12 per image. In contrast, conventional road condition assessments can be significantly more expensive, with costs starting at approximately $3,772 for evaluating a roadway of at least one mile in length [84]. Assuming one image is captured every 25 feet, or approximately 212 images per mile, an MLLM-based assessment would cost between $4.24 and $25.44 for a one-mile segment.
However, these advantages in usability and affordability are accompanied by several limitations. MLLM performance is not consistent across all task types and tends to vary with task complexity. The models perform well on general tasks that require minimal contextual understanding, such as identifying the presence of cracks or potholes. Their performance weakens as tasks require more detailed or specific classifications. For example, the average response rate drops from 99.07% when identifying general cracks to 77.28% when identifying transverse cracks. Accuracy also declines, with the average F1 score decreasing from 0.62 for object identification to 0.20 for maintenance interval estimation.
Another factor affecting model performance is the clarity and completeness of the input prompt. Models often produce inaccurate responses when the input lacks sufficient context. For instance, in this study, the prompt only asked the models to recommend maintenance intervals based on street-view images, without providing any additional background information. As a result, the models focused exclusively on visible pavement conditions and overlooked non-visual determinants of maintenance needs. Incorporating these contextual factors into model prompts or fine-tuning datasets could improve both the reliability and practical utility of MLLM-generated outputs.
Furthermore, generative models may produce probabilistic outputs, meaning that repeated runs do not necessarily yield identical results unless generation parameters are fixed. This stochastic variability parallels observer inconsistency in manual inspections and can introduce uncertainty into the evaluation process. Standardizing model parameters and employing well-structured prompts can mitigate this randomness, thereby enhancing the reproducibility and reliability of MLLM outputs.
A further limitation lies in the explainability of MLLM outputs. Although the models can generate decisions related to surface road condition assessment, their reasoning processes remain opaque. For instance, when asked to provide a PCI based on an image, the models typically return only a numerical value without clarifying how it was derived. While MLLMs can produce descriptive explanations, these are not always grounded in verifiable causal relationships and may reflect post-hoc rationalizations rather than genuine interpretive reasoning. Prompting models to explicitly state the rationale behind their decisions could improve transparency and facilitate more trustworthy applications in pavement assessment.
Finally, Although GAI can reduce reliance on manual inspections by automating visual assessments, human oversight remains necessary to validate cases and ensure alignment with engineering standards. While GAI using street-view imagery can assist with tasks related to surface condition assessment, effective pavement management requires additional data beyond surface imagery such as structural integrity and traffic loading. Therefore, GAI should be viewed as a decision-support toolkit that enhances human judgment rather than a complete replacement for professional expertise and field validation.
Pathways to improving MLLM performance
Enhancing their effectiveness on road surface condition assessment will require improvements in several areas. One key area is the need for additional domain-specific training data to support fine-tuning and improve model performance [85]. Many core assessment tasks, such as identifying specific types of pavement cracks or determining what types of repairs have been made to the road (e.g., patched areas, resurfacing), require detailed recognition capabilities that current models often lack. Compared to general object detection, the response rate for these fine-grained tasks is significantly lower, indicating that MLLMs need further tuning to recognize subtle distinctions in road surface conditions. Similarly, tasks such as pavement severity estimation and maintenance interval prediction depend on contextual variables beyond the visible defect itself. These tasks require reasoning based on factors such as budget, environmental exposure, or traffic patterns, which are typically not part of the model’s default knowledge. Fine-tuning with annotated datasets that include these features is necessary to improve model performance on these complex tasks.
In parallel, knowledge distillation can serve as a complementary strategy [86]. By transferring knowledge from high-performing proprietary models to open-source models, distillation can enhance performance without the computational cost of training from scratch. This is particularly useful for tasks like spatial pattern recognition, where response rates may be low, but the accuracy of responses, when provided, is relatively high. This suggests that models possess some capacity for spatial reasoning, but they are not consistently activated. Distillation can help encode this reasoning more effectively, enabling smaller models to perform better in both classification and spatial interpretation tasks.
At the same time, prompt engineering plays an important role in improving model outputs, particularly for context-dependent tasks [87]. For instance, when asking when a road should be repaired, the model must consider information not visible in the image, such as usage intensity, historical repair records, or surrounding infrastructure. Designing prompts that incorporate this contextual information and guide the model to consider relevant variables can significantly improve response quality. This approach is especially valuable in situations where retraining or fine-tuning is not feasible, as it allows users to optimize model performance through carefully structured input design.
Proprietary vs. open-source models: trade-offs and implications for future development
This study compared the performance of proprietary and open-source MLLMs across multiple dimensions, including responsiveness, accuracy, internal agreement, and cost. While both categories of models demonstrate similar levels of responsiveness, proprietary models generally achieve higher accuracy and show stronger internal agreement, producing similar outputs when presented with the same or slightly modified prompts. These advantages make proprietary models highly reliable for operational deployment; however, their use involves financial costs and limited transparency. Most proprietary models cannot be fine-tuned or adapted for domain-specific applications, and their token-based pricing structures can substantially increase computational expenses, especially in large-scale analyses.
In contrast, open-source models are freely available and offer greater flexibility for customization and domain adaptation. Although they typically perform slightly worse than proprietary models in accuracy and consistency, their transparency, accessibility, and modifiability make them valuable for research and development. Open-source models can be fine-tuned on task-specific datasets, enabling iterative improvement without recurring costs. This balance between performance and accessibility suggests that open-source models are a practical foundation for future pavement assessment applications, particularly in settings where reproducibility and budget constraints are critical.
Among the open-source models evaluated, Gemma 3 is recommended as the base model for fine-tuning. It demonstrates high response rates across all tasks, including a perfect score (100%) for maintenance interval estimation. Gemma 3 also achieves the highest F1 score among all open-source models, exceedingly even that of Gemini 2.5 Pro, which indicates strong overall reliability and balanced performance. Furthermore, Gemma 3 shows higher consistency with proprietary models compared to other open-source models. In contrast, the relatively low consistency observed among the LLaVA and Llama models suggests that their responses may be more variable or randomly generated. Overall, Gemma 3 achieved the best performance across all tasks, suggesting it is a strong candidate for fine-tuning and domain-specific adaptation. For applications involving knowledge distillation, GPT-4o is the most suitable proprietary model. It achieves high accuracy and internal consistency, comparable to OpenAI o1, but at a lower operational cost.
Conclusion
In this study, we evaluated the performance of seven MLLMs, including three proprietary models (GPT-4o, OpenAI o1, Gemini 2.5 Pro) and four open-source models (LLaVA v1.6 Mistral, LLaVA v1.6 Vicuna, Llama 3.2, Gemma 3), across four task categories: (1) pavement surface distress and feature identification, (2) spatial pattern recognition, (3) pavement condition severity evaluation, and (4) pavement maintenance interval estimation. The models were assessed across five dimensions: response rate, accuracy, consistency, multimodal errors, and computational cost.
The results empirically verified the transformative potential of MLLMs, which lies in their strong responsiveness and low cost. On average, the models achieved a 63.65% response rate, and the cost of conducting an assessment based on a single image ranged from approximately $0.02 to $0.12, depending on the model used. While accuracy requires further optimization, this work establishes a benchmark proving that with domain-specific refinement, MLLMs can transition from assistive tools to scalable, cost-effective solutions for pavement monitoring and management.
Supporting information
S1 Table. List of prompts used for model evaluation.
https://doi.org/10.1371/journal.pone.0340380.s001
(DOCX)
S2 Table. Cases illustrate discrepancies between MLLM-generated severity assessments and PCI-based condition label.
https://doi.org/10.1371/journal.pone.0340380.s002
(DOCX)
S3 Table. The comparison between repair maintenance intervals estimated by MLLMs and actual intervals.
https://doi.org/10.1371/journal.pone.0340380.s003
(DOCX)
References
- 1. Li Y, Huang J. Safety Impact of Pavement Conditions. Transportation Research Record: Journal of the Transportation Research Board. 2014;2455(1):77–88.
- 2. Verma D, Singh V. Road maintenance effect in reducing road accident. International Journal for Scientific Research & Development. 2015;3(1):303–7.
- 3.
Wiegmann J, Yelchuru B. Resource allocation logic framework to meet highway asset preservation. Transportation Research Board. 2012.
- 4. Elmansouri O, Alossta A, Badi I. Pavement condition assessment using pavement condition index and multi-criteria decision-making model. Mechatron Intell Transp Syst. 2022;1:57–68.
- 5. Ezzati S, Palma CD, Bettinger P, Eriksson LO, Awasthi A. An integrated multi-criteria decision analysis and optimization modeling approach to spatially operational road repair decisions. Can J For Res. 2021;51(3):465–83.
- 6.
Burningham S, Stankevich N. Why road maintenance is important and how to get it done. 2005.
- 7. Elbagalati O, Elseifi M, Gaspard K, Zhang Z. Development of the pavement structural health index based on falling weight deflectometer testing. International Journal of Pavement Engineering. 2016;19(1):1–8.
- 8. Jiang X, Gabrielson J, Huang B, Bai Y, Polaczyk P, Zhang M, et al. Evaluation of inverted pavement by structural condition indicators from falling weight deflectometer. Construction and Building Materials. 2022;319:125991.
- 9.
Huang YH. Pavement analysis and design. Upper Saddle River, NJ: Pearson/Prentice Hall. 2004.
- 10. Liu Z, Yang Q, Gu X. Assessment of Pavement Structural Conditions and Remaining Life Combining Accelerated Pavement Testing and Ground-Penetrating Radar. Remote Sensing. 2023;15(18):4620.
- 11. Yao Y, Tung S-TE, Glisic B. Crack detection and characterization techniques-An overview. Struct Control Health Monit. 2014;21(12):1387–413.
- 12.
Arianto T, Suprapto M. Pavement condition assessment using IRI from Roadroid and surface distress index method on national road in Sumenep Regency. In: IOP Conference Series: Materials Science and Engineering, 2018.
- 13. Yang CH, Kim JG, Shin SP. Road Hazard Assessment Using Pothole and Traffic Data in South Korea. Journal of Advanced Transportation. 2021;2021:1–10.
- 14. Merkle N, Henry C, Azimi SM, Kurz F. Road condition assessment from aerial imagery using deep learning. ISPRS Ann Photogramm Remote Sens Spatial Inf Sci. 2022;V-2–2022:283–9.
- 15. Lopes Amaral Loures L, Rezazadeh Azar E. Condition Assessment of Unpaved Roads Using Low-Cost Computer Vision–Based Solutions. J Transp Eng, Part B: Pavements. 2023;149(1).
- 16.
Moussa G, Hussain K. A new technique for automatic detection and parameters estimation of pavement crack. In: 2011.
- 17.
Aslan OD, Gultepe E, Ramaji IJ, Kermanshachi S. Using Artifical Intelligence for Automating Pavement Condition Assessment. In: International Conference on Smart Infrastructure and Construction 2019 (ICSIC), 2019. 337–41. https://doi.org/10.1680/icsic.64669.337
- 18. Sholevar N, Golroo A, Esfahani SR. Machine learning techniques for pavement condition evaluation. Automation in Construction. 2022;136:104190.
- 19.
Eisenbach M, Stricker R, Sesselmann M, Seichter D, Gross H. Enhancing the quality of visual road condition assessment by deep learning. In: 2019.
- 20. Satheesan D, Talib M, Li S, Yuan A. An Automated Method for Pavement Surface Distress Evaluation. Int Arch Photogramm Remote Sens Spatial Inf Sci. 2024;XLVIII-M-4–2024:47–53.
- 21. Qureshi WS, Hassan SI, McKeever S, Power D, Mulry B, Feighan K, et al. An Exploration of Recent Intelligent Image Analysis Techniques for Visual Pavement Surface Condition Assessment. Sensors (Basel). 2022;22(22):9019. pmid:36433612
- 22.
Ding J, Li X. An Approach for Validating Quality of Datasets for Machine Learning. In: 2018 IEEE International Conference on Big Data (Big Data), 2018. https://doi.org/10.1109/bigdata.2018.8622640
- 23.
Gharawi AA, Alsubhi J, Ramaswamy L. Impact of Labeling Noise on Machine Learning: A Cost-aware Empirical Study. In: 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), 2022. 936–9. https://doi.org/10.1109/icmla55696.2022.00156
- 24. Whang SE, Lee J-G. Data collection and quality challenges for deep learning. Proc VLDB Endow. 2020;13(12):3429–32.
- 25. Drosatos G, Efraimidis PS, Arampatzis A. Federated and Transfer Learning Applications. MDPI. 2023;:11722.
- 26. Bandi A, Adapa PVSR, Kuchi YEVPK. The Power of Generative AI: A Review of Requirements, Models, Input–Output Formats, Evaluation Metrics, and Challenges. Future Internet. 2023;15(8):260.
- 27. Lu K, Zhao X, Li M, Wang Z, Zhou Y, Wang J. StreetSenser: a novel approach to sensing street view via a fine-tuned multimodal large language model. International Journal of Geographical Information Science. 2025;:1–29.
- 28.
Blečić I, Saiu V, Trunfio A. Enhancing urban walkability assessment with multimodal large language models. In: International Conference on Computational Science and Its Applications, 2024.
- 29. Jang KM, Kim J. Multimodal Large Language Models as Built Environment Auditing Tools. The Professional Geographer. 2024;77(1):84–90.
- 30. Ragnoli A, De Blasiis MR, Di Benedetto A. Pavement Distress Detection Methods: A Review. Infrastructures. 2018;3(4):58.
- 31. Mishalani RG, Koutsopoulos HN. Role of spatial dimension in infrastructure condition assessment and deterioration modeling. Transportation Research Record. 1995;1508.
- 32. Karim FMA, Rubasi KAH, Saleh AA. The Road Pavement Condition Index (PCI) Evaluation and Maintenance: A Case Study of Yemen. Organization, Technology and Management in Construction: an International Journal. 2016;8(1):1446–55.
- 33. Putra DA, Suprapto M. Assessment of the road based on PCI and IRI roadroid measurement. MATEC Web of Conferences. 2018.
- 34. Loprencipe G, Pantuso A. A Specified Procedure for Distress Identification and Assessment for Urban Road Surfaces Based on PCI. Coatings. 2017;7(5):65.
- 35.
Cafiso S, D’Agostino C, Delfino E, Montella A. From manual to automatic pavement distress detection and classification. In: 2017 5th IEEE International Conference on Models and Technologies for Intelligent Transportation Systems (MT-ITS), 2017. 433–8. https://doi.org/10.1109/mtits.2017.8005711
- 36. Schnebele E, Tanyu BF, Cervone G, Waters N. Review of remote sensing methodologies for pavement management and assessment. Eur Transp Res Rev. 2015;7(2).
- 37.
Bou-Saab G, Nlenanya I, Alhasan A. Correlating visual–windshield inspection pavement condition to distresses from automated surveys using classification trees. In: 2019.
- 38. Ibragimov E, Kim Y, Lee JH, Cho J, Lee J-J. Automated Pavement Condition Index Assessment with Deep Learning and Image Analysis: An End-to-End Approach. Sensors (Basel). 2024;24(7):2333. pmid:38610545
- 39.
Radopoulou SC, Brilakis I, Doycheva K, Koch C. A framework for automated pavement condition monitoring. In: Construction Research Congress 2016, 2016.
- 40.
Abdel Raheem M, El-Melegy M. Drive-By Road Condition Assessment Using Internet of Things Technology. In: 2019 International Conference on Advances in the Emerging Computing Technologies (AECT), 2020. 1–6. https://doi.org/10.1109/aect47998.2020.9194190
- 41.
Jang J, Smyth AW, Yang Y, Cavalcanti D. Road surface condition monitoring via multiple sensor-equipped vehicles. In: 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2015. 43–4. https://doi.org/10.1109/infcomw.2015.7179334
- 42. Marcelino P, Lurdes Antunes M de, Fortunato E. Comprehensive performance indicators for road pavement condition assessment. Structure and Infrastructure Engineering. 2018;14(11):1433–45.
- 43. Majidifard H, Adu-Gyamfi Y, Buttlar WG. Deep machine learning approach to develop a new asphalt pavement condition index. Construction and Building Materials. 2020;247:118513.
- 44. Huang Y-T, Jahanshahi MR, Shen F, Mondal TG. Deep Learning–Based Autonomous Road Condition Assessment Leveraging Inexpensive RGB and Depth Sensors and Heterogeneous Data Fusion: Pothole Detection and Quantification. J Transp Eng, Part B: Pavements. 2023;149(2).
- 45.
Hasanaath AA, Moinuddeen A, Mohammad N, Khan MA, Hussain AA. Continuous and Realtime Road Condition Assessment Using Deep Learning. In: 2022 International Conference on Connected Systems & Intelligence (CSI), 2022. 1–7. https://doi.org/10.1109/csi54720.2022.9924135
- 46. Jagatheesaperumal SK, Bibri SE, Ganesan S, Jeyaraman P. Artificial Intelligence for road quality assessment in smart cities: a machine learning approach to acoustic data analysis. ComputUrban Sci. 2023;3(1).
- 47. Martorell S, Sánchez A, Villamizar M, Clemente G. Maintenance modelling and optimization integrating strategies and human resources: Theory and case study. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability. 2008;222(3):347–57.
- 48.
Verma AK, Srividya A, Ramesh P, Deshpande A, Sadiq R. Expert knowledge base in integrated maintenance models for engineering plants. Recent developments and new direction in soft-computing foundations and applications: Selected papers from the 4th World Conference on Soft Computing, May 25-27, 2014, Berkeley. Springer. 2016.
- 49.
Tomescu D, Heiman A, Badescu A. An Automatic Remote Monitoring System for Large Networks. In: 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), 2019. 71–3. https://doi.org/10.1109/cse/euc.2019.00023
- 50.
Štefančič M, Jerič D, Štefančič M, Čebokli P. Dealing with large amount of data from automated pest monitoring system. Massendatenmanagement in der Agrar-und Ernährungswirtschaft–Erhebung–Verarbeitung–Nutzung. Gesellschaft für Informatik eV. 2013.
- 51.
Eisenbach M, Stricker R, Seichter D, Amende K, Debes K, Sesselmann M, et al. How to get pavement distress detection ready for deep learning? A systematic approach. In: 2017 International Joint Conference on Neural Networks (IJCNN), 2017. 2039–47. https://doi.org/10.1109/ijcnn.2017.7966101
- 52. Pearce J, Chiavaroli N. Rethinking assessment in response to generative artificial intelligence. Med Educ. 2023;57(10):889–91. pmid:37042389
- 53. Nguyen Thanh B, Vo DTH, Nguyen Nhat M, Pham TTT, Thai Trung H, Ha Xuan S. Race with the machines: Assessing the capability of generative AI in solving authentic assessments. AJET. 2023;39(5):59–81.
- 54.
Chen Z, Dai J, Lai Z, Liu Z, Lu L, Lu T, et al. VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks. In: Advances in Neural Information Processing Systems 37, 2024. 69925–75. https://doi.org/10.52202/079017-2235
- 55. Song H, Baek K, Byun Y. Pothole detection using machine learning. Advanced Science and Technology. 2018;:151–5.
- 56. Egaji OA, Evans G, Griffiths MG, Islas G. Real-time machine learning-based approach for pothole detection. Expert Systems with Applications. 2021;184:115562.
- 57. Anaissi A, Khoa NLD, Rakotoarivelo T, Alamdari MM, Wang Y. Smart pothole detection system using vehicle-mounted sensors and machine learning. J Civil Struct Health Monit. 2019;9(1):91–102.
- 58. Hoang N-D, Nguyen Q-L. A novel method for asphalt pavement crack classification based on image processing and machine learning. Engineering with Computers. 2018;35(2):487–98.
- 59. Nguyen TH, Nguyen TL, Sidorov DN, Dreglea AI. Machine learning algorithms application to road defects classification. IDT. 2018;12(1):59–66.
- 60. Inkoom S, Sobanjo J, Barbu A, Niu X. Prediction of the crack condition of highway pavements using machine learning models. Structure and Infrastructure Engineering. 2019;15(7):940–53.
- 61.
Abderrahim NYQ, Abderrahim S, Rida A. Road Segmentation using U-Net architecture. In: 2020 IEEE International conference of Moroccan Geomatics (Morgeo), 2020. https://doi.org/10.1109/morgeo49228.2020.9121887
- 62.
Munteanu A, Selea T, Neagul M. Deep learning techniques applied for road segmentation. In: 2019.
- 63.
Zhang S, Mu H, Liu T. Improving Accuracy and Generalizability via Multi-Modal Large Language Models Collaboration. In: 2024 International Joint Conference on Neural Networks (IJCNN), 2024. 1–8. https://doi.org/10.1109/ijcnn60899.2024.10651475
- 64. Ashqar HI, Alhadidi TI, Elhenawy M, Khanfar NO. Leveraging Multimodal Large Language Models (MLLMs) for Enhanced Object Detection and Scene Understanding in Thermal Images for Autonomous Driving Systems. Automation. 2024;5(4):508–26.
- 65.
Tan W, Ding C, Jiang J, Wang F, Zhan Y, Tao D. Harnessing the power of mllms for transferable text-to-image person reid. In: 2024.
- 66. Zhou Q, Zhang J, Zhu Z. Evaluating Urban Visual Attractiveness Perception Using Multimodal Large Language Model and Street View Images. Buildings. 2025;15(16):2970.
- 67. Ki D, Lee H, Park K, Ha J, Lee S. Measuring nuanced walkability: Leveraging ChatGPT’s vision reasoning with multisource spatial data. Computers, Environment and Urban Systems. 2025;121:102319.
- 68.
Stricker R, Aganian D, Sesselmann M, Seichter D, Engelhardt M, Spielhofer R, et al. Road Surface Segmentation - Pixel-Perfect Distress and Object Detection for Road Assessment. In: 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE), 2021. 1789–96. https://doi.org/10.1109/case49439.2021.9551591
- 69.
Wang KC, Gong W. Automated pavement distress survey: a review and a new direction. In: 2002.
- 70. Mukhlisin M, Khiyon KN. The Effects of Cracking on Slope Stability. Journal of the Geological Society of India. 2018;91(6):704–10.
- 71. Grantab R, Shenoy VB. Location- and Orientation-Dependent Progressive Crack Propagation in Cylindrical Graphite Electrode Particles. J Electrochem Soc. 2011;158(8):A948.
- 72. Suryoto, Siswoyo DP, Setyawan A. The Evaluation of Functional Performance of National Roadway using Three Types of Pavement Assessments Methods. Procedia Engineering. 2017;171:1435–42.
- 73. Kumar AV. Pavement surface condition assessment: a-state-of-the-art research review and future perspective. Innov Infrastruct Solut. 2024;9(12).
- 74. Svenson K. Estimated Lifetimes of Road Pavements in Sweden Using Time-to-Event Analysis. J Transp Eng. 2014;140(11).
- 75.
Kavukcuoglu K. Gemini 2.5: Our most intelligent AI model. 2025. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/
- 76.
OpenAI. Introducing OpenAI o1 2024. https://openai.com/o1/
- 77.
OpenAI. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/
- 78.
Farabet CTW. Introducing Gemma 3: The most capable model you can run on a single GPU or TPU. 2025. https://blog.google/technology/developers/gemma-3/
- 79.
Meta 79. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. 2024. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/
- 80.
Liu H. Llava-v1.6-mistral-7b. 2024. https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b?utm_source=chatgpt.com
- 81.
Liu H. Liuhaotian/llava-v1.6-vicuna-7b. 2024. https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b?utm_source=chatgpt.com
- 82.
City and County of San Francisco. Streets Data – Pavement Condition Index (PCI) Scores. https://data.sfgov.org/City-Infrastructure/Streets-Data-Pavement-Condition-Index-PCI-Scores/5aye-4rtt
- 83. Kuang J, Shen Y, Xie J, Luo H, Xu Z, Li R, et al. Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey. ACM Comput Surv. 2025;57(8):1–36.
- 84.
Pavement L. Pavement analysis services request for services attachment A. 2020.
- 85.
Zhou X, He J, Ke Y, Zhu G, Gutierrez Basulto V, Pan J. An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models. In: Findings of the Association for Computational Linguistics ACL 2024, 2024. 10057–84. https://doi.org/10.18653/v1/2024.findings-acl.598
- 86.
Cai Y, Zhang J, He H, He X, Tong A, Gan Z. Llava-kd: A framework of distilling multimodal large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025.
- 87. Chen B, Zhang Z, Langrené N, Zhu S. Unleashing the potential of prompt engineering for large language models. Patterns (N Y). 2025;6(6):101260. pmid:40575123