Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Urban walkability through different lenses: A comparative study of GPT-4o and human perceptions

  • Musab Wedyan,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    Affiliation School of Planning, Design and Construction, Michigan State University, East Lansing, Michigan, United States of America

  • Yu-Chen Yeh,

    Roles Conceptualization, Data curation, Methodology

    Affiliation Department of Horticulture and Landscape Architecture, National Taiwan University, Taipei City, TaiwanTaiwan

  • Fatemeh Saeidi-Rizi ,

    Roles Conceptualization, Investigation, Methodology, Resources, Supervision, Writing – review & editing

    saeidiri@msu.edu

    Affiliation School of Planning, Design and Construction, Michigan State University, East Lansing, Michigan, United States of America

  • Tai-Quan Peng,

    Roles Formal analysis, Methodology

    Affiliation Department of Communication, College of Communication arts and science, Michigan State University, East Lansing, Michigan, United States of America

  • Chun-Yen Chang

    Roles Methodology

    Affiliation Department of Horticulture and Landscape Architecture, National Taiwan University, Taipei City, TaiwanTaiwan

Abstract

Urban environments significantly shape our well-being, behavior, and overall quality of life. Assessing urban environments, particularly walkability, has traditionally relied on computer vision and machine learning algorithms. However, these approaches often fail to capture the subjective and emotional dimensions of walkability, due to their limited ability to integrate human-centered perceptions and contextual understanding. Recently, large language models (LLMs) have gained traction for their ability to process and analyze unstructured data. With the increasing reliance on LLMs in urban studies, it is essential to critically evaluate their potential to accurately capture human perceptions of walkability and contribute to the design of more pedestrian-friendly environments. Therefore, a critical question arises: can large language models (LLMs), such as GPT-4o, accurately reflect human perceptions of urban environments? This study aims to address this question by comparing GPT-4o's evaluations of visual urban scenes with human perceptions, specifically in the context of urban walkability. The research involved human participants and GPT-4o evaluating street-level images based on key dimensions of walkability, including overall walkability, feasibility, accessibility, safety, comfort, and liveliness. To analyze the data, text mining techniques were employed, examining keyword frequency, coherence scores, and similarity indices between the participants and GPT-4o-generated responses. The findings revealed that GPT-4o and participants aligned in their evaluations of overall walkability, feasibility, accessibility, and safety. In contrast, notable differences emerged in the assessment of comfort and liveliness. Human participants demonstrated broader thematic diversity and addressed a wider range of topics, whereas GPT-4o had more focused and cohesive responses, particularly in relation to comfort and safety. In addition, similarity scores between GPT-4o and the responses of participants indicated a moderate level of alignment between GPT-4o’s reasoning and human judgments. The study concludes that human input remains essential for fully capturing human-centered evaluations of walkability. Furthermore, it underscores the importance of refining LLMs to better align with human perceptions in future walkability studies.

1. Introduction

Walkability has garnered considerable attention across various disciplines such as urban planning, public health, and transportation [13]. The quality of the walking environment is recognized also as a crucial component in enhancing community development [4], better human experience in historical sites [5], and reducing carbon emissions [6, 7].

Previous studies have constructed models to measure the perceived walkability such as using panoramic street view images and virtual reality [810]. Additionally, machine learning techniques, such as ResNet, have been employed to objectively quantify walkability based on pedestrian visual perception [11]. In addition, researchers applied deep learning algorithms to create a walkability index using micro and macro urban features [1214]. Collectively, the extensive application of street view imagery and deep learning algorithms have enabled the development of methods to assess pedestrian walkability.

Recently, along with computer vision techniques in urban studies, large language models (LLMs) have become increasingly capable of performing a wide range of tasks, including text completion, sentiment analysis [15,16], and cross-language translation [17]. LLMs have also found applications in social science research, where they simulate human responses to survey questions on attitudes and behaviors [18,19]. The release of ChatGPT at the end of 2022 brought global attention [2022]. Building on this momentum, the newly introduced GPT-4o model, with its multimodal capabilities, has further expanded the possibilities. For example, it has been applied to medical data [2326], fake news detection [27], education [28, 29], business [30, 31], agriculture [32] and social science [33]. Those studies show that LLMs have been applied in different domains. However, despite that generative methods in the field of walkability are expected to grow [34], and the use will be expanding in urban tasks [35], the application of LLMs in the urban domain remains is still limited.

Overall, according to the literature, previous research has extensively utilized street view imagery and computer vision techniques to assess the physical attributes of walkable environments. However, the potential of studying the performance of LLMs in the urban walkability field is unexplored. Addressing this gap, we only aimed to explore the alignment of the human perspective of the perceived walkability and GPT-4o as one of the LLMs. We will answer the following questions: how well do these models accurately capture real-world human experiences of visual appeal? We examined the capabilities of GPT-4o in assessing the visual perception of walkability in urban areas by having it evaluate overall walkability, feasibility, accessibility, safety, comfort, and liveliness. By comparing paired images, we assessed their ratings, text responses, and sentiment scores against those of human participants. Our findings highlight the limitations of GPT-4o in accurately perceiving urban environments and point to opportunities for refining LLM models to better align with human perspectives.

2. Literature review

2.1. Walkability perception

Walkability is increasingly acknowledged as a key element in promoting healthy communities [36], as well as enhancing social interaction and economic vitality within neighborhoods [37]. Walkability is typically characterized as the degree to which a built environment is accessible and appealing to individuals [38], whether they walk out of necessity, preference, or social engagement [39]. It also refers to individuals’ perceptions of a street as a suitable place for walking [40]. As a subjective measure, walkability reflects the perceived quality of the environment and is shaped by personal assessments of its suitability for walking, making it challenging to quantify and assess objectively [41]. Collectively, these elements shape what is often referred to as perceived walkability [42,43].

The investigation of perceived walkability through subjective assessments has emerged as an effective approach to deepen our comprehension of the walking environment [44,45]. Among the various factors influencing perceived walkability, the concept of visual variety has emerged as a critical determinant of pedestrian satisfaction. Visual variety captures the richness and diversity of urban design elements that engage and attract pedestrians, enhancing the overall appeal of space [46]. Based on Maslow’s hierarchy of needs, perceived walkability is suggested to consist of five dimensions: feasibility, accessibility, safety, comfort, and pleasurability [46]. Researchers commonly evaluate perceived walkability through four key dimensions: comfort, safety, utility, and appeal [46,47]. Other studies have referred to visual variety using terms such as imageability, complexity, transparency [40], and positive sensory experiences [48]. Building on this foundation, walkability has been systematically evaluated using five visual indicators established in early research: feasibility, accessibility, safety, comfort, and pleasurability [46].

According to [46], collectively, these indicators form the six categories of Visual Walkability Perception, providing a comprehensive framework for determining whether an environment visually supports walking. The visual walkability indicator provides an overall assessment of whether a location visually supports walking. Feasibility refers to factors that encourage walking, influenced by land use types and the diversity of available facilities. Accessibility addresses visible obstacles, such as dead-end streets or restricted access areas. Safety assesses a street’s security based on crime, traffic accidents, and visual cues like graffiti, litter, and neglected buildings. Comfort examines how the street environment enhances the pedestrian experience, factoring in elements like street furniture, sidewalk width, urban design features, and accessibility facilities. Finally, pleasurability assesses the appeal of public spaces, reflecting how diverse, lively, enjoyable, and interesting they are for walking.

Yet walkability remains inherently subjective and challenging to quantify. In a recent review of the trends of walkability over time, it was concluded that methods for measuring walkability have shifted dramatically [49]. Early studies largely relied on measurement-based methods such as GIS-based assessments, and physical and image audits [5056], as well as mixed-method approaches [57]. However, recent years have seen a growing emphasis on micro-level, street-based evaluations and SVI [58, 59] with applications including measuring psychological greenery and visual crowdedness [60] and object importance [61].

Although analyzing SVI by using CNN-based approaches has been effective in identifying physical features [62], they come with notable limitations. SVI and CNN do not inherently integrate textual opinions or subjective feedback from participants, which are crucial for understanding perceived walkability attributes like safety, aesthetics, or comfort [63,64]. Research has also highlighted the importance of textual opinions, such as those gathered from social media or surveys, to complement SVI in capturing the emotional and subjective dimensions of walkability [65,66]. This lack of interpretability makes it challenging for urban planners and policymakers to grasp the “reasoning” behind a algorithm’s assessment.

Recently, Unlike CNNs, which focus on recognizing visual features, LLMs can analyze both text and images, allowing them to interpret the broader context surrounding an environment [35]. By leveraging both visual and textual data, LLMs bridge the gap between physical environment analysis and subjective user experiences. For example, LLMs can integrate data from computer vision analyses of street view imagery with textual user feedback, enabling a comprehensive understanding of urban spaces [67]. This multimodal approach not only addresses the limitations of single-modality methods but also provides richer, more actionable insights for creating walkable cities.

2.2. ChatGPT advancements and uses

Recently, GPT-4o showcased extraordinary multimodal capabilities in decoding visual and textual content [68]. It has shown precision in functions such as detecting and identifying visual elements [69]. Another study explored the performance of GPT-4o in image classification by integrating images with textual descriptions. It was demonstrated that such a combination can significantly improve the accuracy of classification [70]. In addition, GPT-4o can identify and rank the perceived risk in traffic scenarios in images to some extent, despite that its evaluations do not always align with human judgment [71].

ChatGPT has been attracting attention from individuals across different backgrounds such as healthcare, and academia [2022,72]. In the medical field, GPT-4 has played roles in enhancing radiology report assessments [73], conducting reviews on the digital twin concept [74], undertaking medical writing tasks [75], and aiding in licensing exams [76], medical education [77, 78], and visualizing internal body structures for diagnosis, research. Additionally, in a recent overview summarizing GPT’s application across mathematics [79], physics [8082], and communication [83].

Despite the wide use of LLM in different domains, there is a significant gap in understanding how these models perceive and evaluate outdoor environments compared to human perspectives. While previous studies have extensively explored human perceptions of urban environments through traditional methodologies such as surveys, mixed-methods approach, and deep learning, it is important and methodologically desirable to compare GPT-4o perceptions of outdoor environments in terms of visual walkability with human perceptions. This research is the first to investigate the alignment or divergence between LLM-based evaluations and human perceptions of walkability. This study seeks to evaluate the potential of Large Language Models (LLMs) as reliable tools for assessing environmental factors influencing human experiences, such as walkability, an area which has not been systematically compared in prior research. Based on this, the research addresses the following questions: how do the keyword frequencies in GPT-4o-generated descriptions of paired images compare to those in human-generated descriptions? In what ways do the sentiment scores of GPT-4o-generated descriptions differ from those of human-generated descriptions? Additionally, how do the coherence and similarity indices of GPT-4o-generated responses compare to those of human responses?

3. Methodology

Fig 1 presents a detailed overview of the research process across three distinct phases. The first phase involves the evaluation of paired images by both human participants and GPT-4o. The second phase consists of two scenarios: one where both GPT-4o and human participants choose the first image as having a higher rating, and another where both select the second image. After aligning the responses based on the selected image, the third phase synthesizes the findings by analyzing response coherence and keywords to identify themes more highly rated by either GPT-4o or human participants. This phase also compares sentiment scores across responses, performs LDA topic modeling, and analyzes keyword frequencies.

3.1. Data collection

3.1.1. Image selection and questionnaire structure.

Images were collected from Lansing, East Lansing, and Williamston, Michigan, using a horizontally held iPhone 14 for consistency. The selected images for this study aim to represent multiple perspectives of the urban environment, encompassing a broad spectrum of developments in chosen areas to illustrate the diversity present in city landscapes. The evaluation of these variations was based on the subjective evaluation of the authors. The selection was designed to represent diverse perspectives on urban development, covering areas with various levels of greenery, pavement conditions, population density, vehicle and pedestrian flow, and spatial openness or constriction. This variation included both tree-lined streets and concrete-dominated areas, as well as crowded streets and sparsely populated spaces, reflecting a wide range of urban conditions. These diverse images provided a holistic view of walkability, as influenced by both natural elements and urban infrastructure, capturing how these factors impact perceptions of walkability for both human participants and GPT-4o.

Each image was assigned a unique identifier ranging from 1 to 106 to facilitate randomization and organization. The randomization process was conducted using a Python script, which paired the images to create sets for comparison. This approach was employed to avoid any bias in the selection process and to ensure a fair representation of diverse urban environments. Using this approach, some images were excluded from the final survey for distinct reasons. Images 25, 32, 58, 59, 72, 84, 88, 96, and 102 were not selected during randomization. This resulted in a final selection of 48 unique image pairs that were included in the survey. The finalized image set was designed to provide a comprehensive range of urban conditions, capturing both natural and built features that influence walkability perceptions. The systematic pairing ensured that participants evaluated images reflecting real-world variability in urban design, allowing for robust comparisons of human and GPT-4o perceptions of walkability. Table 1 summarizes the number of respondents and image pairs which were included in the survey and subsequent analysis in the “images numbers” column.

thumbnail
Table 1.. Number of respondents and image pairs used in different groups of human survey.

https://doi.org/10.1371/journal.pone.0322078.t001

Participants in the main survey compared pairs of images of urban settings based on six aspects of walkability. These aspects included overall walkability (ease and attractiveness of walking), feasibility (practicality of walking based on individual and environmental factors), accessibility (how well the area accommodates diverse abilities), safety (perceived security), comfort (pedestrian comfort level), and liveliness (vibrancy of the area). Before conducting the primary survey, a pilot study was carried out to assess the survey design and definitions of walkability. Participants understood the six key aspects of walkability—overall walkability, feasibility, accessibility, safety, comfort, and liveliness, but minor adjustments were made to improve clarity and ensure consistent interpretation. The initial survey was shortened for the main study due to its length. Finally, the survey, consisting of 12 question sets, took about 15–20 minutes to complete, and participants were instructed to respond without using AI tools to ensure authentic responses. S1 Table shows the responses of the human participants for paired images. This study was conducted in compliance with the ethical standards set forth by the Institutional Review Board (IRB) of Michigan State University (MSU). Approval for the study was granted under the protocol MSU Study ID: STUDY00010749, covering the period from May 13, 2024, to July 14, 2024. All participants provided written informed consent before participating in the study. The consent process adhered to the guidelines approved by the MSU IRB to ensure participants were fully informed about the objectives of the study, procedures, and their rights, including the option to withdraw at any time. After obtaining IRB approval, minor modifications were made to the study protocol. The initial survey was revised to shorten its length based on feedback from a pilot study to improve participant engagement. These modifications did not alter the core research objectives and remained within the scope of the original IRB approval.

The participants for this study were randomly selected through an online survey platform, which was distributed to a wide audience to ensure diversity in responses. The recruitment process did not target specific age groups, professions, or cultural backgrounds, allowing for a broad participant pool. This random selection approach helps mitigate biases in recruitment and enhances the generalization of the findings. While specific demographic targeting was not employed, the survey yielded 174 responses from individuals with diverse characteristics, including a range of ages (from 18 to 65 ± years), professions (e.g., students, healthcare workers, educators, and urban planning professionals), and cultural backgrounds, as detailed in section 4.1.

3.1.2. ChatGPT prompting.

Various strategies have been employed to enhance the output of GPT-4o such as the use of composite images [84], comparing images in pairs [85], or employing multimodal cooperation [86]. Another technique is converting visual information into text using prompts like “What’s in this image?”. This method shows significant potential, especially when processing large volumes of images that appear in a temporal sequence [87]. However, minor variations in prompts can lead to inconsistent outputs [88]. To address this, methods like self-consistency or bootstrapping involve re-prompting multiple times with different text permutations and averaging the results, improving overall accuracy [8991]. This method involves repeating the prompting process multiple times, each with a different permutation of the text, and then extracting the mean output. This aggregated output typically achieves higher accuracy than a single prompt.

In our study, we ensured consistency by utilizing the same pairs of images in both the surveys and the GPT-4o web interface. For example, the image pair of 44,1 was prompted to GPT 15 times and GPT-4o and prompted to provide their evaluation of overall walkability. These image pairs were uploaded as composite images into GPT-4o, and self-consistency techniques were used to improve the reliability of the model’s output. By aligning the image pairs used in both the surveys and the GPT-4o web interface, we ensured that the results were directly comparable. The individual images had a size of 4023*3024 pixels. GPT-4o considers the image on the left as the first image while the image on the right is the second one. We prompted GPT-4o between 15 July and 1 August 2024. Each prompt was in a new chat window by using temporary chat in GPT-4o. The use of temporary chat windows was crucial to ensure that each prompt was processed independently, avoiding any potential carryover effects or contextual memory from previous interactions. This approach minimized bias and ensured that the model’s output for each prompt was generated without influence from earlier conversations, thereby enhancing the consistency and reliability of the results. In each prompt, we requested GPT-4o to rank the images from 1–10 and describe the perception of each walkability perception. For example, when asking about overall walkability, we wrote the following prompt: “How do you rate the Walkability of this environment for both photos from 1–10? Based on the photo you rated higher, why do you think it is more Walkable? (Please explain your opinion in at least 20 words). Overall Walkability: This measures the ease and appeal of walking around the area shown in the image.

4. Results

4.1. Demographic variables of respondents in the survey

The survey included 174 participants, with a balanced gender representation: 47% female, 50% male, and 2% preferring not to disclose their gender. The sample primarily consisted of younger adults, with 34% aged 18–25 and 33% aged 26–35, while participation decreased in older age groups (15% aged 36–45, 7% aged 46–55, 6% aged 56–65, and 1% over 65). Geographically, the majority of participants resided in urban areas (87%) compared to rural areas (12%). Most respondents were from the United States (61%), followed by Taiwan (15%), with smaller contributions from Jordan, Canada, China, Germany, and other countries (each 4% or less). Walking habits varied among participants: 22% reported walking daily, 27% walked four to five times a week, and 25% walked two to three times a week, while 12% walked once a week and 12% rarely walked. In terms of weekly walking duration, 52% walked for 30 minutes to 1 hour, 39% for less than 30 minutes, and 8% for 1–2 hours, while only 0.5% walked for more than 2 hours.

To maintain the integrity and consistency of the analysis, responses were filtered according to six specified dimensions of walkability: overall walkability, feasibility, accessibility, safety, comfort, and liveliness. Exclusions were made based on predetermined criteria: responses were deemed “Equal” if participants assigned the same ratings to both images, “Invalid” if they were incomplete or lacked significant differentiation, or inconsistent for particular images. Notably, this exclusion process was conducted at the level of individual responses for specific images, rather than dismissing entire participants, thereby preserving valid responses for other perceptions from the same individuals. Following this rigorous filtering, the final counts of analyzed responses for each dimension were 168 for overall walkability, 164 for feasibility, 162 for accessibility, 161 for safety, 165 for comfort, and 162 for liveliness. This selective methodology ensured maximizing the contributions of participants, thus facilitating a thorough comparison of human and GPT-4o of all perceptions. The total number of responses from participants for each perception included in the analysis is shown in Table 2.

thumbnail
Table 2. Number of responses of GPT-4o and participants’ responses for each perception.

https://doi.org/10.1371/journal.pone.0322078.t002

4.2. Consistency of GPT-4o responses

GPT-4o’s responses exhibited two distinct patterns: it either consistently chose the same image across all prompts or varied its selection between the two images. GPT-4o’s responses displayed a clear answering pattern: for 41 out of the 48 image pairs, it consistently selected the same image (either the first or the second) across all prompts. In the remaining 7 pairs, GPT-4o alternated between selecting the first and second images. To determine the number of chosen responses for analysis, we calculated the similarity index for the generated responses out of the 15 responses. S2 Table shows what image GPT-4o chose as a higher rank, the number of chosen responses out of 15. For example, under the feasibility category, when the image numbers (60,56), (65,68), and (23,16), the second image was consistently chosen.

The similarity index was calculated by comparing responses, with the first response serving as the baseline reference. Pairwise comparisons were conducted across all 15 responses, starting with the first response compared to the second, followed by the first compared to the third, and so forth, until all responses had been evaluated. This method enabled an assessment of the cumulative consistency of the responses as additional outputs were generated. The same approach was applied uniformly to all responses, up to the final response (response number 15). (S1–S6 Fig) in the supporting information present the similarity index for each perception of walkability relative to the number of generated responses by GPT-4o. Although some fluctuation in the similarity index was observed, the values remained within a relatively narrow range. Based on this stability, 15 responses were selected for further analysis.

4.3. Alignment between GPT-4o and human responses

To assess the alignment between the responses from GPT-4o and those from human participants, we matched the responses when GPT-4o and participants chose either image 1 or image 2 in each pair. The number of pair image sets that were selected when the first image was chosen was 21 pairs while when the second image was chosen, we had 29 pairs. The number of participants based on the chosen images is illustrated in Table 2. The results highlighted notable differences in image selection across various perceptual categories. In terms of overall walkability, feasibility, accessibility, and safety, both participants and GPT-4o demonstrated a tendency to select the second image, with participants choosing it 57%, 55%, 52%, and 50% of the time, respectively, while GPT-4o showed comparable preferences at 62%, 59%, 56%, and 62%. However, a key divergence was observed in the comfort perception, where participants significantly favored the first image (62%), while GPT-4o leaned towards the second image (53%). A similar discrepancy emerged in the liveliness perception. Additionally, participant responses displayed more variability, with some instances of “equal” or “invalid” responses, the latter referring to cases where participants provided identical answers across multiple questions.

4.4. Comparing GPT-4o and human ratings

We compared the ratings between GPT-4o and human participants for two sets of images: Image 1 and Image 2. Independent samples t-tests were conducted to evaluate whether there were significant differences between the ratings assigned by humans and the GPT-4o model. For Image 1, the mean rating given by participants was (M = 7.80) with a standard deviation of (SD = 0.60), while the GPT-4o model assigned a mean rating of (M = 7.42) with a standard deviation of (SD = 0.70). The analysis did not reveal a statistically significant difference between the ratings of humans and the GPT-4o model, (t(N) = 1.91), (p = .063), suggesting that the GPT-4o model’s ratings were generally similar to those of human participants for this set of images. For Image 2, participants rated the images with a mean of (M = 7.65) and a standard deviation of (SD = 0.81), whereas the GPT-4o model provided ratings with a mean of (M = 7.76) and a standard deviation of (SD = 0.47). The t-test indicated no significant difference between the two groups, (t(N) = -0.64), (p = .526), implying that the ratings assigned by the GPT-4o model were not substantially different from those of the human participants in this case.

4.5. Content analysis

4.5.1. Similarity index between GPT-4o and participants’ responses.

We assessed the alignment between the textual responses of GPT-4o and human participants by examining the average similarity index to compare their reasoning and descriptive alignment. This was applied to all responses regardless of the perception. Text preprocessing steps, such as tokenization and stop-word removal, were applied before vectorization to ensure meaningful comparisons. This approach follows methodologies established in previous studies on text similarity [92].

The textual data was then vectorized, and cosine similarity was used to measure alignment between GPT-4o and human responses. Cosine similarity is a widely accepted metric for quantifying textual similarity by comparing vectorized representations of text, as discussed in [93, 94]. The threshold value for cosine similarity varies depending on the context, dataset, and application. It is often empirically determined or adaptively set to meet the needs of specific tasks, such as clustering, text classification, or similarity searches. Typical thresholds for high similarity range from 0.7 to 0.9, with values closer to 1.0 indicating stronger similarity, particularly in normalized datasets [95,96]. In the context of this study, a score above 0.4 is generally interpreted as moderate alignment, while scores closer to 0.6 or higher suggest stronger alignment.

The findings revealed a moderate degree of alignment, with an average similarity score of 0.4575 when the first image was chosen and 0.4615 for the second image. While this indicates that GPT-4o is capable of partially mimicking human decision-making processes, the results suggest that there are notable differences in how the two approach visual tasks, particularly in terms of depth and nuance.

4.5.2. Topic modeling.

We compared the responses from GPT-4o and human participants by looking into the topic modeling results and coherence scores across six categories of perception. Fig 2 shows the number of topics when the first image was selected by both groups. Human participants identified nine topics relating to overall walkability perception, with a coherence score of 0.358. In comparison, GPT-4o generated three topics, but with a slightly lower coherence score of 0.317. Regarding feasibility perception, humans produced five topics with a coherence score of 0.360, whereas GPT-4o identified only two topics, which resulted in a coherence score of 0.283. When it comes to accessibility perception, human participants identified eight distinct topics, achieving a coherence score of 0.392, while GPT-4o generated four topics with a score of 0.351. In terms of safety perception, both GPT-4o and human participants had similar outcomes. Human participants generated four topics with a coherence score of 0.377, whereas GPT-4o identified three topics and attained a slightly lower coherence score of 0.368. Interestingly, GPT-4o excelled in the comfort perception. Although it identified fewer topics (three), it achieved a higher coherence score of 0.378 compared to the four topics and a 0.356 score produced by human participants. For liveliness perception, human participants identified two topics with a notably higher coherence score of 0.458, while GPT-4o generated five topics, but with a lower score of 0.317.

Fig 3 shows the number of topics when the second image was chosen. Both GPT-4o and human participants identified three topics related to walkability, but human responses were more cohesive, achieving a coherence score of 0.363 compared to GPT-4o’s 0.332. In assessing feasibility, humans identified five topics, resulting in a higher coherence score of 0.398, while GPT-4o found three topics with a coherence score of 0.348. For accessibility, human participants were able to identify nine topics with a coherence score of 0.363, while GPT-4o only generated four topics, resulting in a much lower score of 0.277. This suggests that human responses captured a broader spectrum of accessibility-related themes in a more cohesive manner. For safety, GPT-4o’s coherence score was considerably lower (0.285) compared to the human participants’ score of 0.387, indicating that human responses were more coherent and comprehensive. In the comfort category, humans identified ten topics with a coherence score of 0.355, whereas GPT-4o identified seven topics with a slightly lower coherence score of 0.333. Finally, for liveliness, human participants identified ten topics with a coherence score of 0.391, whereas GPT-4o only identified two topics, resulting in a coherence score of 0.297.

4.5.3. Top keywords.

When analyzing the responses of the first image, notable differences in the top keywords between human participants and GPT-4o reveal distinct approaches to perceiving outdoor spaces. For example, human responses frequently included experiential and descriptive words like “trees,” “shade,” “people,” and “comfortable,” reflecting a focus on the sensory experience and aesthetic quality of the environment. Humans often described how space made them feel and how specific natural elements contributed to comfort and liveliness. Words such as “traffic” and “obstacles” further indicated that humans were concerned with practical aspects of safety. In contrast, GPT-4o focused on more structured and functional terms like “pedestrian,” “path,” “mobility,” and “amenities.” This language suggests that GPT-4o approached the image from a more technical perspective, emphasizing the design and infrastructure of the outdoor space, such as the presence of sidewalks, accessibility features, and pedestrian pathways. GPT-4o’s responses were centered around how space functions for movement and public use, with less attention to subjective feelings or sensory details. In the second image, the differences between human and GPT-4o responses became even more pronounced. Humans continued to focus on the visual and sensory aspects of the environment, with frequent use of terms like “green,” “grass,” “quiet,” and “lively.” These keywords highlight the participants’ attention to natural features and the ambience of space, indicating a more holistic perception that integrates the physical appearance of the area with its emotional impact. Human participants evaluated space based on how peaceful or active it seemed, using language that suggested an assessment of the overall vibrancy and aesthetic appeal. On the other hand, GPT-4o’s responses were dominated by keywords such as “urban,” “area,” “street,” and “mobility,” once again emphasizing the functional design of the space. While humans highlighted specific natural features and the emotional atmosphere, GPT-4V remained more objective, concentrating on infrastructure and the practical usability of the environment.

5. Discussion

Prior research has shown that machine learning and computer vision methods are effective in analyzing image datasets, including those from Google Street View, to predict various factors such as scene complexity, safety, and socioeconomic conditions [64,97,98]. The emergence of vision-language models, such as ChatGPT [99], PaLM [100], and LLaMa [101] offer new opportunities for evaluating images in more comprehensive ways. Despite their promise, the extent to which these models align with human perception, particularly in urban walkability assessments, remains insufficiently explored. Building on this foundation, our study explored GPT-4o’s potential as a tool for assessing visual walkability in urban environments by comparing its evaluations with those of human participants. Our findings reveal a strong alignment between GPT-4o and human participants in assessing overall walkability, feasibility, accessibility, and safety. However, notable differences emerged in the assessment of comfort and liveliness, where human participants provided more thematically diverse insights, while GPT-4o’s responses were more structured and cohesive, particularly regarding comfort and safety. Similarity scores suggest a moderate level of alignment between GPT-4o’s reasoning and human judgments, highlighting its potential to systematically complement human evaluations of urban walkability with a systematic and focused approach. These findings contribute to ongoing discussions on AI’s role in urban design and suggest promising avenues for integrating AI-driven assessments into urban planning processes.

First, the results in Table 2 indicated that GPT-4o and human participants match each other in the choice of images for perceptions such as general walkability, feasibility, accessibility, and safety. However, human responses had “equal” answers when choosing some images, which indicates a nuance in human decision-making that was not replicated by GPT-4o. That would suggest that in areas where human preferences can be mimicked, the decision-making done by GPT-4o may not possess the same level of complexity and flexibility as that found in human judgments on most subjective perceptions.

Second, while the similarity scores—0.4575 for the first image and 0.4615 for the second indicated a moderate level of agreement, they also suggest that GPT-4o’s responses do not perfectly mirror human judgments. The slight difference in scores shows that GPT-4o might align more closely with human participants when evaluating certain images, but its reasoning does not fully capture the nuances of human perception. This moderate alignment points to GPT-4o’s potential to assist in tasks that require subjective judgment, such as urban design and walkability assessments, but it also highlights the need for caution when relying solely on AI models for decisions that involve complex human-centered evaluations.

Third, the comparison between GPT-4o and human participants shows that humans consistently identified a wider variety of topics across most perception categories, such as walkability, feasibility, and liveliness, although their responses varied in cohesiveness. In contrast, GPT-4o generated fewer topics but delivered more cohesive and focused responses, especially regarding comfort and safety. This indicates that while humans capture a broader range of themes, GPT-4o provides more structured and streamlined interpretations based on the perception being assessed. Notably, GPT-4o had difficulty with more complex and abstract perceptions like liveliness and accessibility, where it identified fewer topics and had lower coherence scores. On the other hand, human participants showed a more consistent ability to recognize a wide array of factors across different perceptions and images. Therefore, we argue that humans excel in identifying thematic diversity, while GPT-4o is skilled at organizing and simplifying more straightforward perceptions.

Fourth, regarding the top keywords, while humans emphasized specific natural features and the emotional atmosphere, GPT-4o remained more objective, concentrating on infrastructure and the practical usability of the environment. This aligns with earlier research on human perceptions of walkability in urban spaces. Studies such as those emphasized the importance of physical infrastructure like sidewalks and road conditions in shaping walkability perceptions [40,102]. In contrast, GPT-4o’s broader focus on the usability of space presents a different approach, one that is more grounded in specific infrastructural details. This divergence mirrors the findings by [60], who argued that while visual elements are essential, the subjective nature of walkability makes it challenging to capture the full scope of human experience through automated tools alone. Our study reinforces this notion by demonstrating that while GPT-4o can provide a consistent overview of environmental quality, it often misses the context-specific insights that human evaluators offer. Therefore, integrating human perspectives to fully capture the depth of contextual and experiential details is needed to understand human emotions in urban spaces [103]. Further supporting the need for a balanced approach, by integrating GPT-4o with human judgment, urban planners can benefit from the strengths of both approaches, leading to more effective and responsive urban design solutions.

Overall, this study demonstrates the potential practical use of LLMs, such as GPT-4o, to complement human-centered urban planning by providing structured, scalable assessments of walkability. While GPT-4o showed notable alignment with human evaluations in aspects such as overall walkability and feasibility, its applications can extend beyond simple evaluations to play a more dynamic role in urban design processes. For instance, LLMs could be employed in automated urban audits, analyzing street-level imagery to identify infrastructure gaps such as the absence of pedestrian crossings, narrow sidewalks, or insufficient greenery. This capability could save significant time and resources, particularly for large-scale urban projects. Another promising application lies in scenario simulations, where planners could upload mock-up designs or proposed changes to urban spaces and receive AI-driven feedback on how these alterations might influence walkability indicators such as comfort, accessibility, or safety. Additionally, LLMs could enhance public engagement by acting as an intermediary in community participation initiatives. It can translate technical urban design elements into more accessible language for residents, helping stakeholders better understand proposed plans and prioritize elements that align with public sentiment.

Despite these promising applications, the limitations of GPT-4o must be acknowledged, particularly in addressing cultural and personal factors that heavily influence human judgments of walkability. Cultural norms shape perceptions of aspects such as safety, liveliness, and accessibility differently across regions. For instance, a vibrant urban space in one cultural context might be perceived as chaotic or unsafe in another. Similarly, personal preferences, including mobility needs or aesthetic values, add layers of subjectivity that GPT-4o struggles to capture without explicit input. These limitations stem from GPT-4o’s reliance on text and image data, which, while powerful, cannot fully account for experiential or emotional connections to urban spaces. To address these subjective differences, future studies should incorporate more diverse datasets that reflect the cultural and geographic variability of urban spaces. Additionally, GPT-4o’s capabilities could be enhanced through multimodal data integration, including pedestrian movement patterns, audio cues, and climate data, to better simulate human sensory experiences and contextualize walkability evaluations. However, even with these advancements, human-centered design decisions require nuanced, context-specific insights that LLMs cannot replicate. LLMs should therefore be seen as a tool to augment human expertise, particularly in subjective and culturally sensitive areas of urban planning, rather than as a replacement for human judgment. Incorporating studies that compare different urban typologies, such as high-density versus low-density areas, could provide a new understanding of the role of built environmental characteristics. Furthermore, exploring the application of LLMs-driven tools in collaborative design processes with stakeholders, including urban planners, architects, and community members, could bridge the gap between data-driven models and practical implementation. In addition, the results emphasize GPT-4o’s proficiency in delivering structured and coherent assessments; however, they also reveal certain limitations, particularly regarding its ability to grasp more abstract and subjective metrics such as liveliness and accessibility. These perceptions are intricately linked to experiential, and contextual elements that may not be entirely quantifiable with the existing LLMs models and methodologies utilized in this research. Recognizing these limitations is essential to maintain realistic expectations for LLMs applications within urban studies. Additionally, the images used in this study were geographically taken in urban areas in Michigan, which may limit the applicability of the findings to other regions with distinct urban characteristics. The reliance on specific image types and the exclusion of certain human responses, such as biased or equal ratings, may have restricted the depth of the analysis. Given these limitations, future research should address these limitations to enhance the understanding of the role of LLMs in urban planning.

6. Conclusion

In conclusion, our results showed that while GPT-4o can help in measuring the perception of people of urban walkability, it cannot fully replicate the depth of human perception. The integration of LLMs into urban planning should be approached with caution, ensuring that these tools are used to augment, rather than replace, the perceptions of people. By refining LLM algorithms and incorporating human feedback, there is potential to develop more effective and responsive tools for urban analysis, ultimately leading to better design. However, this study has several limitations. The relatively small and demographic homogeneity sample may not fully capture the diversity of urban walkability perceptions across different populations. So, our main conclusion is that humans continue to have the final authority in decision-making. Instead of replacing urban planners, future research should concentrate on creating customized LLMs solutions for urban studies.

Supporting information

S1 Table. Human participants’ responses to paired images.

https://doi.org/10.1371/journal.pone.0322078.s001

(DOCX)

S2 Table. GPT-4V responses to paired images.

https://doi.org/10.1371/journal.pone.0322078.s002

(DOCX)

S1 Fig. Similarity Index of GPT-4V Responses Regarding Overall Walkability Perception.

https://doi.org/10.1371/journal.pone.0322078.s003

(TIF)

S2 Fig. Similarity Index of GPT-4V Responses Regarding Feasibility Perception.

https://doi.org/10.1371/journal.pone.0322078.s004

(TIF)

S3 Fig. Similarity Index of GPT-4V Responses Regarding Accessibility Perception.

https://doi.org/10.1371/journal.pone.0322078.s005

(TIF)

S4 Fig. Similarity Index of GPT-4V Responses Regarding Safety Perception.

https://doi.org/10.1371/journal.pone.0322078.s006

(TIF)

S5 Fig. Similarity Index of GPT-4V Responses Regarding Comfort Perception.

https://doi.org/10.1371/journal.pone.0322078.s007

(TIF)

S6 Fig. Similarity Index of GPT-4V Responses Regarding Liveliness Perception.

https://doi.org/10.1371/journal.pone.0322078.s008

(TIF)

Acknowledgments

This study was enriched by collaborative discussions with members of the HealthScape Lab at Michigan State University. We are also deeply thankful to Tai-Quan Peng and Chun-Yen Chang for their insightful feedback and contributions during the review of the manuscript. We gratefully acknowledge Michigan State University for offering the infrastructure and resources necessary to carry out this research. Our heartfelt thanks also go to the editorial team and the anonymous reviewers for their valuable comments and suggestions on the earlier draft of this paper.

References

  1. 1. Wedyan M, Saeidi-Rizi F. Assessing the impact of walkability indicators on health outcomes using machine learning algorithms: A case study of Michigan. Travel Behaviour and Society. 2025;39:100983.
  2. 2. Saelens BE, Handy SL. Built environment correlates of walking: A review. Med Sci Sports Exerc. 2008;40(7 Suppl):S550 pmid:18562973
  3. 3. Giles-Corti B, et al. Encouraging walking for transport and physical activity in children and adolescents: How important is the built environment? Sports Medicine. 2009;39: 995–1009.
  4. 4. Pivo G, Fisher JD. The walkability premium in commercial real estate investments. Real Estate Economics. 2011;39(2):185–219.
  5. 5. Maniei H, et al. The influence of urban design performance on walkability in cultural heritage sites of Isfahan, Iran. Land. 2024;13(9):1523.
  6. 6. Marshall JD, Brauer M, Frank LD. Healthy neighborhoods: Walkability and air pollution. Environ Health Perspect. 2009;117(11):1752–1759. pmid:20049128
  7. 7. Fonseca F, Ribeiro PJG, Conticelli E, Jabbari M, Papageorgiou G, Tondelli S, et al. Built environment attributes and their influence on walkability. Int J Sustain Transp. 2022;16(7):660–679.
  8. 8. Li Y, Yabuki N, Fukuda T. Measuring visual walkability perception using panoramic street view images, virtual reality, and deep learning. Sustain Cities Soc. 2022;86: 104140.
  9. 9. Huang G, Yu Y, Lyu M, Sun D, Zeng Q, Bart D. Using Google Street View panoramas to investigate the influence of urban coastal street environment on visual walkability. Environ Res Commun. 2023;5(6):065017.
  10. 10. Vo DC, Kim J. Exploring perceived walkability in one-way commercial streets: An application of 360° immersive videos. PloS One. 2024;19(12): e0315828.
  11. 11. Biswas G, Roy TK. Measuring objective walkability from pedestrian-level visual perception using machine learning and GSV in Khulna, Bangladesh. Geomatics and Environmental Engineering. 2023;17(6).
  12. 12. Ki D. A novel walkability index using google street view and deep learning. Sustainable Cities and Society. 2023;99:104896.
  13. 13. Li Y, Yabuki N, Fukuda T. Integrating GIS, deep learning, and environmental sensors for multicriteria evaluation of urban street walkability. Landsc Urban Plan. 2023;230:104603.
  14. 14. Kang Y, Kim J, Park J, Lee J. Assessment of perceived and physical walkability using street view images and deep learning technology. ISPRS Int J Geo-Inf. 2023;12(5):186.
  15. 15. Devlin J, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.
  16. 16. Sun C, Huang L, Qiu X. Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. arXiv preprint arXiv:1903.09588. 2019.
  17. 17. Zhu J, et al. Incorporating bert into neural machine translation. arXiv preprint arXiv:2002.06823. 2020.
  18. 18. Milička J, Marklová A, VanSlambrouck K, Pospíšilová E, Šimsová J, Harvan S, et al. Large language models are able to downplay their cognitive abilities to fit the persona they simulate. PLoS One. 2024;19(3):e0298522. pmid:38478522
  19. 19. Gorenz D, Schwarz N. How funny is ChatGPT? A comparison of human-and AI-produced jokes. 2024
  20. 20. Lund BD, Wang T. Chatting about ChatGPT: How may AI and GPT impact academia and libraries?. Library Hi Tech News. 2023;40(3):26–29.
  21. 21. Biswas SS. Role of Chat GPT in public health. Ann Biomed Eng. 2023;51(5):868–869. pmid:36920578
  22. 22. Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare. 2023. MDPI.
  23. 23. Brin D, et al. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. 2024:1–7
  24. 24. Waisberg E, Ong J, Masalkhi M, Zaman N, Sarker P, Lee AG, et al. GPT-4 and medical image analysis: strengths, weaknesses and future directions. Journal of Medical Artificial Intelligence. 2023;6:29–29.
  25. 25. Korngiebel DM, Mooney SD. Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery. NPJ Digit Med. 2021;4(1):93. pmid:34083689
  26. 26. Li Y, Li J, He J, Tao C. AE-GPT: Using large language models to extract adverse events from surveillance reports – A use case with influenza vaccine adverse events. PLoS One. 2024;19(3):e0300919. pmid:38512919
  27. 27. Wang J, Zhu Z, Liu C, Li R, Wu X. LLM-Enhanced multimodal detection of fake news. PloS One. 2024;19(10):e0312240. pmid:39446867
  28. 28. Zhang J, et al. Graph-to-tree learning for solving math word problems. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
  29. 29. Kasneci E, Sessler K, Küchemann S, Bannert M, Dementieva D, Fischer F, et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences. 2023;103:102274.
  30. 30. Frederico GF. ChatGPT in supply chains: initial evidence of applications and potential research agenda. Logistics. 2023;7(2):26.
  31. 31. Mich L, Garigliano R. ChatGPT for e-Tourism: a technological perspective. Information Technology & Tourism. 2023;25(1): 1–12.
  32. 32. Biswas S. Importance of chat GPT in agriculture: According to chat GPT. Available from: SSRN 4405391. 2023.
  33. 33. Lee S, et al. Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias. PLOS Climate. 2024;3(8):e0000429.
  34. 34. Yang J, Fricker P, Jung A. From intangible to tangible: The role of big data and machine learning in walkability studies. Computers, Environment and Urban Systems. 2024;109:102087.
  35. 35. Feng J, et al. CityBench: Evaluating the capabilities of large language model as world model. arXiv preprint arXiv:2406.13945. 2024.
  36. 36. Braveman P, Gottlieb L. The social determinants of health: It’s time to consider the causes of the causes. Public Health Rep. 2014;129(2):19–31. pmid:24385661
  37. 37. Duncan DT, Aldstadt J, Whalen J, Melly SJ, Gortmaker SL. Validation of Walk Score® for estimating neighborhood walkability: an analysis of four US metropolitan areas. Int J Environ Res Public Health. 2011;8(11):4160–4179. pmid:22163200
  38. 38. Abley S, Hill E. Designing living streets – a guide to creating lively, walkable neighbourhoods. 2005.
  39. 39. Cerin E, et al. Destinations that matter: associations with walking for transport. Health & Place. 2007;13(3): 713–724.
  40. 40. Ewing R, Handy S. Measuring the unmeasurable: urban design qualities related to walkability. J Urban Des. 2009;14(1):65–84.
  41. 41. Wang H, Yang Y. Neighbourhood walkability: a review and bibliometric analysis. Cities. 2019;93:43–61.
  42. 42. Wang W, et al. Exploring determinants of pedestrians’ satisfaction with sidewalk environments: case study in Korea. J Urban Plan Dev. 2012;138(2): 166–172.
  43. 43. Lee E, Dean J. Perceptions of walkability and determinants of walking behaviour among urban seniors in Toronto, Canada. J Transp Health. 2018;9: 309–320.
  44. 44. Gan Z, Yang M, Zeng Q, Timmermans HJP. Associations between built environment, perceived walkability/bikeability and metro transfer patterns. Transp Res A: Policy Pract. 2021;153:171–187.
  45. 45. Koohsari MJ, McCormack GR, Shibata A, Ishii K, Yasunaga A, Nakaya T, et al. The relationship between walk score® and perceived walkability in ultrahigh density areas. Prev Med Rep. 2021;23:101393. pmid:34123713
  46. 46. Alfonzo MA. To walk or not to walk? The hierarchy of walking needs. Environ Behav. 2005. 37(6): 808–836.
  47. 47. Speck J. Walkable city: how downtown can save America, One Step At A Time. 2013: Macmillan.
  48. 48. Gehl J. Cities for People. 2013: Island press.
  49. 49. Hasan MM, Oh J-S, Kwigizile V. Exploring the trend of walkability measures by applying hierarchical clustering technique. J Transp Health. 2021;22:101241.
  50. 50. Park S, Deakin E, Lee JS. Perception-based walkability index to test impact of microlevel walkability on sustainable mode choice decisions. Transp Res Rec. 2014. 2464(1): 126–134.
  51. 51. Kelly CE, et al. A comparison of three methods for assessing the walkability of the pedestrian environment. J Transp Geography. 2011. 19(6): 1500–1508.
  52. 52. Hasan MM, Oh J-S, Kwigizile V. Exploring the relationship between human walking trajectories and physical-visual environmental: an application of artificial intelligence and spatial analysis. 2021.
  53. 53. Frank LD, et al. The development of a walkability index: application to the neighborhood quality of life study. British Journal of Sports Medicine. 2010;44(13): 924–933.
  54. 54. Gu P, et al. Using open source data to measure street walkability and bikeability in China: a case of four cities. Transportation research record. 2018;2672(31) 63–75.
  55. 55. Lu Y. The association of urban greenness and walking behavior: using google street view and deep learning techniques to estimate residents’ exposure to urban greenness. Int J Environ Res Public Health. 2018;15(8):1576. pmid:30044417
  56. 56. Lu Y. Using Google Street View to investigate the association between street greenery and physical activity. Landsc Urban Plan. 2019;191: 103435.
  57. 57. Xue F, et al. Personalized walkability assessment for pedestrian paths: An as-built BIM approach using ubiquitous augmented reality (AR) smartphone and deep transfer learning. in Proceedings of the 23rd International Symposium on the Advancement of Construction Management and Real Estate, Guiyang, China. 2018.
  58. 58. Biljecki F, Ito K. Street view imagery in urban analytics and GIS: a review. Landsc Urban Plan. 2021;215:104217.
  59. 59. Wedyan M, Saeidi-Rizi F. Assessing the impact of urban environments on mental health and perception using deep learning: a review and text mining analysis. J Urban Health. 2024;101(2):327–343. pmid:38466494
  60. 60. Zhou H, He S, Cai Y, Wang M, Su S. Social inequalities in neighborhood visual walkability: Using street view imagery and deep learning technologies to facilitate healthy city planning. Sustain Cities Soc. 2019;50:101605.
  61. 61. Wu C, Peng N, Ma X, Li S, Rao J. Assessing multiscale visual appearance characteristics of neighbourhoods using geographically weighted principal component analysis in Shenzhen, China. Comput Environ Urban Syst. 2020;84:101547.
  62. 62. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
  63. 63. Dai L, Zheng C, Dong Z, Yao Y, Wang R, Zhang X, et al. Analyzing the correlation between visual space and residents’ psychology in Wuhan, China using street-view images and deep-learning technique. City Environ Interac. 2021;11:100069.
  64. 64. Zhang F, et al. Measuring human perceptions of a large-scale urban region using machine learning. Landsc Urban Plan. 2018;180:148–160.
  65. 65. Song J, et al. The effect of eye-level street greenness exposure on walking satisfaction: the mediating role of noise and PM2. 5. Urban Forestry & Urban Greening. 2022. 77: 127752.
  66. 66. Tang Y, et al. Exploring the impact of built environment attributes on social followings using social media data and deep learning. ISPRS Int J Geo-Inf. 2022;11(6):325.
  67. 67. Feng J, et al. CityGPT: empowering urban spatial cognition of large language models. arXiv preprint arXiv:2406.13948. 2024.
  68. 68. OpenAI R. Gpt-4 technical report. arxiv 2303.08774. View in Article. 2023;2:13.
  69. 69. Johnson O, Mohammed Alyasiri O, Akhtom D, Johnson OE. Image analysis through the lens of ChatGPT-4. Journal of Applied Artificial Intelligence. 2023;4(2).
  70. 70. Ding N, et al. Can large pre-trained models help vision models on perception tasks? arXiv preprint arXiv:2306.00693. 2023.
  71. 71. Driessen T, et al. Putting ChatGPT Vision (GPT-4V) to the test: risk perception in traffic images. 2023.
  72. 72. Dale R. GPT-3: What’s it good for? Nat Lang Eng. 2021;27(1):113–118.
  73. 73. Lecler A, Duron L, Soyer P. Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT. Diagn Interv Imaging. 2023;104(6):269–274. pmid:36858933
  74. 74. Aydın Ö, Karaarslan E. OpenAI ChatGPT generated literature review: digital twin in healthcare. Available at SSRN 4308687. 2022.
  75. 75. Kitamura FC. ChatGPT is shaping the future of medical writing but still requires human judgment. Radiological Society of North America. 2023:e230171.
  76. 76. Gilson A, et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Medical Education. 2023;9(1): e45312.
  77. 77. Kung TH, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digital Health. 2023;2(2): e0000198.
  78. 78. Hu M, et al. Advancing medical imaging with language models: A journey from n-grams to chatgpt. arXiv preprint arXiv:2304.04920. 2023.
  79. 79. Lu P, et al. Mathvista: evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255. 2023.
  80. 80. Lehnert K. AI insights into theoretical physics and the Swampland program: a journey through the cosmos with ChatGPT. arXiv preprint arXiv:2301.08155. 2023.
  81. 81. Kortemeyer G. Could an artificial-intelligence agent pass an introductory physics course? Phys Rev Phys Educ Res. 2023;19(1):010132.
  82. 82. West CG. AI and the FCI: Can ChatGPT project an understanding of introductory physics? arXiv preprint arXiv:2303.01067. 2023.
  83. 83. Guo S, et al. Semantic communications with ordered importance using ChatGPT. arXiv preprint arXiv:2302.07142. 2023.
  84. 84. Li Y, et al. A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering. arXiv preprint arXiv:2311.07536. 2023.
  85. 85. Zhang X, et al. Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361. 2023.
  86. 86. Ye Q, et al. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
  87. 87. Liu Y, et al. Rec-GPT4V: multimodal recommendation with large vision-language models. arXiv preprint arXiv:2402.08670. 2024.
  88. 88. Huang J, et al. GPT-4V takes the wheel: evaluating promise and challenges for pedestrian behavior prediction. arXiv preprint arXiv:2311.14786. 2023.
  89. 89. Tabone W, de Winter J. Using ChatGPT for human-computer interaction research: a primer. R Soc Open Sci. 2023;10(9):231053. pmid:37711151
  90. 90. Tang R, et al. Found in the middle: Permutation self-consistency improves listwise ranking in large language models. arXiv preprint arXiv:2310.07712. 2023.
  91. 91. Wang X, et al. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171. 2022.
  92. 92. Mikolov T Efficient estimation of word representations in vector space. arXiv preprint. 2013 https://arxiv.org/abs/1301.3781:3781
  93. 93. Lahitani AR, Permanasari AE, Setiawan NA. Cosine similarity to determine similarity measure: Study case in online essay assessment. 2016 4th International conference on cyber and IT service management. 2016. IEEE.
  94. 94. Vijaymeena M, Kavitha K. A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal. 2016;3(2): 19–28.
  95. 95. Xia P, Zhang L, Li F. Learning similarity with cosine similarity ensemble. Information Sciences. 2015;307:39–52.
  96. 96. Kryszkiewicz M. Determining cosine similarity neighborhoods by means of the euclidean distance. Rough Sets and Intelligent Systems-Professor Zdzisław Pawlak in Memoriam: Volume 2. 2013: 323–345
  97. 97. Dubey A, et al. Deep learning the city: quantifying urban perception at a global scale. Computer Vision–ECCV. 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. 2016. Springer.
  98. 98. Fan Z, Zhang F, Loo BPY, Ratti C. Urban visual intelligence: uncovering hidden city profiles with street view images. Proc Natl Acad Sci U S A. 2023;120(27):e2220417120. pmid:37364096
  99. 99. Bahrini A, et al. ChatGPT: Applications, opportunities, and threats. in 2023 Systems and Information Engineering Design Symposium (SIEDS). 2023. IEEE.
  100. 100. Chowdhery A., et al. Palm: scaling language modeling with pathways. Journal of Machine Learning Research. 2023. 24(240): 1–113.
  101. 101. Touvron H, et al. Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971. 2023.
  102. 102. Southworth M. Designing the walkable city. J Urban Plann Dev. 2005;131(4):246–257.
  103. 103. Malekzadeh M., et al. Urban visual appeal according to ChatGPT: contrasting AI and human insights. arXiv preprint arXiv:2407.14268. 2024.