Skip to main content
Advertisement
  • Loading metrics

Evaluating a large language model’s ability to solve programming exercises from an introductory bioinformatics course

  • Stephen R. Piccolo ,

    Roles Conceptualization, Formal analysis, Investigation, Resources, Software, Visualization, Writing – original draft, Writing – review & editing

    stephen_piccolo@byu.edu

    Affiliation Department of Biology, Brigham Young University, Provo, Utah, United States of America

  • Paul Denny,

    Roles Conceptualization, Writing – original draft

    Affiliation School of Computer Science, The University of Auckland, Auckland, New Zealand

  • Andrew Luxton-Reilly,

    Roles Conceptualization, Writing – original draft

    Affiliation School of Computer Science, The University of Auckland, Auckland, New Zealand

  • Samuel H. Payne,

    Roles Resources, Writing – original draft, Writing – review & editing

    Affiliation Department of Biology, Brigham Young University, Provo, Utah, United States of America

  • Perry G. Ridge

    Roles Resources, Writing – review & editing

    Affiliation Department of Biology, Brigham Young University, Provo, Utah, United States of America

Abstract

Computer programming is a fundamental tool for life scientists, allowing them to carry out essential research tasks. However, despite various educational efforts, learning to write code can be a challenging endeavor for students and researchers in life-sciences disciplines. Recent advances in artificial intelligence have made it possible to translate human-language prompts to functional code, raising questions about whether these technologies can aid (or replace) life scientists’ efforts to write code. Using 184 programming exercises from an introductory-bioinformatics course, we evaluated the extent to which one such tool—OpenAI’s ChatGPT—could successfully complete programming tasks. ChatGPT solved 139 (75.5%) of the exercises on its first attempt. For the remaining exercises, we provided natural-language feedback to the model, prompting it to try different approaches. Within 7 or fewer attempts, ChatGPT solved 179 (97.3%) of the exercises. These findings have implications for life-sciences education and research. Instructors may need to adapt their pedagogical approaches and assessment techniques to account for these new capabilities that are available to the general public. For some programming tasks, researchers may be able to work in collaboration with machine-learning models to produce functional code.

Author summary

Life scientists frequently write computer code when doing research. Computer programming can aid researchers in performing tasks that are not supported by existing tools. Programming can also help researchers to implement analytical logic in a way that documents their steps and thus enables others to repeat those steps. Many educational resources are available to teach computer programming, but this skill remains challenging for many researchers and students to master. Artificial-intelligence tools like OpenAI’s ChatGPT are able to interpret human-language requests to generate code. Accordingly, we evaluated the extent to which this technology might be used to perform programming tasks described by humans. To evaluate ChatGPT, we used requirements specified for 184 programming exercises taught in an introductory bioinformatics course at the undergraduate level. Within 7 or fewer attempts, ChatGPT solved 179 (97.3%) of the exercises. These findings suggest that some educators may need to reconsider how they evaluate students’ programming abilities, and researchers might be able to collaborate with such tools in research settings.

Introduction

For decades, the life-sciences community has called for researchers to gain a greater awareness of “computing in all its forms” [1]. This need is now greater than ever. A 2016 survey of principal investigators from diverse biology disciplines revealed that almost 90% of researchers had a current or impending need to use computational methods [2]. Computers can help researchers formalize scientific processes [3], accelerate research progress [4], improve job prospects, and even learn biological concepts [5]. These opportunities have motivated the creation of interdisciplinary training programs, courses, workshops, and tutorials to teach computing skills in a life-sciences context [2,612]. In some circumstances, it is sufficient for researchers to understand computing concepts and learn to use existing tools; in others, learning to write computer code is invaluable [13]. A 2011 survey of scientists from many disciplines (other than computer science) found that researchers spent 35% of their time, on average, writing code [14]. Computer programming makes it possible to complete tasks not supported by existing tools, interface with software libraries, adapt algorithms based on custom needs, tidy data, and more [1517]. In these applied scenarios, computer programs are often small [13] and used for only one project.

Scripting languages are well suited to such tasks because researchers can focus on high-level needs and worry less about memory management, code efficiency, and other technical details [15]. Python, a scripting language, has gained much acceptance among scientists [14] and programming educators [18], perhaps due to its relatively simple syntax [19] and the availability of libraries supporting common tasks [2024]. However, learning to program is a daunting challenge for many researchers. Decades of research have sought to characterize common errors and identify effective ways for novices to learn programming skills [2530]; much remains to be discovered.

Recent advances in artificial intelligence have shown promise for converting natural-language descriptions of programming tasks to functional code [31,32]. The first such large language models (LLMs) fine-tuned to generate code that captured widespread interest were OpenAI’s Codex and DeepMind’s AlphaCode [33,34]. These models were trained on millions of code examples, representing diverse programming tasks. In November 2022, OpenAI released ChatGPT, which uses an LLM fine-tuned with human feedback to generate natural dialogue-based text and code [35]. Researchers have speculated whether such models may be able to aid researchers—or even replace their efforts on basic programming tasks. For more complicated projects, LLMs might be able to assist in writing or debugging portions of the code. If successful in these settings, LLMs could reduce life scientists’ time spent on programming, leaving more time for other research tasks.

We undertook a study to assess the extent to which an LLM can solve basic computer-programming tasks. By understanding LLMs’ current capabilities and limits [36], we sought to gain perspective on their potential usefulness in life-sciences education and research. We used ChatGPT because it 1) was released recently, 2) is accessible with a Web browser, 3) can interact with users in a conversational manner, and 4) has garnered considerable attention among academics, industry competitors, and the general public [3743]. By January 2023, ChatGPT had over 100 million active users [44].

We evaluated and documented ChatGPT’s effectiveness on Python-programming exercises from an introductory-bioinformatics course taught primarily to undergraduates. We evaluated how well ChatGPT could interpret the prompts and respond to human feedback to generate functional Python code. Here we describe quantitative and qualitative aspects of ChatGPT’s performance, describe ways that ChatGPT could aid life scientists in research, and discuss implications for teaching and assessing students’ programming capabilities in an educational context.

Methods

Programming exercises

Since 2012, Brigham Young University has offered an introductory bioinformatics course, Introduction to Bioinformatics. The course was designed for novice programmers who have an interest in biology. One learning outcome for the course is that “students will be able to create computer scripts in the Python programming language to manipulate biological data stored in diverse file formats.” To facilitate skill development, the instructors created Python programming exercises, which serve as formative assessments. We used online datasets, articles, and tools to create the exercises [4554]. Six exercises in the second assignment were derived from an online course [55]. To our knowledge, none of the other exercises were in the public domain at the time of our experiment; thus, it is unlikely that they were used to train the LLM before our testing. We used the “Jan 30 Version Free Research Preview” version of ChatGPT, which used version 3.5 of the Generative Pre-trained Transformer (GPT) model.

The exercises are organized into 19 assignments, each designed to teach a particular concept. Students complete the assignments and exercises in a defined sequence. The assignments cover 1) relatively simple tasks like declaring and using variables, performing mathematical calculations, and writing conditional statements; 2) medium-difficulty tasks like working with strings, lists, loops, dictionaries, and files; and 3) more advanced tasks like writing regular expressions, manipulating tabular data, and creating data visualizations. Other assignments give students practice with techniques that they learned in previous assignments. At specific times throughout the course, students complete additional programming exercises as summative assessments (exams), culminating in an end-of-course summative assessment. We excluded the summative assessments from this study.

The programming exercises are delivered via CodeBuddy, a Web-based application that acts as an automated grader (https://github.com/srp33/CodeBuddy). For each exercise, students receive a prompt describing the problem’s context and requirements. The prompt sometimes includes basic code that students can use as a starting point. Each exercise has at least one test, including inputs and expected outputs. When applicable, the inputs consist of data file(s) provided within the prompt. The expected outputs may be text based (n = 179) or image based (n = 5). To generate the expected outputs, the instructor provides a solution; CodeBuddy executes the code and stores the output. The output of the student’s code must match the expected output exactly. Students can make multiple attempts, as needed, without penalty. In many cases, instructors provide test(s) for which the inputs and/or expected outputs are hidden; this helps to prevent students from writing code that does not address the stated requirements. We excluded these tests from the study to maintain consistency with what students see; we manually verified whether ChatGPT-generated code met the requirements.

We used openpyxl (version 3.1.0) [56] to create a spreadsheet with information about each exercise. One column contains the prompt for each exercise, including the instructions and a summary of each test. Fig 1 shows an example of how the prompts were structured. For image-based tests, we did not include the expected outputs because ChatGPT does not accept images as input. To make the prompts more understandable to ChatGPT—and to mimic what students or researchers might do—we added natural-language transitions between each section of the prompt. Other columns in the spreadsheet include the instructors’ solutions and flags indicating whether each exercise was biology oriented. Many exercise prompts provide biology-based scenarios such that a basic understanding of biology concepts is helpful when interpreting the prompts.

thumbnail
Fig 1. Example prompt for a programming exercise delivered to ChatGPT.

https://doi.org/10.1371/journal.pcbi.1011511.g001

Evaluation approach

We initiated a conversation with ChatGPT for each assignment. For one exercise at a time, we copied the prompt into ChatGPT’s Web-based interface. To assess functional correctness, we copied ChatGPT’s generated code into CodeBuddy. If the code did not pass all the tests, we continued the conversation with ChatGPT. In these interactions, we took the stance of a naive programmer who wishes to obtain functional code but is not necessarily able to provide detailed feedback about the code itself. We allowed ChatGPT a maximum of ten attempts per exercise. As we interacted with ChatGPT, we used the spreadsheet to record the dates of the interactions, ChatGPT’s generated code (its final attempt), the number of passed tests, the number of attempts made by ChatGPT, and comments describing our interactions. When our interactions with ChatGPT suggested that a prompt lacked clarity, we slightly modified the prompt and updated the spreadsheet accordingly.

After completing our evaluation of ChatGPT, Google released Bard, a Web-based application that uses an LLM to generate text and code. We tested 66 of the 184 exercises using Bard (version: 2023.06.07). These exercises consisted of the first and last exercises in each assignment and all exercises in the following assignments: “01—Declaring and Converting Variables,” “07—Problem Solving,” “13—Advanced Functions & Additional Practice,” and “19—Additional Practice.”

When executing the Python code, we used version 3.8 of Python. To generate the manuscript figures and perform statistical analyses, we used the R statistical software (version 4.0.2) and the tidyverse packages (version 1.3.2) [57,58]. All statistical tests were two sided.

Results

After our filtering steps, 184 Python programming exercises were available for testing. ChatGPT successfully solved 139 (75.5%) of the exercises on its first attempt. When it was unsuccessful on the first attempt, we engaged in a dialog with ChatGPT, allowing up to 10 interactions. Table 1 summarizes these interactions. In 26 instances, we indicated that the code had resulted in a runtime error, and we provided the error message to ChatGPT. More commonly, the generated code’s output did not match the expected output, either due to a logic error (n = 44) or a simple formatting issue (n = 17). In many cases—typically after three or more interactions for a given exercise—we restated the original prompt (n = 30), a modified version of the prompt (n = 11), or simply asked the model to try again. Rarely (n = 7), we provided a suggestion about the code itself (e.g., to change a function name or to use a different parameter). Still, we never provided code with our feedback.

thumbnail
Table 1. Summary of interactions between the human user and ChatGPT.

For the exercises that ChatGPT failed to solve on the first attempt, we categorized the subsequent interactions between the human user and the model. This table indicates the frequency of each interaction type.

https://doi.org/10.1371/journal.pcbi.1011511.t001

Of the 45 exercises that did not pass on the first attempt, ChatGPT solved 27 within 1 or 2 subsequent attempts. Within 7 or fewer attempts total, ChatGPT solved 179 (97.3%) of the exercises (Fig 2). As summarized in Table 2, the five unsolved exercises are delivered in the middle or end of the course; each requires students to combine multiple types of programming skills. One of these is the course’s final exercise; the instructors’ solution uses 61 lines of code, nearly twice as many as any other solution. For the remaining exercises that failed, ChatGPT came close to passing the tests. Its solutions resulted in logic errors or runtime errors or produced outputs that did not match the expected outputs exactly.

thumbnail
Fig 2. Number of ChatGPT iterations per exercise.

For each exercise prompt, we gave ChatGPT up to 10 attempts at generating a code solution that successfully passed the tests. The counts above each bar represent the number of exercises that required a particular number of attempts.

https://doi.org/10.1371/journal.pcbi.1011511.g002

thumbnail
Table 2. Summary of exercises that ChatGPT did not solve.

ChatGPT failed to solve 5 of the exercises within 10 attempts. This table summarizes characteristics of these exercises and provides a brief summary of complications that ChatGPT faced when attempting to solve them.

https://doi.org/10.1371/journal.pcbi.1011511.t002

We used statistics to understand more about the scenarios in which ChatGPT either succeeded or failed. First, we used the length of the instructors’ solutions as an indicator for difficulty level. After removing comments (inline descriptions of how the code works) and blank lines, we compared the number of lines of code between the exercises that ChatGPT solved and those that it did not (Fig 3). The median for passing solutions was 6, and the median for non-passing solutions was 7; this difference was not statistically significant (Mann-Whitney U p-value: 0.2836). The lengths of the instructors’ solutions were significantly (positively) correlated with the lengths of ChatGPT’s solutions (Fig 4), both for the number of characters (Spearman’s rho = 0.89; p-value < 0.001) and the number of lines (Spearman’s rho = 0.83; p-value < 0.001). Another indicator of difficulty level is the exercise-prompt length. For passing solutions, the median was 2019 characters, while the median was 9115 for non-passing solutions. Although this difference was not statistically significant (Mann-Whitney U p-value: 0.1021), it is consistent with a recent study of computer science exercises [32].

thumbnail
Fig 3. Lines of Python code per instructor solution.

Course instructors provided a solution for each exercise. This plot illustrates the number of lines of code for each solution, after removing comment lines.

https://doi.org/10.1371/journal.pcbi.1011511.g003

thumbnail
Fig 4. Comparison of code-solution lengths for instructor solutions versus ChatGPT solutions.

This illustrates the relationship between A) the number of characters or B) the number of lines of code, for each exercise, after removing comment lines. The dashed, red line is the identity line.

https://doi.org/10.1371/journal.pcbi.1011511.g004

The number of attempts provides additional insight into ChatGPT’s capabilities but should be cautiously interpreted because ChatGPT exhibits stochasticity. Whether ChatGPT provides a correct answer on the first or a later attempt, eventual success shows that its probabilistic model can aid users. However, a smaller number of attempts might suggest an ability to formulate a valid response more readily, thus requiring less time by the user. The number of attempts per exercise was significantly correlated with the length of the instructors’ solution (rho = 0.234; p = 2.2e-16) and the length of the prompt (rho = 0.31; p = 2.2e-16). These correlations held, whether or not we considered the five exercises that ChatGPT failed to solve.

Of the 184 prompts, 98 (53.3%) were framed in a biological context. Of the five exercises that ChatGPT did not solve, four were framed in a biological context (Fisher’s exact test p-value = 0.37). The median length (characters) of biology-oriented prompts was 3203, whereas the median was 1437 for the remaining prompts (Mann-Whitney U p-value = 1.6e-09). In the course, we frequently use biological data (e.g., genome sequences, medical observations, narrative text) to teach analysis skills and make the exercises more authentic. We included these data so that ChatGPT could evaluate the files’ structure. For 24 exercises, the prompt size exceeded the maximum allowed by ChatGPT. After we truncated the data to the first few lines, ChatGPT was successful at solving all of these exercises. On four other occasions, we shortened parts of the prompt as we interacted with ChatGPT to attempt to provide clarity. For example, we shortened the descriptions of how the code would be tested. ChatGPT eventually solved two of these four exercises.

We note additional challenges that ChatGPT faced when interpreting the programming prompts. On 17 exercises, ChatGPT used correct logic but produced outputs that were different from the expected outputs (for example, “Number of worms in the last box: 5” instead of “5”). Eventually, ChatGPT solved all of these exercises. On 25 exercises, ChatGPT generated code that produced logic errors; it eventually solved 20 of these exercises. On 10 exercises, ChatGPT generated code that produced runtime errors (exceptions); it eventually solved 8 of these exercises. On two exercises, ChatGPT generated passing code that did not directly address the prompt. For example, in one case, the prompt called for using a regular expression (text-based pattern matching), but ChatGPT used iteration logic instead; we marked these exercises as passing because the automatic grader did not verify which type of logic they used. On five occasions, we noted parts of the prompt that may have been ambiguous. We clarified these prompts; subsequently, ChatGPT solved four of these exercises.

In using ChatGPT to solve these programming problems, we observed several practical issues that may impact the value offered by ChatGPT to researchers and students. For 13 exercises that did not pass on the first attempt, we asked ChatGPT to try an alternative approach, and/or we re-delivered the original prompt. In these cases, we sought to take advantage of its stochastic nature, perhaps resulting in code that used a considerably different strategy. ChatGPT eventually passed 11 of these 13 exercises. For 41 (22.2%) exercises in total, ChatGPT used at least one programming technique that would have been unfamiliar to most students in the course. Many of these techniques are never taught in the course, whereas others are introduced in later units. Finally, following an unsuccessful first attempt for six exercises, ChatGPT generated code that did not address the original prompt. Conceivably, the model “forgot” earlier parts of the conversation or was “distracted” by subsequent inputs. Eventually, it solved all of these exercises.

Although the focus of this study was not to compare LLMs, we wished to approximate how well our findings would generalize to another LLM. We evaluated Google Bard’s code-generation ability for 66 of the Python exercises. Bard solved 33 (50.0%) within one attempt and 45 (68.2%) within 10 attempts (Fig 5).

thumbnail
Fig 5. Number of Bard iterations per exercise.

For 66 exercise prompts, we gave Google Bard up to 10 attempts at generating a code solution that successfully passed the tests. The counts above each bar represent the number of exercises that required a particular number of attempts.

https://doi.org/10.1371/journal.pcbi.1011511.g005

Discussion

These findings demonstrate that modern LLMs can solve many basic Python programming tasks, often in a biological context. On educational assessments requiring basic programming skills, students might seek help from LLMs when it is available. Additionally, in some settings, researchers might be able to rely on LLMs’ abilities to translate natural-language descriptions to code. Researchers have already begun to explore this capability in practice [59]. We anticipate that as the models evolve, students and researchers will increasingly author programming prompts in addition to code.

Anecdotally, we have found that authoring programming prompts is not always easy. During our evaluations, communicating with the model was cognitively taxing at times. In addition, these conversations were sometimes awkward. Although ChatGPT can retain a memory of previous interactions, its default response was to provide a solution; in many cases, it might have been more helpful for ChatGPT to request additional information or clarification regarding the problem. It was often more effective for us to restate the original prompt than engage in a back-and-forth dialog. On a positive note, ChatGPT was exceptionally effective at determining which parts of a given prompt were most informative; for example, it seemed to identify relevant aspects of biological context and ignore extraneous details. Currently, LLMs do not execute code; thus, they often cannot predict the output of code [60]. This is one area in which human feedback remains critical.

For 60+ years, researchers have been working to automate program synthesis [61,62]. Recent efforts have focused on training neural networks on large code repositories [33,34,63,64]. Our results show that ChatGPT represents a considerable advance compared to prior models. Chen, et al. [33] evaluated Codex’s ability to solve short- to medium-length programming exercises (median solution = 5.5 lines of code). When delivering the prompts, they used “docstrings” (structured descriptions of functions). Codex was successful for 28.8% of these exercises in a single attempt; when making 100 attempts per exercise, it solved 77.5% of the exercises [33]. In an additional study, Austin, et al. used a different set of exercises that were either mathematical or focused on core programming skills [60]. In contrast to Chen, et al., they used natural-language prompts (one or a few sentences). The solutions had a median length of 5 lines. Using various LLMs, they solved as many as 83.8% of the mathematical problems and 60% of the remaining problems (within 100 attempts). For a subset of the problems, they provided human-language feedback to the models (up to four interactions); maximum accuracy was 65%. In an educational context, Finnie-Ansley, et al. showed that Codex could solve 82.6% of programming exercises from an introductory computer-science course within 10 attempts and that the model would have ranked among the top quartile of students in the course [31]. Finally, in a similar approach to ours, Denny, et al. used Copilot (a development environment plug-in powered by OpenAI’s Codex model) to solve 166 programming exercises designed for novice computer science students [65]. For problems that initially failed, they observed similar improvements in the model’s performance through natural-language modifications to the prompts. However, only 80% of the problems were ultimately solved. Given that the problems they analyzed were also designed for novices, the superior performance we observed may be due to improvements in the models themselves in the intervening six months.

Aside from the use of ChatGPT, our work differs from prior evaluations in scope and context. Previous studies used exercises that evaluated the models’ abilities to solve mathematical problems or to use core programming skills like processing lists, processing strings, or evaluating integer sequences. Our exercises required similar techniques and higher-level tasks like parsing data files, writing data files, creating graphics, and using external Python packages. Furthermore, more than half of our exercises were framed in a biological context. LLMs may be most helpful for routine tasks that appear frequently in training sets and only need to be modified for a particular purpose; however, our results show that LLMs can be used in new and diverse contexts as well.

Our findings have important implications for education. Unless LLMs demonstrate an ability to replace all human programming efforts, it will remain necessary for students (and others) to gain programming skills [66]. In our course, preventing students from using LLMs on formative assessments (homework) would be impossible. However, summative assessments (exams) are the primary way we determine students’ final grades; these assessments are invigilated, and students cannot access the Internet. Therefore, we retain confidence in the validity of grades determined under secure assessment conditions. Extensive practice is a critical part of learning to write code [67,68]. Thus, if students rely on LLMs to generate answers to formative assessments without first devising their own solutions, they may be more likely to perform poorly on summative assessments. Indeed, over-reliance by novices was a key risk identified by Chen et al. when releasing the Codex model [33]. With the ease of use and wide availability of tools like Copilot and ChatGPT, novices may quickly learn to rely on auto-suggested solutions without thinking about the computational steps involved—or reading problem statements carefully. Furthermore, if students copy and paste code without understanding it—as has been observed for an online forum [69]—they may underperform on summative assessments. One way that instructors could counter this behavior is to generate student-specific questions about their code. Lehtinen, et al. used an LLM to generate multiple-choice questions about code that students had submitted in an introductory-programming course [70]. Students who struggled to answer these questions were more likely to perform poorly or drop the course. LLMs also provide opportunities to make learning processes more efficient. For example, to aid students on programming exercises, the instructor could allow access to an LLM, which could act as an intelligent tutor. When a student struggled to complete a given exercise, the tutor could ingest the student’s code and the exercise requirements and offer suggestions [7173]. Doing so may reduce the need for instructors or teaching assistants to provide help. Finally, instructors might be able to use LLMs when creating new exercises to evaluate whether their prompts are clear.

It remains essential to have a human in the loop to evaluate the outputs of LLMs. Students and researchers must be competent at code comprehension and code evaluation. LLMs often produce code that does not meet the stated requirements; additionally, edge cases may not be specified as part of prompts. As a (simple) validation, we compared ChatGPT’s output against the expected outputs that we had defined before beginning our study. This approach aligns with the educational context that we considered (an introductory course). However, in subsequent courses and the “real world,” other types of validation would be necessary. Educators may need to shift pedagogical practice toward ensuring that students can understand code that has been generated, evaluating whether generated code meets specifications, debugging generated code, adapting code to different library versions, etc. Fig 6 provides recommendations on how to use LLMs effectively in an educational or research context.

We deliberately chose to allow ChatGPT up to 10 attempts to solve each exercise. Firstly, this criterion aligns with our pedagogical approach. The exercises we tested are formative. Accordingly, failing, receiving feedback, and re-attempting are part of the learning process [67,68,74]. Secondly, allowing multiple attempts reflects how biologists could use LLMs in research. If LLM-generated code does not function correctly on the first attempt, the researcher could ask the model to revise or generate a new solution. Thirdly, allowing multiple attempts per exercise is consistent with what others have reported [31,60].

Our study has several limitations. We applied one particular version of one LLM to all 184 exercises. We applied a second LLM (Google Bard) to a subset of the exercises. We do not know how our findings would generalize to other models or versions; however, the performance of LLM-based code generators will likely continue to improve as model sizes increase. The programming exercises we evaluated do not necessarily represent skills that would be taught in other introductory bioinformatics courses or used broadly in bioinformatics research. We used the Python programming language; our findings might not generalize to other languages. Future studies can shed additional light on how LLMs might be helpful for bioinformatics education and research.

Another limitation is that our evaluation process was subjective. When the initially generated solutions did not pass, the human user judged which types of feedback would be most helpful in each interaction. Other users would have interacted differently with the models. Furthermore, the human user was not a student but an instructor with 25 years of programming experience and 15 years of Python experience. In our attempt to mimic novice programmers, we rarely suggested tweaks to the code (Table 1); perhaps students would have described problematic aspects of generated code more (or less) frequently. Additionally, students might have provided more (or less) context to the models about runtime errors that occurred.

In this study, we provide evidence that dialog-based LLMs, such as ChatGPT, can aid in solving basic programming exercises, with or without biological relevance. However, despite generally excellent performance, much remains to be learned about how these models can replace human programming efforts. In an authentic research setting, where an auto-grader cannot provide instant feedback on the correctness of model-generated code, there is a risk that relying on their outputs may produce erroneous results. Nevertheless, our findings have important implications for educators and researchers who seek to incorporate programming skills into their work. With the help of machine-learning models, instructors may be able to provide more personalized and efficient feedback to students, and researchers might be able to accelerate their work.

Acknowledgments

Brandon Pickett, Justin Miller, Corinne Sexton, Ashlyn Powell, and Eric Upton-Rowley contributed to programming exercises that were used in this study.

References

  1. 1. Beynon RJ. CABIOS editorial. Bioinformatics. 1985 Jan;1(1):1–1.
  2. 2. Barone L, Williams J, Micklos D. Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators. PLOS Computational Biology. 2017 Oct;13(10):e1005755. pmid:29049281
  3. 3. Guzdial M. Teaching computing for everyone. J Comput Sci Coll. 2006 Apr;21(4):6.
  4. 4. Baker M. Scientific computing: Code alert. Nature. 2017;541(7638):563–5.
  5. 5. Guzdial M. Computational thinking and using programming to learn. In: Learner-centered design of computing education: Research on computing for everyone. Cham: Springer International Publishing; 2016. p. 37–51.
  6. 6. Zatz MM. Bioinformatics training in the USA. Brief Bioinform. 2002;3(4):353–60. pmid:12511064
  7. 7. Kulkarni-Kale U, Sawant S, Chavan V. Bioinformatics education in india. Brief Bioinform. 2010;11(6):616–25. pmid:20705754
  8. 8. Welch L, Lewitter F, Schwartz R, Brooksbank C, Radivojac P, Gaeta B, et al. Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies. PLOS Computational Biology. 2014 Mar;10(3):e1003496. pmid:24603430
  9. 9. Williams JJ, Teal TK. A vision for collaborative training infrastructure for bioinformatics. Ann N Y Acad Sci. 2017;1387(1):54–60. pmid:27603332
  10. 10. Mulder N, Schwartz R, Brazas MD, Brooksbank C, Gaeta B, Morgan SL, et al. The development and application of bioinformatics core competencies to improve bioinformatics training and education. PLOS Computational Biology. 2018 Feb;14(2):e1005772. pmid:29390004
  11. 11. Shaffer JG, Mather FJ, Wele M, Li J, Tangara CO, Kassogue Y, et al. Expanding Research Capacity in Sub-Saharan Africa Through Informatics, Bioinformatics, and Data Science Training Programs in Mali. Front Genet. 2019;10.
  12. 12. Attwood TK, Blackford S, Brazas MD, Davies A, Schneider MV. A global perspective on evolving bioinformatics and data science training needs. Brief Bioinform. 2019;20(2):398–404. pmid:28968751
  13. 13. Sayres MAW, Hauser C, Sierk M, Robic S, Rosenwald AG, Smith TM, et al. Bioinformatics core competencies for undergraduate life sciences education. PLOS ONE. 2018 Jun;13(6):e0196878. pmid:29870542
  14. 14. Prabhu P, Jablin TB, Raman A, Zhang Y, Huang J, Kim H, et al. A survey of the practice of computational science. In: State Pract Rep. New York, NY, USA: Association for Computing Machinery; 2011. p. 1–2. (SC ‘11).
  15. 15. Ekmekci B, McAnany CE, Mura C. An Introduction to Programming for Bioscientists: A Python-Based Primer. PLOS Computational Biology. 2016 Jun;12(6):e1004867. pmid:27271528
  16. 16. Wickham H. Tidy Data. J Stat Softw. 2014;59(10).
  17. 17. Dasu T, Johnson T. Exploratory data mining and data cleaning. John Wiley & Sons; 2003.
  18. 18. Simon , Mason R, Crick T, Davenport JH, Murphy E. Language Choice in Introductory Programming Courses at Australasian and UK Universities. In: Proc 49th ACM Tech Symp Comput Sci Educ. New York, NY, USA: Association for Computing Machinery; 2018. p. 852–7. (SIGCSE ‘18).
  19. 19. Fourment M, Gillings MR. A comparison of common programming languages used in bioinformatics. BMC Bioinformatics. 2008 Feb;9(1):82. pmid:18251993
  20. 20. Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. pmid:19304878
  21. 21. McKinney W. Data Structures for Statistical Computing in Python. In: Proc 9th Python Sci Conf. 2010. p. 6.
  22. 22. Walt S van der, Colbert SC, Varoquaux G. The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering. 2011 Mar;13(2):22–30.
  23. 23. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
  24. 24. Géron A. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. 1st ed. O’Reilly Media, Inc.; 2017.
  25. 25. Perkins DN, Hancock C, Hobbs R, Martin F, Simmons R. Conditions of learning in novice programmers. J Educ Comput Res. 1986;2(1):37–55.
  26. 26. Kelleher C, Pausch R. Lowering the barriers to programming: A taxonomy of programming environments and languages for novice programmers. ACM Comput Surv. 2005 Jun;37(2):83–137.
  27. 27. Lahtinen E, Ala-Mutka K, Järvinen HM. A study of the difficulties of novice programmers. SIGCSE Bull. 2005 Jun;37(3):14–8.
  28. 28. Luxton-Reilly A, Albluwi I, Becker BA, Giannakos M, Kumar AN, Ott L, et al. Introductory programming: A systematic literature review. In: Proc Companion 23rd Annu ACM Conf Innov Technol Comput Sci Educ. 2018. p. 55–106.
  29. 29. Smith R, Rixner S. The Error Landscape: Characterizing the Mistakes of Novice Programmers. In: Proc 50th ACM Tech Symp Comput Sci Educ. New York, NY, USA: Association for Computing Machinery; 2019. p. 538–44. (SIGCSE ‘19).
  30. 30. Becker BA, Denny P, Pettit R, Bouchard D, Bouvier DJ, Harrington B, et al. Compiler Error Messages Considered Unhelpful: The Landscape of Text-Based Programming Error Message Research. In: Proc Work Group Rep Innov Technol Comput Sci Educ. New York, NY, USA: Association for Computing Machinery; 2019. p. 177–210. (ITiCSE-WGR ‘19).
  31. 31. Finnie-Ansley J, Denny P, Becker BA, Luxton-Reilly A, Prather J. The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. In: Proc 24th Australas Comput Educ Conf. New York, NY, USA: Association for Computing Machinery; 2022. p. 10–9. (ACE ‘22).
  32. 32. Finnie-Ansley J, Denny P, Luxton-Reilly A, Santos EA, Prather J, Becker BA. My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. In: Proc 25th Australas Comput Educ Conf. New York, NY, USA: Association for Computing Machinery; 2023. p. 97–104. (ACE ‘23).
  33. 33. Chen M, Tworek J, Jun H, Yuan Q, Pinto HP de O, Kaplan J, et al. Evaluating Large Language Models Trained on Code [Internet]. arXiv; 2021 [cited 2023 Feb 17]. Available from: https://arxiv.org/abs/2107.03374
  34. 34. Li Y, Choi D, Chung J, Kushman N, Schrittwieser J, Leblond R, et al. Competition-level code generation with AlphaCode. Science. 2022 Dec;378(6624):1092–7. pmid:36480631
  35. 35. ChatGPT: Optimizing Language Models for Dialogue. OpenAI. [Cited 2023 September 19]. Available from https://openai.com/blog/chatgpt.
  36. 36. Hendler J. Understanding the limits of AI coding. Science. 2023;379(6632):548–8. pmid:36758097
  37. 37. van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: Five priorities for research. Nature. 2023 Feb;614(7947):224–6. pmid:36737653
  38. 38. Thorp HH. ChatGPT is fun, but not an author. Science. 2023 Jan;379(6630):313–3. pmid:36701446
  39. 39. Kung TH, Cheatham M, Medenilla A, Sillos C, Leon LD, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health. 2023 Feb;2(2):e0000198. pmid:36812645
  40. 40. Jiao W, Wang W, Huang J, Wang X, Tu Z. Is ChatGPT A Good Translator? A Preliminary Study. 2023 Jan; Available from https://arxiv.org/abs/2301.08745.
  41. 41. Liebrenz M, Schleifer R, Buadze A, Bhugra D, Smith A. Generating scholarly content with ChatGPT: Ethical challenges for medical publishing. The Lancet Digital Health. 2023 Feb;0(0). pmid:36754725
  42. 42. Jalil S, Rafi S, LaToza TD, Moran K, Lam W. ChatGPT and Software Testing Education: Promises & Perils [Internet]. arXiv; 2023 [cited 2023 Feb 17]. Available from: https://arxiv.org/abs/2302.03287
  43. 43. Elias J. Google is asking employees to test potential ChatGPT competitors, including a chatbot called ‘Apprentice Bard’. CNBC. https://www.cnbc.com/2023/01/31/google-testing-chatgpt-like-chatbot-apprentice-bard-with-employees.html; 2023.
  44. 44. Hu K. ChatGPT sets record for fastest-growing user base—analyst note. Reuters. 2023 Feb; Available from https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01.
  45. 45. Arel-Bundock V. Rdatasets: A collection of datasets originally distributed in various R packages. 2023.
  46. 46. DiCiccio TJ, Efron B. Bootstrap confidence intervals. Stat Sci. 1996;11(3):189–228.
  47. 47. Gire SK, Goba A, Andersen KG, Sealfon RS, Park DJ, Kanneh L, et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. science. 2014;345(6202):1369–72.
  48. 48. Consortium U. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47(D1):D506–15. pmid:30395287
  49. 49. Random Name Generator Generated full names. Random Lists. Available from https://www.randomlists.com/random-names.
  50. 50. Ellis MJ, Gillette M, Carr SA, Paulovich AG, Smith RD, Rodland KK, et al. Connecting genomic alterations to cancer biology with proteomics: The NCI Clinical Proteomic Tumor Analysis Consortium. Cancer Discov. 2013;3(10):1108–12. pmid:24124232
  51. 51. Edwards NJ, Oberti M, Thangudu RR, Cai S, McGarvey PB, Jacob S, et al. The CPTAC data portal: A resource for cancer proteomics research. J Proteome Res. 2015;14(6):2707–13. pmid:25873244
  52. 52. Lindgren CM, Adams DW, Kimball B, Boekweg H, Tayler S, Pugh SL, et al. Simplified and unified access to cancer proteogenomic data. J Proteome Res. 2021;20(4):1902–10. pmid:33560848
  53. 53. Huang LS, Mathew B, Li H, Zhao Y, Ma SF, Noth I, et al. The mitochondrial cardiolipin remodeling enzyme lysocardiolipin acyltransferase is a novel target in pulmonary fibrosis. Am J Respir Crit Care Med. 2014;189(11):1402–15. pmid:24779708
  54. 54. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: Archive for functional genomics data setsupdate. Nucleic Acids Research. 2012 Nov;41(D1):D991–5. pmid:23193258
  55. 55. White E. Programming for Biologists. Available from http://www.programmingforbiologists.org.
  56. 56. Hunt J, Hunt J. Working with excel files. Adv Guide Python 3 Program. 2019;249–55.
  57. 57. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2020.
  58. 58. Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, et al. Welcome to the tidyverse. J Open Source Softw. 2019;4(43):1686.
  59. 59. Owens B. How Nature readers are using ChatGPT. Nature. 2023 Feb; pmid:36807343
  60. 60. Austin J, Odena A, Nye M, Bosma M, Michalewski H, Dohan D, et al. Program Synthesis with Large Language Models [Internet]. arXiv; 2021 [cited 2023 Feb 17]. Available from: https://arxiv.org/abs/2108.07732
  61. 61. Simon HA. Experiments with a Heuristic Compiler. J ACM. 1963 Oct;10(4):493–506.
  62. 62. Manna Z, Waldinger RJ. Toward automatic program synthesis. Commun ACM. 1971 Mar;14(3):151–65.
  63. 63. Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, et al. Codebert: A pre-trained model for programming and natural languages. ArXiv Prepr ArXiv200208155 [Internet]. 2020; Available from: https://arxiv.org/abs/2002.08155
  64. 64. Clement CB, Drain D, Timcheck J, Svyatkovskiy A, Sundaresan N. PyMT5: Multi-mode translation of natural language and Python code with transformers. ArXiv Prepr ArXiv201003150 [Internet]. 2020; Available from: https://arxiv.org/abs/2010.03150
  65. 65. Denny P, Kumar V, Giacaman N. Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language. In: Proc 54th ACM Tech Symp Comput Sci Educ V 1. New York, NY, USA: Association for Computing Machinery; 2023. p. 1136–42. (SIGCSE 2023).
  66. 66. Yellin DM. The Premature Obituary of Programming. Commun ACM. 2023 Jan;66(2):41–4.
  67. 67. Robins AV, Margulieux LE, Morrison BB. Cognitive sciences for computing education. Camb Handb Comput Educ Res. 2019;231–75.
  68. 68. Denny P, Luxton-Reilly A, Craig M, Petersen A. Improving complex task performance using a sequence of simple practice tasks. In: Proc 23rd Annu ACM Conf Innov Technol Comput Sci Educ. New York, NY, USA: Association for Computing Machinery; 2018. p. 4–9. (ITiCSE 2018).
  69. 69. López-Nores M, Blanco-Fernández Y, Bravo-Torres JF, Pazos-Arias JJ, Gil-Solla A, Ramos-Cabrer M. Experiences from placing Stack Overflow at the core of an intermediate programming course. Comput Appl Eng Educ. 2019;27(3):698–707.
  70. 70. Lehtinen T, Haaranen L, Leinonen J. Automated Questionnaires About Students’ JavaScript Programs: Towards Gauging Novice Programming Processes. In: Proc 25th Australas Comput Educ Conf. New York, NY, USA: Association for Computing Machinery; 2023. p. 49–58. (ACE ‘23).
  71. 71. Crow T, Luxton-Reilly A, Wuensche B. Intelligent tutoring systems for programming education: A systematic review. In: Proc 20th Australas Comput Educ Conf. 2018. p. 53–62.
  72. 72. MacNeil S, Tran A, Hellas A, Kim J, Sarsa S, Denny P, et al. Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. In: Proc 54th ACM Tech Symp Comput Sci Educ V 1. New York, NY, USA: Association for Computing Machinery; 2023. p. 931–7. (SIGCSE 2023).
  73. 73. Sarsa S, Denny P, Hellas A, Leinonen J. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In: Proc 2022 ACM Conf Int Comput Educ Res—Vol 1. New York, NY, USA: Association for Computing Machinery; 2022. p. 27–43. (ICER ‘22; vol. 1).
  74. 74. Shute VJ. Focus on formative feedback. Rev Educ Res. 2008;78(1):153–89.