Small changes, big gains: A curriculum-wide study of teaching practices and student learning in undergraduate biology

A growing body of evidence has shown that active learning has a considerable advantage over traditional lecture for student learning in undergraduate STEM classes, but there have been few large-scale studies to identify the specific types of activities that have the greatest impact on learning. We therefore undertook a large-scale, curriculum-wide study to investigate the effects of time spent on a variety of classroom activities on learning gains. We quantified classroom practices and related these to student learning, assessed using diagnostic tests written by over 3700 students, across 31 undergraduate biology classes at a research-intensive university in the Pacific Northwest. The most significant positive predictor of learning gains was the use of group work, supporting the findings of previous studies. Strikingly, we found that the addition of worksheets as an active learning tool for in-class group activities had the strongest impact on diagnostic test scores. This particular low-tech activity promotes student collaboration, develops problem solving skills, and can be used to inform the instructor about what students are struggling with, thus providing opportunities for valuable and timely feedback. Overall, our results indicate that group activities with low barriers to entry, such as worksheets, can result in significant learning gains in undergraduate science.


Introduction
It is well-established that active, student-centered classrooms in undergraduate STEM education improve student outcomes compared to traditional lecture. These positive effects of active learning have been documented within individual courses; active approaches in high-enrollment introductory courses show improvements in student learning, engagement, attendance, attitudes, and retention in a course or program [1][2][3][4][5][6]. This pattern is remarkably consistent despite considerable variability in approaches to active learning and the magnitude of the impact of these approaches across courses. Freeman et al.'s [7] meta-analysis of over 200 published STEM studies indicates that active learning improved student performance and decreased failure rates irrespective of discipline, class size, course level, and instructor experience. However, the degree to which different active learning methods relate to variation in student success remains an open question. In this study, we explore the effectiveness of different active learning tools by examining how they relate to student success. In practice, "active learning" in STEM education encompasses a wide variety of approaches that include collaborative learning, team-based learning, think-pair-share, and peer instruction [8]. While these techniques may include different tools (e.g., personal response systems such as i>Clickers, paper worksheets), most include a considerable amount of group work. When implementing activities from the literature, instructors often adapt in-class approaches to suit their own classrooms and teaching style. Thus, to understand the impact of varied classroom practices on student outcomes, it is essential to understand the variety of activities that occur in a typical lecture period in real classrooms.
With few exceptions [2,4], most of the STEM literature that focuses on the use of active learning approaches is based upon instructor self-reports, qualitative surveys, or indirect observations of active learning. These classroom measures are usually not generalizable and can be inaccurate [2,[9][10][11]. Recent classroom observation tools such as the Classroom Observation Protocol for Undergraduate STEM (COPUS) [12] and the Practical Observation Rubric To Assess Active Learning (PORTAAL) [13] have been developed to provide a systematic and quantifiable estimate of the diversity of classroom practices and use of class time. These tools allow for a quantitative measure of the time spent on different activities in the classroom and objective comparisons across classes within or among courses. For higher-level comparisons among courses, data can be clustered to represent broad instructional styles across a continuum of approaches ranging from instructor-centred to student-centred [14].
Assessment of the success of different active learning approaches requires quantifiable and comparable measures not only of classroom activities, but of student learning as well. Multiple-choice conceptual inventory tests, or diagnostic instruments, are broadly used tools that allow for objective measurements of student thinking that are independent of course-specific quizzes or examinations [15][16][17]. Because of their informative power and ease of implementation, an abundance of rigorously evaluated conceptual inventory tests are readily available in the published literature (e.g., [15,16]). When combined with direct classroom observation, change in student performance on these inventories can provide robust evidence for the impact of various active learning practices.
In this work, we aim to characterize instructional styles in use across a range of biology courses at a large research institution, and to investigate the relationships between student learning and specific teaching practices without experimental manipulation of class activities. Student learning outcomes are measured by comparing pre-and post-course performance on conceptual inventory tests that align with the core concepts for each course. Given the large body of literature indicating that active learning enhances student performance, we predict that classes employing a student-centred approach will exhibit relatively higher student scores on concept inventory tests compared to classes that use traditional lecture. In addition, we examine the effect of different types of specific active learning practices on student performance. By accurately documenting the range of approaches used across classrooms, we can investigate which classroom activities contribute to improved student performance.

Cohort
This study focused on the Biology program at the University of British Columbia, a large, research-intensive university in Vancouver, British Columbia, Canada. All instructors teaching biology courses without a laboratory component during Fall 2014 and Winter 2015 were contacted to take part in the study; a total of 31 class sections participated, with an average of 211 registered students per class. Class sizes ranged from numbers in the teens in fourth year courses to over 300 in first year courses. We chose to focus on courses without an integrated laboratory component so that we could assess student learning in primarily classroom environments. The 31 classes without a laboratory component that we studied here represent approximately 40% of all courses that are offered by the biology program in a given year. As incentive to participate, all instructors were offered the opportunity to see the aggregate data from their course (COPUS observations and student performance on concept inventories). The courses involved in the study were largely lower division (first-and second-year) courses; six of these courses were required for biology majors (out of a total of seven required courses). Many of these lower division courses were run as multiple different sections, each taught by a different instructor. Each instructor-unique course section was analyzed independently because each instructor offered a different approach to teaching. Students in the participating classes were asked for consent to use their data; only data from students who wrote both the pre-and post-test and gave consent were included in our analyses. No additional incentives were given to students. A breakdown of the courses and number of participants is shown in Table 1. This work was performed under approval from the UBC Behavioural Research Ethics Board, H14-02293.
The percentage of total students registered reflects the numbers in the courses that we surveyed, rather than all courses in the curriculum.

Compiling conceptual tests
We quantified student learning as a change in score on a concept inventory test. Seventeen different conceptual inventory tests were administered, corresponding with 17 different courses in the study. These conceptual tests varied in length (6 to 23 questions) and were composed of a combination of questions that were either based on previously used test questions created by the researchers, sourced from validated inventories, or modified slightly from the validated questions. A complete breakdown of question sources and calculations of discrimination indices are available in S1 Table and S1 Fig. Tests were compiled collaboratively with instructors to match the central course content and learning objectives. Pre-tests were administered before any exposure to the content in that class, and post-tests were run at end of the semester (most during the last week of class). Researchers administered both the pre-and post-tests in person during class time. For each course, questions were presented at the start and end of the semester in one of two ways: 1) they appeared on lecture slides and students answered them on hard copy bubble answer sheets or voted on the answers with i>Clicker personal response systems; or 2) students were given a paper copy of the test and filled in their answers on a bubble sheet. The approach taken to deliver the questions and the method used to answer was consistent within each class for the pre-and post-tests. All matched scores for concept inventory tests are available in S1 File.

COPUS observations
The COPUS protocol [12] was used to gather in-class observational data because it allows for live collection of quantitative data. A group of seven researchers, including six post-doctoral teaching fellows and an undergraduate student, conducted the classroom observations. All of the observers were trained to use COPUS prior to the study. Seven "practice" classes were attended and scored by more than one observer, and the intra-class correlation for the total number of each COPUS category was calculated for these observations using the ICC package in R [18]. This approach determines the degree to which values within a category are in agreement between observers; values for intra-class correlations vary between 0 and 1, and estimates greater than 0.75 are considered to be excellent inter-rater agreement [19]. The intra-class correlation for each class observed by more than one researcher ranged from 0.86 to 0.98, with a mean of 0.93. Data from the practice observations was not used in further analysis; raw data are included in S2 File. Class observation data were collected for a "typical week" of the course, consistent with the approach used by Lund et al. [14]. This included approximately 150 minutes of class time that occurred in either three 50-minute classes or two 80-minute classes. The data were collected during weeks eight to ten of the 13-week semester. We avoided any irregular class sessions such as midterms. In total, 98 COPUS observations were made. Data from a particular section were averaged to obtain one value across observed sessions, such that each section of a particular course had only one set of values. This approach was used to reflect the time spent on different activities for courses in which classes on different days might have different structures (e.g., introduction of a topic on Tuesday, worksheets or other activities to promote understanding on Thursday). Raw and summarized COPUS data are available in S2 File. To assess the between-class variation within a course section, we examined pairwise correlation coefficients for the frequency of different class activities across the set of observations within a particular course. The average pairwise correlation between classes within the same section was generally very strong; in 24 of 31 sections observed, correlation coefficients were greater than 0.7, and 6 of the remaining 7 sections had correlation coefficients between 0.5 and 0.7, representing strong to moderate relationships. No sections had negative correlations between class periods. Thus, the data we used in subsequent analyses should accurately capture the total duration of different classroom activities.

Classroom characterization
We first created broad categories of each of the 31 course sections using the methodology of Lund et al. [14]. These categories are 'Mostly Lecture', which encompasses both traditional and socratic lecturing, 'Emergence of Group Work', which includes practices ranging from limited to extensive peer instruction, and 'Extensive Group Work', which involves studentcentered peer instruction and group work. COPUS code abbreviations used in this study, including definitions from Smith et al. [12], can be found in Table 2. Activities that occurred on average 5% of the time or less in the observed classes were removed from subsequent analyses. Instructor administration (I-Adm) was not a variable of interest for our study, and thus was also removed from our analyses. Following Lund et al. [14], we eliminated redundancy by including only the student component of any student-instructor pair of variables that were very tightly and significantly correlated (see S2 Table for full correlation matrix). This occurred for S-Q and IAnQ (r = 0.96, p<0.001), IPQ and SAnQ (r = 0.92, p <0.001), and S-CG and I-CQ (r = 0.95, p <0.001; see S2 Table for full correlation matrix. Student listening occurred relatively frequently (on average in 85% of two-minute time intervals), and was positively correlated with instructor lecturing (r = 0.76, p<0.001) and negatively correlated with group work (worksheets: r = -0.58, p <0.001, other group work: r = -0.53, p = 0.004) and instructors moving in groups (r = -0.77, p <0.001). Because we retained the other variables, we excluded student listening from our analyses to minimize spurious results. Thus, we retained the following five student codes for our analyses: S-CG, S-WG, S-OG, S-AnQ, and S-Q. Student group work variables (S-CG, S-WG, and S-OG) were re-coded to a 'Student group work' (S-GW) variable [14], to reflect the amount of time spent on group work, regardless of type, and to Codes that begin with S or I are 'students doing' and 'instructor doing' codes, respectively. All definitions are from [12], with the exception of student group work (S-GW). Mean percentage of 2-minute intervals were calculated using means per class section. Activities in bold were retained for further analysis. account for positive correlations among the three types of group work. For each time interval, we counted whether any group work occurred and included it only once to avoid doublecounting group work. This approach reduced the number of student variables to three (S-GW, S-ANQ, and S-Q). The four instructor codes that were used in our analyses were I-Lec, I-RtW, I-FUp, and I-MG.

Student performance
Because our data were based on different concept inventories across a variety of courses, we used a meta-analytic approach such that different outcomes could be standardized prior to comparison. Each of the 31 sections that we observed was considered a single 'study', and thus each section was treated as an independent data point in our analyses. To assess changes in student performance between the pre-and post-tests, we calculated an effect size for each section. We recognize inherent differences among concept inventories required given that the diversity of classes that were used in this study may have an effect on comparisons among classes; the standardization to effect sizes was done to dampen this effect, but it cannot remove it entirely. The effect size of the difference between pre-and post-test scores and its standard error within each class section were calculated using the standardized mean gain following Lipsey and Wilson [20]. Equations and full descriptions of these calculations can be found in the S3 File.

Statistical analyses
Each course section was used as datum in our analyses, resulting in 31 individual data points. In all cases, the dependent variable in our analyses was the effect size for the standardized mean gain in a particular course section. We compared sets of generalized linear models to assess how well different predictors of student performance fit the observed patterns in learning gains. We used Akaike Information Criteria, corrected for sample size (AICc), to rank the models. A particular model was considered the single 'best' model if it had the lowest AICc value, and differed from the next best model by a value of 2 or more. All analyses were carried out in R 3.4.4 [21]. We first examined the relationship between the broad categories for 'Instructional style', outlined in Lund et al. [14] and the standardized mean gain within each course section. While these are very broad categorizations of classroom activity, we have included this analysis to examine the general trends in approaches to teaching. To add more detail to this approach, we then determined which class activities were the best predictors of learning gains, and assessed the effect of the duration of the seven COPUS variables outlined above on effect sizes. The three-way and all possible two-way interactions among S-GW, I-FUp, and I-MG were included to account for the potential interactive effects of instructor and student activities that occur during implementation of active learning approaches. All 304 possible models given the seven independent variables and specified interactions were compared using the dredge function in the MuMin package in R [22]. Predictor variables were standardized prior to model comparison. Because there was more than one 'best' model, we used a model averaging approach to extract regression coefficients from the top 2AICc models to determine the magnitude and direction of the most consistent predictors of learning gains [23]. As a part of this approach, we tested for collinearity between additive independent predictors by examining the variance inflation factors (vif), an indication of the severity of collinearity between predictor variables. Because I-Lec had a vif value larger than five and could influence the outcome of model averaging to yield spurious results [24], we ran the analyses with and without this variable included and they yielded the same results (i.e., I-Lec was dropped from the best models when included). Finally, we used the same generalized linear modeling framework to identify whether particular types of group work are more effective than others by comparing learning gains between course sections that used these approaches to those that did not. In this case, we compared two models for each type of group work: one that included presence/absence of the activity as a predictor of learning gains, and one that included only an intercept (i.e., a nonzero effect size with no influence of the activity on learning gains).

Instructional styles and student performance
Of the 31 course sections observed here, five were classified as "Mostly Lecture," 17 as "Emergence of Group Work," and nine as "Extensive Group Work". For all classroom profiles, effect sizes were positive, indicating that student performance improved between the preand post-test for all categories (Fig 1). However, the effect of instructional style on student learning gains was relatively weak; comparison of the model containing instructional style as a predictor and one that did not (an 'intercept only' model) revealed that these two models were equivalent (ΔAICc = 0.18). This result reflects the large variance within instructional styles, as despite the fact that the mean effect size for "Extensive Group Work" was 1.8 times higher than "Mostly Lecture", the 95% confidence intervals for these values overlap (Fig 1). Teaching practices and student learning in undergraduate biology

COPUS categories and student performance
The frequency and occurrence of all COPUS categories are shown in Table 2. Of the seven variables used in our analysis, six were retained in the top models: S-GW, S-AnQ, S-Q, I-RtW, I-FUp and I-MG (Table 3). Both S-GW and I-MG were retained in all five of the top 2AICc models, and their coefficients differed from zero for the averaged model as well as in each individual model (Table 3). These parameters influenced learning gains in opposite directions; S-GW had a positive relationship with effect size, while the effect of I-MG was negative (Table 3; Fig 2). Coefficients for the other four variables (S-AnQ, S-Q, I-RtW and I-FUp) were not consistently different from zero in the averaged model (Table 3). Thus, we did not consider these four variables as reliable predictors of learning gains.

Types of group work and student performance
To further examine the positive effect of group work on student learning, we split this variable into its component parts: i<Clicker questions for which students discuss their answers, worksheets, and 'other types' of activities that primarily involved an instructor showing a slide or overhead with a question and having the students answer it together in groups. We compared student performance between sections in which a particular activity occurred or not. Most sections had clicker questions (n = 26) and 'other types' of group work (n = 26); however, fewer than half of the observed sections used worksheets (n = 10; Table 2). Of the three types of group work, only the presence of worksheets had a clear effect on student performance when compared to an "intercept only" model ignoring worksheets (ΔAICc = 12.91; Fig 3). i<Clicker questions had a very weak effect on learning gains (ΔAICc = 0.44; Fig 3).

Discussion
By combining direct, non-interventional classroom observations with quantitative assessments of learning gains across the Biology curriculum at a large university, we confirm the well-established positive effects of active learning on student conceptual understanding (e.g., [4,6]). Strikingly, we found that using in-class worksheets, a simple intervention with a low barrier to entry, resulted in significant increases in student scores. Thus, by using observations of classroom practices in conjunction with course-specific concept diagnostics, we were able to specify which types of student-centred activities support and promote conceptual learning.

Student learning and classroom structure
Student performance was higher in classes that were characterized as consisting of Extensive Group Work, compared to the other two instructional styles (Mostly Lecture, Emergence of Group Work), although there was considerable variation within each category. The Extensive Group Work category is largely defined by class periods in which the instructor lectures for only half of the allotted time or less, while the majority of classroom activity involves student group work and follow-up discussions mediated by the instructor [14]. By contrast, the 'Emergence of Group Work' category features more than half of the class time spent on lecture, and less than 25% of the time on group work [14]. Our result that the highest learning gains occurred in classes with a considerable amount of group work is consistent with findings from Prather et al. [4], where classes that spent 25% of time or more on student centred teaching practices tended to have the highest learning gains. In addition, a recent investigation of the influence of moderate versus high use of student-centered classroom approaches in Introductory Biology indicated that an extensive amount of student activity, driven mainly by a difference in the frequency of group work, improved student performance and attitudes about the topic [6]. Our results, coupled with other studies, suggest that class time investment in group work will result in higher learning gains. While a broad categorization of instructional styles allows for a general characterization of the classroom, COPUS data on specific student and instructor actions can provide greater resolution regarding the types of activities that may be most beneficial for conceptual understanding. Indeed, given the uncertainty surrounding the comparison among instructional styles, the best positive predictor of student performance was the time allocated to student group work. Group work is often used to typify situations that would be considered 'active learning', and several meta-analyses have indicated that active learning practices in general enhance student learning [3,4,7]. In addition, collaborative classrooms dedicate a large portion of time-on-task to student discussion, which allows for engagement with the course material through explanation and discussion that can maximize student learning [13,25,26]. In our averaged model, a 10% increase in group work time (five minutes in a 50-minute class) correlates with a 0.30 increase in effect size-roughly a 3% improvement in student performance (almost one letter grade, depending on the institution), holding other variables constant. In a broader perspective, simply giving five minutes of time for group work has similar impact as the use of a "researcher-developed" or "specialized" intervention [27]. Further highlighting the potential impact of such a low-barrier intervention, studies with effect sizes greater than 0.20 are noted to be of interest in educational policy decisions [7]. Our study is particularly relevant in this Teaching practices and student learning in undergraduate biology context, as current policy looks towards more diversity and inclusion in STEM, and activelearning approaches are known to close the achievement gap for students from disadvantaged backgrounds and minorities [2,7,28].
Surprisingly, our best models indicated that "Instructors moving in groups" (I-MG) was a negative predictor of student learning. In our study, the "instructor" included both teaching assistants and the lecturer. There are several reasons why class time spent on discussions with individual groups might negatively affect learning. First, the nature of these interactions, and impact on the group(s) directly participating, are unknown; our broad-scale observational approach does not capture the nuances of the interactions between instructor and student groups. For example, during short discussions, the nature of student-instructor interactions has been shown to alter the quality of student interactions, particularly if reasoning is provided by the instructor rather than allowing the students to express their rationale [29,30]. In addition, in large-enrollment classes such as the majority of those assessed here, instructors are only able to interact with a relatively small number of the groups within the classroom; this is further influenced by the tiered layout of large classes, such that instructors do not always have the ability to access the entire room effectively. This has the potential to reduce the engagement level of students in groups that are not targeted, which might reduce learning gains for the class as a whole. Further directed study to address interactions between instructors and groups is needed to verify and resolve this negative effect.
The four other variables that were retained in the best models had much lower explanatory power. Two of these variables, Student Asks a Question (S-Q) and Instructor Follow-up (I-FUp), were retained in only one model and did not have coefficients that differed from zero. Both instructor real-time writing (I-RtW) and student answering a question posed by the instructor (S-AnQ) were retained in more than one of the best models; however for each variable the confidence intervals around the estimated coefficients did not overlap zero for only one model, and both had confidence intervals that overlapped zero in the averaged model. This suggests that their influence on student learning is not strong compared to the other predictors on the models. The amount of time spent on real-time-writing by the instructor (I-RtW) was a weakly positive indicator of student learning; this may be attributed to the fact that real time writing was often observed during follow-up after a group activity. Additionally, real-time writing may result in a decrease in the pace of the classroom allowing students more time to synthesize information and take additional hand-written notes, a behavior that increases gains on conceptual questions compared to using a laptop ( [31], but see [32] for a replicate study reporting a non-significant effect). By contrast, student answering a question in front of the whole class (S-AnQ) was a weakly negative predictor in our models. This effect may be attributed to a decrease in student engagement during this activity, particularly in large, acoustically-poor lecture halls. Students may disengage if the discussion between the individual student and the instructor lasts too long; indeed, there may be an optimal duration of this activity that allows all students to remain engaged, or, there may be other instructional approaches to avoid this issue entirely while still eliciting student responses (e.g., having Teaching/Learning Assistants summarize student responses to the whole class). Data on these aspects of classroom interactions are lacking.

Components of group work: Worksheets and peer instruction
Time spent on group work emerged as an important predictor of learning gains and therefore we further investigated any possible effects of specific sub-types of group work. Strikingly, worksheets had a strong effect on increasing student performance, despite the variability in worksheet styles and practices implemented across courses. This finding promotes a simple and accessible method to support student learning. Worksheets do not necessarily require large time investments by the instructor for development or feedback; their construction can be relatively straightforward such as using questions based on previous tests or problem sets, and they need not be handed in for grading. Furthermore, they do not require special technology for implementation or student engagement. The benefit of even the simplest of worksheets is that they encourage students to articulate, evaluate, and reflect on a written answer. The simultaneous or sequential combination of peer discussion and writing has been shown to enhance student understanding and retention of complex concepts in STEM education [33][34][35], and this type of effect may explain the learning gains associated with worksheets in our study. Furthermore, this type of activity has been implemented in STEM classrooms [36][37][38][39], and assessment of student learning indicates that worksheets increase conceptual understanding when used as part of an active learning curriculum [36,39]. In the courses that we observed, many of the worksheets were guided activities that were embedded within the lectures or were 'case-study' approaches that required students to examine multiple aspects of a particular problem. While the use of worksheets resulted in higher gains compared to student response systems in our study, most of the courses we assessed used that technology, limiting our ability to detect an effect. The effectiveness of student response systems for student learning has been documented previously (e.g., [39,40]), and the adoption of this tool at our institution was widespread for this reason.

Variation in the impact of active learning approaches
While classroom practices accounted for approximately 45% of the variability in student performance, our approach did not capture several important aspects of student learning. First, this study examined only in-class activities. Students spend a non-trivial amount of time on class preparation, homework, and studying. From a national survey, this university's student population has a self-reported average of 19 study hours weekly [41]; this value includes students from all faculties and is likely an underestimate for STEM students [42]. Further, individual student characteristics and perceptions impact their practices and experiences, even within an active-learning classroom [2,[43][44][45]. Even in classes where the amount of assigned work is equivalent, time-on-task outside the classroom can vary across students and may be affected by various factors relating to course structure, allotment of grades, and instructor characteristics. Studies that further investigate these questions should include the amount, and the type, of work that students undertake outside of class, as well as instructor expectations of their work. Second, the COPUS observational tool only captures the amount of time spent in a particular classroom practice. It does not distinguish between different implementations of the same practice, which can significantly impact the effectiveness of any classroom approach [46]. This can include student accountability (such as whether or not participation marks are allocated, or if clicker questions are graded, or if worksheets are handed in [36,47]), contentspecific and content-independent instructor cues during peer instruction [2,13,48,49], and Bloom's level and scaffolding of the worksheet/clicker questions [37]. Third, our study did not target comparisons between courses in upper and lower years, and thus did not capture differences in learning as students move through the curriculum and mature as learners. It is important to note that in this study the courses with the highest effect sizes were from the first and second years of the program. Finally, our analysis does not take into account temporal spacing in a classroom, such as the order of content, practice, feedback, the length of particular group work sessions, or the distinction between individual and group work. These variables are likely to be very important for student learning, as seen in other studies [50][51][52][53]. Tools that allow for the analysis and flow-of-time visualization of COPUS data are sorely needed to support research that will investigate the impact of how class time evolves on student outcomes.

Implications for teaching
Given the large number of variables that can impact student learning, it is indeed notable that student performance increased with the simple inclusion of more group work. The finding that student performance can be predicted by in-class time underscores the importance of the structure and use of in-class time to facilitate achievement of learning goals. The frequent calls for changes to STEM education cannot, and should not, be ignored; however, the process of changing one's approach to using class time is not trivial. Our results indicate that even a relatively short duration of group work can lead to increases in student learning. As an instructional tool, we would suggest that educators consider the use of structured worksheets as a way to increase student-centered use of class-time. Using this easy-to-implement, low-technology teaching practice will encourage collaboration, problem solving, and can be used to inform the instructor about what students are struggling with, providing opportunities for valuable and timely feedback.