Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1.

Data model for all models presented in this paper.

More »

Fig 1 Expand

Fig 2.

Model evaluation flowchart.

Data is split into testing and training sets and is evaluated for two separate models: logistic regression solved by maximum likelihood and gradient boosted trees (xgboost). Missing data for the xgboost model is not imputed from the mean and instead uses the built in imputation engine within xgboost [10].

More »

Fig 2 Expand

Fig 3.

Out-of-sample F1 scores for all models per semester.

Xgboost clearly increases the precision and recall tradeoff for the bulk of semesters. Error bars are the bootstrapped standard deviation of the F1 score.

More »

Fig 3 Expand

Fig 4.

Average feature importance for all graduating semesters for the xgboost model.

Feature importances are weighted by the F1 scores per semester. Enrollment factors and the cumulative average grade are more likely to predict when a student graduates than other factors.

More »

Fig 4 Expand

Fig 5.

Xgboost feature importances for predicting if a student will graduate in the next semester.

Feature importances are weighted by the F1 score calculated from test data. The row-wise sum of these features would produce Fig 4. By far the most important feature for graduating “on time” (within 8 semesters) is having enough credit hours. Student preparation is important for graduating “early” (<8 semesters). Students who change majors are more likely to graduate later and thus this feature becomes more important in later graduating semesters.

More »

Fig 5 Expand

Fig 6.

The partial dependence per semester for the cumulative credit hours a student has obtained.

The stronger the partial dependence, the more contribution that value of cumulative credit hours has on the predicted probability that a student will graduate in the given semester. The partial dependence has been weighted by the per semester F1 score. Having a total number of credit hours close to 120 is highly predictive of graduating if students graduate between 8 and 10 semesters of enrollment. Outside of this window the impact of the total number of credit hours diminishes. This is likely due to students having additional credit hours that do not count towards a degree such as when they change their major.

More »

Fig 6 Expand

Fig 7.

The partial dependence per semester for the cumulative average GPA a student has obtained.

The stronger the partial dependence, the more contribution that the cumulative average GPA has on the predicted probability that a student will graduate in the given semester. The partial dependence has been weighted by the per semester F1 score. Having a far above average GPA is a major contribution to graduating if a student graduates between semesters 9 and 10. This is likely due to these students never failing a course and having a high commitment to their chosen major.

More »

Fig 7 Expand

Fig 8.

Students typically graduate within the window of 8 to 12 semesters.

The bars represent the number of students across all cohort years who graduate in the following semester. The green line represents the cumulative fraction of students who have graduated.

More »

Fig 8 Expand

Fig 9.

Recall scores for sub-populations within the data.

The Xgboost model consistently labels women, under represented minorities, and students with missing data better than logistic regression. In Fig 3, there is a large disparity between the logistic regression model and the xgboost model for semesters 5-7. This may be due to the imputation method used with the logistic regression model and the correlation between graduating and missing data during those semesters. However, the gaps in model performance for women, minorities, and missing data in later semesters with no correlation are likely not due to imputation and demonstrate a strong preference for using the xgboost model over the logistic regression model.

More »

Fig 9 Expand

Fig 10.

A comparison of a subset of features for students who graduate versus those who do not based on first year enrolled.

Since 1992 the student population has become more diverse racially, students have had increasing high school GPAs and math placement scores, and the time it takes students who graduate to graduate has been decreasing. Students who graduate typically take more credit hours than those who don’t, typically are better prepared as measured by high school GPA and math placement score, and are less diverse than the total university population.

More »

Fig 10 Expand