State-level population estimates of sexual minority adolescents in the United States: A predictive modeling study

Johannes O. Ferstad; Maria Aslam; Li Yan Wang; Katherine Henaghan; Jiayi Zhao; Jingjing Li; Joshua A. Salomon

doi:10.1371/journal.pone.0304175

Abstract

Purpose

The Youth Risk Behavior Survey (YRBS) among high school students includes standard questions about sexual identity and sex of sexual contacts, but these questions are not consistently included in every state that conducts the survey. This study aimed to develop and apply a method to predict state-level proportions of high school students identifying as lesbian, gay, or bisexual (LGB) or reporting any same-sex sexual contacts in those states that did not include these questions in their 2017 YRBS.

Methods

We used state-level high school YRBS data from 2013, 2015, and 2017. We defined two primary outcomes relating to self-reported LGB identity and reported same-sex sexual contacts. We developed machine learning models to predict the two outcomes based on other YRBS variables, and comparing different modeling approaches. We used a leave-one-out cross-validation approach and report results from best-performing models.

Results

Modern ensemble models outperformed traditional linear models at predicting state-level proportions for the two outcomes, and we identified prediction methods that performed well across different years and prediction tasks. Predicted proportions of respondents reporting LGB identity in states that did not include direct measurement ranged between 9.4% and 12.9%. Predicted proportions of respondents reporting any same-sex contacts, where not directly observed, ranged between 7.0% and 10.4%.

Conclusion

Comparable population estimates of sexual minority adolescents can raise awareness among state policy makers and the public about what proportion of youth may be exposed to disparate health risks and outcomes associated with sexual minority status. This information can help decision makers in public health and education agencies design, implement and evaluate community and school interventions to improve the health of LGB youth.

Citation: Ferstad JO, Aslam M, Wang LY, Henaghan K, Zhao J, Li J, et al. (2024) State-level population estimates of sexual minority adolescents in the United States: A predictive modeling study. PLoS ONE 19(6): e0304175. https://doi.org/10.1371/journal.pone.0304175

Editor: Leona Cilar Budler, University of Maribor, SLOVENIA

Received: July 11, 2023; Accepted: May 8, 2024; Published: June 27, 2024

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Data Availability: YRBS data used in these analyses were obtained from the Centers for Disease Control and Prevention, with the following exceptions: additional permissions were requested from Ohio Department of Health, the Connecticut Department of Health, the Maryland Department of Health, and the Massachusetts Department of Elementary and Secondary Education, supported by the Centers for Disease Control and Prevention (CDC). Since data acquition for this paper, the set of state datasets available from CDC has increased. Data for all except two states are now available for download from CDC: https://www.cdc.gov/healthyyouth/data/yrbs/data.htm. The two exceptions are Ohio and Massachusetts. Data from Ohio may be requested from the Ohio Department of Health (https://odh.ohio.gov/know-our-programs/youth-risk-behavior-survey). Data from Massachusetts may be requested from the Massachusetts Department of Elementary and Secondary Education (https://www.doe.mass.edu/sfs/yrbs/).

Funding: JOF, KH, JZ and JAS were funded by the Centers for Disease Control and Prevention (cooperative agreement U38-PS004646). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Population-based data on adolescent health behaviors and experiences are essential for planning and evaluating health promotion programs, tracking progress toward health goals, and understanding the effectiveness of interventions to mitigate health risk behaviors [1]. An important objective in the monitoring of adolescent health is to understand and reduce health disparities [2]. Sexual minority youth, including those identifying as lesbian, gay, or bisexual (LGB) and those reporting having same-sex partners, face stigma and discrimination that can place them at higher risks for negative health outcomes [3–5]. For example, compared to their heterosexual peers, LGB youth are more likely to report having felt sad or hopeless, been bullied at school, forced to have sex, used illicit drugs, misused prescription opioids, or seriously considered suicide [6, 7]. Studies have shown that LGB youth are more likely than their heterosexual peers to report a range of sexual risks, such as having sexual intercourse before age 13, having multiple sexual partners, or having sex without a condom [2, 6, 8, 9]. These experiences can in turn increase risks of mental health problems, human immunodeficiency virus (HIV) infection, other sexually transmitted diseases, and pregnancy. For examples, young gay and bisexual males have disproportionately high rates of HIV (69% of all new HIV diagnoses in 2019 were among gay and bisexual men, and those aged 13–34 accounted for 65% of all cases among gay and bisexual men), syphilis, and other sexually transmitted infections (STI) [10, 11]; adolescent bisexual females are more likely to have ever been pregnant than their heterosexual peers [12, 13].

Estimation of the population size of LGB youth is critical for understanding the scope of disparities at a population level and developing effective interventions to address the overall health needs of the LGB population. For example, population estimates of adolescent sexual minority males can not only help public health practitioners determine the burden of HIV and other STI in this population, but also help guide public health policies, programmatic efforts, and resources to effectively prevent and mitigate these infection risks, such as allowing minors to consent to pre-exposure prophylaxis services, educating health care providers about service delivery, and promoting inclusive sexual health education and gay-straight alliances in schools.

The Youth Risk Behavior Surveillance System, which includes surveys among high school students in most US states, includes standard questions about sexual identity and sex of sexual contacts, but these questions have not been included in every state survey [2]. To address the information gaps resulting from omission of these questions from some surveys, we developed a new predictive model using machine learning methods and state Youth Risk Behavior Survey (YRBS) data. We predicted survey responses to questions about LGB identity and having sex with any same-sex partners in order to produce aggregate-level estimates for states that do not have this information directly available from surveys.

Materials and methods

Data

We used state high school YRBS data from 2013, 2015, and 2017 [2, 6, 14]. These surveys are conducted by state education and health agencies, designed in reference to a standard set of questions, with allowance for state agencies to add or delete questions depending on their programmatic or policy needs, following guidance from the Centers for Disease Control and Prevention (CDC). We obtained state-level datasets either from CDC or directly from individual states. We only included those surveys that provided weighted data to produce representative samples of the high school students in each state. Across all 50 states, three (Minnesota, Oregon, and Washington) did not conduct a YRBS in any of the three included study years. Table 1 and S1 Table provide summaries of the data used in our analysis. The full list of survey questions and variables included in the dataset can be found in the YRBS Combined Datasets User’s Guide [15].

Download:

Table 1. Data used to predict proportions of students in grades 9–12 reporting lesbian, gay, or bisexual identity and proportions reporting any same-sex sexual contacts in the United States in 2017.

https://doi.org/10.1371/journal.pone.0304175.t001

Measures

We defined two primary outcome variables corresponding to YRBS questionnaire items as follows:

Self-reported LGB identity.

This outcome was coded based on individual responses to Q67 in the 2017 Combined YRBS Dataset: Which of the following best describes you? A. Heterosexual (straight) B. Gay or lesbian C. Bisexual D. Not sure. We coded the binary minority sexual identity outcome (“LGB Identity”) as 1 if the recorded answer was “Gay or Lesbian” or “Bisexual”. All other responses, including missing responses, were coded as 0.

Reported (any) same-sex sexual contact.

This outcome was coded based on individual responses to the sex question and Q66 in the 2017 Combined YRBS Dataset. The sex question asks What is your sex? A. Female B. Male. Q66 asks During your life, with whom have you had sexual contact? A. I have never had sexual contact B. Females C. Males D. Females and males. The following respondents were assigned a 1 for the binary sexual contacts outcome (“Any Same-Sex Contact”): (1) respondents answering “Females and males” to Q66, (2) respondents responding “Male” to the sex question and “Males” to Q66, and (3) respondents responding “Female” to the sex question and “Females” to Q66. All other responses, including missing responses, were coded as a 0.

Our main analyses focused on predictions for the two primary outcomes for both sexes combined. Because some public health risks and programs vary for adolescent males and females–for example, adolescent sexual minority males are identified as a population with elevated risk of acquiring HIV and other STI–we also produced sex-stratified predictions of proportions reporting any same-sex sexual contacts (see S2 Table for sex-stratified data summary). Independent variables for prediction included responses to all other survey questions in YRBS. Missing responses on the independent variables were replaced with the modal response in the full dataset (across all the included states), and an additional covariate was created for each question comprising an indicator for whether the response was missing for a particular respondent.

Analysis

Overview.

The overall aim of the analysis was to develop a predictive model for proportions of high school students reporting LGB identity or reporting any same-sex sexual contacts, in order to impute these proportions in states that did not include these items on their 2017 YRBS questionnaires. We trained separate predictive models for each of the two outcome variables, examining a range of alternative modeling approaches, and we evaluated the predictions from alternative approaches using a leave-one-out cross-validation strategy. Based on model selection criteria, we identified the best-performing model for each of the two primary outcomes, and summarized results as observed versus modeled proportions of high school students reporting LGB identity or same-sex sexual contacts. The following sections elaborate on each of these steps, with a high-level overview depicted in Fig 1.

Download:

Fig 1. Summary of study approach to training, prediction, and evaluation.

Abbreviations: YRBS, Youth Risk Behavior Survey; LGB, lesbian, gay, or bisexual.

https://doi.org/10.1371/journal.pone.0304175.g001

Predicting individual-level YRBS responses.

To predict state-level proportions of sexual minority youth, we first trained predictive models for each of the two primary outcome variables at the individual level. We compared the performance of the following predictive modeling approaches: ordinary least squares (OLS), logistic regression, least absolute shrinkage and selection operator (LASSO) penalized linear regression, LASSO penalized logistic regression, ridge penalized linear regression, ridge penalized logistic regression, random forest (RF), and gradient boosted regression trees (GBRT).

Hyperparameter tuning was done to minimize mean squared error in n-fold cross validation with each state’s data allocated to only one of the folds such that no state’s data was included in both the training and validation datasets while evaluating hyperparameters. This ensured that we found a set of hyperparameters that performed well when predicting responses in a different state. For the penalized regression models, we tuned the lambda hyperparameter, which penalizes the sum of absolute (LASSO) or squared (ridge) coefficients in the fitted model. A higher penalty will force the model to ignore predictors that do not improve the fit of the model but may also reduce the quality of the predictions if the penalty is too high. An optimal lambda value helps the model pick a subset of predictors that leads to the best final predictors.

For ensemble models (RF and GBRT), we tuned maximum depth, number of trees, learning rate (of the boosted trees), and the number of features to sample for each split. Tuning these hyperparameters helps identify a good trade-off between the quality of predictions from each tree against the quality of the average prediction in the ensemble of trees. For each of the two outcome variables, we compared the performance of models that included the other outcome variable as a predictor against models that did not include the other outcome variable.

Generating state-level proportions and prediction intervals.

After generating predictions of the two outcome variables for individual YRBS respondents, we created state-level proportion estimates using the survey weights from YRBS. Next, we generated an error distribution around the point estimate for a given state by first calculating the error (residual) of the point estimates for all other states and calculating the mean and standard deviation of these errors. Then, we defined a Student’s t distribution of errors with mean equal to the average residual of the other states, standard deviation equal to the standard deviation of the errors from the other states, and degrees of freedom equal to the number of states represented in the distribution, minus one. This method of defining an error distribution is commonly used to generate prediction intervals for neural networks and performed adequately for our use case based on the coverage of our generated prediction intervals [16].

We generated 95% state prediction intervals with lower bound set at the 2.5^th percentile of the error distribution, and upper bound at the 97.5^th percentile of the error distribution. We generated separate prediction intervals for three different types of prediction tasks that were used across different subsets of states: (1) prediction based on YRBS data from the same year including the other primary outcome as a predictor, (2) prediction based on YRBS data from the same year without the other outcome variable as a predictor, and (3) prediction based on YRBS data from a previous year without the other outcome variable as a predictor. Thus, a state with 2015 and 2017 YRBS data that included answers to both focal questions would have three prediction intervals for each focal question– one interval for each type of prediction task.

Comparative evaluation of predictive models, model selection, and final state-level estimates.

We evaluated the predictive models using a leave-one-out cross validation approach at the state level. For each state, we trained the individual-level model on data from all other states, and then used the available data from the held-out state to generate a prediction interval for the primary outcome variable (Fig 1). We repeated this for all the states that had observed data for each outcome. Our evaluation metrics were: (1) the intraclass correlation coefficient (ICC) between our point estimates and those observed in the YRBS data, and (2) prediction interval coverage, expressed as the fraction of observed state proportions contained within our generated prediction intervals.

To generate the final set of state-level estimates, we selected the predictive model for each outcome that had the highest intraclass correlation coefficient. Using the selected model, we predicted survey responses in states in which there were no survey data available for one or both questions. We summarized results as estimated proportions of high school students reporting LGB identity or any same-sex sexual contacts, including both directly observed survey estimates where available and our model-predicted estimates otherwise. All of the code used to generate our predictions is available online [17].

Ethics statement

The study was reviewed by the Stanford University Institutional Review Board, which determined that the research does not involve human subjects as defined in 45 CFR 46.102(f) or 21 CFR 50.3(g).

Results

Model-fitting results

The training dataset included 382,251 responses for the survey item on reported LGB identity, and 320,410 responses for the survey item on sex of sexual contacts. Many states only had training data for one or two years, and the number of questions asked in each state varied (Table 1).

When predicting state-level proportions, the RF and GBRT models had the highest ICCs on both outcome variables, while OLS and logistic regression had the lowest. Penalized linear and logistic models had ICCs that fell between these extremes (Tables 2, 3 and S3, S4 Tables). As expected, ICCs were higher when including the other outcome variable as a predictor. We used the RF over GBRT to predict outcomes in states without outcome data as the RF ICC values were slightly higher in nearly every case.

Download:

Table 2. Evaluation results from eight algorithms predicting the proportion of students in grades 9–12 reporting lesbian, gay, or bisexual identity in 2017.

https://doi.org/10.1371/journal.pone.0304175.t002

Download:

Table 3. Evaluation results from eight algorithms predicting the proportions of students in grades 9–12 reporting any same-sex sexual contacts in 2017.

https://doi.org/10.1371/journal.pone.0304175.t003

S5 Table reports the top 20 predictors across the two full RF models based on variable importance (permutation) scores. S1 Fig shows how the performance of the different prediction algorithms related to the number of predictors available. The predictive performance of the tree-based algorithms surpassed those of the other algorithms as the number of predictors available increased.

For states with observed proportions of respondents reporting LGB identity, the mean absolute error (MAE) of out-of-sample predictions was 0.76 percentage points (pp) when predicting using data from the same year and including the answer to the other focal question (same-sex sexual contacts), 1.05pp when predicting using data from the same year but without the other focal question, and 0.85pp when predicting using data from a previous year without the other focal question. Analogous MAEs for the same-sex sexual contact proportions were 0.61pp, 0.88pp, and 0.77pp, respectively.

Coverage of the out-of-sample prediction intervals for the RF models ranged between 91% and 96% depending on the outcome and prediction dataset. For every state, the observed proportion of students reporting LGB identity was greater than the observed proportion reporting any same-sex sexual contact. We verified that the predictions preserved this relationship.

State-level estimates of proportions reporting LGB identity or same-sex sexual contacts

Figs 2 and 3 show observed and predicted proportions by state using the RF models. S6 and S7 Tables report state-level observed intervals from the YRBS and our prediction intervals. Observed proportions of respondents reporting LGB identity ranged between 8.4% and 13.4%, while the predicted proportions across states without observed proportions, and across different prediction tasks, ranged between 9.4% and 12.9%. Observed proportions of respondents reporting any same-sex sexual contacts ranged between 5.3% and 10.9%, while the predicted proportions where not observed ranged between 7.0% and 10.4%. Results for sex-stratified models of proportions reporting any same-sex sexual contacts are reported in S8 and S9 Tables. Predicted proportions for males where not observed ranged between 3.8% and 8.0%, and for females between 9.7% and 13.3%.

Download:

Fig 2.

Observed (Panel A) and predicted (Panel B) proportions of students in grades 9–12 reporting lesbian, gay, or bisexual identity in 2017.

https://doi.org/10.1371/journal.pone.0304175.g002

Download:

Fig 3.

Observed (Panel A) and predicted (Panel B) proportions of students in grades 9–12 reporting any same-sex sexual contacts in 2017.

https://doi.org/10.1371/journal.pone.0304175.g003

Discussion

Using a machine learning approach and recent state YRBS data, we generated population estimates of sexual minority adolescents for 47 states. Because of the high salience of these estimates for public health monitoring and program planning, our goal was to supplement data that are available for only a subset of states with predicted estimates for states that do not have direct observations on these measures but do have YRBS data that can inform these predictions.

Comparing across different possible methods for predicting responses to survey items on reported LGB status and sex of sex partners, we found that tree-based ensemble methods commonly used in machine learning applications consistently outperformed traditional linear regression-based models for our prediction tasks. As noted in other areas of research, it is important to benchmark the performance of machine learning models against simpler alternatives [18], and in this application, the additional complexity produced substantial improvements in predictive performance. As in previous studies [6, 7, 19, 20], some of the variables that were most highly correlated with the primary outcome variables were related to suicidal ideation, mental health, substance use, and violence victimization (S5 Table).

Our finding that tree-based ensemble models outperformed other machine learning models is consistent with previous studies focusing on similar binary classification tasks [21, 22]. Analogizing to regression-based approaches, this finding suggests that the relationships between responses to individual YRBS questions and the reported LGB identity and same-sex sexual contacts questions are better modeled as a set of complex non-linear interactions than as a traditional penalized linear model. We found that linear methods designed for high-dimensional data (LASSO, ridge) outperformed the un-penalized OLS and logistic models but did not perform as well as the tree-based ensemble models, suggesting linear penalization is not sufficient to excel at our prediction task.

Limitations

Our study has several limitations. First, while we modelled responses to items on the YRBS questionnaire, we did not attempt to characterize or adjust for the accuracy of these responses. Previous studies have indicated that willingness to report LGB status may vary systematically across individuals [23]. Second, there are known concerns about the representativeness of surveys relating to sexual minorities [24], which may be exacerbated by non-response at either the item or survey levels. Third, we assumed state-level errors in predicted proportions were distributed identically across states due to our relatively limited numbers of observation units, which may result in mischaracterization of the uncertainty intervals around state-specific estimates in our study.

Another set of limitations derives from the nature of the prediction task. Predicting the size of LGB youth populations from survey data is difficult because reported LGB identity is not highly correlated with the responses to any individual YRBS question. Even when predicting reported LGB identity with the responses to all other YRBS questions and their interactions, no individual was assigned a very high predicted probability of reporting LGB status. Without such strong correlations, prediction will always be difficult, and will result in substantial prediction errors and wide uncertainty intervals, as in our results. It is worth noting, however, that our measurement goal was to develop aggregate-level predictions that may be used by public health decision-makers and not to impute individual-level responses. Among other considerations, this distinction has an important ethical dimension in that public-use research datasets from surveys must meet high standards for protection of privacy and confidentiality of respondents.

We also recognize the complexity of measuring sexual orientation; while we focused on predicting responses to two particular survey measures that operationalize identity-based or behavior-based dimensions of sexual orientation, there is a rich discussion in the literature on the challenges of survey measurement in this domain that is beyond the scope of our study [25, 26]. Finally, we acknowledge that the predicted population sizes of LGB minority youth in this study provide only a starting point for further analyses that can inform public health programs and priorities. An important extension of this work will be to examine possibilities for estimates that are disaggregated below state-level, for example by race and ethnicity, to enable deeper examination of disparities.

Conclusion

Accurate estimates of LGB population are necessary for informed policy making. Predicted LGB population estimates in this study can raise awareness among state policy makers and the public about what proportion of youth may be exposed to disparate health risks associated with sexual minority status. These estimates support development and implementation of policies that are inclusive and address the health needs of LGB individuals. Understanding the prevalence of sexual minority identities among adolescents can help in advocating for anti-discrimination policies and practices that promote equality and inclusivity, which in turn can have positive effects on the health and well-being of LGB individuals. Estimates of sexual minority youth populations also supply denominators for calculating HIV and STI prevalence and incidence among this group, provide parameter estimates for modeling the impact of various prevention programs, and help public health practitioners develop and implement evidence-based interventions to effectively reduce the burden of infections, likelihood of experiencing violence, and adverse mental health outcomes. Ultimately, the results from this study can help decision makers in public health and education agencies understand the need for community and school interventions to improve the overall health of LGB youth, for example, all-inclusive sex education, positive youth development programs, and gay-straight alliances.

Supporting information

S1 Table. Youth Risk Behavior Survey data availability by state and year.

https://doi.org/10.1371/journal.pone.0304175.s001

(PDF)

S2 Table. Data used to predict sex-stratified proportions of students in grades 9–12 reporting any same-sex sexual contacts in the United States in 2017.

https://doi.org/10.1371/journal.pone.0304175.s002

(PDF)

S3 Table. Evaluation results from eight algorithms predicting the proportions of male students in grades 9–12 reporting any same-sex sexual contacts in 2017.

https://doi.org/10.1371/journal.pone.0304175.s003

(PDF)

S4 Table. Evaluation results from eight algorithms predicting the proportions of female students in grades 9–12 reporting any same-sex sexual contacts in 2017.

https://doi.org/10.1371/journal.pone.0304175.s004

(PDF)

S5 Table. Top 20 predictors based on variable importance scores predicting reported lesbian, gay, or bisexual identity and reporting any same-sex sexual contacts among students in grades 9–12 using YRBS data from 2013–2017.

https://doi.org/10.1371/journal.pone.0304175.s005

(PDF)

S6 Table. Observed and predicted proportions of students in grades 9–12 reporting lesbian, gay, or bisexual identity in 2017, by state and prediction data.

https://doi.org/10.1371/journal.pone.0304175.s006

(PDF)

S7 Table. Observed and predicted proportions of students in grades 9–12 reporting any same-sex sexual contacts in 2017, by state and prediction data.

https://doi.org/10.1371/journal.pone.0304175.s007

(PDF)

S8 Table. Observed and predicted proportions of male students in grades 9–12 reporting any same-sex sexual contacts in 2017, by state and prediction data.

https://doi.org/10.1371/journal.pone.0304175.s008

(PDF)

S9 Table. Observed and predicted proportions of female students in grades 9–12 reporting any same-sex sexual contacts in 2017, by state and prediction data.

https://doi.org/10.1371/journal.pone.0304175.s009

(PDF)

S1 Fig. Prediction loss by number of predictors and prediction algorithm.

https://doi.org/10.1371/journal.pone.0304175.s010

(PDF)

Acknowledgments

YRBS data used in these analyses were obtained from the Centers for Disease Control and Prevention, with the following exceptions: additional permissions were requested from the Connecticut Department of Public Health, Georgia Department of Public Health, Indiana Department of Health, Maryland Department of Health and Mental Hygiene, Massachusetts Department of Elementary and Secondary Education, New Mexico Department of Health, Ohio Department of Health, Texas Department of State Health Services, and the Vermont Department of Health, supported by the Centers for Disease Control and Prevention (CDC)

Disclaimer: The findings and conclusions in this article are those of the authors and do not necessarily represent the views of the CDC.

Use of YRBS data obtained from the Connecticut Department of Public Health, Georgia Department of Public Health, Indiana Department of Health, Maryland Department of Health and Mental Hygiene, Massachusetts Department of Elementary and Secondary Education, New Mexico Department of Health, Ohio Department of Health, Texas Department of State Health Services, and the Vermont Department of Health, supported by CDC, does not imply that the Departments or CDC agree or disagree with the analyses, interpretations or conclusions in this publication.

References

1. Newby H, Marsh AD, Moller AB, Adebayo E, Azzopardi PS, Carvajal L, et al. A scoping review of adolescent health indicators. J Adolesc Health. 2021;69(3):365–74. pmid:34272169
- View Article
- PubMed/NCBI
- Google Scholar
2. Kann L, McManus T, Harris WA, Shanklin SL, Flint KH, Queen B, et al. Youth Risk Behavior Surveillance—United States, 2017. MMWR Surveill Summ. 2018;67(8):1–114. pmid:29902162
- View Article
- PubMed/NCBI
- Google Scholar
3. Meyer IH, Frost DM. Minority stress and the health of sexual minorities. In: Patterson CJ, D’Augelli AR, editors. Handbook of psychology and sexual orientation. New York: Oxford University Press; 2013. pp. 252–66.
4. Herek GM. Sexual stigma and sexual prejudice in the United States: A conceptual framework. In: Hope DA, editor. Contemporary perspectives on lesbian, gay, and bisexual identities. New York: Springer Science + Business Media; 2009. pp. 65–111.
5. Mustanski B, Van Wagenen A, Birkett M, Eyster S, Corliss HL. Identifying sexual orientation health disparities in adolescents: analysis of pooled data from the Youth Risk Behavior Survey, 2005 and 2007. Am J Public Health. 2014;104(2):211–7. pmid:24328640
- View Article
- PubMed/NCBI
- Google Scholar
6. Kann L, Olsen EO, McManus T, Harris WA, Shanklin SL, Flint KH, et al. Sexual identity, sex of sexual contacts, and health-related behaviors among students in grades 9–12—United States and selected sites, 2015. MMWR Surveill Summ. 2016;65(9):1–202. pmid:27513843
- View Article
- PubMed/NCBI
- Google Scholar
7. Johns MM, Lowry R, Rasberry CN, Dunville R, Robin L, Pampati S, et al. Violence victimization, substance use, and suicide risk among sexual minority high school students—United States, 2015–2017. MMWR Morb Mortal Wkly Rep. 2018;67(43):1211–5. pmid:30383738
- View Article
- PubMed/NCBI
- Google Scholar
8. Rasberry CN, Lowry R, Johns M, Robin L, Dunville R, Pampati S, et al. Sexual risk behavior differences among sexual minority high school students—United States, 2015 and 2017. MMWR Morb Mortal Wkly Rep. 2018;67(36):1007–11. pmid:30212446
- View Article
- PubMed/NCBI
- Google Scholar
9. Underwood JM, Brener N, Thornton J, Harris WA, Bryan LN, Shanklin SL, et al. Overview and methods for the Youth Risk Behavior Surveillance System—United States, 2019. MMWR Suppl. 2020;69(1):1–10. pmid:32817611
- View Article
- PubMed/NCBI
- Google Scholar
10. Centers for Disease Control and Prevention. HIV Surveillance Report, 2019; vol. 32; 2021.
- View Article
- Google Scholar
11. Centers for Disease Control and Prevention. Sexually Transmitted Disease Surveillance, 2021; 2023.
- View Article
- Google Scholar
12. Charlton BM, Corliss HL, Missmer SA, Rosario M, Spiegelman D, Austin SB. Sexual orientation differences in teen pregnancy and hormonal contraceptive use: an examination across 2 generations. Am J Obstet Gynecol. 2013;209(3):204.e1–8. pmid:23796650
- View Article
- PubMed/NCBI
- Google Scholar
13. Everett BG, Turner B, Hughes TL, Veldhuis CB, Paschen-Wolff M, Phillips G 2nd. Sexual orientation disparities in pregnancy risk behaviors and pregnancy among sexually active teenage girls: updates from the Youth Risk Behavior Survey. LGBT Health. 2019;6(7):342–9. pmid:31618165
- View Article
- PubMed/NCBI
- Google Scholar
14. Kann L, Kinchen S, Shanklin SL, Flint KH, Kawkins J, Harris WA, et al. Youth risk behavior surveillance—United States, 2013. MMWR Suppl. 2014;63(4):1–168.
- View Article
- Google Scholar
15. Centers for Disease Control and Prevention. Youth Risk Behavior Surveillance System (YRBSS) 2017 National, State, and District Combined Dataset User’s Guide 2017. 2018 [cited 2024 May 17]. Available from: https://www.cdc.gov/healthyyouth/data/yrbs/pdf/2017/2017_yrbs_sadc_documentation.pdf.
- View Article
- Google Scholar
16. Chryssolouris G, Lee M, Ramsey A. Confidence interval prediction for neural network models. IEEE Trans Neural Netw. 1996;7(1):229–32. pmid:18255575
- View Article
- PubMed/NCBI
- Google Scholar
17. Prevention Policy Modeling Lab; 2024 [cited 2024 May 17]. Available from: https://github.com/PPML.
- View Article
- Google Scholar
18. Salganik MJ, Lundberg I, Kindel AT, Ahearn CE, Al-Ghoneim K, Almaatouq A, et al. Measuring the predictability of life outcomes with a scientific mass collaboration. Proc Natl Acad Sci U S A. 2020;117(15):8398–403. pmid:32229555
- View Article
- PubMed/NCBI
- Google Scholar
19. Johns MM, Lowry R, Haderxhanaj LT, Rasberry CN, Robin L, Scales L, et al. Trends in violence victimization and suicide risk by sexual identity among high school students—Youth Risk Behavior Survey, United States, 2015–2019. MMWR Suppl. 2020;69(1):19–27. pmid:32817596
- View Article
- PubMed/NCBI
- Google Scholar
20. Lowry R, Johns MM, Gordon AR, Austin SB, Robin LE, Kann LK. Nonconforming gender expression and associated mental distress and substance use among high school students. JAMA Pediatr. 2018;172(11):1020–8. pmid:30264092
- View Article
- PubMed/NCBI
- Google Scholar
21. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. Proceedings of the 23rd international conference on machine learning. New York: Association for Computing Machinery; 2006. pp. 161–8.
22. Zhang C, Liu C, Zhang X, Almpanidis G. An up-to-date comparison of state-of-the-art classification algorithms. Expert Systems with Applications. 2017;82:128–50.
- View Article
- Google Scholar
23. Cimpian JR, Timmer JD, Birkett MA, Marro RL, Turner BC, Phillips GL 2nd. Bias from potentially mischievous responders on large-scale estimates of lesbian, gay, bisexual, or questioning (LGBQ)-heterosexual youth health disparities. Am J Public Health. 2018;108(S4):S258–s65. pmid:30383423
- View Article
- PubMed/NCBI
- Google Scholar
24. Rendina HJ, Talan AJ, Tavella NF, Matos JL, Jimenez RH, Jones SS, et al. leveraging technology to blend large-scale epidemiologic surveillance with social and behavioral science methods: successes, challenges, and lessons learned implementing the UNITE longitudinal cohort study of HIV risk factors among sexual minority men in the United States. Am J Epidemiol. 2021;190(4):681–95. pmid:33057684
- View Article
- PubMed/NCBI
- Google Scholar
25. Saewyc EM, Bauer GR, Skay CL, Bearinger LH, Resnick MD, Reis E, et al. Measuring sexual orientation in adolescent health surveys: evaluation of eight school-based surveys. J Adolesc Health. 2004;35(4):345.e1–15. pmid:15830439
- View Article
- PubMed/NCBI
- Google Scholar
26. National Academies of Sciences, Engineering, and Medicine. Measuring sex, gender identity, and sexual orientation. Bates N, Chin M, Becker T, editors. Washington, DC: The National Academies Press; 2022.

[ref1] 1. Newby H, Marsh AD, Moller AB, Adebayo E, Azzopardi PS, Carvajal L, et al. A scoping review of adolescent health indicators. J Adolesc Health. 2021;69(3):365–74. pmid:34272169
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Kann L, McManus T, Harris WA, Shanklin SL, Flint KH, Queen B, et al. Youth Risk Behavior Surveillance—United States, 2017. MMWR Surveill Summ. 2018;67(8):1–114. pmid:29902162
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Meyer IH, Frost DM. Minority stress and the health of sexual minorities. In: Patterson CJ, D’Augelli AR, editors. Handbook of psychology and sexual orientation. New York: Oxford University Press; 2013. pp. 252–66.

[ref4] 4. Herek GM. Sexual stigma and sexual prejudice in the United States: A conceptual framework. In: Hope DA, editor. Contemporary perspectives on lesbian, gay, and bisexual identities. New York: Springer Science + Business Media; 2009. pp. 65–111.

[ref5] 5. Mustanski B, Van Wagenen A, Birkett M, Eyster S, Corliss HL. Identifying sexual orientation health disparities in adolescents: analysis of pooled data from the Youth Risk Behavior Survey, 2005 and 2007. Am J Public Health. 2014;104(2):211–7. pmid:24328640
View Article
PubMed/NCBI
Google Scholar

[12] View Article

[13] PubMed/NCBI

[14] Google Scholar

[ref6] 6. Kann L, Olsen EO, McManus T, Harris WA, Shanklin SL, Flint KH, et al. Sexual identity, sex of sexual contacts, and health-related behaviors among students in grades 9–12—United States and selected sites, 2015. MMWR Surveill Summ. 2016;65(9):1–202. pmid:27513843
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref7] 7. Johns MM, Lowry R, Rasberry CN, Dunville R, Robin L, Pampati S, et al. Violence victimization, substance use, and suicide risk among sexual minority high school students—United States, 2015–2017. MMWR Morb Mortal Wkly Rep. 2018;67(43):1211–5. pmid:30383738
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref8] 8. Rasberry CN, Lowry R, Johns M, Robin L, Dunville R, Pampati S, et al. Sexual risk behavior differences among sexual minority high school students—United States, 2015 and 2017. MMWR Morb Mortal Wkly Rep. 2018;67(36):1007–11. pmid:30212446
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref9] 9. Underwood JM, Brener N, Thornton J, Harris WA, Bryan LN, Shanklin SL, et al. Overview and methods for the Youth Risk Behavior Surveillance System—United States, 2019. MMWR Suppl. 2020;69(1):1–10. pmid:32817611
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref10] 10. Centers for Disease Control and Prevention. HIV Surveillance Report, 2019; vol. 32; 2021.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref11] 11. Centers for Disease Control and Prevention. Sexually Transmitted Disease Surveillance, 2021; 2023.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref12] 12. Charlton BM, Corliss HL, Missmer SA, Rosario M, Spiegelman D, Austin SB. Sexual orientation differences in teen pregnancy and hormonal contraceptive use: an examination across 2 generations. Am J Obstet Gynecol. 2013;209(3):204.e1–8. pmid:23796650
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref13] 13. Everett BG, Turner B, Hughes TL, Veldhuis CB, Paschen-Wolff M, Phillips G 2nd. Sexual orientation disparities in pregnancy risk behaviors and pregnancy among sexually active teenage girls: updates from the Youth Risk Behavior Survey. LGBT Health. 2019;6(7):342–9. pmid:31618165
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref14] 14. Kann L, Kinchen S, Shanklin SL, Flint KH, Kawkins J, Harris WA, et al. Youth risk behavior surveillance—United States, 2013. MMWR Suppl. 2014;63(4):1–168.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref15] 15. Centers for Disease Control and Prevention. Youth Risk Behavior Surveillance System (YRBSS) 2017 National, State, and District Combined Dataset User’s Guide 2017. 2018 [cited 2024 May 17]. Available from: https://www.cdc.gov/healthyyouth/data/yrbs/pdf/2017/2017_yrbs_sadc_documentation.pdf.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref16] 16. Chryssolouris G, Lee M, Ramsey A. Confidence interval prediction for neural network models. IEEE Trans Neural Netw. 1996;7(1):229–32. pmid:18255575
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref17] 17. Prevention Policy Modeling Lab; 2024 [cited 2024 May 17]. Available from: https://github.com/PPML.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref18] 18. Salganik MJ, Lundberg I, Kindel AT, Ahearn CE, Al-Ghoneim K, Almaatouq A, et al. Measuring the predictability of life outcomes with a scientific mass collaboration. Proc Natl Acad Sci U S A. 2020;117(15):8398–403. pmid:32229555
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref19] 19. Johns MM, Lowry R, Haderxhanaj LT, Rasberry CN, Robin L, Scales L, et al. Trends in violence victimization and suicide risk by sexual identity among high school students—Youth Risk Behavior Survey, United States, 2015–2019. MMWR Suppl. 2020;69(1):19–27. pmid:32817596
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref20] 20. Lowry R, Johns MM, Gordon AR, Austin SB, Robin LE, Kann LK. Nonconforming gender expression and associated mental distress and substance use among high school students. JAMA Pediatr. 2018;172(11):1020–8. pmid:30264092
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref21] 21. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. Proceedings of the 23rd international conference on machine learning. New York: Association for Computing Machinery; 2006. pp. 161–8.

[ref22] 22. Zhang C, Liu C, Zhang X, Almpanidis G. An up-to-date comparison of state-of-the-art classification algorithms. Expert Systems with Applications. 2017;82:128–50.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref23] 23. Cimpian JR, Timmer JD, Birkett MA, Marro RL, Turner BC, Phillips GL 2nd. Bias from potentially mischievous responders on large-scale estimates of lesbian, gay, bisexual, or questioning (LGBQ)-heterosexual youth health disparities. Am J Public Health. 2018;108(S4):S258–s65. pmid:30383423
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref24] 24. Rendina HJ, Talan AJ, Tavella NF, Matos JL, Jimenez RH, Jones SS, et al. leveraging technology to blend large-scale epidemiologic surveillance with social and behavioral science methods: successes, challenges, and lessons learned implementing the UNITE longitudinal cohort study of HIV risk factors among sexual minority men in the United States. Am J Epidemiol. 2021;190(4):681–95. pmid:33057684
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref25] 25. Saewyc EM, Bauer GR, Skay CL, Bearinger LH, Resnick MD, Reis E, et al. Measuring sexual orientation in adolescent health surveys: evaluation of eight school-based surveys. J Adolesc Health. 2004;35(4):345.e1–15. pmid:15830439
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref26] 26. National Academies of Sciences, Engineering, and Medicine. Measuring sex, gender identity, and sexual orientation. Bates N, Chin M, Becker T, editors. Washington, DC: The National Academies Press; 2022.

Figures

Abstract

Purpose

Methods

Results

Conclusion

Introduction

Materials and methods

Data

Measures

Self-reported LGB identity.

Reported (any) same-sex sexual contact.

Analysis

Overview.

Predicting individual-level YRBS responses.

Generating state-level proportions and prediction intervals.

Comparative evaluation of predictive models, model selection, and final state-level estimates.

Ethics statement

Results

Model-fitting results

State-level estimates of proportions reporting LGB identity or same-sex sexual contacts

Discussion

Limitations

Conclusion

Supporting information

S1 Table. Youth Risk Behavior Survey data availability by state and year.

S2 Table. Data used to predict sex-stratified proportions of students in grades 9–12 reporting any same-sex sexual contacts in the United States in 2017.

S3 Table. Evaluation results from eight algorithms predicting the proportions of male students in grades 9–12 reporting any same-sex sexual contacts in 2017.

S4 Table. Evaluation results from eight algorithms predicting the proportions of female students in grades 9–12 reporting any same-sex sexual contacts in 2017.

S5 Table. Top 20 predictors based on variable importance scores predicting reported lesbian, gay, or bisexual identity and reporting any same-sex sexual contacts among students in grades 9–12 using YRBS data from 2013–2017.

S6 Table. Observed and predicted proportions of students in grades 9–12 reporting lesbian, gay, or bisexual identity in 2017, by state and prediction data.

S7 Table. Observed and predicted proportions of students in grades 9–12 reporting any same-sex sexual contacts in 2017, by state and prediction data.

S8 Table. Observed and predicted proportions of male students in grades 9–12 reporting any same-sex sexual contacts in 2017, by state and prediction data.

S9 Table. Observed and predicted proportions of female students in grades 9–12 reporting any same-sex sexual contacts in 2017, by state and prediction data.

S1 Fig. Prediction loss by number of predictors and prediction algorithm.

Acknowledgments

References