Initial validity and reliability testing of the SGBA-5

Andrew Putman; Adam Cole; Shilpa Dogra

doi:10.1371/journal.pone.0323834

Abstract

Background

A growing body of research indicates that sex (biological) and gender (sociocultural) influence health through a variety of distinct mechanisms. Sex- and Gender-Based Analysis (SGBA) techniques could examine these influences, however, there is a lack of nuanced and easily implementable measurement tools for health research. To address this gap, we created the Sex- and Gender-Based Analysis Tool – 5 item (SGBA-5).

Objectives

This research aims to assess the validity and reliability of the SGBA-5 for use in health sciences research where sex or gender are not primary variables of interest.

Methods

A Delphi consensus study was conducted with Canadian researchers (n = 14). The Delphi experts rated the validity of each SGBA-5 item on a 5-point Likert scale each round, receiving summary statistics of other experts’ responses after the first round. A conservative threshold for consensus agreement (75% rating an item 4+ of 5) was used given the novelty of this scale’s items. Reliability was assessed through a two-armed test-retest study. The university student arm (n = 89) was conducted in-person (on paper), and the older adult arm (n = 71) was conducted online (digitally).

Results

The Delphi study ended after three rounds; experts reached consensus agreement on the validity of the biological sex item of the SGBA-5 (93%) and consensus non-agreement on each of the gendered aspect of health items (identity: 64%, expression: 64%, roles: 50%, relations: 57%). Both the student arm (sex item: , gendered items: ) and the older adult arm (sex item: , gendered items: ) of the test-retest study indicated that all items were reliable.

Conclusions

The novel SGBA-5 tool demonstrated reliability across all scale items and validity of the biological sex item. The gendered aspects of health items may be valid. Future research can further develop the SGBA-5 as a tool for use in health research.

Citation: Putman A, Cole A, Dogra S (2025) Initial validity and reliability testing of the SGBA-5. PLoS One 20(5): e0323834. https://doi.org/10.1371/journal.pone.0323834

Editor: Pasyodun Koralage Buddhika Mahesh, Ministry of Health, Sri Lanka, SRI LANKA

Received: April 1, 2024; Accepted: April 16, 2025; Published: May 16, 2025

Copyright: © 2025 Putman et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data cannot be shared publicly because of the privacy concern of the potential for participant re-identification from disaggregated anonymized data. Data are available from the authors upon approval from the Ontario Tech Research Ethics Board (contact via researchethics@ontariotechu.ca) to release data to researchers who meet the criteria for access to confidential data.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In health sciences research, Sex- and Gender-Based Analysis (SGBA) is an umbrella term for the collection of research methods and analytic techniques that can provide insight into how sex and gender interact with health. In this context, sex is defined as biological characteristics in living beings that relate to sexual reproduction, and gender is defined as the socio-cultural expectations, roles, expressions, and identities that are associated with women, men, and gender diverse people [1]. Many leading organizations - including the Canadian tri-council funding agencies, the National Institutes of Health (US), the European Association of Science Editors, and the World Health Organization - have encouraged and prioritized systematic inclusion of SGBA. In fact, all of them have created specific policies promoting the inclusion of SGBA in health research [2–9].

Despite these collective efforts, the integration of SGBA into health research has not been universal. For example, in 2009 the Canadian Institutes of Health Research (CIHR) began implementation of their 10-year SGBA action plan; this plan required that all grant applicants had to complete SGBA training modules prior to submitting their applications [2,10,11]. Haverfield and Tannenbaum (2021) analyzed more than 39,000 grant applications to the CIHR from 2011 to 2019 and found that while the inclusion of sex-based analysis increased from 22% of grant applications in 2011 to 83% in 2019, the inclusion of gender-based analysis went from 12% in 2011 to 33% in 2019 [10]. The relatively small increase in the inclusion of gender-based analysis into grant applications suggests that researchers are finding it harder to integrate gender-based analysis into their study designs than sex-based analyses. It was further noted that despite completing the required SGBA training modules, a portion of both the grant applicants and the grant evaluators demonstrated a lack of comprehension of core SGBA principles, such as conflating sex and gender [10]. This gap between the goal of SGBA implementation by funding agencies and research entities, and the integration by researchers in individual studies and grant applications is influenced by many factors, including the limited number of valid and reliable tools that can be used for SGBA.

Several measurement tools have been created over the past 50 years that aim to assess the effects of sex and/or gender in a health context. However, there are several limitations to these measurement tools. Common limitations are that these tools can be lengthy, invasive, overly reliant on stereotypes, offensive, demeaning, or based on outdated conceptions of sex and gender [6,12,13]. A more detailed analysis of strengths and limitations of existing tools is available elsewhere [14]. More specifically, there is currently a lack of valid and reliable tools that are concise while allowing for differentiation between biological sex and gender as a social determinant of health in research where sex or gender are not the primary focus. For example, in 2022 the National Academy of Sciences (US) recommended that researchers use a two-step method of nominal categorical responses where participants report their sex-assigned at birth and gender identity separately [15]. The recommendation of the two-step method is supported by evidence of its validity and reliability for use in population-level and census questionnaires [15–17]. Unfortunately, these kind of categorical indicator variables can only provide meaningful differentiation between sex-based and gender-based effects in studies that have very large sample sizes, an inherent limitation of using disaggregation analyses. Disaggregation-based analyses do not allow for the level of detailed insight that is possible from more complex conceptualizations of sex and gender – which require more intricate scale measures or qualitative work [18]. Thus, the use of nominal categories and disaggregated analysis in SGBA can provide some insight into differences between sex and/or gender categorizations in large, population-scale surveys, they cannot provide the level of granular detail needed for investigation of the multidimensional aspects and within-category variation that is inherent in social constructs like gender [6].

To address this gap, we created the Sex- and Gender-Based Analysis Tool – 5 Item (SGBA-5). This tool is proposed as one way to conduct SGBA in health research studies based on current evidence of how sex and gender influence health but is certainly not the only way to do so. There are a multitude of ways to model and conceptualize sex and gender, and how they impact health; the SGBA-5 is one of many such tools that are needed to address the breadth of potential SGBA implementations in the health sciences [2,6,12]. Thus, the purpose of this study was to assess the validity and reliability of the SGBA-5 for use in health sciences research where sex or gender are not primary variables of interest.

Methods

The first iterations of the SGBA-5 were developed alongside a thorough literature review and small-group feedback. The more formal assessment of the SGBA-5 began with a Delphi expert consensus study on the SGBA-5’s validity, and then continued with a test-retest study to assess its reliability. These steps and their relation to the steps in creation of a novel measurement tool (scale) are visualized in Fig 1 [18–20].

Download:

Fig 1. Visual representation of the cyclical process of novel scale development.

This diagram shows the steps involved in novel scale creation, beginning with Item Development and then progressing to an ongoing cycle of Scale Development and Scale Assessment. This cycle of Scale Development and Scale Assessment continues alongside use of the scale to ensure the scale’s validity and reliability as well as to assess the suitability of the scale’s use in different contexts. The diagram also indicates the steps of novel scale development associated with the initial design and testing of the SGBA-5 described in this paper.

https://doi.org/10.1371/journal.pone.0323834.g001

Initial design and development

As is typical with the development of a scale that does not have direct comparator scales to draw upon, potential items for inclusion in the SGBA-5 were drawn from extensive reviews of the peer-reviewed and grey literature relating to biological sex, gendered aspects of health, measurement of multidimensional health impacts, continuous scale creation, and scale validation methodologies [8,12,15,18,19,21–29]. One of the key takeaways from this earlier investigation was that at present, while there is theoretical backing for a variety of ways in which to operationalize measurement of aspects of gender that may influence health [2,6,12], there is not a similar theoretical background for the measurement of different aspects of biological sex in general health research.

The SGBA-5 consists of a categorical option for biological sex (male/female/intersex) and four gendered aspect of health constructs measured on a visual analogue scale that depicts a feminine-masculine continuum. The wording and type of measurement for each item that is included in SGBA-5 is presented in Table 1 and is included in S1 File. The categorical option of biological sex was selected as the authors could not find a theoretical foundation on which to base a more complex understanding of biological sex that is both appropriate for use in a general population or that could be answered on a questionnaire without a priori knowledge of specific biological test results. A short list of potential gendered aspects of health to include in the SGBA-5 was derived from a selection of the existing academic literature, white papers, and SGBA policies that was identified in the initial background literature review [2,6,8,12,13,26]. The four gender constructs included in the SGBA-5 were chosen because evidence was found in the literature to support their proposed pathway of health impact (gender identity [30,31], gender expression [32,33], gender roles [34,35], and gendered relations[36,37]). Institutionalized gender as a gendered aspect of health was considered but not included in the SGBA-5 as institutional-level impacts are best assessed at a legislative or community level rather than on an individual level [6,12].

Download:

Table 1. Description of the items included in the SGBA-5.

https://doi.org/10.1371/journal.pone.0323834.t001

As the CIHR Institute of Gender and Health notes in its definitions of sex and gender in the context of health research, “[gender] is not confined to a binary (girl/woman, boy/man) nor is it static; it exists along a continuum and can change over time” [1]. To attempt to capture more of this variation than nominal items alone could provide, feminine – masculine analogue (continuous) scales were used to represent the four gendered aspects of health constructs that were included in the SGBA-5; they are not assumed to have true zero [0] values. We implemented this interpretation constraint onto the SGBA-5’s analogue measures with the aim to mitigate potential over-interpretation of differences between groups or individuals who complete the SGBA-5 as a part of a health study where sex or gender are not primary variables of interest. Despite the limitations of assuming that the analogue measures do not have a true zero value, these measures can still provide more information on group and individual-level variation than would be possible from a nominal categorization item alone [18].

We aimed to design a tool for within study analysis of the impacts that sex or gender have on the study’s primary outcomes. The SGBA-5 is not designed to ‘determine’ any individual participant’s sex or gender, nor is designed to be sensitive enough for research where sex or gender are the primary variable[s] in the study. The SGBA-5 is not a replacement for dedicated work with marginalized sex and gender communities (i.e., trans & nonbinary individuals, intersex persons, etc.) however, the SGBA-5 is designed so that it can be completed by persons of sex and gender minorities as part of studies where sex or gender are not primary variables. The SGBA-5 is designed with the intention to be integrated into clinical trials and research studies, enabling researchers to facilitate the inclusion of multiple dimensions of SGBA into their studies. Completing the SGBA-5 requires one to two minutes, and thus should not be onerous for the participant or researcher to use. Additional information on scale methodology rationale, appropriate application, and interpretation of the SGBA-5 are also included in S1 File.

After developing the first full version of the SGBA-5, we informally presented it to health sciences researchers at a Faculty seminar (n ≈ 30), and at a multi-university journal club meeting (n ≈ 25) to obtain feedback on the face validity of the SGBA-5. The five selected item topics and their respective methods of measurement remained consistent throughout the small-group feedback stage of development, but the presentation and phrasing of those items were updated and iterated upon throughout.

Delphi expert consensus

More rigorous initial evaluation of the SGBA-5’s suitability for use in health research involved a Delphi expert consensus of Canadian health researchers to assess the content validity of the SGBA-5 for within-sample SGBA. Generally, a Delphi study of content validity consists of a minimum of three rounds in which experts independently evaluate the proposed scale item and score it using a Likert scale or pass/fail rating [19,39]. These evaluations are communicated to the Delphi researchers who pool and assess the expert feedback. From the second round onward, the researchers typically provide the Delphi experts with descriptive statistics and/or qualitative summaries from the previous round’s anonymized expert feedback [40,41]. This anonymized feedback provides experts an overview of the other Delphi experts’ opinions which they can use to inform their evaluation of that round. A Delphi study is halted once the experts reach a stable consensus, or if a pre-determined maximum number of rounds has been reached (not used in this study), or if the between-round differences in the Delphi expert’s rating drops below a predetermined threshold (used in this study) [39–41].

The purpose of this Delphi Expert Consensus study was to receive feedback on the construct validity of the SGBA-5’s scale items from a sample of Canadian health sciences researchers (this study’s Delphi experts and the most likely initial users of the SGBA-5). This process provided evidence that each item measures what it is proposed to measure (an item’s content validity) [18,19]. In accordance with the threshold of evidence used to initially select the items for inclusion in the SGBA-5, it was decided a priori that any major changes (i.e., adding a new item, switching from continuous to ordinal measures, etc.) resulting from the Delphi expert’s feedback must reflect evidence in the current health sciences literature.

Participants

Beginning January 10^th, 2023, the authors (AP and SD) contacted health science Deans at institutions across Canada and requested that they recommend 1–2 potential participants. Specifically, we asked for them to identify researchers who had expertise in conducting health research studies with human participants, and who had been involved in CIHR-funded research in the past (to ensure familiarity with the CIHR sex and gender definitions in health research). By the end of participant recruitment on Feb 14^th, 2023, 32 researchers had been recommended to the authors as potential experts for the Delphi, 17 of whom consented to participate in the study. Fourteen experts (82%) participated in all three rounds. The Delphi experts represented institutions spanning seven provinces and one territory. All participants provided written consent prior to participation in the study. The Delphi study was reviewed by and conducted in compliance with the regulations of the Ontario Tech University Research Ethics Board (REB # 17153).

Procedure

In the first round, participants were emailed a link to a survey in which they were presented each of the items from the SGBA-5 as well as the instruction page that would be provided to researchers administering the SGBA-5. Participants were asked to score each scale item from the SGBA-5 on a 1–5 Likert scale, with a score of 1 being “This question is not a valid measure of [scale item] for SGBA in health research”, and a score of 5 being “This question is a valid measure of [scale item] for SGBA in health research”. The participants were also able to provide optional written feedback on the scale items’ construct validity individually, or the SGBA-5 as a whole.

In the second and third rounds, participants were asked to conduct the same rating exercise and were also provided with optional supplementary documentation that further explained the rationale for the SGBA-5, as well as a short Question & Answer-style summary of the SGBA-5’s creation, and appropriate use cases. In these rounds, participants were shown the median and interquartile ranges of the Likert scores from the previous round when rating each scale item. Furthermore, minor adjustments were made to the SGBA-5’s formatting or phrasing between rounds based on comments from the previous round.

Statistical analysis

The threshold for consensus agreement was set a priori to 75%. That is, the expert rating of each item from the SGBA-5 had to be: 1) rated at least 4 out of 5 on the Likert scale, and 2) have a test demonstrating inter-round answer stability at after a minimum of three rounds. Alternatively, the Delphi study would also be halted if the between-round difference in the coefficient of variance () was , which would be an indication that the researchers have reached a stable non-agreement consensus. These thresholds are in line with Delphi Expert Consensus best practices and intentionally use more conservative target measures to avoid overestimation of expert consensus [18,19,39–41].

The statistical tests for consensus () and overall summary statistics (median, IQR) from and between each round were calculated after the completion of each round in the Delphi study.

Test-retest study

A test-retest component aimed to assess the reliability of each scale item in the SGBA-5 in two populations.

Participants

The test-retest reliability of the SGBA-5 was assessed in two separate populations, university students and older adults. These two populations were recruited and assessed separately in an in-person student arm, and a virtually administered older adult arm.

Participants in the student arm were recruited from the student participant pools of the Kinesiology and Psychology programs at Ontario Tech University between September 4^th, 2023, and November 13^th, 2023. Inclusion was limited to those who were a current student at Ontario Tech University, those able to come to two in-person sessions, and those able to communicate in English. The older adult arm was recruited between September 11^th, 2023, and February 2^nd, 2024 through Ontario Tech University’s Age-Friendly Campus email newsletter and through Facebook advertisements targeting older adults in the Durham Region. Participants were eligible for the older adult arm if they were 55 years of age or older, and were capable of completing, or had access to assistance in completing, the SGBA-5 via a web-hosted survey (the URL was emailed to participants after they indicated to the research team that they were interested in participating). All participants in both arms provided written consent prior to participation in the study. The test-retest study was approved by and conducted in compliance with the regulations of the Ontario Tech University Research Ethics Board (REB # 17477).

Procedure

Eligible participants completed the SGBA-5 twice, at least two weeks apart; they also completed a demographic questionnaire prior to completing the SGBA-5 for the first time. For the students, the SGBA-5 was completed on paper in-person; for the older adults, the SGBA-5 was completed online.

Statistical analysis

The primary test statistics used to evaluate the test-retest reliability in each sample were Cohen’s kappa () coefficient of agreement for the categorical sex variable and intraclass correlation coefficient of agreement () for the four gendered aspect of health continuum variables at . The threshold for determining appropriate reliability of the SGBA-5 for use in research were set a priori as and [18,42–45]. P-values are reported but were not used as a threshold to determine scale item reliability because the magnitude of the and coefficients (how similar a participant’s scores are between the test and retest) more directly assess a measurement tool’s reliability than calculating the probability that that tool’s results could be due to chance [18,44]. Secondary reliability analyses of the tool were conducted to quantify the standard error of measurement (SE_M) for each of the four gendered aspect of health continuums, and sensitivity analysis were conducted using the sample’s demographic variables. SE_M results are presented as percentages of the feminine to masculine continuum used to quantify the four gendered aspects of health addressed in this tool. These SE_M percentages reflect the minimum difference (measured in % of the full continuum) between two observations of the same individual that would be needed to find a difference over time that is unlikely to be explained entirely by error [44]. Since this validity and reliability analysis is not assessing whether there are meaningful changes in participant’s experiences of these gendered aspects overtime, the SE_M percentages are instead presented to help demonstrate the typical amounts of variation that could be expected in an individual participant’s answers to the gendered aspect of health scale items. This allows for more confidence in identification of when there are meaningful differences between participants, i.e., if the difference between response of participant 1 and participant 2 is larger than the SE_M of that scale item, then we can suggest that there is likely a true difference between participants for that scale item.

All statistical analyses were conducted in the R statistical programming language (version 4.3.1: Beagle Scouts) along with the packages tidyverse (version 2.0.0), irr (version 0.84.1), tableone (version 0.13.2), and ggpubr (version 0.6.0) [46–50].

Sample size

The target minimum sample size of 62 participants per arm completing the first (test) component of the study was derived by first plotting predicted minimum sample sizes using the Intraclass Transformation method and Optimal Design Approximation methods through a range of predicted reliability coefficients, confidence intervals, and minimum acceptable reliability coefficients [18,51,52]. Then, the more conservative estimate (n = 42), was identified for a predicted reliability coefficient of 0.85 (CI from 0.80 to 0.90). This number was rounded up to 50 to mitigate potential overestimation of the predicted reliability and then after accounting for a potential dropout rate of 25%, the target minimum sample size for the test-retest study was set as n = 62 [18].