Validation of the Kidsights Measurement Tool: A parent-reported instrument to track children’s development at the population level

Marcus R. Waldman; Katelyn Hepworth; Jolene Johnson; Kelsey M. Tourek; Kelly J. Jones; Yaritza Estrada Garcia; Laura M. Fritz; Abbey Siebler; Abbie Raikes

doi:10.1371/journal.pone.0324082

Abstract

Disparities in child development between groups of children arise early and reflect social inequities in early environments, geography, and other factors. To track and address these disparities, valid and reliable tools are needed that can be implemented at-scale and across populations. However, no population-level measures of child development appropriate for children from birth to age five years have been developed and validated in the United States to date. The Kidsights Measurement Tool (KMT) is a parent-report, population-level tool for children birth to age five years intended to track group-level differences in the developmental status across normative aspects of children’s motor, cognitive, language, and social/emotional development. This study reports on validation of KMT as a feasible tool that can be implemented in large-scale surveys to track disparities in early childhood development. Using a sample of N = 5,001 initial parent reports residing in Nebraska and across the United States, we find strong evidence that the KMT can detect disparities in child development birth to age five, as indicated by expected criterion associations with parent education and mental health, as well as child’s race and ethnicity. In addition, we found that the KMT is strongly associated with gold-standard direct observation instruments (i.e., the Bayley Scale of Infant Development and the Woodcock-Johnson) that measure similar developmental constructs both concurrently and one-year 12–24 months later. Finally, the KMT exhibits strong reliability even after controlling for age, and we find no evidence that measurement noninvariance threatens valid inferences about group difference. Taken together, our findings indicate that a parent-report measure can generate valid and useful estimates for tracking disparities in early child development at the population level.

Citation: Waldman MR, Hepworth K, Johnson J, Tourek KM, Jones KJ, Garcia YE, et al. (2025) Validation of the Kidsights Measurement Tool: A parent-reported instrument to track children’s development at the population level. PLoS One 20(6): e0324082. https://doi.org/10.1371/journal.pone.0324082

Editor: Randall Waechter, Caribbean Center for Child Neurodevelopment, GRENADA

Received: December 13, 2024; Accepted: April 20, 2025; Published: June 26, 2025

Copyright: © 2025 Waldman et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data underlying the results presented in the study are available from https://osf.io/gnhds/files/osfstorage

Funding: Support for this research was provided by the Buffett Early Childhood Fund, Imaginable Futures, Overdeck Family Foundation, Pritzker Children’s Initiative, and Valhalla Foundation.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Early child development lays the groundwork for lifelong health and wellbeing. Supportive early environments that promote healthy child development are associated with later adult health and well-being, including lower incidence of diseases, greater earnings, and lower risk of incarceration [1–4]. A substantial body of research outlines the biological and social mechanisms by which early development influences later health and wellbeing [5–8]. As such, investments in early childhood development yield notable long-term returns [9], as supporting healthy development early in life may be more cost-effective than mitigating the consequences of early stress and deprivation in adulthood.

Because early child development has such significant implications for later development, the impacts of early stress are visible in group-level disparities in health, learning and wellbeing throughout the life course [10]. Disparities in cognitive, language and social/emotional development outcomes in young children by socio-economic status, race/ethnicity, and geography have been extensively documented over several decades in the United States [11–13], emerging in the first year of life and persisting over time [14–16]. These disparities have been shown to be attributable to the differences in environmental supports, including access to quality childcare, economic resources, and neighborhood and community supports [15,16]. In sum, children’s development is strongly influenced by the social and economic context of their families and communities and, when these contexts are not supportive, lead to persistent group-level disparities in child development outcomes.

Population-level tracking of early child development in the United States: Too little, too late

“Population-level tracking” refers to measuring and drawing conclusions about group-level characteristics, as opposed to drawing conclusions about individual children. Validity of claims regarding group-level characteristics depends upon a representative sampling approach as well as tools that are valid in measuring group characteristics, sufficiently reliable for this intended use, and scalable so that the data can be collected quickly and cheaply to meet large sample size requirements of population-level tracking. The purpose of this paper is to provide psychometric evidence that the Kidsights Measurement Tool (KM) is valid, reliable, and scalable for future use in large sample studies that incorporate representative sampling. As we outline below, to date, there have been few attempts to create population-level estimates of child development, despite the need for insight into early disparities across large populations of young children.

In the United States, population-level tracking of children’s development in the first five years is limited in scope and relies on a small and narrow set of indicators, especially for children birth to age three. Infant mortality, for example, is an important indicator of early inequities [17] but does not provide insight into normative child development. Yet, children from lower-income families, families with less formal education, and/or families who represent diverse racial and ethnic backgrounds are up to half a year behind their more advantaged peers in cognitive, language, and social/emotional development by the time formal schooling begins [18,19]. Given the substantive size of these disparities at the start of school, more population-level data are needed on children’s developmental outcomes earlier in life, when interventions can address emerging inequities. The lack of data is especially problematic given the rapid pace of early development and the opportunity to support healthy development through cost-effective programs such as home visiting and support for quality childcare [20].

The importance of population-level tracking of child development is underscored by US state and federal agendas, which identify indicators of early child development as key indicators of population-level health and wellbeing. The United States Center for Disease Control’s (CDC) Healthy People 2030, for example, states an objective of increasing the national proportion of children who are developmentally on track and ready to start school [21]. But progress toward this objective is not currently reported, as CDC has designated the Healthy People 2030 goal regarding children’s developmental status as “presently lacking reliable baseline data” [21].

The landscape of routinely collected population-level child development data for children five years and younger in the United States is limited to the National Survey of Children’s Health (NSCH). The NSCH is administered by the Health Resources and Services Administration using a representative sampling design and includes a parent-reported National Outcomes Measure on preschool-aged children’s development to generate population-level estimates, called Healthy and Ready to Learn (HRTL) [22,23]. The percentage of children who are HRTL is estimated based on a set of 27 parent-report items to index child development across motor development, health, early learning skills, self-regulation, and social/emotional development for children between three and six years. Results from national samples estimate that only 64.3% of children were designated HRTL. Significant disparities were found by—among other factors—the parent’s mental health status, the quality of the home learning environment, and the child’s neighborhood [23]. However, HRTL does not estimate child development for children under age three years.

The lack of workable tools for children birth to age five is a primary barrier to obtaining population-level data on child development, which is the focus of the present study. Beyond HRTL, three additional parent-report, population-level tools have been developed for use globally, the Global Scales for Early Development (GSED) for children birth to age three [24]; the Caregiver-Reported Early Development Index (CREDI) for children birth to age three [25]; and the Early Child Development Index (ECDI) for children two to five years [26], but are not at present routinely incorporated into routine household survey in the US for population-level tracking.

In sum, despite policy goals of collecting population data on infant, toddler and preschooler wellbeing and development, data on these constructs beginning at birth and extending through the start of school are largely absent from statewide and national data systems in the United States [27]. Indeed, the HRTL measure administered as part of the NSCH represents the lone population-level of child development but is limited to children aged 3–5 years. Population data on children’s development for the nation’s youngest children is currently not collected in government-led data collection efforts such as the NSCH. One barrier to the collection of population level data beginning at birth is the lack of measures that can be incorporated into routine household surveys and are appropriate for children beginning at birth and extending to school-entry. Overcoming this barrier is the focus of the present study.

Developing measurement tools for measuring child development at the population level

Population-based tools of child development should reflect the holistic nature of early development, addressing language, cognitive, motor, and social/emotional development, and therefore must balance developmental breadth with feasibility for use at scale. The considerable time and cost of direct assessments of child development, such as the Bayley Scales of Infant Development, preclude their use at a population level, thus leading to reliance on parent report tools (e.g., CREDI; GSED; and ECDI), and requiring innovation in scale development to create accurate and reliable parent-report tools. Further, even though the tools must be feasible for use at scale, it is equally important to address the validity of population-level instruments of child development. Recommendations from American Psychological Association’s [28] Standards for Educational and Psychological Testing (henceforth, Standards) define types of validity evidence: (1) test content, (2) relations with other variables (including criterion variables and scores from concurrent instrument), (3) the sensitivity of conclusions to threats from measurement non-invariance or item misfit, and (4) score precision as indicated by errors of measurement (i.e., reliability). McCoy et al. [25] reported acceptable psychometric properties and criterion validity of the CREDI across high, middle and low-income countries. Ghandour et al. [19] similarly reported acceptable psychometric properties and criterion validity for HRTL. Validation evidence for the GSED and ECDI are still in development. However, two types of validity evidence – concurrent validity with observational measures and predictive validity demonstrating associations between parent-report and observational measures over time – have not yet been frequently reported for population-based measures of child development for children birth to age five years.

The purpose of the present study was to evaluate the validity evidence of a new parent-report measure of child development for children birth to age five living in the United States called the Kidsights Measurement Tool (KMT). When incorporated into studies designed for representative sampling, the KMT is intended to provide population-based estimates of children’s development from birth through age five, thereby addressing the present gap in holistic indicators of child development with a measure that is feasible to scale. We hypothesized that the KMT would show acceptable psychometric properties. Using the Standards of test reliability and validity, we assessed the KMT’s validity using a large sample (N = 5,001) including predictive validity with direct assessment administered by trained observers 12–24 months later. We also jointly addressed criterion validity and documentation of early disparities, with known correlates of children’s development, including parent education, parent anxiety and depression, the family’s socio-economic status, and exposure to experiences that may be traumatic and/or adverse to healthy development.

Materials and methods

Participants

Participants were adults who were responsible for the child’s care and reported on their child’s development, and they identified themselves as biological, foster, adoptive parents or other relatives of children aged birth to five years, residing in Nebraska and throughout the United States (referred to as “parents”).

We collected data on N = 5,001 children in three phases (see Table 1) between October 1, 2020 and July 31, 2023. In the first phase (October 2020- February 2021), we obtained n = 977 parent responses in the Lincoln and Omaha, Nebraska, USA metropolitan area to be used as pilot data; we followed up with n = 70 willing Phase 1 participants approximately 16 months (range: 12–24 months) following their initial survey response. In Phase 2 (August 2022- March 2023), we collected validity evidence from n = 2,428 parents throughout Nebraska, USA. Lastly, in Phase 3 (June-July 2023), we obtained item response data from n = 1,596 parents throughout the United States through a limited and opportunistic partnership with Stanford Center for Early Childhood’s RAPID survey in which caregivers who were part of RAPID’s ongoing panel study were given the option to complete the KMT.

Download:

Table 1. Instrument development and validation study phases and collected measures.

https://doi.org/10.1371/journal.pone.0324082.t001

Table 2 reports demographic information on participants with reference to the US national population estimates from the NSCH. Apart from the follow-up sample in Phase 1, we pooled the data in all three phases to calibrate the scale and assess validity evidence. In this pooled sample, 59.7% of the children whose parents reported on their development were identified as white, non-Hispanic compared to the national estimate of compared to the national estimate of 41.9%, and the mean age was 35.5 months. Data on the children’s sex was collected in Phase 1 and Phase 2, but child’s sex was opted not to be collected by the RAPID in Phase 3. In total across Phases 1 and 2, 53.3% of parents reported their child as male, and 97.6% of respondents reported that they were the biological, foster, or adoptive parent of the child.

Download:

Table 2. Sample demographics and family characteristics.

https://doi.org/10.1371/journal.pone.0324082.t002

In the Phase 1 follow-up sample, 58.6% of children were identified as white, non-Hispanic. All respondents (100%) identified as the biological, foster, or adoptive parent, and 42.9% of parents reported their child as male.

Procedures

In all phases, data were collected using the Qualtrics [30] online survey platform (Qualtrics, Provo, UT). We recruited parents through healthcare and childcare providers, parenting support programs, and social media posts. Parents received a link to an online questionnaire including several questions on family demographics, development and health. Respondents could complete the survey using their mobile phone, tablet, or computer and took between 20–30 minutes to complete. We offered parents a gift card to complete the survey. All protocols were reviewed by the Institutional Review Board of University of Nebraska Medical Center and Research Ethics Review Committee of the World Health Organization (ID ERC.0003514). To obtain informed consent, parents were given a written description of the study at the start along with the option to end their participation at any point. Acknowledgment of providing informed consent was documented in the Qualtrics platform. As part of a screening protocol, we excluded observations if (a) metadata information resulted in a “likely fraudulent” score from the rIP package [30] and the IP Hub database (https://iphub.info/), (b) the caregiver failed to accurately confirm the child’s birthdate, and (c) whether scores were above or below 5SD on the CREDI or ECDI (see below for a description of each).

For the Phase 1 follow up, parents that agreed to be recontacted received an email with a link to a short survey that included information about the activities, specifically the in-person appointment needed for the child direct assessment and the additional survey and asked for information to help schedule the in-person appointment. The assessment team scheduled appointments based on the family’s preference for data and location. The child direct assessments, specifically the Bayley Scales of Infant and Toddler Development, Fourth Edition (Bayley-4) [31] for children up to 42 months or the Woodcock Johnson IV Early Cognitive and Academic Development (WJ IV ECAD) [32] for children from 43 to 72 months, were conducted at the university, in the home or at the child’s childcare center when approved by the childcare provider. Parents received a link to the online survey including the KMT immediately after the child direct assessment appointment.

Measures

Kidsights measurement tool.

The KMT was constructed by first forming a candidate item bank by adopting items from four previously validated instruments each measuring normative aspects of children’s development (i.e., skills or behaviors that children acquire or exhibit as they age when undergoing healthy development) between 0–5 years. These four instruments included (1) the Global Scale of Early Development Short Form (GSED-SF) [33], (2) the Caregiver Reported Early Development Instruments Long Form (CREDI-LF) [34], (3) the Early Childhood Development Index (ECDI2030) [35], and (4) Healthy and Ready to Learn (HRTL) [22]. We included only items that measured normative aspects of children’s development (i.e., skills or positive behaviors that are acquired or manifest as children age under healthy development), and we excluded items from these instruments that measure constructs such as problem behaviors or other indicators of psychosocial difficulties.

This process resulted in a candidate item bank of 223 items with 79 items unique to the GSED-SF, 23 items unique to the CREDI-LF, 7 items unique to the ECDI2030, and 49 items unique to the HRTL. Of the 223 items, 49 items were shared across one or more of the four contributing instruments (42 items were common between GSED-SF and CREDI; 7 items were common between the GSED-SF, CREDI-LF, and ECDI2030). The 223 candidate items measure motor, cognition, language, and/or social/emotional constructs according to the published literature and existing documentation for the four instruments. Specifically, 71 items represented fine or gross motor development constructs [33,34] or physical development [19] Additionally, 82 represented cognitive or language development [33–35] or early learning skills [19]. Lastly, 70 items measure social/emotional development [19,33–35] including normative aspects of children’s self-regulation [19]. In total, 224 candidate items were then screened and selected based on their psychometric properties (see S1 Table for details). The result was a final set of 197 items spanning development from birth to age 5 years, children are only administered 39 age-appropriate items, on average. Item stems and responses options are available in the S3 Table.

Concurrent and predictive measures.

The Bayley-4 or the WJ IV ECAD were administered to the Phase 1 follow-up subsample (N = 70). Specifically, the WJ IV ECAD was used for children from 43−60 months at follow-up (n = 33), and the Bayley-4 was used for children up to 42 months (n = 37).

Bayley-4. The Bayley-4 is validated to measure child development up to 42 months. The instrument is divided into items that capture development in the cognitive, language, motor, social/emotional and adaptive behavior domains through direct administration of activities, observation of the child, and questions to the caregiver [31]. Scores are provided at domain level and the subtest level. For the Bayley-4 training, assessors were required to complete a 12-hour online training hosted by the measure publisher. After completing the training, assessors submitted video recordings of administrations of the Bayley-4 or scheduled in-person observations with the research team’s Bayley-4 supervisor. The measure supervisor has several years of experience administering the Bayley Scales of Infant Development and evaluated each assessor for correct administration of item and scoring in order to certify each assessor as reliable on the Bayley-4.

WJ IV ECAD. The WJ IV ECAD is considered a measure of intellectual ability, academic skills and language, specifically oral expression for children 30 months to 6 years old [32]. The results of the WJ IV ECAD administrations result in a General Intellectual Ability Score, an Expressive Language Score, and scores by each test [19]. Although there are 10 tests in the WJ IV ECAD, only 7 of the tests were administered for the study. Assessors followed a similar training and reliability process for the WJ IV ECAD as for the Bayley-4. They reviewed the WJ IV ECAD kit materials and submitted video of the administration of the WJ IV ECAD. The research team’s WJ-ECAD supervisor has experience administering the measure and reviewed the recording for correct administration and scoring before certifying the assessors as reliable on the measure.

Caregiver-reported instruments: The candidate KMT item pool included all items from the GSED-SF, CREDI-LF, ECDI2030, and the items from HRTL as described in Ghandour et al. [19]. As a result, in administering the KMT, we effectively administered these four caregiver instruments concurrently. Scores from the GSED-SF are termed “D-scores” [36] and calculated using the dscore R package [37]. The CREDI Long Form results in an overall score of child development as well as subscale scores of motor, cognition, language and social/emotional development. We calculated CREDI scores using the credi R package (https://github.com/marcus-waldman/credi). ECDI2030 scores were calculated using the UNICEF’s (2023) provided R syntax file. HRTL scores were calculated by replicating the four-factor solution reported in Ghandour et al. [19] using the lavaan R package [38] and extracting factor scores. The four factors include a physical/motor factor, an early learning factor, a social/emotional factor, and a self-regulation factor.

Family- and caregiver-level criterion measures.

Socioeconomic measures: In the 2020 survey, we asked parents to report their 2019 household income in United States Dollars (USD). Likewise, in the 2022 survey, we asked parents to report their 2021 household income. To make household income on the same scale across years, we adjusted for inflation by converting to 1999 USD.

Parents could select from nine options in reporting their educational attainment: 1) 8th grade or less; 2) 9th-12th grade; 3) No diploma; 4) High School Graduate or GED Completed 5) Completed a vocational, trade, or business school program 6) Some College Credit, but No Degree 7) Associate Degree (AA, AS); 8) Bachelor’s Degree (BA, BS, AB); 9) Master’s Degree (MA, MS, MSW, MBA); 10) Doctorate (PhD, EdD) or Professional Degree (MD, DDS, DVM, JD). In the present study, we collapsed this information into four categories including 1) no high school (HS) diploma; 2) HS diploma; 3) Some college or an Associate’s degree (i.e., AA/AS); 4) Bachelor’s degree (i.e., BA/BS); and 5) Master’s degree or higher. We dummy coded caregiver educational attainment using parents with only high school education as the reference group.

Anxiety and depressive symptoms: We administered the Patient Health Questionnaire 2-item (PHQ-2) [39] and the Generalized Anxiety Disorder 2-item (GAD-2) [40] to obtain caregiver self-reports of depressive and anxiety symptoms. Parents reported whether, over the last two weeks, they (1) had little pleasure or interest in doing things (i.e., indicator 1 of PHQ-2) and (2) were feeling down, depressed, and hopeless (i.e., indicator 2 of PHQ-2), (3) were feeling nervous, anxious, or on edge (i.e., indicator 1 of GAD-2), and (4) were not able to stop or control worrying (indicator 2 of GAD-2). Parents responded using a four-point Likert scale (i.e., “0-Not at all”; “1-Several days”; “2-More than half the days”; “3-Nearly every day”). For analysis, we created a depression and anxiety symptom total score by summing all four PHQ/GAD-2 items.

Child’s race and ethnicity.

Parents could select from up to 15 racial categories and one of five ethnicity categories. Racial category response options included: 1) American Indian or Alaska Native; 2) Asian Indian; 3) Black or African American; 4) Chinese; 5) Filipino; 6) Guamanian or Chamorro; 7) Japanese; 8) Korean; 9) Native Hawaiian; 10) other Asian; 11) other Pacific Islander; 12) Samoan; 13) Vietnamese; 14) White; or 15) Some other race. Ethnicity response options included 1) No, not of Hispanic, Latino, or Spanish origin; 2) Yes, Mexican, Mexican American, Chicano; 3) Yes, Puerto Rican; 4) Yes, Cuban; 5) Yes, another Hispanic, Latino, or Spanish origin. For analysis, we combined racial and ethnicity into four major categories. These included: 1) White, non-Hispanic, 2) Black or African American, non-Hispanic, 3) Other (including two or more races), non-Hispanic, and 4) Hispanic.

Scaling and scoring procedures.

We fit the graded-response IRT model in (1)-(2) to the polytomous data.

(1)

where indexes a child (with a latent score [i.e., ability] of ), indexes an item (with response options), and k indexes one of the responses options for the item. Model parameters in (1)-(2) include (the item discrimination value), (the difficulty value associated with response option for item ), and a vector of latent regression coefficients, . We fit the model using maximum marginal likelihood estimation (also referred to as full information maximum likelihood). Maximum likelihood estimators are gold standard approaches to treating missing data [41] All model fitting occurred using the mirt R package [42]. We calculated scores from the KMT (“Kidsights scores”) by summarizing the posterior distribution using the expected-a-posteriori (EAP) point estimate.

Analytic plan

We followed the recommendations from the APA Standards in collecting validity evidence. Evidence came from analyzing (1) test content, (2) relations with other variables (including criterion variables and scores from concurrent instrument), (3) the sensitivity of conclusions to threats from measurement non-invariance, and (4) score precision as indicated by errors of measurement (i.e., reliability).

Evidence based on test content.

Using the domain assignments provided by the originating instruments (CREDI, GSED, HRTL and ECDI as described above), we designated items into one of three domains: (1) Motor/physical development, (2) Cognition or language development, or (3) Social/emotional development. To assess content coverage, we calculated the (average) domain composition of the administered items within yearly age categories. Evidence based on test content was evaluated by two subject matter experts to ensure adequate representation of items by developmental domain. See Fig 1 for a summary of item domains by age.

Download:

Fig 1. Item composition.

Kidsights Measurement Tool domain representation by children’s age. Notes: Cog. – Cognition development, Lang. – Language development, Phys. – Physical development. Socemot. – Socioemotional development.

https://doi.org/10.1371/journal.pone.0324082.g001

Evidence based on other variables

In line with the Standards, we collected evidence that Kidsights scores correlate with other variables in the expected magnitude and direction. This includes: (a) convergent validity evidence with scores from concurrent instruments that measure equivalent (or highly similar) constructs as those directly intended to be measured by the KMT; (b) discriminant validity evidence with scores from concurrent measures measuring constructs not directly intended to be measured using the KMT; and (c) association of Kidsights scores with criterion variables known to be predictive of child development.

Convergent validity evidence.

Convergent validity evidence from calculating part correlations (i.e., correlations after adjusting for the child’s age) of Kidsights score with (a) the Bayley-4 Cognition, Receptive and Expressive Communication, and Gross and Fine Motor domain and subtest level scores for children 42 months and younger; and (b) the WJ IV ECAD General Intellectual Ability- Early Development scores, Expressive Language cluster scores, as well as the Verbal Analogies, Sentence Repetition, and Rapid Picture Naming test activities for children 43–60 months. Convergent validity evidence from caregiver-reported instruments came from part correlations with: (a) D-scores, (b) CREDI scores (overall scores, as well as motor, language, cognition, and social/emotional subscores), (c) ECDI2030 scores, and (d) motor/physical, early learning, and social/emotional factor scores from the HRTL. We are aware of no published ideal or minimum threshold for a correlation to establish convergent validity evidence. Because a correlation value of approximately represents 50% of the variance explained, we used this value as an ideal threshold and a correlation of as a minimum threshold (i.e., 25% of the variance explained).

Discriminant validity evidence.

Discriminant validity evidence from part correlations (i.e., correlations after controlling for children’s age) with scores from other instruments that reflect aspects of children’s behaviors which a not directly tied to normative aspects of children motor, cognitive, language, or social/emotional development. This included scores from the HRTL self-regulation factor and psychosocial problem scores from the GSED-PF. Because these scores reflect constructs not intended to be directly measured by the KMT, evidence comes from correlations smaller in magnitude (i.e., ) and in the expected direction (i.e., negative part correlations with psychosocial problem scores, and positive correlations with self-regulation).

Predictive validity evidence.

We assessed predictive validity evidence by studying part correlations of Kidsights scores at Time 1 with Bayley-4 and WJ IV ECAD scores obtained 12–24 months (M = 16 months) later at Time 2. We took positive and statistically significant correlations as evidence that Kidsights scores predict future development and learning.

Criterion associations.

We fit multiple regression models to control for age and to assess bivariate associations with the following criterion variables: (a) family income (b) parent’s educational attainment (reference category: parents with a bachelor’s degree), (c) the child’s race and ethnicity (reference category: white, non-Hispanic children), and (d) parent’s depressive and anxiety symptoms. In fitting the regression models, family income was log-transformed to offset heavy right skew of that variable.

Criterion validity evidence includes positive associations between Kidsights scores and family income and greater levels of parent’s educational attainment. We expect to find a negative association with parent’s depressive and anxiety symptoms measured through a combined GAD/PHQ-2 total score. Lastly, we expect to find that historically disadvantaged groups (i.e., non-white and Hispanic children in the United States, children from lower levels of caregiver education, and those from lower family income households) will exhibit lower average Kidsights scores compared to peers of the same age.

As a result of the planned missingness in Phase 1 of the study for the GAD/PHQ-2 items (see above), missingness was most present in the GAD/PHQ-2 items resulting in 10.8% of the sample not having a GAD/PHQ-2 combined total score. Additionally, we found that 7.2% of parents did not provide an income. All other criterion variables contained no missingness. In response, we employed multiple imputation with 10 imputed datasets using the mice (van Buuren et al., 2015) R package to deal with missingness on the criterion variables. For all models, we evaluated the evidence from criterion associations by pooling results using Rubin’s rules, conducting pairwise t-tests, and interpreting the magnitude and substantive size of the coefficients.

Measurement noninvariance.

We evaluated whether items exhibited measurement noninvariance and conducted sensitivity analyses to determine if ignoring noninvariance threatened valid inferences. Using the alignment method [43], we tested for DIF across education, race and ethnicity groups, and between the Nebraska sample (i.e., Phase 1 [initial sample] and Phase 2 of the present study vs. Phase 3 [national sample]). We completed all invariance testing using Mplus Version 8.10 [44].

Because we are aware of no accepted standard that exists on the maximum percentage of items that exhibit noninvariance before valid inference is threatened, we conducted a sensitivity analysis in which we allowed all DIF items to be freely estimated when evaluating criterion associations with race and income, as well as group differences between the Nebraska (Phases 1 and 2) and the national samples. We compared associations and mean differences to the mean by ignoring DIF (i.e., assuming invariance across all items) to associations that allow for item parameters to be freely estimated for items exhibiting DIF. We employed generalized linear hypothesis tests to evaluate whether ignoring DIF results in significantly different conclusions about group differences.

Reliability and errors of measurement.

We assessed the precision of scores in two ways. First, we calculate the marginal reliability statistic () proposed by Thissen and Wainer [45] using the standard errors for the EAP estimates. Although the is a valid measure of reliability, as a marginal statistic it overestimates the reliability when a child’s score is only to be compared to scores from their peers. To evaluate the precision of scores at conditional on a child’s age, we fit a generalized additive model for location scale and shape [46] to estimate age-conditional variances in EAP scores. Using the within age variance of EAP scores and the conditional standard error of measurement (CSEM), we then calculated the expected conditional reliability value () for each child,

(2)

where is the EAP score. We evaluated the average values calculated in the previous step, in addition to evaluating the expected reliability at each age. We used as a cutoff for the minimal reliability at each age. Such a reliability value may be below traditional guidelines for individual assessments (i.e., when the conclusions are drawn regarding individuals). However, population measures like the KMT are intended to produce statistics that aggregate across individual scores, thereby washing out measurement error across individual score.

Results

Demographics information in our sample is reported in Table 2. In total, 53.3% of children in the sample were identified as male. Moreover, majority of sampled children were identified as White, non-Hispanic (59.7%) from a parent reporting have attained a Bachelor’s degree or higher (51.3%).

Evidence based on test content

Motor and physical development represented most items (53%) assigned in the first year of life, but these items represent only approximately 20% of administered items for children aged 2 years and older. In contrast, items identified as measuring cognitive or language development represent less than a quarter (23%) of items assigned to newborns, but these items represent a majority beginning at age 3 (Max = 52%). The percentage of items reflecting social/emotional development varies between 24% (for newborns) and 34% (for two-year-olds) (see Fig 1).