Performance of Dutch Children on the Bayley III: A Comparison Study of US and Dutch Norms

Background The Bayley Scales of Infant and Toddler Development-third edition (Bayley-III) are frequently used to assess early child development worldwide. However, the original standardization only included US children, and it is still unclear whether or not these norms are adequate for use in other populations. Recently, norms for the Dutch version of the Bayley-III (The Bayley-III-NL) were made. Scores based on Dutch and US norms were compared to study the need for population-specific norms. Methods Scaled scores based on Dutch and US norms were compared for 1912 children between 14 days and 42 months 14 days. Next, the proportions of children scoring < 1-SD and < -2 SD based on the two norms were compared, to identify over- or under-referral for developmental delay resulting from non-population-based norms. Results Scaled scores based on Dutch norms fluctuated around values based on US norms on all subtests. The extent of the deviations differed across ages and subtests. Differences in means were significant across all five subtests (p < .01) with small to large effect sizes (η p 2) ranging from .03 to .26). Using the US instead of Dutch norms resulted in over-referral regarding gross motor skills, and under-referral regarding cognitive, receptive communication, expressive communication, and fine motor skills. Conclusions The Dutch norms differ from the US norms for all subtests and these differences are clinically relevant. Population specific norms are needed to identify children with low scores for referral and intervention, and to facilitate international comparisons of population data.

Introduction met the criteria for this group if they had special needs (e.g., clinical indications for low vision of motor functioning), or when parents reported presence of risk factors for development (e.g. prematurity, Down Syndrome).
To ensure representativeness for the Dutch population, the composition of the norm sample was guided by the four background demographics that also were used for the compilation of the US sample: child's gender, parental education, parental ethnicity and geographical region. The target percentages regarding these demographics are presented in Table 1. These are based on population characteristics as provided by the Dutch Central Bureau of Statistics(CBS) [12] The categories used by the CBS to distinguish educational levels, differences in ethnicity and geographical region, fit the Dutch system and therefore these were also used for the norm sample and in this study. To optimize the representativeness of the sample, the norm sample was also weighted per age-group in order to reach the target percentages as described in Table 1, in terms of gender, geographical living region, parental ethnicity, and mother's educational level.
The weighting values were between zero and two, which is in concordance with the guidelines of the Commissie Testaangelegenheden Nederland (COTAN) [13].
For the current study, we included children for whom results were available for all subtests of the Bayley-III. This concerns 1912 (97.9%) of the children in the norm sample. Children at risk or showing developmental delay were included (12.6%). In Table 1, characteristics of the participants are presented for different age groups. The sample was considered to be representative for the Dutch population as proportions of children in relation to the background characteristics did not deviate more than 5 percent from the target percentages.

Measurements
Background questionnaire. Mothers completed a background questionnaire containing 26 questions about family background and child-characteristics, such as date of birth, health of the child, ethnicity of children and parents, and family composition.
Bayley-III-NL. The Bayley-III-NL is the translated and slightly adapted version of the Bayley Scales of Infant and Toddler Development-third version (Bayley-III; [1,9,10,11]). It is an individually administered instrument that measures the developmental level of children between 16 days to 42 months and 15 days old. For administration purposes, this age range is divided into 17 age groups, just as in the original US version. The Dutch version consists of five subtests: Cognition (91 items), Receptive Communication (49 items), Expressive Communication (46 items), Fine Motor (66 items), and Gross Motor (72 items).
In the Dutch version a few adaptations were necessary in order to fit the Dutch culture and specifically the Dutch language. Changes were kept to a minimum to maintain international comparability of the Bayley-III-NL. First, five pictures in the material were adapted to Dutch culture: one in the Cognitive Scale, three in the Receptive Communication Subtest, and one in the Expressive Communication Subtest). For example an American football was changed into a soccer ball which is more common in the Netherlands. Second, two items of the Expressive Communication subtest were deleted as these do not fit Dutch language development: item "Uses Verb + ing" and item "Uses present progressive form" [10,11]. Third, as in the pilot study Dutch children showed a slower development in Gross Motor skills compared to the US children, the starting point for this subtest was set one age group younger than in the US version. In addition, the reversal rule was made stricter; the first five instead of three items had to be scored positively or else the items of a younger age group had to be assessed [11].
The Bayley-III-NL norms were constructed by means of continuous norming techniques using the weighted sample [9]. The construction of the Dutch norms was based on age calculated in days, as this most precisely seems to reflect the development of young children [9]; for the US norms, age-groups varying from two weeks to three months were used [1]. Like in the original US version, the standardized scores of the subtests of the Bayley-III-NL range from 1 to 19 with a mean of 10 and a standard deviation of 3. The Bayley-III-NL is a reliable and valid instrument with good psychometric characteristics in Dutch children. The reliabilities of the five subtests were assessed using Guttman's Lambda 2 and varied from .82 to .92 [2].

Procedure
Examiners were experienced clinicians or pedagogy students in the final year of their bacheloror master education and all were trained to be reliable in their test administration. All examiners also scored a Bayley-III-NL assessment on film and had to acquire an inter-rater reliability level with a minimal consensus rate of 80% per subtest to pass their training. Their scores were compared to the scores of the trainer and the average kappa for all administered items over all subtests was .77 (SD = .05).
Two weeks prior to the planned Bayley-III-NL assessment, mothers received questionnaires and an informed consent form by mail which they were asked to complete at home. During the visit for the Bayley-III-NL assessment, the questionnaires and informed consent form were collected. Next, a trained test leader administered the Bayley-III-NL in presence of a maximum of two primary care-givers. Locations within an acceptable travel distance for the parents and their child were selected, and the rooms were free from distracting stimuli. Because young infants are only awake and alert for small periods of time and traveling to a lab could be too fatiguing, children up to six months of age were tested at home. Dependent on the age of the child and the preference of the caregiver(s) and the child, the child sat at the lap of one of the caregivers, or independently on a chair during the parts of the assessment for which sitting at a table was required. The Utrecht University Medical Center's Medical Ethical Committee approved this study.

Data analysis
Differences between the scaled scores based on Dutch and US norms were calculated for all children on all subtests. A one sample Multivariate analysis of variance (MANOVA) was used to test whether the mean difference scores over all subtests for the sample as a whole were equal to zero or not, and to control for inflation of type 1 error. When this MANOVA indicated significant differences between the scaled scores based on Dutch and US norms, we referred to the univariate results to see for which subtest these significant differences were found. As the mean differences might be age dependent, the same MANOVA including all five subtests was performed in the next step for each age group separately. Effect sizes (η p 2 ) of these results were evaluated and interpreted according to Cohen [14] with .06 or less indicating a small effect, .07-.13 a medium effect, and .14 or higher a large effect size. Finally, the proportions of children with low scores, scoring <-1 SD (i.e. a scaled scores <7) and <-2 SD (i.e. a scaled score <4) based on Dutch norms and US norms, were compared by means of McNemar analyses. Analyses were performed using SPSS 20.0.
Results of the overall MANOVA over all age groups revealed that for all subtests, the mean differences between the scaled scores based on the Dutch and US norms significantly deviated from 0, with a large effect size, F(5,1907) = 449,99, p <. 01, η p 2 = .54, indicating a significant difference between the Dutch norms and the US norms. Next, the univariate results (see Table 2) showed significant differences for all subtests with large effect sizes for the Cognition, Fine Motor, and Gross Motor subtests of .15, .16, and .25 respectively, and small effect sizes for the Receptive Communication and Expressive Communication subtests of .06 and .05 respectively. The mean differences presented in Table 2 also provide information on the size of the standard deviation in relation to the effect sizes. The graphs in Fig 1 indicate that the extent of the deviations fluctuated across age groups for all subtests. Table 3 presents the mean difference between the scaled scores based on the Dutch and the US norms per age group for all subtests. The smallest mean difference of .01 was found for Expressive Communication for age group B (1 months 16 days-2 months 15 days). The largest mean difference of 3.18 was found for Gross Motor skills for age group G (6 months 16 days-8 months 30 days.), which equals more than 1 SD based on the Bayley-III-NL scaled scores.
As Fig 1 shows that the size of the differences between the scores based on the Dutch and the US norms differed per age group, the results were also analyzed with MANOVA's including all subtests for each Bayley-III-NL age group separately. The second column in Table 4 displays the effect sizes regarding the multivariate analyses in which all five subtests were included. Large effect sizes were found for the differences between the scaled scores based on the US and Dutch norms for all age groups, but not consistently for specific subtests or for specific age groups (Table 4). For Cognition, effect sizes were generally large for all age groups. For the Receptive Communication subtest, effect sizes were generally large with the exception of four age groups. Regarding the Expressive Communication subtests, for children 6 months and 15 days most effect sizes were large, whereas small to moderate effect sizes were found for the age groups between 1 month 15 days to 6 months 15 days old, which represents the period of preverbal development. Regarding the Fine Motor subtest, differences between the US and Dutch norms were largest in children between 1 month 15 days to 8 months 30 days old. The effect sizes for the older age groups, from 9 months to 42 months and 15 months old, fluctuated from small to large. For the Gross Motor scale, most effect sizes were large for the age-groups 1 month 15 days to 28 months 15 days. For the older age groups, only moderate to small effect sizes were found. For most of these small effect sizes, 0 falls within the confidence interval, indicating that no significant difference exists between the scaled scores based on the Dutch norms and the US norms.
Using a scaled score of 7 as a cutoff point, McNemar analyses showed that for all subtests, except Receptive Communication, significantly different proportions of children with low scores were found using Dutch and US norms (Table 5). When using the US norms instead of the Dutch norms, fewer children scored below 1 or 2 SD in Cognition, Fine Motor and Note. The Mean difference is calculated by the scaled score based on Dutch norms minus the scaled score based on US norms. Mean differences < 0 indicate that the score based on the US norms was higher than the scaled scores based on the Dutch norms. Mean differences >0 indicate that the scaled score based on the US sample is lower than the scaled score based on the Dutch sample.
doi:10.1371/journal.pone.0132871.t002 Expressive Communication and more children regarding Gross Motor functioning. Regarding the Receptive Communication subtest, a similar proportion of children scoring below 1 SD, but less children scoring below 2 SD were identified when using the US norms. In addition, McNemar analyses have been performed for 4 age groups (see Table 5). For all age groups, the proportions of children scoring below 1SD using Dutch and US norms differed significantly for most subtests. The difference between the proportions of children who scored below 2 SD using Dutch and US norms, was significant for only few subtests for the youngest three age groups and for most subtests for the oldest age group. In general, more children scored below 1 and 2 SD using the Dutch norms in comparison to using the US norms.

Discussion
For a large group of Dutch children, significant differences were found between their scores based on the Dutch norms and their scores based on the US norms, on all subtests of the Bayley-III-NL. Overall, effect sizes of the differences between the scores based on the Dutch and US norms were large. Analyses concerning the proportions of children with low scores, that may indicate a developmental delay (i.e., below 1 SD and 2 SD), showed that this concerns clinically important differences. Regarding the Cognition, Fine Motor, and Expressive Communication subtests, under-referral might have resulted from the US norms, as fewer children would have been identified with a developmental delay compared to the Dutch norms. The reverse was found for the Gross Motor subtest: The use of US-norms would have resulted in over-referral, as more children would have been identified with a developmental delay compared to the Dutch norms. These results regarding over-and under referral are to some extent age-dependent. The largest difference between Dutch and US scores was found for the Gross motor subtest: For children of approximately nine months old, the difference was one standard deviation, and the mean of the Dutch children resembled that of seven months old US children. These findings illustrate important differences in functioning and developmental levels of children in two western populations. These results are in accordance with earlier studies from different countries that compared the Bayley results for children to the US norms [4,5,6,7,8].
In relation to the findings of this study, it is important to realize that some adaptations were needed to make the Bayley-III appropriate for the Dutch population. Besides translation, some changes were made to the Communication subtests in accordance with Dutch culture which is described earlier. A previous study evaluated whether the original item sequence of the Bayley-III-in which the items increase in difficulty-would be adequate for assessment of the development of Dutch children, and it was concluded that the same item sequence could be applied [11]. Thus, also for the adapted items the level of difficulty was adequate in relation to the pattern of increasing difficulty of the items which indicates that the adaptation to the items did not result in too easy or too difficult items for the children in relation to age. It is therefore unlikely that the changes caused the differences between the scores based on the Dutch and the US norms. Furthermore, only small to moderate differences between Dutch and US norm scores were found for the oldest age group on the Expressive Communication subtest in which two items were deleted. In addition, another adaptation was made to the starting point and reversal rule of the Gross Motor subtest, as described under Methods. For some children, this could have resulted in the administration of more items, and accordingly more mistakes made by these children. As a result, the use of US norms could have led to lower scores at all ages in comparison to the Dutch norms. However, for several age groups the Dutch norms were higher or comparable to the US norms. Therefore, it seems unlikely that the adaptations that needed to be made for the Dutch version of the Bayley scales solely explain the differences between the US and Dutch norm scores. An important explanation for the differences between the norm scores concerns the constellation of the Dutch and US population that underlies the samples used for the norm construction. For the development of the Bayley-III-NL and its norms, the norming and validation procedure as used in the US was replicated. However, due to cultural differences and differences between the constellation of the populations, a perfect replication was not always possible. Both norming samples had a constellation representative for the population based on the same background characteristics concerning gender, parental education, ethnicity, and geographical region. The categories used to distinguish educational levels, differences in ethnicity and geographical region in the Netherlands were based upon the distinctions used by the CBS [12] and therewith fit the Dutch system. However, concerning ethnicity in the US, 40% of the population was from a White background, 14% African American, 20% Hispanic, 4% Asian and 1% was coded as Other [1,15]. For the Dutch population different categories were used: 75% originally Dutch (White Caucasian) versus 25% non-Dutch parents [12]. Previous studies showed that developmental trajectories of children with different ethnic backgrounds, even within the same country, were significantly different for motor skills [16] and language skills [17]. Therefore, the difference in constellation of the Dutch and US norming sample regarding ethnicity of the parents might have contributed to the differences between the norms.
Another important factor related to developmental outcome is maternal educational level. In the US, educational level was measured in years and 42% of the parents had a low-, 30% a medium-and 28% a high level of education [15]. In the Netherlands, 16% of mothers between 25 and 45 years of age had a low-, 40% a medium-and 44% a high level of education [12]. For the Bayley-III-NL, analyses regarding the association between mother's educational level and the scaled scores of the Dutch norm sample revealed significant differences between the scaled scores of children of mothers with a low, medium and high education, with increasing age: Children of highly educated mothers generally had higher scores on the subtests Cognition and Receptive Communication compared to children of lower educated mothers [9]. This is in concordance with earlier studies, which showed that children of parents with a lower SES, including a lower educational level, had poorer language skills in comparison to children of parents with a higher SES and poorer executive functioning skills [18]. Thus the difference in constellation of the Dutch and US norming sample regarding educational levels might also have contributed to the differences between the norms.
For the youngest age groups, between 15 days and 10 months of a age, a swing in the scores based on the Dutch norms was found regarding Cognition and Fine Motor Skills. This might be explained by the fact that the norms of the Bayley-III-NL were created based on a weighted sample and based on age calculated in days, whereas the results of this study are based on an un-weighted sample and the means are presented for age groups. However, more research is needed on the relation between this swing in scores for the young children and the Bayley-III-NL and why this is seen specifically in two of the subtests.
It is concluded that outside the US, the use of population specific norms instead of the US norms is preferable. However, it is costly and time-consuming to create such norms. When population specific norms are unavailable, a matched control group should be used of children from the same population and assessed at the same ages as the studied group. Using these matched control groups as a reference may be more reliable when norms of a country with a more similar culture and constellation of the population than that of the US, are used. When data from a matched control group is not available, using norms of a more similar country might be a better alternative than using the US norms. However, caution is still needed when interpreting the results.

Conclusion
The current study shows the importance of population-specific norms for the interpretation of developmental test results. Therefore, in the Netherlands, the Dutch population specific norms should be used for all subtests and all ages.