Where are the vulnerable children? Identification and comparison of clusters of young children with health and developmental vulnerabilities across Queensland

This study aimed to better understand the vulnerability of children in their first year of school, aged between 5 years 5 months and 6 years 6 months, based on five health and development domains. Identification of subgroups of children within these domains can lead to more targeted policies to reduce these vulnerabilities. The focus of this study was to determine clusters of geographical regions with high and low proportions of vulnerable children in Queensland, Australia. This was achieved by carrying out a K-means analysis on data from the Australian Early Development Census and the Australian Bureau of Statistics. The clusters were then compared with respect to their geographic locations and risk factor profiles. The results are made publicly available via an interactive dashboard application developed in R Shiny.


Background
Internationally, there is an increasing focus on population health among integrated care organisations and health systems [1,2].The goal of population health methods is to enhance the overall health of a group of people.In order to do this, it is critical to recognise the requirements of various groups within the population [3,4,5].An important group in the population is children.Healthy child development improves human capabilities by allowing children to mature and participate in economic, social, and civic life [6].Child development includes the biological, psychological, and emotional as a baseline in 2009, children falling below the 10th percentile in a domain, taking into account the age differences, are categorised as 'developmentally vulnerable'.The AEDC reveals that the proportion of children who are developmentally vulnerable, within each developmental domain, varies considerably between geographical regions across Australia.This variation exists across the smallest geographic areas defined by the AEDC, which are referred to as local communities and are often equivalent to suburbs [12].To address inequalities in developmental vulnerabilities, further insight is needed into the factors that contribute to such variation [13].One method to understand the variation is cluster analysis [14].Cluster analysis is a mechanism for grouping (clustering) a set of objects (e.g., local communities) in such a way that objects within a group (cluster) are more similar (e.g., in terms of developmentally vulnerable) to one another than to those in other groups (clusters) [15].There are many clustering methods: model-based versus fully empirical, parametric versus non-parametric, probabilistic versus nonprobabilistic, hierarchical versus partition-based, and supervised versus unsupervised [16].There are also many computational methods for clustering, including the Expectation-Maximisation (EM) algorithm [17] and a variety of simulation-based algorithms, such as Markov chain Monte Carlo (MCMC) [18].A wellestablished simple non-probabilistic unsupervised partitioning method, which is employed in this study, is K-means clustering, where K denotes the number of clusters (Section ) [19,20].Common strategies for choosing the value of K include the elbow method [21], gap statistic [22], silhouette coefficient [23], and canopy method [24].In this study, the silhouette coefficient (Section ) is used to determine the number of clusters.While the elbow method is easy to implement and the calculations required are simple, the silhouette coefficient allows evaluations of clusters on multiple criteria, and hence it is more likely that the optimal number of clusters can be determined [25].Publicly accessible data in the population AEDC domain are frequently aggregated within geographical areas [26].In Australia, these geographical areas are typically the statistical areas defined in the Australian Statistical Geography Standard (ASGS).In the ASGS, Statistical Areas Level 1 (SA1) are the smallest defined geographical areas and aggregate to form Statistical Areas Level 2 (SA2).There are four levels of aggregation of statistical areas, SA1 through SA4.Where personal-level information is available, it is not uncommon for data on the exact location of individuals to be missing.Even if exact location data are available, privacy and confidentiality concerns prevent publication of person-level information.Hence this study uses data aggregated at SA2 level [26].In this study, we aimed to identify and characterise regions in Queensland in terms of high and low vulnerability across five domains of health and development, for children in their first year of full time school.We used K-means clustering to identify regions, such that within a cluster of SA2s making up a region, children have similar vulnerabilities for a given AEDC domain.In characterising the regions (clusters of SA2s), we consider the factors from AEDC which are publicly available in SA2 level: attendance at preschool, Indigenous status, mother's language, country of birth, socioeconomic status, and remoteness status.In addition to the abridged results presented in this paper, we developed a web application to make the complete set of results accessible and more easily digestible.The web application has an intuitive interface that allows users to interactively explore child development vulnerability across the AEDC domains and across the SA2 areas of Queensland.We used the Shiny package for R [27] to develop the web application, since the data were analysed also using R statistical software [28].The results of this research will support targeted early intervention strategies which can allow children to reach their maximum developmental potential.

Methods
Case study and sources of data Child development vulnerability data were obtained from the 2018 Australian Early Development Census (AEDC).The AEDC is conducted every three years and collects data on children in their first year of full time school.The AEDC recently took place in 2021, but the most recent data available is for the 2018 census.The census is performed by classroom teachers in the child's first year of full-time schooling across Australian Government and non-Government schools, and data are collected with the agreement of parents [12].The data provided on a child by their teacher, based on the teacher's knowledge and observations of the child, is used to assign the child a score (0 to 10) for each AEDC developmental domain .For each domain, the child is then classified as vulnerable if their score is in the lowest 10% of scores for that domain using the cutoffs established as a baseline in 2009.Approximately 65,000 children (98.1% of eligible children) across 1,414 Queensland Government, Catholic and Independent schools participated in the 2018 AEDC collection.The data were available as aggregated counts at the SA2 level.Among the 528 SA2s that make up Queensland, there was an average of 123 children per SA2, with a standard deviation of 100 [12].All five domains of health and developmentally vulnerable from the AEDC were considered in this study: physical health and well-being (Physical), social competence (Social), emotional maturity (Emotional), language and cognitive skills-school based (Language), and communication skills and general knowledge (Communication).We also considered two additional AEDC indicators of vulnerability: vulnerable in one or more domain (Vuln 1), and vulnerable in two or more domains (Vuln 2).Due to the aggregated nature of the available data, we focused on the proportion of vulnerable children within each SA2.The following data were also obtained from the 2018 AEDC for each SA2: proportion of children who attended pre-school (Preschool), proportion who identified as Indigenous (Indigenous), proportion with English as the mother's language (English), proportion with Australia as country of birth (Australia).Further data for 2018 were obtained from the Australian Bureau of Statistics (ABS) for each SA2 including: Index of Relative Socio-economic Disadvantage (IRSD) for the SA2 (1 to 10), and Remoteness (Major City, Inner Regional, Outer Regional, Remote, Very Remote).The IRSD is coded from 1 (lowest) to 10 (highest) [29]; a low score suggests that the area, in general, is at a disadvantage, e.g., many low-income households, many people without qualifications or with low-skill occupations.In 2018, there were 294 major city, 113 inner regional, 96 outer regional, 11 remote and 14 very remote SA2s in Queensland.Between 3% and 6% of the data were missing variables in the dataset.Proportions (e.g., Preschool, Indigenous) that were missing for an SA2 were imputed using the average of the proportions from the neighbouring SA2s.For categorical data, i.e; IRSD and Remoteness, the missing value was imputed using the highest frequency category of the neighbourhood SA2s.Missing values for two islands could not be imputed, as the regions have no contiguous neighbours.As a result, the analysis carried out in this study was reduced to the remaining 526 SA2 areas.

Clustering method
This section details the clustering method used to investigate the data clusters.All statistical analyses were conducted using R statistical software version R-4.1.3[28].The analyses for the K-means algorithm were carried out using mclust [30], and factoextra [31] packages, and the shiny package in R was used to develop the interactive dashboard [27].

K-means clustering
The K-means clustering method is a popular unsupervised machine learning technique that is extensively utilised due to its simplicity and fast convergence.The K-means algorithm is a basic partitioning approach that utilises a distance metric for partitioning observations into clusters.The number of clusters, K, is determined beforehand.The centre of a cluster is known as the cluster centroid.Every data point is allocated to a cluster such that within a cluster the summed distance between the centroid and data points is minimised, and between clusters the summed distance between cluster centroids is maximised.Some distance metrics include Euclidean distance, Manhattan distance, cosine distance, Minkowski distance and correlation distance [32].In this study, the Euclidean distance was adopted.The chosen value of K directly influences both the convergence of the algorithm and the inferences.In this study, we considered a range of plausible value of K and chose the value that gave the best fit, as determined by the silhouette coefficient, see section .The algorithm proceeds as follows.1) Define the number of clusters K. 2) Randomly select K data points as the cluster centroids.3) Assign data points to the closest cluster centroid.4) Recompute the cluster centroids.5) Repeat steps 3) and 4) until either the centroids do not change or the maximum number of iterations is reached [33].In this paper we apply the K-means algorithm to the proportion of vulnerable in a SA2 for each of the five AEDC domains and two indicators.

Cluster evaluation
Internal and relative validation are two popular ways of evaluating a cluster analysis.Internal validation uses two fundamental principles to validate clusters: cohesion and separation.Cohesion measures average distance between items within clusters, while separation measures average distance of a cluster to the adjacent cluster.Clusters are confirmed in relative validation by altering the clustering algorithm's parameters, such as the number of clusters K, to optimise a given measure of fit.In this study, we adopt the silhouette method for cluster evaluation [23], which combines cohesion and separation.The similarity between the item and the cluster to which it belongs is represented by cohesion, and when compared to other clusters, it is described as separation.These comparisons may be quantified using the silhouette coefficient, which ranges from −1 to 1, with a value near 1 suggesting good identification between the item and the cluster.In general, silhouette width scores less than 0.2 or silhouette width scores greater than 0.9 are problematic; silhouette width scores of 0.5 are good, and silhouette width scores between 0.7 and 0.9 are preferable [34].The Silhouette coefficient is given as: where s(i) is the silhouette coefficient of data point i, a(i) is the average distance between i and all the other data points in the cluster to which i belongs, a(i) represents the intra-cluster dissimilarity of sample i, b(i) is the minimum average distance from i to all clusters to which i does not belong.The inter-cluster dissimilarity of sample i is defined as b(i).

R Shiny
Shiny is a R web application framework that allows the development of interactive web applications.This package makes it easy to create websites that interact with R without prior knowledge of web programming or other scripting languages.To create the shiny application, we uploaded the data set, built the clustering algorithm in R, and then used these two files to create the Shiny application.A brief summary of Shiny and a description of the main components used to implement the application are provided in Appendix .The program allows user involvement and generates interactive visualisations such as maps with padding and zooming capabilities.One disadvantage of Shiny is that applications created with it can only be deployed online using the Shiny web server.It is noted, however, that although Shiny currently has a relatively limited feature set, this will likely expand, given the product's popularity [35].It is important to note that the application requires access to the internet.

Results
The developmentally vulnerable proportions were analysed on the log scale, due to skewness in the proportions, and converted back to their original scale in reporting the results.The number of clusters was evaluated for K = {3, 4, 5, ..., 12}, separately for each of the five AEDC domains and the two composite domain indicators (Vuln 1, Vuln 2).The optimal number of clusters for each domain was chosen to be four after validating the clusters internally using silhouette scores.The silhouette plots for the clusters in the five domains and two indicators are displayed in Figure 1.
Summary statistics for each cluster (size, mean, variance, range) for the five AEDC domains and two indicators with the association demographic factors are given in Appendix A. These results are visualised in the R Shiny application, accessed at https: //waladraidi.shinyapps.io/Shiny_2_6_2022/.Figures 2 and 3 illustrate the application interface.The interface includes two tabs.For the first tab (Figure 2), which shows the K-means cluster summary, the user can select the type of development vulnerability and the cluster of interest.The clusters, labelled C1, C2, C3 and C4, correspond to vulnerability level ordered from lowest vulnerability (C1) to highest vulnerability (C4).Furthermore, this first tab shows the associated characteristics related to the demographic factors for each cluster and the location of the SA2 areas on the map.The second tab (Figure 3) shows a map of the distribution of the clusters (regions of differing vulnerability) for a given development vulnerability.The user can choose the type of AEDC domain from the five domains and two indicators, and can zoom in on the Queensland map to view finer details for each region.This provides an interactive visual summary of vulnerability across the regions of Queensland, and comparison of the vulnerabilities across the five AEDC domains and two indicators.An example map is given in Figure 3.A selection of outputs from the app, comparing C4 (highest vulnerability) to C1 (lowest vulnerability), is provided in Appendix B.
Averaging over the five AEDC domains, the SA2s that make up C4 have an average of 25% of children identified as vulnerable, compared to 5% for C1 (Table 1); this discrepancy is also observed within each AEDC domain (Table A8).In comparing the most vulnerable cluster, C4, to the least vulnerable cluster, C1, there are higher proportions of children who do not have English as their mother language, who are Indigenous and who did not attend preschool (Tables 1 and A8 ).
The one exception to this is that, for the SA2s making up the most Emotionally vulnerable cluster, C4, there is a higher proportion of children who do have English as their mother language compared to the least Emotionally vulnerable cluster, C1 (Table A8).The SA2s belonging to C4 are located in far north and north west Queensland, and a small number can be found in the coastal areas of Queensland (Table A9).In contrast, the SA2s belonging to C1 can be found in the south east of the state.This region contains the majority of the children of Queensland and the capital city, Brisbane (Table A9).Across all AEDC domains, there is a much higher proportion of children residing in SA2s belonging to C4 with a low IRSD score (greater socioeconomic disadvantage) compared to C1.
In comparing Vuln 1 and Vuln 2, unsurprisingly, the proportion of children who are vulnerable on two or more (2+) domains is lower than the proportion vulnerable on one or more (1+) domains for both C4 and C1.For C4, the proportion of children who don't have English as their primary language is higher for 2+    vulnerabilities compared to 1+ vulnerabilities.In comparison, the proportion of children who identified as Indigenous is lower for 2+ vulnerabilities compared to 1+ vulnerabilities.The proportion who did not attend preschool is about the same for 1+ and 2+ vulnerabilities.The SA2s belonging to C4 for Vuln 1 are located in the same geographic areas of Queensland as Vuln 2 and additionally in the southeast and central coast.
For the SA2s in C4, there is a higher proportion of children residing in SA2s with a low IRSD score (greater socio-economic disadvantage) for Vuln 1 compared to Vuln 2.
In comparison, across the five domains for the most vulnerable cluster (C4), the smallest cluster size can be found in the physical health domain with around 30 SA2 areas, and the largest cluster size can be found in the communication skills domain, 46 SA2 areas.In addition, there was a notably higher proportion of Indigenous children in Physical domains in comparison with the rest of the AEDC domains, the proportion of the country of birth was greater than 85% across all the clusters, and for all domains, with some slight differences between the clusters of no more than around 5%.

Discussion
In this study, the K-means algorithm was applied to investigate commonalities in statistical areas across Queensland, Australia, with respect to children's vulnerability based on five AEDC domains and two indicators.Four clusters were identified for each of these domains, and demographic profiles were developed for each cluster.In addition to presenting summary statistics in tabular form, an R Shiny app was developed to visualise and summarise the results of the analyses.
This app enables users to engage with the results interactively.For example, health managers can use the app to identify regions with high proportions of developmentally vulnerable children and develop more targeted services for these areas.This study is crucial for the government and individuals to identify the regions of high vulnerabilities and improve services for these population groups.
The clustering analyses reveal a strong relationship between AEDC domains and socio-economic and remoteness factors.We found that SA2s with the lowest proportion of vulnerable children typically had larger proportions of children who attended pre-school and whose primary language is English.However, there was substantial spatial variation in the results.The communication skills domain (Communication) was found to have the largest cluster size for the most vulnerable SA2s (C4) compared to the other domains.
In contrast, the language and cognitive skill domain (Language) had the largest cluster size for the least vulnerable SA2s (C1) compared to other domains.SA2s in this later group were characterised by children typically from high SA2 socio-economic regions with a lower proportion of Indigenous status and a higher proportion of attendance at pre-school.In this case study, the data are analysed at the SA2 level of aggregation.Therefore, care must be taken in making inferences at another level of aggregation or about individuals due to biases such as Simpson's paradox [36].The clustering of these SA2 level child developmentally vulnerable data offers a comprehensive breakdown of the factors impacting child health development across Queensland.This breakdown of vulnerabilities at the statistical area level allows for improved region-based analysis and policy development.Table C1 provides a summary of the most vulnerable cluster (C4) to other clusters in each type of AEDC domain.For the Physical health domain, the cluster sizes for the least (C1) to most (C4) proportions of vulnerable children were 126, 219, 151 and 30 SA2 areas, respectively.For the most physically vulnerable cluster, there were roughly similar percentages of children from inner cities and regional areas (23-30%), in comparison to the other clusters where most children were from inner cities (76%); there was a notably higher percentage of children from very remote areas (17%) compared to the other clusters (< 2%); 94% were in the lowest four rungs of socioeconomic disadvantage (most disadvantaged), which is substantially more than the other clusters (14%-61%); there was a lower percentage of children (15% less) with English as their first language; less children went to preschool (2-13% less); and there was a higher proportion of Indigenous children (19-30%).These SA2 areas in (C4) were located in the north of Queensland and a small number were also identified in the central coast and south east of Queensland.For the social health domain, the cluster sizes for the least (C1) to most (C4) proportions of vulnerable children were 168, 205, 111 and 42 SA2 areas, respectively.For the most vulnerable cluster in this domain, there were higher percentages of children from inner cities (45%), in comparison to the other clusters; there was a higher percentage of children from very remote areas (7%) compared to the other clusters (< 3%); 71% were in the lowest four rungs of socioeconomic disadvantage (most disadvantaged), which is substantially more than the other clusters (18%-51%); there was a lower percentage of children (8% less) with English as their first language; less children went to preschool (3-10% less); and there was a higher proportion of Indigenous children (9-18%).These SA2 areas in (C4) were located in the north west of Queensland and a small number were also identified in the central Queensland of Queensland.For the emotional health domain, the cluster sizes for the least (C1) to most (C4) proportions of vulnerable children were 1113, 194, 180 and 39 SA2 areas, respectively, For the most vulnerable cluster in this domain, there were high percentages of children from inner cities and regional areas (20-51%), in comparison to the other clusters where most children were from inner cities; there was a notably higher percentage of children from very remote areas (8%) compared to the other clusters (< 3%); 71% were in thee bottom four rungs of socioeconomic disadvantage (most disadvantaged), which is substantially more than the other clusters (26%-38%); there was a lower percentage of children ( 5% less) with English as their first language; less children went to preschool (4-9%); and there was a higher proportions of Indigenous children (9-17%).These SA2 areas in (C4) were located in located in the far north of Queensland.For the language health domain, the cluster sizes for the least (C1) to most (C4) proportions of vulnerable children were 162, 194, 133, and 37 SA2 areas, respectively.For the most vulnerable cluster in this domain, there were low percentages of children from inner cities and regional areas (0-18%), in comparison to the other clusters where most children were from regional areas; there was a notably higher percentage of children from very remote areas (36%) compared to the other clusters (< 8%); 100% were in the bottom four rungs of socioeconomic disadvantage (most disadvantaged), which is substantially more than the other clusters (10%-78%); there was a notably higher percentage of children (23-29% less) with English as their first language; less children went to preschool (2-4%); and there was a notably higher proportion of Indigenous children (42-62%) .These SA2 areas in (C4) were located in the far north of Queensland.For the communication health domain, the cluster sizes for the least (C1) to most (C4) proportions of vulnerable children were 152, 195, 133, and 46 SA2 areas, respectively.For the most vulnerable cluster in this domain, there were roughly higher percentages of children from inner cities and regional areas (27-40%), in comparison to the other clusters where most children were from inner cities; there was a notably higher percentage of children from very remote areas (11%) compared to the other clusters (< 7%);86% were in the bottom four rungs of socioeconomic disadvantage (most disadvantaged), which is substantially more than the other clusters (13%-67%); there was a notably higher percentage of children (12-23%) with English as their first language; less children went to preschool (4-13%); and there was a notably higher proportion of Indigenous children (8-42%).These SA2 areas in (C4) were located in the north west of Queensland and a small number also identified in the south west and coastal areas of Queensland.For the Vuln 1 indicator, the cluster sizes for the least (C1) to most (C4) proportions of vulnerable children were 101, 181, 173, and 71 SA2 areas, respectively.For the most vulnerable cluster in this indicator, there were roughly higher percentages of children from inner cities and regional areas (23-35%), in comparison to the other clusters where most children were from inner cities; there was a notably higher percentage of children from very remote areas (18%) compared to the other clusters (< 3%); 97% were in the bottom four rungs of socioeconomic disadvantage (most disadvantaged), which is substantially more than the other clusters (13%-53%); there was a notably higher percentage of children (11-18%) with English as their first language; less children went to preschool (1-12%); and there was a notably higher proportion of Indigenous children (14-26%).These SA2 areas in (C4) were located in the far north of Queensland and a small number also identified in the south east and central coast of Queensland.With regard to Vuln 2 indicator, the cluster sizes for the least (C1) to most (C4) proportions of vulnerable children were 162, 207, 117, and 40, SA2 areas, respectively.For the most Vuln 2 cluster: there were roughly higher percentages of children from inner cities and regional areas (35-25%), in comparison to the other clusters where most children were from inner cities; there was a notably higher percentage of children from very remote areas (13%) compared to the other clusters (< 4%); 94% were in thee bottom four rungs of socioeconomic disadvantage (most disadvantaged), which is substantially more than the other clusters (16%-46%); there was a notably higher percentage of children (15-18%) with English as their first language; less children went to preschool (1-10%); and there was a higher proportion of Indigenous children (6-12%), The most Vuln 2 cluster domain can be found in far north of Queensland and a small number also identified in south east.Compared to Vuln 1, the geographic distribution is less in central coast and south east.there was a slightly higher percentage of children (3%) with English as their first language; less children went to preschool (2% ); and there was a higher proportion of Indigenous children (8-14%).

Figure 1 :
Figure 1: Silhouette plots for clusters across the five AEDC domains and two indicators, the x-axis are the clusters, and the height of each cluster is the silhouette width score for the cluster.The dotted line is the average silhouette width score across the four clusters.

Figure 2 :
Figure 2: Example of K-means clustering results displayed in the web interface for C1 of the physical health development domain, the dashboard shows the box plot for the proportions of Australia, English, Indigenous and Preschool variables, and pie charts for the percentages of remoteness and IRSD and the location of C1 in Queensland map.

Figure 3 :
Figure 3: Map of the four clusters obtained for the physical health AEDC domain.The clusters are ordered from C1 (green, least vulnerable) to C4 (red, most vulnerable).

Table A8 :
Comparison of C4 (most vulnerable) to C1 (least vulnerable) for the five domains of development and two indicators.

Table A9 :
Geographic locations of the most vulnerable cluster (C4) and the least vulnerable cluster (C1).

Table C1 :
Summary for the most vulnerable cluster (C4) to other clusters in each AEDC domain.