A machine learning approach to predict ethnicity using personal name and census location in Canada

Background Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. Methods Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. Results The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68–95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63–67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). Conclusions The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by specific ethnic categories.


Introduction
Ethnicity and race are cornerstones of individuals' sense of self-identity, social belonging, and shared experiences that influence one's health beliefs, behaviours, and outcomes [1]. Ethnicity and race are socially-defined constructs that are complex and multilayered. While they are sometimes used interchangeably, they are two different but related concepts. The term "race" suggests a biological basis for socially-constructed categories that in-group members are implied to share greater genetic homogeneity than out-group members [2]. However, in reality, the degree of additional genetic similarity shared among members of the same race is largely negligible and biologically inconsequential compared to the total genetic makeup shared between individuals from different races [3]. The term "ethnicity" generally refers to a wide range of socially-constructed categories that in-group members tend to share a common culture, language, heritage, or national origin. While race is often characterized by a person's physical attributes such as body height, hair texture, facial feature, and skin color, ethnicity is a person's subjective affinity towards an ethnic group that he or she feels most self-identifiable with [4]. Since ethnicity is more widely used than race in Canada and it is conceptualized more narrowly for research and surveillance purposes [2], it is the focus of this research.
With great ethnic diversity, Canada is challenged by pervasive and persistent public health and social issues that are disproportionately affecting specific ethnic groups, including health inequality and racial discrimination towards Aboriginals and visible minorities [5,6]. Specifically, Aboriginals experience significantly higher disease prevalence and worse clinical outcomes for a large number of acute and chronic conditions [7][8][9], yet they are also burdened by lower access and awareness to available health resources compared to non-Aboriginals in Canada [10][11][12]. As a result, ethnicity (and race) is considered a key social determinant of health in Canada [13]. Despite its public health significance and recognition, ethnicity data is absent in the vast majority of disease registries, patient-care administrative data, vital statistics data, and major health surveys across Canada [5,6,14]. This ethnicity data gap impedes the pursuit and attainment of ethnically-specific health evidence that are essential for the understanding, development, and monitoring of effective health policies and programs [5,6].
Implementing large-scale, systematic data collection on ethnicity in health data across Canada may encounter many practical, political, and legal challenges [6,15]. While the general public and health researchers and practitioners largely agree that collecting ethnicity for research is ethical and desirable, the net benefit may not be realized by all individuals [6]. For example, directly asking questions about ethnicity and race has evoked anxieties about racism and racist classification in some patients, especially those who have personally experienced racism in the healthcare system. Some interviewers have raised concerns regarding if the process of asking about ethnic and racial identities may steer the respondents to believe that inequality is endemic in the healthcare system [6]. An alternative approach that is timelier, less costly, and indirect in acquiring ethnicity information needs to be explored in Canada. Unlike ethnicity, personal names are often collected in most databases by default. An alternative is to automate the prediction of ethnicity using commonly-collected variables such as personal names. Personal names are typically recorded in the form of unstructured format of the entire name or more structured format with specific name entities (such as first, middle, and last name). Since many naming practices are influenced by cultural, religious, and familial traditions that intersect with ethnicity, individuals' first, middle, last, and full names carry a degree of predictive quality for the associated ethnicity. Mateos [16] conducted a systematic review which identified 13 representative studies that developed name-based ethnicity classification methodologies. The sensitivity and positive predictive value (PPV) of these studies ranged between 67-95% and 70-96%, respectively. These studies included a number of common methodological process and research components: 1) a name reference list is independently developed or sourced from another study or domain experts; 2) a separate target population (or testing data) is manually or automatically classified into different ethnic categories; and 3) the performance metrics of the method is evaluated against the previously known ethnicity ("gold standards") in the target population [16]. The process of predicting or classifying ethnicity for the target population were either done manually or automatically. The manual approach typically involved domain experts examining and making the judgment of what the most likely ethnicity was for a particular name, based on the expertise in linguistic and ethnocultural history. In contrast, the automated approach did not rely on human judgment. Instead, it was based on programming the computer to automatically detect and utilize signals from names in the form of rule-based patterns (such as regular expression), statistical patterns (such as ML), or a combination of both. Mateos [16] found nine studies that have employed automated, as opposed to manual, ethnicity prediction methods to predict the ethnicity of the target population. Machine learning (ML) frameworks have continuously been explored in this area [17][18][19][20]. Ambekar et al. [17] combined decision tree and Hidden Markov Model (HMM) to conduct classification on a taxonomy with 13 ethnic categories. Treeratpituk et al. [15] examined both alphabetical and phonetic sequences in names to improve predictions. Fiscella and Fremont [21] and Imai and Khanna [22] have found that combining name and residence location further improved the performance.

PLOS ONE
Most published studies to date are non-Canadian with ML models trained with databases outside of Canada, which were not optimized for the Canadian population. Aboriginal populations must be included in this line of work as they are vastly marginalized and disadvantaged. In many ML publications, a formal procedure of feature selection and hyperparameter optimization may not be conducted or reported, thus reducing research reproducibility. Past Canadian studies using automated ML approaches to predict ethnicity tend to be scarce, outdated, examining only a few ethnic groups, and not including Aboriginals and their subgroups as distinct categories [16]. To our knowledge, no studies have combined textual and phonetic name features and location features to predict ethnicity. The primary objective of this study is to determine if the ethnicity data gap in Canada may be potentially addressed using a novel automated ML approach. This is achieved by conducting and formally evaluating the predictive performance of a large-scale ML framework utilizing a novel name and location feature set for ethnicity prediction. The secondary objectives of this study are to provide detailed description and to share codebase of this ML framework to support future research.

Data source and analytic framework
The Canadian census contains all the required person-level variable fields, including selfreported full name, census location, sex, and ethnicity. From the Statistics Act of 2005, the release of personal records in past censuses is restricted for 92 years after their respective years of collection [23]. However, individual records from censuses prior to 1906 were publicly accessible at the National Archives of Canada. Under these constraints, we chose census 1901 as data source to build our ML framework (Fig 1). Features were extracted from the text recorded in name and census location variable fields. The census was split randomly into 80% training set and 20% test set. A development (dev) set was created by randomly-sampling 12.5% of the training set (Fig 1). Using only the dev set, the feature selection and hyperparameter optimization steps were conducted to finalize name and census location feature sets and to automatically optimize the ML algorithms, respectively. Finally, the obtained final feature sets and optimized hyperparameter values were used to train the ML algorithms within the training set. The trained ML algorithms would then be applied to predict the ethnicity of individuals in the test set. The agreement and discrepancy between the recorded ethnicity in census (as "gold standards") and our predicted ethnicity were quantified based on a predefined set of evaluation metrics.

Variables and initial data processing
ML models containing only sex or only dummy variables (with randomly-generated numeric or text strings) served as benchmarks, as they possessed no predictive quality for ethnicity. Incomplete (on name, census location, or ethnicity information) or duplicate records were removed. Name and ethnicity variables were in unstructured text format, thus they required further data cleaning and standardization. Personal titles, numbers, single alphabets, and punctuation marks were stripped from names. Alternative spellings and misspellings of ethnicity labels were standardized and recategorized into one of the followings: Aboriginal (Ab), English (En), Chinese (Ch), French (Fr), Irish (Ir), Italian (It), Japanese (Jp), Russian (Ru), Scottish (Sc), and others. These ethnic categories were selected either due to their large representation of the Canadian population (such as En, Sc, Fr, Ir, and It) or their public health and socioeconomic significance as ethnic minorities (such as Ab, Ch, and Jp), based on literature search and research team discussions. Some Aboriginals received a secondary label if they could be further categorized as First Nations (Ab-Fn), Métis (Ab-Mé), or Inuit (Ab-In). The (census) location variable contained structured text of predefined census location information, including province/territory, district, and subdistrict, where the respondents were assigned to by Statistics Canada based on their residential addresses [24]. During the time, the major province/territory-level geographic boundaries included British Columbia, Manitoba, New Brunswick, Nova Scotia, Ontario, Prince Edward Island, Quebec, as well as Yukon Territory and Northwest Territories, and the District of Keewatin. One notable difference between this geographic categorization and current time is that Alberta and Saskatchewan were not considered a separate province but an individual district within the "Yukon Territory and Northwest Territories, and the District of Keewatin" region in census 1901 [24]. Nunavut was still part of the Northwest Territories until April 1 st , 1999 [25], and Newfoundland and Labrador were not a Canadian province until its confederation on March 31, 1949 [26]. Census geographic boundaries did change over time due to major political events and population growth. However, boundary revisions rarely occurred and were only done when necessary, in order to maintain high comparability between censuses [27]. Thus, we believe the census location from census 1901 has retained useful and representative information, and the predictive quality of the underlying ML methodology remains highly relevant to current time.

Feature engineering
The feature extraction process to create individual and grouped name features are described in Table 1, using "Wing Sun Lee" as an example. A technical challenge using the name variable in census 1901 was that it contained unstructured text, as opposed to the more recent censuses that divided the name into "given name" and "family name" at data collection. We labeled the first name entity ("Wing") as "First name", the last name entity ("Lee") as "Last name", and any remaining text string as "Middle name". While we recognize that the most accurate split of the Chinese name "Wing Sun Lee" should divide "Wing Sun" as first name and "Lee" as last name, this name splitting would require the foreknowledge that it belongs to a Chinese person.

6-letter substrings n/a None None None
Phonetic name features Phonemes using double-metaphone algorithms [28] "ANK", "FNK", "SN", "L" n/a n/a n/a Numeric name features Number of name entity 3 ("wing", "sun", "lee") n/a n/a n/a Total character length 4 ("wing") +3 ("sun") +3 ("lee") = 10 n/a n/a n/a Average character length by name entity (4+3+3)/3 = 3.3 n/a n/a n/a Number of vowels 4 ("i", "u", "e", "e") n/a n/a n/a Vowel-to-length ratio 4/10 = 0.4 n/a n/a n/a n/a, not applicable. https://doi.org/10.1371/journal.pone.0241239.t001 Despite these exceptions, our selected name splitting method should be generalizable to most individuals in most databases where the information of ethnicity is unavailable. The key aspect was that all the textural information of the name was captured from the span of the extracted name features as a whole, even though there would be a small information lost due to the incorrect placement into the "First name", "Middle name", and "Last name" predictor slots for a small number of individuals.
For the location features, in addition to the existing full location string values, the corresponding province/territory, district, and subdistrict were extracted into three separate features. For example, the original location string of "H, L'Assomption, Quebec, Canada" corresponds to "subdistrict number, district, province/territory, country". The text was processed by removing the non-informative "Canada" and lowercasing into "h, l'assomption, quebec" (as "Full location string"). It was then further broken into individual features as "quebec" (as "Province/Territory"), "l'assomption" (as "District"), and "h" (as "Subdistrict"). "All location features" included all four location features "Full location string", "Province/Territory", "District", and "Subdistrict".

Machine learning pipelines
Our ML framework consisted of two pipelines (Fig 1). The fundamental difference between them was that in the multiclass classification pipeline, the predicted ethnicity was one of the 10 ethnicity labels, whereas in the binary classification pipeline, in each iteration, the original ethnicities were recategorized into a binary label (i.e., Ab or non-Ab). The multiclass classification pipeline involved the following steps: feature transformation, feature set selection, hyperparameter optimization, and final training and testing (Fig 1). Both feature set selection and hyperparameter optimization were based on k-fold cross validation (CV) conducted within the dev set. The k-fold CV proceeded the following sequentially: shuffling the data randomly, splitting data into k groups, and iterating the training/testing in each group for k times. Iteratively, the training and testing in each group involved taking one of the k groups as a test set, while the remaining groups as training set, and fitting ML model on the training set and evaluating it on the test set separately. Feature set selection was done by the a-priori criteria set by our research team. In order to tune the ML algorithms to optimize their learning prior to the final training and testing, the hyperparameter optimization step automatically selected the hyperparameter values corresponding to the highest F1-score (F1) for each ML algorithm. This approach was chosen since 1) the ML algorithms would be expected to learn from our data more effectively, and 2) the lack of previously-published Canadian studies precluded us to choose the final feature sets and hyperparameter values in a more manual fashion.
For feature transformation, all numeric features were scaled to zero mean and unit variance. Categorical features containing a single string value per individual were encoded as onehot numeric arrays. Categorical features containing multiple string values (such as 1-to 6-letter substrings and double-metaphones) per individual were converted to matrices of token counts, known as count vectorization. For feature set selection, five-fold CV within dev set with regularized logistic regression (LR) classifiers was done. A-priori decision was made to derive two final feature sets: 1) "All name features" and 2) "All name and location features". Apriori feature inclusion criteria and steps were used to confirm the inclusion of individual name features. "Basic name features", "Name substring features", "Numeric name features", and "Phonetic name features" would only be included in the final "All name features" if its F1 was greater than the F1 of benchmarks by at least 10% (dummy feature as denominator). "All location features" were added to the "All name features" to create the "All name and location features". As a confirmation, it was expected that the "All name and location features" would outperform the "All name features" by at least 10% in F1 ("All name features" as denominator). Regardless if this confirmation was actually observed in data, the "All name and location features" would proceed in both ML pipelines, since it was expected that the addition of location features would at least boost the predictive performance for some of the individual ethnic categories in binary classifications. We assumed that creating an individualized feature set for each binary ethnic category would not significantly improve the predictive performance. Thus, the final feature sets obtained from the multiclass classification pipeline would also be applied to the binary classification pipeline (Fig 1).
The ML hyperparameter optimization was done only in the dev set to determine hyperparameter values that maximized F1. ML classifiers including the regularized LR, C-support vector (SVC), naïve Bayes (NB), decision trees (DT), and random forest (RF) were implemented. Randomized search via 5-fold CV repeated three times to obtain the highest F1 was done for each ML algorithm (Fig 1). The final feature sets and optimized hyperparameters were then used in the final training and testing step. Training was done in the training set, and the trained models were applied to predict individuals' ethnicity in the unseen (or hold-out) test dataset. The evaluation metrices on predictive performance consisted of accuracy, sensitivity, specificity, PPV, negative predictive value (NPV), F1, Area Under the Curve for Receiver Operating Characteristic curve (AUC-ROC), and average PPV [29]. AUC-ROC and average PPV are indicators that summarize how well the ML classifiers perform over a range of thresholds for decision boundary [30].
All data cleaning and preprocessing, descriptive analysis, ML pipelines, and visualization were carried out in Python 3.6.5 and related scientific libraries (such as Pandas, Numpy, Scikit-Learn, Statsmodels, and Metaphone). The codebase was made publicly available as a GitHub repository at https://github.com/kaionwong/ethnicity-ml-prediction.

Results
Census 1901 initially contained 5,079,210 records. After removing missing and duplicated records, the total number of unique individuals was 4,812,958 (94.8%). From this, 20% randomly extracted as test set (N = 962,592). The remaining 80% became training set (N = 3,850,366). 12.5% of the training set was randomly-selected to form the dev set (N = 481,296). The breakdown by ethnicity is illustrated in Fig 2. Considerable class imbalance was observed in some of the minority ethnic categories (i.e., Inuit) with low occurrence frequencies. Table 2 shows that all the name and location features outperformed the benchmarks by at least 10% in F1, fulfilling the a-priori criteria for inclusion. The "All name features" outperformed all the individual name feature subsets by 7% to 275% in F1. The "All name and location features" outperformed the "All name features" and "All location features" by 20% and 76%, respectively. The content of the "All name features" and "All name and location features" were confirmed, as previously described, to be the final feature sets for both ML pipelines. Table 3 describes the predictive performance for multiclass classification. The DT and RF classifiers consistently and vastly underperformed compared to LR, SVC, and NB classifiers, thus their results were omitted. Overall, LR classifiers tended to perform marginally better than SVC, while both were more superior over NB. The "All name and location features" expectedly achieved the best overall performance. Improvement by adding the location features for specific ethnicity label varied widely between 2% (Ch and Fr) and 34% (Ab) in F1 for the LR classifiers.

Multiclass classification
The multiclass confusion matrix for "All name and location features" with LR is shown in Table 4. The multiclass confusion matrix for "All name features" with LR is shown in Table 5.

Binary class classification
The binary classification predictive performance is described in Table 6 for LR. Similar tables for SVR and NB are presented in Tables 7 and 8, respectively. The addition of location features features only = number of name entities, total character length, total character length by name entity, number of vowels, and vowel-to-length ratio. All name features = all name-derived features. All location features = all locationderived features including processed location text string, province/territory, district, and sub-district features. All name and location features = "All name features" and "All location features". a The two chosen final feature sets "All name features" and "All name and location features" were then passed down to the subsequent steps for both multiclass and binary classification pipelines.

Discussion
To our best knowledge, we conducted one of the most extensive ML research in Canada within the domain of ethnicity prediction using name and location information. Overall, we employed a two-pipeline approach to demonstrate our classifiers' performance for a wider range of potential applications. The multiclass classifiers achieved 76% F1 and 91% accuracy.
The confusion matrices showed that most frequently-misclassified labels for English, Irish, and Scottish individuals occurred among themselves. This is to be expected as these groups historically shared a large degree of cultural and linguistic heritage. To improve the performance, regrouping them all under one generic "British" category may be considered, as shown in Ambekar et al. [17].
Aboriginals are most frequently misclassified as En, Ir, Sc, and Fr. The Indian Act was first introduced in 1876 (prior to 1901 census) as a consolidation of previous colonial mandates and regulations that aimed to eradicate First Nations culture in favour of assimilation into Euro-Canadian societies [31]. Thus, the misclassifications among Aboriginals are likely a result of the Indian Act's naming policies which unjustly forced Aboriginals to adopt new European names. Despite this inherent challenge, there was a large improvement in performance by adding location features onto the name feature sets. This aligned with the phenomenon that certain geographic regions in Canada were highly populated by Aboriginal populations [32]. For binary classifications, the sensitivity and F1 increased by 44% (36% to 80%) and 34% (50% to 84%) for "Aboriginal" predictions, and by 43% (35% to 78%) and 32% (49% to 81%) for "First Nations" predictions, respectively. Métis and Inuit binary classifications resulted in mediocre and poor predictions (65% and 31% F1, respectively), which is not unexpected due to the interracial marriage with Europeans among Métis [33] and the small available training data of Inuit in census data. Excluding the Ab, Ab-Fn, Ab-Mé, Ab-In, En, Ir, and Sc, the F1 ranged from 70% to 95% (median: 87%) in binary classification for the remaining six ethnic categories (Ch, Fr, It, Jp, Ru, Others). As mentioned in the Materials and Methods section, our name splitting method likely misplaced the first and last names from the correct name feature labels for a portion of Chinese individuals if their last names were recorded as the first (position) name entity. However, the consistently high predictive performance for Chinese (i.e., 0.88 sensitivity, 0.96 PPV, and 0.92 F1 in binary classification with "All name features" set) confirms our assumption that the information lost is negligible as long as all the name entities are considered and further extracted somewhere within the span of all the name features. In Canada, the trends and degrees of urbanisation and migration differ by ethnic groups. From 1961 to 2006, the percent of Aboriginals living in urban areas (as opposed to Indian reserves and rural areas) increased from 13% to 53% [34]. More specifically from 1981 to 2006, it increased from 40% to 53% for Aboriginals and from 75% to 81% for non-Aboriginals. However, residential segregation by ethnicity or ethnicity-associated factors remains. For example, residential aggregation of visible minorities, recent immigrants, and Aboriginals exists and is found to be a key associative factor underpinning the increase of concentrated urban poverty in various regions [35]. In addition to being highly populated in Canadian territories (Northwest Territories, Yukon, and Nunavut), Aboriginals are found to become more concentrated in a number of Prairie cities, particularly Saskatoon, Regina, and Winnipeg in recent times. Families of visible minority account for up to 78% of the low-income families residing in high poverty neighborhoods in 2001, doubling the level in 1981 [35]. Despite the changes in geographic boundaries between censuses and the dynamic nature of residential mobility and migration by Canadians, these findings strongly suggest that discernable patterns in geographic segregation by specific ethnic groups may always exist. These serves as indirect evidence that our demonstrated ML method utilizing name and location features should remain relevant and applicable if it is to be adopted by modern datasets. The feature selection step has shown alphabetic name features carry more predictive quality over phonetic name features, which aligned with findings by Treeratpituk and Giles [15]. Similar to Fiscella and Fremont [21] and Imai and Khanna [22], we found that combining name  In terms of potential real-life applications, the trained ML classifiers using "All name features" can be applied to the personal name field of other databases to generate predicted ethnicity at the individual-level. However, databases that also contain respondents' residential addresses will not be able to use our trained ML classifiers with "All name and location features" directly since they are trained with the older (1901) census boundaries. An additional step is needed to standardize and remap the location information before using both name and location information from applied databases. For example, one option is to convert both the location information in both census 1901 and applied database to the Global Positioning System (GPS) coordinates. The individuals in the applied database will receive an approximated census location, via mapping the GPS of their addresses to the nearest corresponding census location, which can then be used directly by our trained ML classifiers. Available remapping tools exist that enable the geographic conversion of residential addresses to corresponding census location information [38].

Limitations
This study used an older Canadian census 1901, which is about four generations (30 years per generation) from the past. Studies have shown that distinctive naming practices in different ethnic groups are often persistent over a long period of time, even after immigration to another geographic location with different cultural and social environment [39][40][41]. As mentioned earlier, despite continuous urbanisation and migration, distinctive residential segregation patterns exist by different ethnic groups. As a result, we believe the underlying ML methodology conducted is applicable and generalizable to more recent time. Nonetheless, we encourage future studies with similar research interest to access and evaluate with a more recent data source to strengthen the temporal representativeness of the ML models. We also recommend future studies to expand on the current ethnic categories, as well as examine regrouping ethnicity labels into hierarchical structures that is relevant in the Canadian context.

Conclusions
This is the first comprehensive Canadian study to show that a wide range of ethnic categories can be accurately predicted using a ML framework that learns from relatively simple and widely-collected personal name and location information. There are many potential public health applications (i.e., disease and risk factor surveillance, effectiveness of intervention, and patterns in health service utilization and related costs) in which adding the ethnicity dimension will greatly multiply the value, utility, and relevance of the existing information. Widespread implementation of ethnicity classifiers will help generate ethnically-specific health evidence that together will fill many critical knowledge gaps that currently impede effective health program and policy development in Canada.