Figures
Abstract
Background
We aimed to evaluate NamSor’s performance in predicting the country of origin and ethnicity of individuals based on their first/last names.
Methods
We retrieved the name and country of affiliation of all authors of PubMed publications in 2021, affiliated with universities in the twenty-two countries whose researchers authored ≥1,000 medical publications and whose percentage of migrants was <2.5% (N = 88,699). We estimated with NamSor their most likely "continent of origin" (Asia/Africa/Europe), "country of origin" and "ethnicity". We also examined two other variables that we created: “continent#2” ("Europe" replaced by "Europe/America/Oceania") and “country#2” ("Spain" replaced by “Spain/Hispanic American country” and "Portugal" replaced by "Portugal/Brazil"). Using "country of affiliation" as a proxy for "country of origin", we calculated for these five variables the proportion of misclassifications (= errorCodedWithoutNA) and the proportion of non-classifications (= naCoded). We repeated the analyses with a subsample consisting of all results with inference accuracy ≥50%.
Results
For the full sample and the subsample, errorCodedWithoutNA was 16.0% and 12.6% for “continent”, 6.3% and 3.3% for “continent#2”, 27.3% and 19.5% for “country”, 19.7% and 11.4% for “country#2”, and 20.2% and 14.8% for “ethnicity”; naCoded was zero and 18.0% for all variables, except for “ethnicity” (zero and 10.7%).
Conclusion
NamSor is accurate in determining the continent of origin, especially when using the modified variable (continent#2) and/or restricting the analysis to names with accuracy ≥50%. The risk of misclassification is higher with country of origin or ethnicity, but decreases, as with continent of origin, when using the modified variable (country#2) and/or the subsample.
Citation: Sebo P (2023) How well does NamSor perform in predicting the country of origin and ethnicity of individuals based on their first and last names? PLoS ONE 18(11): e0294562. https://doi.org/10.1371/journal.pone.0294562
Editor: Difang Huang, The University of Hong Kong, HONG KONG
Received: August 3, 2023; Accepted: November 2, 2023; Published: November 16, 2023
Copyright: © 2023 Paul Sebo. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: "Yes - all data are fully available without restriction; S3 Table shows the study data (first/last name, country of affiliation, country of origin as determined by NamSor, and accuracy of NamSor determination of country of origin) for a random selection of 1,077 researchers. S4 and S5 Tables show the study data of all researchers with inference accuracy ≥70%, for respectively Japan, a country whose names were well recognized by NamSor, and Kenya, a country whose names were less recognized by NamSor. The data associated with this article are available in Open Science Framework (DOI 10.17605/OSF.IO/7KEWG) https://doi.org/10.17605/OSF.IO/7KEWG".
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Individuals are regularly discriminated against, for example because of their gender, their sexual orientation, their religion or their social or ethnic origin. The world of research is only a mirror of our society and does not escape these rejection behaviors. The study of discrimination in research mainly focused on gender inequalities, and numerous publications highlighted the major obstacles faced by women throughout their careers [1–6]. As a result, programs were launched in many countries to increase the representation of women in key academic positions and improve their career prospects [7].
However, rejection behaviors can be related to other social categories in addition to gender. The origin of researchers seems to be a criterion of discrimination according to several recent publications. Researchers from low- and middle-income countries (LMICs), for example, were found to be underrepresented as authors of articles [8–10] or as members of editorial boards [11].
To save time and resources, researchers can rely on NamSor, an online onomastic tool that infers origin from first and last names. NamSor combines three main advantages that are valuable to researchers: it is fast, cost-effective, and can be applied retroactively to large datasets. The methodology used by the algorithm to determine the most likely origin of individuals is relatively opaque to non-specialists, but likely relies on large databases combining names with cultural, ethnic and linguistic backgrounds.
NamSor can be particularly useful in bibliometric studies to determine country of origin or ethnicity when this variable is not available, whether to explore cross-cultural differences in research, inequalities in publications or citations of scientific articles related to the origin of researchers, reviewers or editors, or more broadly for any study including origin as a variable of interest. Indeed, such studies usually require large datasets and self-determination of country of origin or ethnicity is often not possible.
The tool was already used in several studies to estimate the origin of individuals. In a study comparing the number of citations, a proxy for scientific impact and relevance, for 13,000 articles published between 2015 and 2019 in fourteen high-impact general medical journals, we found that articles by first/last authors with African names were cited less often than other articles [10]. In another study evaluating ethnic and gender disparities in 442 prize presentation sessions at two prestigious surgical conferences in the UK over a 21-year period, the authors showed that almost half of the presenters (48%) were white men, followed by Asian men (25%) [12]. By contrast, there was only one black woman, one black man, and sixteen Asian women during these twenty-one years.
NamSor can help to determine both the gender and the origin of individuals. Its performance is high for gender inference, as demonstrated recently in a study comparing several gender detection tools [13], but, to our knowledge, there is no published data on the accuracy of this tool for determining the origin of individuals.
Based on a database of scientific publications (PubMed) including authors’ names and affiliations, the objective of this study was to evaluate the performance of NamSor for estimating the origin of individuals. Thanks to the progress made in data mining techniques, it is hypothesized that its performance should be relatively high but should vary according to individuals’ countries of origin.
Methods
Selection of publications and their authors
We used data from SCImago Journal & Country Rank to retrieve all countries whose researchers authored at least 1,000 scientific publications in 2020 in the field of medicine. SCImago Journal & Country Rank is a publicly available portal that includes scientific indicators for journals and countries developed from information in the Scopus® database [14]. Citation data are from over 34,000 titles and over 5,000 international publishers. Seventy-five countries met the inclusion criterion for the study, as shown in Table 1 (country #1: USA with 277,130 publications, country #75: Cuba with 1,059 publications). We also used data from International Migrant Stock 2020, available on the United Nations Population Division portal, to obtain the percentage of migrants by country in 2020. Data on estimates of the number (or "stock") of international migrants are presented as a percentage of the total population, by age, sex, and country of destination, and are based on national statistics, in most cases obtained from population censuses [15]. We selected the 22 countries for which this proportion was below 2.5 percent (Table 1). We restricted the study to these countries only in order to obtain names of researchers that were as homogeneous as possible and representative of the selected countries. The proportion of migrants for these countries ranged from zero for Cuba to 2.2 percent for Japan and Poland.
Then, using PyMed [16], a Python library that gives access to PubMed, we extracted all publications in 2021 with at least one author affiliated with a university or research institute located in the selected countries (N = 118,897). S1 Table shows the Python program used for authors with affiliations in China. The same procedure was followed for the other countries of affiliation.
We obtained a csv file in which the variable ’authors’ had the following form (example for a publication authored by three researchers):
[{‘lastname’ : ‘x’, ‘firstname’ : ‘x’, ‘initials’ : ‘x’, ‘affiliation’ : ‘x’}, {‘lastname’ : ‘y’, ‘firstname’ : ‘y’, ‘initials’ : ‘y’, ‘affiliation’ : ‘y’}, {‘lastname’ : ‘z’, ‘firstname’ : ‘z’, ‘initials’ : ‘z’, ‘affiliation’ : ‘z’}]
Using Stata, we created the variable ’author1’ (i.e., data for first authors only) and the variable ’country1’ (i.e., country of affiliation for first authors only). As the Python program retrieved all articles with at least one author affiliated with one of the countries selected for the study, we removed the publications for which the affiliation to the selected countries did not concern the first author. To do this, we used regular expressions (‘regexm’ in Stata) to extract the country of affiliation of the first author of each article. For missing data (i.e., publications for which the author’s country of affiliation was missing), we added a manual search using the information provided by PubMed (city or university name). Then, all publications for which the country of affiliation of the first author was not one of those selected for the study were removed from the database. The study database contained data for 88,699 publications. Since the authors of these publications were all affiliated with countries with relatively homogeneous populations, we used their country of affiliation as a proxy for their country of origin.
The database included authors with a single affiliation (N = 68,133) and authors with multiple affiliations (N = 20,566). This second group of researchers could possibly include authors with affiliations in several countries (e.g. USA and China). For these researchers, the country of affiliation used was the one that was part of the countries selected for the study (China and not the USA in the case above). We compared the results of the study using the full sample and the subsample consisting only of authors with a single affiliation to assess if the procedure followed for authors with multiple affiliations was appropriate. We again used regular expressions to identify the two groups of researchers. For a researcher with at least two affiliations, these affiliations were separated in the csv file created with the Python program by the newline character ‘\n’ or a semicolon.
NamSor Applied Onomastics
The authors’ names were classified with NamSor Applied Onomastics, a name recognition software [17]. The software recognizes the linguistic or cultural origin of each name and assigns a gender (male or female) and/or an onomastic class (e.g., China or India). As the estimation is probabilistic, the software also provides a probability for the inference (‘probabilityCalibrated’) ranging from zero to one, with ’one’ meaning 100% accuracy. The way this parameter is calculated is described elsewhere [18].
NamSor operates on the principles of linguistic analysis, probabilistic modeling, and machine learning. By leveraging a vast and diverse database of names from around the world, NamSor identifies patterns in names associated with specific regions and ethnicities. It harnesses linguistic attributes, including phonetics and morphology, to enhance its recognition capabilities. In addition, machine learning techniques empower NamSor to continually improve its accuracy through learning from its database. However, the tool has certain potential limitations. First, names can be highly diverse and may not always accurately reflect an individual’s true origin. For example, people may have names from different cultures due to migration, intercultural marriages, or other factors. Second, the algorithm’s performance can vary significantly by country and ethnicity. It may work very well for some regions but less effectively for others. Third, in cases where a person has a first name from one region and a last name from another, the tool may not perform optimally. Finally, NamSor may not be well-suited for areas with highly diverse populations, such as multicultural urban centers or regions with extensive international immigration. In such places, the algorithm’s accuracy may be challenged.
Names can be classified by NamSor in three different ways: by continent of origin (only three continents: Asia, Africa or Europe), country of origin (e.g., China or India) or ethnicity (e.g., Chinese or Indian). The origin of a name refers to the country or continent where an individual was born, or where the individual’s parents or ancestors came from. According to NamSor, the most likely country of origin for the name “Keith Haring” is the United Kingdom with a probability of 48% (i.e., probabilityCalibrated = 0.48). An ethnicity (or an ethnic group) is a group of individuals who identify with each other on the basis of shared attributes distinguishing them from other groups, such as common sets of cultural traditions, ancestry, language and religion. According to NamSor, the most likely ethnicity for the name “Keith Haring” is German with a probability of 61% (i.e., probabilityCalibrated = 0.61). NamSor can also classify names according to race, a classification that we did not use in our study. This categorization includes six classes and is based on the taxonomy used for the US census: White, Black or African American, American Indian or Alaska Native, Asian, Hispanic or Latino, and Native Hawaiian or other Pacific Islander. According to NamSor, "Keith Haring" is most likely "White" (probability = 73%).
We created two other variables: continent#2 ("Europe" replaced by "Europe, America or Oceania") and country#2 ("Spain" replaced by “Spain or Hispanic American country” and "Portugal" replaced by "Portugal or Brazil"). We added these variables because a preliminary analysis of our data showed that a majority of researchers with Hispanic or Portuguese names who were affiliated with universities or research institutes in Brazil, Mexico or Cuba were considered to be from either Spain or Portugal.
Performance analysis
We evaluated NamSor’s performance by computing three efficiency metrics. These metrics refer to the confusion matrix that contains three components, with ’c’ corresponding to correct classifications, ’i’ to misclassifications (i.e., a wrong continent, country or ethnicity assigned to a name) and ’u’ to non-classifications (i.e., no continent, country or ethnicity assigned). In this study, we considered the “actual country of origin” of the researchers to be their country of affiliation, as extracted from the database listing the authors of publications. The “predicted country of origin” was the country determined by NamSor using the researchers’ first and last names. These definitions also apply to continent of origin and ethnicity.
errorCoded = (i + u) / (c + i + u)
errorCodedWithoutNA = (i) / (c + i)
naCoded = (u) / (c + i + u)
The three performance metrics computed in the study can be interpreted as follows: errorCoded estimates the proportion of misclassifications and non-classifications (this measure therefore penalizes both types of errors equally), errorCodedWithoutNA measures the proportion of misclassifications excluding non-classifications, and naCoded measures the proportion of non-classifications. The same metrics were computed in several recent studies, including some conducted by our research team, to estimate the performance of gender detection tools [13, 18–20]. These tools allow to determine the gender of individuals based on their name.
We repeated the analyses by removing all results with inference accuracy <40% (i.e., probabilityCalibrated <0.4), <50% (i.e., probabilityCalibrated <0.5), <60% (i.e., probabilityCalibrated <0.6) and <70% (i.e., probabilityCalibrated <0.7), respectively. All assignments made with an accuracy level below the selected threshold value were considered as non-classifications. These sensitivity analyses were conducted to determine how the proportion of misclassifications and the proportion of non-classifications changed as the accuracy threshold increased or decreased.
Estimating the proportion of foreign researchers in the study sample
We used the misclassification proportion for "country#2", restricting the analysis to only those researchers for whom the country of origin was determined with an inference accuracy ≥70%, as an indicator of the proportion of foreign researchers in the study sample. Indeed, the level of misclassification can be considered as an indirect measure of the maximum proportion of foreign researchers according to the following formula: "proportion of misclassification with inference accuracy ≥70%" = "proportion of misclassification due to NamSor error" + "proportion of misclassification due to foreign researchers". Even including in the calculation only those researchers for whom their country of origin was determined with an accuracy ≥70%, the proportion of foreign researchers should in fact be lower than the proportion of misclassification for “country#2”, since NamSor is not 100% accurate. As researchers are expected to be more mobile than the general population, we hypothesize that by limiting the countries of affiliation to only those countries with a migrant stock <2.5% the proportion of foreign researchers would be at most 5% in the study. We performed all analyses with STATA version 15.1 (College Station, TX, USA).
Results
The main results of the study are presented in Tables 2–4, for the full sample, for authors with a single affiliation, and for the four subsamples including only names for which the inference accuracy was, respectively, ≥40%, ≥50%, ≥60%, and ≥70%. Table 2 shows for each of the twenty-two selected countries the number of researchers whose name origin was correctly classified by NamSor. These data are then summarized in Table 3 (confusion matrices) and Table 4 (performance metrics).
In addition, S2 Table shows for each country of affiliation the countries of origin estimated by NamSor. These countries, five per country of affiliation, are ranked in the table by the number of inferences. S3 Table lists for each country of affiliation the first and last names of a random selection of researchers according to their country of origin, as estimated by NamSor. To build this table we used ’listsome’, a Stata module to list a random sample of observations. Finally, S4 and S5 Tables show the first name, last name and country of origin of all researchers with inference accuracy ≥70%, for respectively Japan, a country whose names were well recognized by NamSor, and Kenya, a country whose names were less recognized by NamSor.
All results obtained in the study were similar using the full sample and the subsample consisting only of authors with a single affiliation. As shown in Table 2, the proportion of correct classifications varied widely by country, and was higher for “continent of origin”, compared to “country of origin” and “ethnicity”. Most of the names were correctly identified for some countries, such as Polish, Pakistani and Vietnamese names. Other names were poorly recognized, for example Nepalese or Tanzanian names, and others were not recognized at all by NamSor, mainly Latin American names. No Brazilian, Mexican, Filipino or Cuban names were correctly identified. Brazilian names were mostly considered Portuguese, while Mexican or Cuban names were mostly considered Spanish (S2 Table).
S3 Table shows that NamSor could also be wrong with some names when the first name suggested a different country of origin than the surname. For example, Karol Deutsch is a researcher affiliated with a university in Poland. Although Karol is a common first name in Poland, Namsor identified this researcher as being of German origin, probably because of his surname (’Deutsch’). Similarly, although Erika Marie Bascos’ country of affiliation is the Philippines, Namsor considered this researcher to be of French origin, probably because of her first name (’Erika Marie’). Finally, NamSor was of course mistaken with foreign researchers, or more broadly with researchers with names suggesting a country of origin different from the country of affiliation. For example, Muhammad Bilal and Abdullah Al Mamun were (correctly) identified by NamSor as being of Pakistani and Bangladeshi origin respectively, but these were misclassifications because these researchers are affiliated with Chinese universities.
The use of two modified variables (continent#2 and country#2) increased for all countries the proportion of correct classifications. In addition, by restricting the analyses to subsamples, NamSor’s performance tended to increase gradually as the accuracy threshold value increased. For example, for "country of origin", the proportion of correct classifications for Japan was 85.7% for the full sample (and 86.7% for authors with a single affiliation), 86.3% for a threshold value of 40%, 89.1% for a threshold value of 50%, 90.1% for a threshold value of 60%, and 91.1% for a threshold value of 70%. Similarly, the number of non-classifications also gradually increased as the accuracy threshold value increased. For example, for the same variable (country of origin) and the same country (Japan), the number of names classified by NamSor was 6,362 for the full sample, 6,308 with a cut-off value of 40%, 6,032 with a cut-off value of 50%, 5,905 with a cut-off value of 60%, and 5,732 with a cut-off value of 70%.
As shown in the confusion matrices (Table 3), there was a decrease in the number of correct classifications as the threshold value for inference accuracy increased, due to a greater increase in the number of non-classifications relative to the decrease in the number of misclassifications. For example, for “country of origin”, the number of correct classifications was 64,499 for the full sample, 63,901 with a threshold value of 40%, 58,608 with a threshold value of 50%, 55,149 with a threshold value of 60%, and 50,679 with a threshold value of 70%.
Table 4 (accuracy metrics) confirms the results of the previous table. The proportion of misclassifications and non-classifications (i.e., errorCoded) was lowest for the full sample and for authors with a single affiliation, and increased gradually as the threshold value increased. With a cut-off value of 40%, errorCoded increased only slightly compared to the full sample because the number and proportion of non-classifications (= naCoded) was low: 2,259 (2.6%) for “continent of origin” and “country of origin”, and 6,028 (6.8%) for “ethnicity”. Above 60%, errorCoded reached or exceeded 25% for “continent of origin” and “country of origin” and 15% for “ethnicity”. Using a cut-off value of 50% was probably the strategy that provided the best compromise between “proportion of correct classifications” and “proportion of non-classifications”. For the full sample, the subsample consisting only of authors with a single affiliation, and the subsample with inference accuracy ≥50%, the proportion of misclassifications (= errorCodedWithoutNA) was, respectively, 16.0%, 15.6% and 12.6% for “continent of origin”, 6.3%, 5.7% and 3.3% for “continent#2”, 27.3%, 26.5% and 19.5% for “country of origin”, 19.7%, 18.6% and 11.4% for “country#2”, and 20.2%, 18.8% and 14.8% for “ethnicity”.
As expected, the total proportion of foreign researchers in the study sample can be estimated to be less than 5%, since the proportion of names that were misclassified for "country#2", by including in the analysis only those researchers for whom the country of origin was determined with ≥70% accuracy, was 5.0% (Table 3).
Discussion
Main findings
In this cross-sectional study, we examined the performance of NamSor in determining the origin of individuals based on their first and last names. To this end, we used a database of researchers whose country of affiliation was known. We limited the analysis to researchers affiliated with low immigration countries (i.e., <2.5%). We considered the country of origin of these researchers to be their country of affiliation.
All results obtained in the study were similar using the full sample and the subsample consisting only of authors with a single affiliation. We found NamSor to be accurate in determining the continent of origin, especially when using the modified variable (continent#2) and restricting the analysis to names with an inference accuracy ≥50%. For continent#2, the proportion of misclassifications (i.e., errorCodedWNA) was only 6.3% for the full sample, 5.7% for authors with a single affiliation, and 3.3% for the subsample with inference accuracy ≥50%. However, we found that the risk of misclassification was higher with country of origin or ethnicity, but also decreased when using the modified variable (country#2) and the subsample.
Comparison with existing literature
Several authors used Namsor in the past to estimate the origin of individuals in their research, both in medicine [10, 12] and in other disciplines [21, 22], but our study is the first to our knowledge to have evaluated its performance. We already evaluated NamSor’s performance in determining the gender of individuals from their first and last names, and showed that the tool was accurate in the majority of cases [13]. However, we found that NamSor was much less efficient for some countries, for example for Chinese names [18]. We also found that the use of the accuracy parameter (’probabilityCalibrated’) was not useful to improve the performance of NamSor for gender estimation [23].
The results we obtained in the current study were quite different. Asian names were in general relatively well recognized by NamSor. For example, 76% of the names of researchers affiliated with universities or research institutes in China were correctly classified for “country of origin” (and even 85% for “ethnicity”). These figures were 86% and 85%, respectively, for Japan. The results were similar for authors with a single affiliation (China: 76% for “country of origin” and 84% for “ethnicity”; Japan: 87% for both variables). Furthermore, the use of the accuracy parameter greatly improved the performance of the tool for origin. The best compromise between improving NamSor’s performance and increasing the number of non-classifications was obtained with a threshold value of 50%. With a threshold value of 40%, too few queries were considered as non-classifications (2.6% for “continent of origin” and “country of origin”, and 6.8% for “ethnicity”) to make a noticeable change in performance metrics. For example, for “continent of origin” and “country of origin”, errorcodedWNA decreased only from 16.0% to 15.4% and from 27.3% to 26.1%, while these proportions decreased to 12.6% and 19.5%, respectively, for a threshold value of 50%.
As expected, using “continent of origin” yielded more accurate assignments than either “country of origin” or “ethnicity”. This is a logical finding since “continent of origin” consisted of only three categories, far fewer than the other two variables. For example, if authors with Chinese names were considered to be of Japanese origin, the continent of origin (i.e., Asia) would have been correctly estimated, unlike country of origin or ethnicity. However, if researchers using NamSor needed more precision for their study than simply assigning a continent of origin, the use of “ethnicity” would a priori allow more accurate queries than “country of origin”. For example, for the total sample and for authors with a single affiliation, errorCodedWNA was respectively 20.2% and 18.8% for “ethnicity” and 27.3% and 26.5% for “country of origin”. This difference persisted with the various subsamples.
As expected, it was the joint use of “continent#2” or “country#2” and the various subsamples with threshold values of 50% or more that really improved the performance of NamSor. For “continent#2” and a cut-off value of 50%, the proportion of misclassifications was only 2.6% in our study (vs., for “continent of origin”, 16.0% for the total sample and 15.6% for authors with a single affiliation). For “country#2” and the same cut-off value of 50%, this proportion was 9.3% (vs., for “country of origin”, 27.3% for the total sample and 26.5% for authors with a single affiliation). “Continent#2” led to more accurate assignments than “continent of origin”, as many researchers with Spanish or Portuguese names were actually affiliated with universities or research institutes in Latin America. For the same reason, replacing “country of origin” by “country#2” (i.e., "Spain" by "Spain or Hispanic American country", and "Portugal" by "Portugal or Brazil") was also useful for improving NamSor’s performance.
Anglo-Saxon countries (i.e., UK, USA, Canada, Australia and New Zealand) were excluded from the study, as the proportion of migrants was too high in these countries. However, it is likely that if they were included we would observe misclassifications for the same reason as for names of Spanish or Portuguese origin. It would therefore make sense to use a third variable (country#3) that would add a modification to "country#2", replacing "UK", "USA", "Canada", "Australia" and "New Zealand" with "UK or USA or Canada or Australia or New Zealand".
In recent years, there has been growing recognition of the impact of artificial intelligence and other mechanisms on gender equality and biases across various domains. Three papers by Bao & Huang shed light on this topic. The first paper explored how artificial intelligence (AI) can contribute to creating a gender-neutral learning environment, reducing gender disparities in education [24]. The study compared the results of students in the game of Go with human teachers vs. AI trainers. With human teachers, boys consistently had a higher winning rate than girls, whereas the use of AI trainers led to improvements in the performance of both male and female students. The two other papers highlighted the importance of addressing gender-specific favoritism in scientific recruitment processes, particularly in prestigious scientific committees [25, 26]. Greater female representation in these committees could lead to innovative approaches and managerial effectiveness in shaping research resource allocation and public projects. Bao & Huang discussed the underrepresentation of women in top scientific positions and the need to reform scientific election procedures to foster gender balance, illustrating how gender-specific biases can impact career success and exacerbate gender disparities. These three papers underscore the broader context within which our study of NamSor’s ability to predict individuals’ country of origin and ethnicity takes place. The accurate name-based classification offered by NamSor serves as a critical tool in addressing biases, improving diversity, and promoting equity in research and various decision-making processes.
Implications for practice and research
The performance of NamSor in determining the origin of individuals was probably underestimated in our study, as it was based on the assumption that all researchers affiliated with universities or research institutes in a given country were from that country. This assumption is not entirely correct, as the countries of affiliation could be included in the study up to a threshold in the proportion of migrants of 2.5%. In addition, researchers are a priori more mobile than the general population and the proportion of foreign researchers is expected to be above the 2.5% threshold for a number of countries. However, this proportion was probably less than 5% in the study since the proportion of misclassification for “country#2”, which includes misclassification related to foreign researchers, was 5.0% using the sample consisting of all names for which the determination was made with at least 70% accuracy.
The fact that the estimate of the performance of NamSor is rather conservative means that using this tool in research following the procedure proposed in the study is probably safe. However, finer determinations of the origin of individuals, at the level of country rather than continent, could possibly also be an option. In order to demonstrate the validity of this strategy, further studies would be needed, which could rely for example on self-identification or the expertise of linguists or onomatologists to assess the performance of NamSor for a large number of countries. Unfortunately, such studies are often difficult to conduct if the number of participants is large, which would be necessary in order to have a wide variety of names represented. Future studies could also be used to compare NamSor to other similar tools that estimate the origin of individuals based on their names, for example NamePrism, a name-based nationality and ethnicity classification, and ethnicolr, a name-based race and ethnicity classification.
NamSor’s potential applications extend to various stakeholders, including researchers, institutions, and policymakers. Researchers can harness the power of NamSor to evaluate and address issues related to discrimination based on individuals’ origin in academic collaborations, funding decisions, and research recognition. By leveraging NamSor’s predictions, research institutions and funding agencies can foster a more diverse and inclusive academic environment. For example, NamSor could facilitate targeted initiatives aimed at addressing underrepresentation of certain groups in research. This includes allocating resources to support underrepresented researchers or encouraging collaborations that bridge diverse backgrounds. The tool can also serve as a valuable resource for ensuring equity in research evaluations, such as grant allocations and award nominations. Furthermore, policymakers can utilize NamSor’s capabilities to inform and shape evidence-based policies that promote diversity, inclusivity, and global collaboration in research. By recognizing and addressing disparities, decision-makers can take steps to enhance the international research landscape. Therefore, NamSor’s role in improving the accuracy of origin predictions has wide-reaching implications, contributing to a more equitable and inclusive research community.
Limitations
Our study has a large sample size but has two main limitations. First, we restricted the study to twenty-two countries spread over four continents (Europe, Asia, Africa and America). As the performance of NamSor varies depending on the country examined, our results are not necessarily generalizable to other countries. We therefore recommend some changes in the variables used by NamSor. We recommend the use of "continent#2" (i.e., "Europe" replaced by "Europe/America/Oceania") instead of "continent of origin", and the use of "country#3" (i.e., "UK", "USA", "Canada", "Australia" and "New Zealand" replaced in country#2 by "UK or USA or Canada or Australia or New Zealand") instead of "country of origin" or “country#2”. Second, as already stated above, we considered the country of origin of the researchers to be their country of affiliation. Although we restricted the study to countries with less than 2.5% migrants to obtain the most homogeneous populations possible with names representative of the selected countries, there were inevitably foreign researchers in these countries. The results of our study are therefore probably an underestimate of the real performance of NamSor. It would have been better to determine the (actual) origin of the researchers by self-identification, linguistic analysis, or consultation of experts in onomastics. Furthermore, exploring multiple proxies for “country of origin”, combining various sources of information, could offer a more robust approach to ascertain the true origin of individuals. It is essential to recognize that the complexity of individuals’ origin, identity, and mobility necessitates a multifaceted approach to validation.
Conclusion
NamSor is accurate in determining the continent of origin of individuals from their first and last names, especially when using the modified variable (i.e., continent#2) and restricting the analysis to names with inference accuracy ≥ 50%. The risk of misclassification is higher with country of origin or ethnicity, but decreases, as with continent of origin, when using the modified variable (i.e., country#2) and the subsample. Further research would be useful in the future, as the performance of NamSor was probably underestimated in our study due to the relatively high mobility of researchers. Future investigations could also involve the comparison of NamSor with other name-based classification algorithms, such as NamePrism and ethnicolr, or explore avenues for enhancing the accuracy of NamSor’s predictions through advanced machine learning techniques and more extensive name databases. Such endeavors are essential to advance the understanding of origin determination techniques and their applications across various fields.
Supporting information
S1 Table. Python program to retrieve all PubMed articles published in 2021 with at least one author affiliated with a university or research institute in China (adapted from https://github.com/gijswobben/pymed).
The same procedure was followed for the other countries of affiliation.
https://doi.org/10.1371/journal.pone.0294562.s001
(DOCX)
S2 Table. Number and proportion of researchers by country of origin of researchers (five countries of origin, ranked by the number of inferences, are shown for each country of affiliation).
Data are presented for the full sample and for two subsamples including only names for which the accuracy of inference was, respectively, ≥50% and ≥70%.
https://doi.org/10.1371/journal.pone.0294562.s002
(DOCX)
S3 Table. First and last names of a random selection of researchers (sorted by country of affiliation and country of origin).
https://doi.org/10.1371/journal.pone.0294562.s003
(DOCX)
S4 Table. First name, last name and country of origin of all researchers with inference accuracy ≥70% for a country whose names were well recognized by NamSor (i.e., Japan).
https://doi.org/10.1371/journal.pone.0294562.s004
(XLS)
S5 Table. First name, last name and country of origin of all researchers with inference accuracy ≥70% for a country whose names were relatively poorly recognized by NamSor (i.e., Kenya).
https://doi.org/10.1371/journal.pone.0294562.s005
(XLS)
References
- 1. Safdar B, Naveed S, Chaudhary AMD, Saboor S, Zeshan M, Khosa F. Gender Disparity in Grants and Awards at the National Institute of Health. Cureus. 2021;13: e14644. pmid:34046277
- 2. Richter KP, Clark L, Wick JA, Cruvinel E, Durham D, Shaw P, et al. Women Physicians and Promotion in Academic Medicine. N Engl J Med. 2020;383: 2148–2157. pmid:33252871
- 3. Sebo P, Clair C. Gender gap in authorship: a study of 44,000 articles published in 100 high-impact general medical journals. Eur J Intern Med. 2021; S0953–6205(21)00313–7. pmid:34598855
- 4. Sebo P, Maisonneuve H, Fournier JP. Gender gap in research: a bibliometric study of published articles in primary health care and general internal medicine. Fam Pract. 2020;37: 325–331. pmid:31935279
- 5. Sebo P, Clair C. Gender Inequalities in Citations of Articles Published in High-Impact General Medical Journals: a Cross-Sectional Study. J Gen Intern Med. 2022. pmid:35794309
- 6. Sebo P, de Lucia S, Vernaz N. Gender gap in medical research: a bibliometric study in Swiss university hospitals. Scientometrics. 2020 [cited 12 Dec 2020].
- 7.
Gender equality in research and innovation. In: European Commission—European Commission [Internet]. [cited 20 Mar 2022]. Available: https://ec.europa.eu/info/research-and-innovation/strategy/strategy-2020-2024/democracy-and-rights/gender-equality-research-and-innovation_en
- 8. Merriman R, Galizia I, Tanaka S, Sheffel A, Buse K, Hawkes S. The gender and geography of publishing: a review of sex/gender reporting and author representation in leading general medical and global health journals. BMJ Glob Health. 2021;6: e005672. pmid:33986001
- 9. Busse CE, Anderson EW, Endale T, Smith YR, Kaniecki M, Shannon C, et al. Strengthening research capacity: a systematic review of manuscript writing and publishing interventions for researchers in low-income and middle-income countries. BMJ Glob Health. 2022;7: e008059. pmid:35165096
- 10. Sebo P. Publication and citation inequalities faced by African researchers. Eur J Intern Med. 2022; S0953–6205(22)00292–8. pmid:35985953
- 11. Nafade V, Sen P, Pai M. Global health journals need to address equity, diversity and inclusion. BMJ Glob Health. 2019;4: e002018. pmid:31750004
- 12. Seehra JK, Lewis-Lloyd C, Koh A, Theophilidou E, Daliya P, Adiamah A, et al. Publication Rates, Ethnic and Sex Disparities in UK and Ireland Surgical Research Prize Presentations: An Analysis of Data From the Moynihan and Patey Prizes From 2000 to 2020. World J Surg. 2021;45: 3266–3277. pmid:34383090
- 13. Sebo P. Performance of gender detection tools: a comparative study of name-to-gender inference services. J Med Libr Assoc JMLA. 2021;109: 414–421. pmid:34629970
- 14.
SJR—International Science Ranking. [cited 14 May 2021]. Available: https://www.scimagojr.com/countryrank.php?year=2019
- 15.
International Migrant Stock | Population Division. [cited 17 Apr 2022]. Available: https://www.un.org/development/desa/pd/content/international-migrant-stock
- 16.
gijswobben/pymed. In: GitHub [Internet]. [cited 4 Feb 2021]. Available: https://github.com/gijswobben/pymed
- 17.
Namsor: name checker for gender, origin and ethnicity classification. [cited 17 Apr 2022]. Available: https://namsor.app/
- 18. Sebo P. How accurate are gender detection tools in predicting the gender for Chinese names? A study with 20,000 given names in Pinyin format. J Med Libr Assoc JMLA. 2022;110: 205–211. pmid:35440899
- 19. Santamaría L, Mihaljević H. Comparison and benchmark of name-to-gender inference services. PeerJ Comput Sci. 2018;4: e156. pmid:33816809
- 20. Sebo P. Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference. J Med Libr Assoc JMLA. 2021;109: 609–612. pmid:34858090
- 21. Nagle F, Teodoridis F. Jack of all trades and master of knowledge: The role of diversification in new distant knowledge integration. Strateg Manag J. 2020;41: 55–85.
- 22. de Rassenfosse G, Hosseini R. Discrimination against foreigners in the U.S. patent system. J Int Bus Policy. 2020;3: 349–366.
- 23. Sebo P. Are Accuracy Parameters Useful for Improving the Performance of Gender Detection Tools? A Comparative Study with Western and Chinese Names. J Gen Intern Med. 2022. pmid:35292910
- 24. Bao Z, Huang D. Can Artificial Intelligence Improve Gender Equality? Evidence from a Natural Experiment. Rochester, NY; 2022.
- 25. Bao Z, Huang D. Gender-specific favoritism in science. J Econ Behav Organ. 2023 [cited 14 Oct 2023].
- 26. Bao Z, Huang D. Reform scientific elections to improve gender equality. Nat Hum Behav. 2022;6: 478–479. pmid:35273356
- 27.
Sebo P. NamSor’s performance in predicting the country of origin and ethnicity of 90,000 researchers based on their first and last names. Preprint at https://doi.org/10.21203/rs.3.rs-1565759/v3