Application of big-data for epidemiological studies of refractive error

Purpose To examine whether data sourced from electronic medical records (EMR) and a large industrial spectacle lens manufacturing database can estimate refractive error distribution within large populations as an alternative to typical population surveys of refractive error. Subjects A total of 555,528 patient visits from 28 Irish primary care optometry practices between the years 1980 and 2019 and 141,547,436 spectacle lens sales records from an international European lens manufacturer between the years 1998 and 2016. Methods Anonymized EMR data included demographic, refractive and visual acuity values. Anonymized spectacle lens data included refractive data. Spectacle lens data was separated into lenses containing an addition (ADD) and those without an addition (SV). The proportions of refractive errors from the EMR data and ADD lenses were compared to published results from the European Eye Epidemiology (E3) Consortium and the Gutenberg Health Study (GHS). Results Age and gender matched proportions of refractive error were comparable in the E3 data and the EMR data, with no significant difference in the overall refractive error distribution (χ2 = 527, p = 0.29, DoF = 510). EMR data provided a closer match to the E3 refractive error distribution by age than the ADD lens data. The ADD lens data, however, provided a closer approximation to the E3 data for total myopia prevalence than the GHS data, up to age 64. Conclusions The prevalence of refractive error within a population can be estimated using EMR data in the absence of population surveys. Industry derived sales data can also provide insights on the epidemiology of refractive errors in a population over certain age ranges. EMR and industrial data may therefore provide a fast and cost-effective surrogate measure of refractive error distribution that can be used for future health service planning purposes.

healthcare in recent years is of specific interest. Data such as electronic medical records (EMR) and industrial manufacturing or sales records represent a potentially valuable source of secondary data, i.e. data used for a purpose that is different from that for which it was originally collected. The scale of such data is often far larger than conventional research datasets and it is now commonly referred to as Big Data. Big Data is now recognized as an important resource for scientific research, allowing conclusions to be drawn that would otherwise be impossible using traditional scientific techniques [15,16].
In the field of eyecare, several studies have demonstrated the usefulness of EMR data for determining disease epidemiology [17,18] and treatment outcomes [19,20]. The application of such approaches to myopia genetics research has shown strong correlation with the results obtained using conventional epidemiological research methodologies [21,22]. National [23,24] and private insurance claims records have also been used to determine the epidemiology of several ocular diseases, as have hospital records [25]. Big Data sources of this type can be used as an alternative form of epidemiological data, particularly in the absence of conventional epidemiological studies. Datasets such as national insurance claims records can be generalised to an entire population while EMR and hospital record data are useful when considering specific population cohorts.
The potential of Big Data as a tool to monitor population trends in refractive error has received little attention. Optometric EMR data provides an obvious example of a rich source of data on refractive error that has yet to be exploited for this purpose. Another novel, but less obvious, source of data is the manufacturing and sales records of companies involved in the supply of optical appliances such as spectacle and contact lenses. This data source is much more limited in terms of the information available, but the ubiquity of these optical appliances indicates such data may still elicit useful insights on refractive error epidemiology.
This study was designed, therefore to examine whether optometric EMR data or spectacle lens data can provide estimates of refractive error distribution that are comparable to traditional population surveys.

Methods
Anonymized EMR data was gathered from 28 Irish optometry practices. The data was extracted remotely through the EMR provider following provision of explicit consent from the data (practice) owners during the period of May 2018 to June 2019 for all 28 practices. This study was approved by the TU Dublin Research Ethics and Integrity Committee and adheres to the tenets of the Declaration of Helsinki (REC-18-124). Patient level consent was not required due to the nature of the anonymization of the data. The data extracted comprised all practice records since first use up to the date of extraction for each practice. The EMR provider removed any personally identifying data and anonymized the data prior to delivery so that the anonymization could not be reversed by the researchers. The data was analysed using the R programming language (R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.Rproject.org/.). At the time of extraction, a new unique identifying number was generated for each subject within the EMR data allowing their data to be tracked across multiple visits. The data available for each subject included demographic, refractive, visual acuity, binocular vision, contact lens, ocular health and clinical management data. For this analysis only demographic, refractive and visual acuity data were considered with most refractions having been performed as non-cycloplegic subjective refractions.
Anonymized patient spectacle lens sales data was provided by a major European manufacturer. This comprised lenses that had been manufactured and dispatched after an order was received from a practitioner with the majority of lenses for delivery within Europe. The data was collated into histogram data using the SQLite database engine (Hipp, Wyrick & Company, Inc., Charlotte, North Carolina, USA) and analysed using the R statistical programming language. The data provided included the spherical power, cylindrical power and axis of the spectacle prescription. The lens design, diameter, laterality (prescribed for right or left eye) and date of manufacture were also included. For lens designs with an addition, this was also specified. The presence of an addition allowed the lenses to be separated into two groups, the single vision (SV) lens group and the addition (ADD) lens group. The data was validated for missing and malformed data fields and any lenses with incomplete or invalid data were excluded. The spherical equivalent power was calculated for each lens.
Data from the E3 study was extracted by digitizing the published results using Plot Digitiser [26]. Data from the GHS study [27], a population based observational study, was also digitized as an additional comparison. The GHS was chosen as an additional comparison as it took place in Germany, had a similar age range  and was one of the component studies of the E3 study. In addition, Germany was the largest contributor to the spectacle lens data.
Myopia was defined according to the International Myopia standards [28], with a spherical equivalent (SE) refractive error of � -0.50 D being considered myopic, and � -6.00 D considered highly myopic. Hyperopia was defined as � +0.75 D and emmetropia defined as > -0.50 D and < +0.75 D. For comparison with the E3 study, analysis was also performed using the myopia definition used in that study, i.e. � -0.75 D.
The E3 study, a meta-analysis on refractive error prevalence in Europe, was chosen as a comparative study for several reasons. Firstly, the manufacturer database reflected almost exclusively European lens sales. Secondly, as the spectacle lens data comprised a substantial proportion of reading addition lenses typically used by older presbyopic adults [29] (age � 40-45 typically) [30], the adult age profile of the E3 consortium (age 25-89 years) was deemed suitable, and it was assumed that the datasets could be comparable. These age assumptions were also validated using the EMR data. With this more detailed optometric data, both the age and spectacle correction data were available, allowing determination of the age distribution of patients with single vision and reading addition spectacles. The relationship between age and reading addition was determined by fitting a logistic function to the age and right eye reading addition found in the EMR data using the 'drc' extension package for R [31]. A logistic function was also created to determine the number of individuals requiring a reading addition at each age from 1 to 100 years old within the EMR data. The base R predict function was then used to generate 95% prediction intervals for both logistic models. Probability density functions were generated for each reading addition value to determine the distribution of age associated with that reading addition. The ADD lens group then had an estimated age assigned for each spectacle lens based on the reading addition value for that lens using the probabilities generated from the EMR data.
The EMR data was randomly sampled to provide an age and gender matched population for comparison with the E3 population. The ADD lens data was also age matched with the E3 population using the estimated age for each lens. From the age matched EMR and ADD lens data, the proportion of myopia, high myopia and hyperopia present was calculated in 5-year age brackets to allow comparison with the E3 and GHS data.

Spectacle lens dispensing and EMR refractive error distribution
The spectacle lens dataset comprised 141,547,436 lenses from the manufacturer sales records ranging from the year 1998 to 2016. The EMR dataset included 555,528 patient visits ranging from the year 1980 to 2019. Records with incomplete or missing data were excluded from both datasets and only years with complete data were included in the analysis (Fig 1). In total 134,280,063 spectacle lenses were included, comprised of 84,561,994 SV lenses and 49,709,191 ADD lenses. The final EMR dataset was composed of 524,868 patient visits.
Over 97% of spectacle lenses were for delivery within Europe with Germany accounting for the largest proportion (�48%) of all lenses delivered. The EMR data included 244,002 unique patients representing 5.1% of the population of the Republic of Ireland [32]. The gender distribution of EMR patient visits was 51.3% female, 34.9% male and not recorded in 13.8% of records. The 28 optometric practices were located all across the Republic of Ireland representing both rural and urban populations.
The distribution of refractive error within the EMR data and spectacle lens data are presented in Fig 2, including the complete datasets and also segregated according to lens type (SV or ADD lens). Table 1 summarises the descriptive statistics for each distribution.
All distributions demonstrate the classic negatively skewed leptokurtotic curve found in most studies of refractive error, with the majority of observations centred close to emmetropia. The only exception to this pattern was the SV spectacle lenses which were found to have a bimodal distribution with a significant notch apparent at zero spherical equivalent.  Table 2 shows the relationship between age and the likelihood of prescribing a reading addition in the form of a contingency table. A summary of the distributions and their statistical relationship is given in Table 3.

Estimating age using reading addition
The relationship between age and the power of the addition given in glasses for the EMR data is shown in Fig 4. This relationship could be accurately fitted to a logistic function with nonlinear regression (estimate = 2.2 D, t = 818.94, p < 0.001). The residual standard error found was 7.56 years. Fig 4 also shows the 95% prediction limits for estimating age if only the spectacle add power is known, as is the case with lens dispensing data. A logistic function was also fitted to the  relationship between the probability of being prescribed a reading addition and age (estimate = 42.29 years, t = 653.73, p < 0.001). The residual standard error was 1.73%. This allows estimation of the proportion of individuals at each age likely to require a reading addition ( Fig  5). These relationships were then used to infer ages for the ADD lens data. This allowed the generation of sub-populations of a given age for comparison with the EMR, E3 and GHS data. Using these two functions to determine age ranges and by generating probability density functions for each value of reading addition in the EMR data, the level of myopia, hyperopia and astigmatism was calculated for age groups from �45 years to � 80 years for the ADD lens data.

Comparison with E3
The distributions of spherical equivalent refraction in the E3 study and the age matched EMR data were closely matched (χ 2 = 527, p = 0.29, DoF = 510) with both being negatively skewed leptokurtotic distributions (Fig 6).

PLOS ONE
Application of big-data for epidemiological studies of refractive error Age-matched comparison of the level of myopia, hyperopia and astigmatism for EMR relative to E3 data revealed broadly similar distributions across the refractive error types, albeit that the distribution of myopia was lower and hyperopia higher in the EMR data relative to the E3 data ( Table 4). The ADD lens data distributions of myopia, hyperopia and astigmatism were all higher but also similar to the age matched E3 data ( Table 5).
The E3 reported levels of myopia, hyperopia and high myopia across various age groups were compared to the EMR, ADD lenses and GHS data across the same age groups (Figs 7-9). These figures show the EMR data is the closest match to the E3 data. Confidence intervals for the EMR data were found to be overlapping with the confidence intervals for E3 data at 7 age points for myopic refractions (Fig 7), 6 age points for hyperopic refractions (Fig 8) and 12 age points for highly myopic refractions (Fig 9). The ADD lens data, however, provides a closer approximation to the E3 data for total myopia compared to the GHS data, particularly up to age 64 (Fig 7).

Discussion
Our results indicate that EMR data provides a close approximation to refractive error prevalence values found as part of the E3 study. Age related variation in the proportions of myopes and hyperopes are similar across the EMR and E3 data. Both the EMR and E3 datasets demonstrated high levels of myopia in younger age groups (Fig 7) which supports the findings of other studies demonstrating an increase in myopia prevalence in more recent generations [5,6]. Although the EMR data falls outside the E3 confidence intervals at some points for both the myopia and hyperopia comparisons, this is also true of the GHS data which was a component study of the E3 dataset, with the EMR data providing a closer match to the E3 than the GHS data. As the confidence intervals indicate the likely position of the mean of the study population some fluctuation is expected when comparing different study populations. It was possible to estimate the likely recipient age for every spectacle lens prescription containing a reading addition by using the EMR data. This was achieved based on the observation that a significant majority of EMR patient visits below the age of 40 years were not prescribed an addition while the majority of patients visits above the age of 50 years were prescribed an addition. Along with the presence of an addition, the power of the reading addition was also found to provide a means of estimating a patient's age. These inferences allowed an estimated age to be associated with each spectacle lens containing an addition within the spectacle lens sales dataset. The combination of disparate data sources to provide greater insight is a hallmark of Big Data analysis [33], and in this case allowed a deeper understanding of the usefulness of the spectacle lens sales data as a source of epidemiological data of refractive error.
Having accurate and current information on the prevalence of refractive error is vital to allow health services to plan for the increasing need for optical correction and the increased burden due to the ocular comorbidities [3,[34][35][36][37] associated with increasing refractive error. Myopia is of particular concern as it is estimated that up to 49.8% of the global population will be myopic by 2050 and 9.8% of those will be highly myopic [4]. The combination of high myopia and increasing age have been found to be a risk factor for vision impairment and blindness   [38]. A recent meta-analysis found a significantly increased risk of myopic macular degeneration and retinal detachment in high myopes with reduced visual acuity and worse treatment outcomes in eyes with these conditions [39]. Assessing any change to the prevalence of high myopia within a population is the area of most concern when considering the ocular comorbidities associated with refractive error. EMR data contains refractive error information and patient demographics including age, which can help to determine the population risk of vision impairment. The EMR data provides a good match to the E3 study for high myopia (Fig 9) and as such may be an invaluable method to determine the ongoing risk of vision impairment. While conventional epidemiological studies remain the gold standard, they have some disadvantages. The most reliable studies have large sample sizes allowing their results to be generalized to the entire population. Such sample sizes require significant investment and time to conduct the study, which perhaps explains the relative lack of epidemiological studies of refractive error and significant lack of longitudinal studies of refractive error. This paucity of data also contributes to uncertainty with regards to future projections of myopia prevalence [4]. Where such data is not available, EMR or industrial data may have a useful role as these are increasingly being collected as a matter of routine and can be collected with greater ease and at more regular intervals. It is important to acknowledge that all epidemiological studies suffer from various forms of bias. For example, it is well established that most cross sectional studies suffer from volunteer bias, with volunteers usually from higher socio-economic backgrounds with a higher level of education [40]. Longitudinal studies frequently suffer from loss to follow up which may induce a bias in the profile of the remaining study population. It is important, therefore, when designing an epidemiological survey of refractive error to attempt to minimise these biases. Big data studies on refractive error will not suffer with the same biases as the data was not collected for the purpose of determining the population burden of refractive error. This type of epidemiological study will however, have a different set of biases which need to be considered. A frequent criticism of the secondary use of EMR data concerns the lack of access to healthcare of some population cohorts [41] due to a lack of health insurance. As this EMR data has come from a jurisdiction with free access to eyecare which is widely availed of, this should not create a significant bias in our data [42,43]. Less frequent replacement of spectacle lenses from those of lower socio-economic backgrounds may present a more significant issue with regards to the spectacle lens dispensing data. Measurement error can exist as a bias in any epidemiological study but may be well controlled in small studies through standardization of equipment and procedures. In a Big Data study of this nature, this is not possible. Nevertheless, error rates of subjective refraction in adults are typically low at between 1% and 2%, indicating the vast majority of refractions should be accurate to within ± 0.50 D of the correct refraction [44,45].
There are several limitations to this study that must be considered. In relation to spectacle lens data, demographic information of the individuals purchasing the spectacle lenses is not typically available in industrial datasets. Geographic information is likely to be available, however, which can provide some useful information. Using the EMR data to infer the age of a cohort of the spectacle lens users enhances the usefulness of this data, but the overall lack of demographic information means that further conclusions on subpopulations cannot be drawn. In this study, the spectacle lens data was supplied by one manufacturer. Economic factors and market penetration may have an effect on the background of the consumer choosing lenses from this manufacturer. Industrial data could be biased, for example, to particular socio-economic, ethnic or other demographic subgroups for reasons such as product cost, geographic location and other factors specific to individual manufacturers. Higher educational attainment is associated with both socio-economic status and myopia [6], for example, so the possibility that the oversampling of individuals from particular backgrounds within individual datasets might influence population estimates of refractive error needs to be considered.
Under sampling of emmetropic patients is a more significant issue for the spectacle lens data as these represent spectacle lens sales. This will tend to produce an apparent increased proportion of hyperopic and myopic refractive errors, especially for younger subjects, as observed in this study. It is unlikely that emmetropic patients are purchasing spectacle lenses in significant numbers. This is particularly evident when considering the SV lenses in Fig 3. The notch apparent at zero dioptric power represents the reduction in purchasing of spectacle lenses by this group. It might be expected that the number of zero power lenses would be smaller than was observed, but there are plausible reasons to explain this. In cases of anisometropia one eye may have a zero-power lens when the fellow eye needs correction. In addition, the computation of spherical equivalent may result in zero spherical equivalent power for lenses prescribed to patients with mixed astigmatism. The lack of emmetropes represented within the spectacle lens sales data presents a problem and may explain the poorer match to the E3 study relative to EMR data. This implies that such data may be more representative of the distribution of refractive error within a population above a certain threshold of refractive error. The greatest risks of visual impairment are associated with high levels of myopia [39], and also high levels of hyperopia [3], both categories likely to seek optical correction. Further analysis and modelling may remove the limitation associated with the under sampling of emmetropes and allow the determination of the risk of vision impairment in those using spectacle lenses to correct higher refractive errors.
There are less limitations applicable to the EMR data due to the increased demographic detail captured in this data. Under sampling of emmetropic patients is likely to be less problematic for the EMR data which includes refraction data found as part of a patient's eye examination. Emmetropic patients are still likely to attend routine eye examinations for the purposes of screening for common ocular pathologies such as glaucoma and cataract [46] although some under sampling of young emmetropic patients may have still occurred. Importantly, EMR data is likely to be highly representative of the older population given the almost universal need for optical correction as presbyopia begins to manifest as a problem, even for emmetropes and low hyperopes who did not previously need correction. This is particularly the case in most countries in Europe where subsidised eye examinations are accessible to the majority of the population [47]. The close match of the EMR and E3 data observed herein suggests that the EMR is representative of the population at large.
In this EMR dataset, it was not possible to tell what type of refraction had been performed to reach the refractive error prescribed. Cycloplegic refraction is performed to avoid the errors in refraction that can be induced by accommodation in children and the use of cycloplegia is considered the most appropriate method to assess refractive error for research purposes [48]. Although it is unknown how many of these refractions have been performed with the aid of cycloplegia, a significant number of epidemiological surveys on refractive error have been carried out without the use of cycloplegia [7]. It has been found that accommodation mostly affects the determination of refractive error in children and has little impact on adults [49,50], particularly older adults [51]. The technique of refraction used, therefore, should have little impact on the primarily adult dataset used herein.

Conclusion
The prevalence of refractive error within a population can be estimated using EMR data in the absence of population surveys. Results from EMR data also allow age to be inferred from the addition in a spectacle lens. Industry derived sales can then be used to provide insights on the epidemiology of refractive errors in a population over certain age ranges. EMR and industrial data may therefore provide a fast and cost-effective surrogate measure of refractive error distribution that can be used for future health service planning purposes.