Determining occupational accidents baseline ratios by considering a synthetic population: The case of Spain

Jordi Olivella Nadal; Gema Calleja Sanz; Ignacio Fuentes Ribas; Pedro Rodriguez Mondelo

doi:10.1371/journal.pone.0294707

Abstract

In most countries, a government agency or collaborating organization gathers information on occupational accidents. Comparisons based on a single factor such as autonomous community, activity sector or others, often leads to contradictory conclusions. The use of this information for comparison is not immediate because the different characteristics considered give place to different possible comparisons. The elaboration of a single baseline for each set of characteristics is addressed. The method proposed comes from the data available in Spain but could be applied to other cases. The method consists of: (1) selecting factors–those selected are age, sex, autonomous community and activity; (2) the generation of a synthetic population based on data from a survey and general proportions by applying the Optimal Representative Sample Weighting (rsw); and (3) the prediction of the accidents ratio for each set of characteristic by using a XGBoost decision trees ensemble. The results confirm the appropriateness of the method.

Citation: Olivella Nadal J, Calleja Sanz G, Fuentes Ribas I, Rodriguez Mondelo P (2023) Determining occupational accidents baseline ratios by considering a synthetic population: The case of Spain. PLoS ONE 18(11): e0294707. https://doi.org/10.1371/journal.pone.0294707

Editor: Santiago Gascón, Universidad de Zaragoza, SPAIN

Received: September 27, 2022; Accepted: November 3, 2023; Published: November 22, 2023

Copyright: © 2023 Olivella Nadal et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data cannot be shared publicly because of it has been provided by the Ministerio de Trabajo y Economía Social. However, any researcher can ask the Ministerio the data for verification or further research. The institution that provides the data and contact information are included in the text (Secretaría General Técnica. Subdirección General de Estadística y Análisis Sociolaboral, estadistica@meyss.es).

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

Past accidents provide a strong indicator of future risk. The ultimate objective of Occupational Health and Safety (OHS) is to ensure safe and healthy working conditions by preventing injuries in a workplace. Therefore, we need to identify the risks to address and assess the importance of their possible outcomes and probability. The priorities adopted will be determined to a good extent by the analysis of the accidents occurred in the past.

More specifically, the assessment of the volume of accidents in certain industry, activity or organisation relies, to a good extent, on the details regarding the accidents both within the same context and in different scenarios. When assessing accident risk for a certain facility, the accidents occurred in this facility are the main source of information. However, often the data available is not enough to obtain significant conclusions. Even when it is, data from other facilities or workers’ groups is also useful and commonly considered. In fact, after calculating an incidence ratio of accidents, it is very common to compare it with other accident ratios. For example, assume that a company accidents ratio is 2,000 –in this paper, we consider as accidents ratio the number of accidents per 100,000 workers and year. If the industry average ratio is 1,000, this is definitely something to be worried about. These comparisons are useful to identify if the situation in an industry or a company is due to general circumstances or can be attributed to particular factors.

In most countries, a government agency or collaborating organization gathers information on accidents and provides information including aspects such as personal characteristics of the workers and the company and some details on the accident itself [1]. Based on this information, summary data and accidents ratios are usually made publicly available and can be used as baseline for comparison. However, the information provided is often descriptive and does not immediately translate to reliable comparisons, as it is described in the following example.

Let us consider the case of a young population (16–34 years), female, working in the machinery industry in a specific autonomous community and country–common characteristics to consider when analysing general accident ratios. Let us also assume that the accident ratio of this group is, for example, 4,679. How can we know if this ratio is high, low or average? We can certainly take into consideration the average of the industry, for example, but other characteristics could also be taken into account for comparison, as presented in Table 1 This example corresponds to real data of the case presented in this paper, but exact details are not included because the objective of presenting the example is only to clarify the situation that justifies this research.

Download:

Table 1. Example characteristics/accidents ratio.

https://doi.org/10.1371/journal.pone.0294707.t001

There is no clear reason to select one or the other of the averages. A possibility would be to take a value somewhere in the middle but, but it is not that straight forward. We can see, also in Table 1, that the average ratio for a narrowed group, 16-34/female value, is much closer to female average value than to 16–34 average value. Therefore, to consider an average of different values is not a reasonable solution. For a more useful comparison, we would need baselines globally reflecting the factors present in the information at a country level. Obtaining these baselines is the general objective of this paper.

1.1 The case of Spain

The analysis proposed will be applied to the specific case of Spain. Although the problem addressed is general, the information available depends on the specific situation and, thus, some aspects of the solution can differ. This specificity in the data available depends, firstly, on how the information is gained and stored, and, secondly, on the access of the analyst to this information.

In the case of Spain, detailed information on the accidents occurred is gathered through a centralised system of record [2]. In line with other European countries, detailed data from the accidents reports is a mandatory requirement and included in a central register [1]. Microdata of this register are made available to researchers under request to Secretaría General Técnica. Subdirección General de Estadística y Análisis Sociolaboral (estadistica@meyss.es). Additionally, detailed information on workers working at a certain moment is also gathered by public authorities through the registers on workers affiliated to the national Social Security system [3]. In this case, only a summarised version of the information is available.

1.2 General approach

When dealing with personal data, often data is available but it is not possible to use it due to privacy concerns. This is a very strong barrier for research on human health. The analysis of occupational accidents can be considered as one of the areas of human health research and is certainly impacted by this privacy concerns. A usual approach is the anonymisation of data, in other words, a transformation that makes it impossible to identify specific information on certain person or event. The anonymising of data can make it less detailed and, in consequence, can cause loss of information. An alternative approach is the generation of synthetic data.

Synthetic data is data that differs from reality while maintaining the same behaviour. We can consider two main types of synthetic data:

Synthetic data prepared by someone having the real data. In this case, the real data cannot be used because of privacy concerns. However, the organisation with access can generate synthetic data with the same distribution that can be widely spread. The results of the analysis performed is expected to be the same as if obtained by using the real data.
Synthetic data when complete real data is not available. Often the data available is only partial because it has not been registered or because it is not accessible to the researcher.

The case addressed here belongs to the second possible situation. Naturally, the method to apply to generate the synthetic data and the quality of the results depends on the partial data available. In the Spain’s case, a complementary source of information can be accessed to partly compensate the lack of detailed data on affiliated workers by generating a synthetic population. This source is EPA (Encuesta de Población Activa), a wide and nationwide survey on job situation performed systematically [4]. The microdata of this survey is publicly available. Mixing EPA data and summary data on affiliated workers will make it possible to generate a synthetic population representative of the population of workers in Spain.

The enhancement of data from a survey based on population summary data to obtain health phenomena ratios can be found in literature [5, 6]. At a more specific level, data from the EPA survey has been previously used to calculate accident ratios [7]. In these works, since the analysis was performed at an aggregated level, the generation of a synthetic population was not necessary.

By considering the data on accidents and the synthetic population of workers, we can determine occupational accidents baselines that make useful comparisons possible. To do so, we will establish the occupational accidents that we can expect under certain circumstances reflected in general statistics. This approach, using predictions to define accidents occurrence baselines, has been previously applied to road traffic accidents [8].

Technically, we will be making a prediction. However, the objective is not exactly this. In the example presented in Table 1, the features to be considered are age, sex, autonomous community, and activity. We cannot expect to predict precisely the occupational accidents ratio of a workplace only based on these parameters. Our aim is to establish an expected accidents ratio by taking into account the features considered–that is to say, we want to define baselines for comparison. The baselines should be useful both at regional, industry or company level.

The problem mentioned consists on predicting a figure and, as such, it is a regression model. In this paper, a decision-trees ensemble approach is considered. The method proposed is applied to data from Spain for the year 2019. This year has been selected because the years, 2020 and 2021, were too strongly influenced by the COVID pandemic. A regular year was considered more appropriated to check the functioning of the method proposed.

Literature on occupational accidents is wide and multifaceted [9]. Data on real accidents occurrence is one of the main ways to identify risk, and has been frequently used [10]. In particular, analyses at regional level have been addressed by considering, for example, obtaining homogeneous and comparable measures [11] and the ratios at country and activity level [12]. On the other hand, the accident occurrence in Spain at country level has also been considered, for example, to determine the factor’s influence [13], evolution over time [7], and the influence of the economic cycle [14]. However, to the best knowledge of the authors, the approach presented here is a novel contribution to literature both from the point of view of the generation of a synthetic population and for obtaining risk baselines.

The continuation of the paper is structured as following. Section 2 presents the proposed methodology, section 3 is devoted to the application of the method in the specific case of Spain, and discussion and conclusions are covered in section 4.

2. Problem definition and methodology

2.1 Problem definition

We consider a situation where the information available includes:

Disaggregated data on occupational accidents that took place in a certain period and area, including information on some characteristics of the accident, such as the worker involved, the activity they were performing, the activity of the company and the geographical area.
The total number of workers working in the period and area considered.
For a list of characteristics of workers and the activity they perform, the proportion of workers in the period and area considered with each of the possible values of these features–e.g., if we consider the feature ‘industry’ we know the proportion of workers in each industry.
A sample of the workers working in the period and area considered, with information on some demographics and the activity the workers perform. The sample was obtained from a survey.

It is possible that the data on demographics, activity and accidents of the different sources used are not exactly the same. However, we can expect some variables to be totally or partially common. The analyses performed refers to characteristics that are present in the different sources available–or can be obtained from the information they provide. Based on the information mentioned, our objective is to define baselines for comparison of occupational accidents ratios. To do so, and as an intermediary objective, a synthetic population will be generated.

2.2 Generation of a synthetic population

Since detailed data on workers working in the period under analysis is not available, we opt for generating a population with the same or similar characteristics–i.e., a synthetic population. To do so, in the problem defined we assume that we have a sample of the population, obtained from a survey, and the total number of workers. If we were confident that the sample reflects precisely the characteristics of the population the generation of the synthetic population would be immediate–in effect, we would divide the number of instances of the population by the number of instances of the sample and we would increase the number of instances of each specific type by this same proportion. However, most of the times samples obtained from surveys are biased because the trend to answer surveys of different groups of people is not the same.

The situation is identical to the problem addressed by raking or RIM weighting, a sample-weighting algorithm [15]. This problem corresponds to surveys with a biased sample of the population. In raking, weights are calculated for each instance in order to adjust the influence of the answers of each type of respondent to their representation in the population. Raking is considered the standard sample-weighting method [16]. Let us assume that the sample is made up of a 60% of men and a 40% of women. Let us also assume that we know that the proportion of men and women is 50%. We could adjust by multiplying the answers of each man by 0.8 (50/60) and the answers of each woman by 1.25 (50/40). The figures 0.8 and 1.2 are the weights. However, real situations are not that simple.

Let us consider a two-feature case. A simple example of this type of situation is presented in Tables 2 and 3. Table 2 shows the proportions of the variables sex and age in a population, while Table 3 shows some instances and their corresponding sex and age. A total of 8 instances of a sample are presented. In this case, the calculation of the weights is not as direct as in the case with a single instance, as we will present bellow. Now, we will explain how we propose to use the result of this problem to generate a synthetic population.

Download:

Table 2. Sample-weighting example: Population proportions.

https://doi.org/10.1371/journal.pone.0294707.t002

Download:

Table 3. Sample-weighting example: Instances and results.

https://doi.org/10.1371/journal.pone.0294707.t003

Assume that the weights presented in Table 3 (column Weight), are the result of the sample-weighting problem with this data–the sum of the proportions of the different kinds of instances verify the proportions of Table 2, as we can see in the column Percentage. If we were analysing a survey, we would multiply the answer of each person by their corresponding weight to obtain the global results. In this case, we are only interested in the composition of the population and not in weighting answers. Therefore, we will use the weights in a different way. Let us assume now that the population size is 1,000. If we multiply the sum of weights of each kind of instances by 1,000, we obtain a set of numbers of instances in the population satisfying the proportions established–as presented in Table 2, column Synthetic.

As mentioned, the problem addressed corresponds to the classical raking or RIM weighting method. However, this classical method is not fully appropriate in this case. The raking method does not give place to a single possible solution and does not take into account all the information available. To address this limitation, different solutions have been proposed [17]. Here we opt for Optimal Representative Sample Weighting (rsw) [18], a completely different solution method. In this method, conditions are established not only in relation to the characteristics of the population but also in relation to the weights. The objective is that the weights are as similar as possible and, thus, the characteristics of the sample are respected as much as possible.

In its most general sense, the method consists of a loss function measuring the fitting to some conditions of the population and a regularisation term, which includes condition on the weights. The equation that follows has to be minimised subject to where f^des are values reflecting the characteristics of the population, w the weights and F a function transforming the weights into the values f to be compared with f^des.

The strict Optimal Representative Sample Weighting method imposes that the proportions are fulfilled. To do so, the loss function is defined as: and the condition on weights can typically be the maximisation of the entropy of the weights where N is the number of elements of the sample.

In the specific case analysed here, we consider the following variables:

Download:

https://doi.org/10.1371/journal.pone.0294707.t004

Then, we define the function F as the matrix: and the desired values as

We will use the weights of the instances of the sample to generate a synthetic population. To do so, we have to define C, the set of possible combinations of values of the characteristics where is the value of the characteristic i (i = 1..I) of the possible combination l (l = 1..L), and

The variable m_l (l = 1..L), which represents the number of elements to include in the synthetic population with characteristics l, will be calculated with the formula: where N is the number of instances in the population.

2.3 Baseline determination

To determine the baselines, we define two additional variables,

Download:

https://doi.org/10.1371/journal.pone.0294707.t005

To calculate the ratio of accidents we make the following calculations–that is to say, we divide the number of accidents with values of characteristics c_ij by the number of instances of the synthetic population with the same values of characteristics:

Then, we can define a prediction model with c_il, the value of the characteristics, as explanatory variables, and r_l, ratio of accidents, as explained variable. We will predict the ratios of accidents based on the characteristics considered. These predictions can be used as baselines with which to compare real ratios.

To perform the prediction, a method based on decision trees seems appropriate in order to capture the complex relations between the different variables that could exist. Given the potentially high number of variables and possible values, a decision tree ensemble algorithm is applied and, in particular, the XGBoost, which has been proved very efficient in predicting accidents risk based on context variables [19–21]. The appropriated parameters of the algorithm are obtained through a wide test of parameters with different training/test sets. A linear-based forecast will be performed for comparison.

Finally, the forecast for each combination of characteristics is performed by using the parameters selected and by considering the average of the results obtained from different training/test sets, to prevent a bias due to the training/test set used. The training set includes all the instances available. However, to prevent overfitting, the instance which ratio is being predicted and the most similar ones are excluded.

3. Application of the method to the case

3.1 Selection of variables and data transformation

The analysis performed takes into account three sources of information on Spanish occupational accidents and workers for the year 2019. The data, source, kind of data and number of registers are presented in Table 4. Only features present in the three sources can be used. The variables selected are age, sex, activity and autonomous community. Other features in relation to Social Security regimes and self-employment condition could also be taken into account. However, it has been judged that they depend largely on legal regulations and could give way to doubtful conclusions.

Download:

Table 4. Data sources.

https://doi.org/10.1371/journal.pone.0294707.t006

The aggregation of the data is not the same in the three sources mentioned. The least aggregated measure had necessarily to be taken. In particular, the EPA uses a 10-blocks activity classification (Table 5), which is also used here. Additionally, the data on the ages of the affiliated workers is presented in blocks. For this reason, the variable reflecting ages has to be in blocks and, thus, categorical. The other variables are categorical by nature. Finally, to consider a limited number of possible values of the variables maintaining the most relevant information has been considered preferable for the application of the methods proposed. Because of this, three age blocks are considered and Autonomous Community is preferred to provinces. Finally, the age data is aggregated by dividing the most common working age range (20 to 64 years) into three blocks. Additionally, ages under 20 years and over 64 years are included in the first and third blocks obtained, respectively. The variables and blocks are showed in Table 6.

Download:

Table 5. Activity classification based on CNAE-2009, used by EPA.

https://doi.org/10.1371/journal.pone.0294707.t007

Download:

Table 6. Variables selected.

https://doi.org/10.1371/journal.pone.0294707.t008

The data from the EPA survey needs some transformations in order to obtain the selected data, as presented in Table 7. The field TRAREM is used to release the cases of people not working during the past week. In effect, as mentioned before, the survey includes people that are working when they are being surveyed. Additionally, the field PROEST is blank when the worker works outside Spain, and was used to release these cases. The remaining registers made counting the number of workers corresponding to each age, sex, activity and Autonomous Community block surveyed by EPA in 2019, possible.

Download:

Table 7. Transformation of data from EPA 2019.

https://doi.org/10.1371/journal.pone.0294707.t009

The CNAE-2009 activity codes are used by the Spanish Instituto Nacional de Estadística. However, until the third level of classification the CNAE-2009 classification follows NACE Rev. 2. NACE is the “statistical classification of economic activities in the European Community” and is the subject of legislation at the European Union [22]. Additionally, the two autonomous cities, Ceuta and Melilla, have been considered together due to the similar geographic characteristics.

The summary data on affiliated workers also requires some processing. The specific tables on workers affiliated to the national Social Security system considered [4] are shown in Table 8. The summary data on two variables, Age and Sex, is showed in Table 9 to illustrate the kind of information used.

Download:

Table 8. Tables with data on affiliation considered.

https://doi.org/10.1371/journal.pone.0294707.t010

Download:

Table 9. Summary data on age and sex.

https://doi.org/10.1371/journal.pone.0294707.t011

On the other hand, the information on accidents also requires some modifications. The fields considered are shown in Table 10. The name and description of each filed is included, altogether with the D field number. Delta is a software system used in most of the Spanish autonomous communities to register occupational accidents data.

Download:

Table 10. Fields of the data on accidents considered.

https://doi.org/10.1371/journal.pone.0294707.t012

The field “Type of accident” makes it possible to identify relapses, which are discarded. Additionally, following the criteria of the official Spanish reports [23], road accidents when travelling to or from the workplace (“in itinere”) are discarded. The field “Place of accident” makes that visible. Finally, the fields on economic activity, province and age need to be transformed into the common selected variables indicated in Table 6.

3.2 Synthetic population generation

Following the methodology proposed in subsection 2.2, a synthetic population was generated based on the data selected and appropriately transformed. The rsw method is implemented by using the library developed by the same authors proposing the method [24]. The hyperparameters, shown in Table 11, are standard. Although the maximum number of iterations established was 10⁶, the calculation process ended at iteration 7,800 because the value of the parameters of control were reached. The entropy of the solution is 12.204, whereas the entropy of the uniform distribution is 12.357 –thus, original values are highly respected.

Download:

Table 11. rsw method hyperparameters.

https://doi.org/10.1371/journal.pone.0294707.t013

3.3 Baselines determination

The next step is the determination of baselines in accordance with the characteristics of the population considered and the ratio of accidents, which are determined by following the procedure presented in subsection 2.3. A numerical test was performed to establish the most appropriated hyperparameters. The fix and the selected parameters are shown in Table 12. A total of 100 tests with different train and test subsets were performed for each possible combination of parameters. The train and test subsets were randomly generated with a 33% testing size.

Download:

Table 12. XGBoost hyperparameters.

https://doi.org/10.1371/journal.pone.0294707.t014

The main criteria for selecting a set of hyperparameters was the minimum score on the test data of 10 tries. The 100 hyperparameter sets with the best minimum score have been chosen. To select one of them, a little difference between the average of accuracies of the 10 training sets and the average of accuracies of the corresponding test sets was considered. A low depth of the trees and a relatively low number of estimators have also been prioritised. All this criteria are intended to prevent overfitting. The training set average score, the test set average score and the minimum test set score were respectively 0.8425, 0.8022 and 0.7564 –the selected set is shown in Table 12.

Next, the baselines were calculated. For each instance, the training set used is the whole set, excluding the instances from the same Autonomous Community and with the same activity. If we didn’t do it this way, the forecasting could be more influenced by the values of the most similar instances than for the factors considered. Our objective here is not to predict the accidents ratio, but to know its expected value given a set of factors. The coefficient of determination of the baselines obtained in relation to the ratios was 0.7560. A linear regression was also performed for comparison purposes and the coefficient of determination was 0.6243.

4. Discussion of results and conclusions

The performance of the procedure proposed has been evaluated by using the scores corresponding to the specific methods used. Favourable results were obtained, as indicated in sections 3.2 and 3.3. As a result of the analysis performed, some intermediate results and an outcome, the baselines, have been obtained. A list of combinations of features has been randomly selected in order to be used as an example. The features selected, the intermediate results and the outcomes obtained are shown in Table 13.

Download:

Table 13. Results example (1).

https://doi.org/10.1371/journal.pone.0294707.t015

It is clear from these examples that, in some cases, the composition of the synthetic population is clearly different from the composition of the EPA sample. In effect, while the instance (b) of the table has 1,108 instances in EPA and 18,082 in the synthetic population, the instance (c) has 79 cases in EPA and 13,815 in the synthetic population. This is not surprising since rsw, the method applied, gives full priority to the fitting with the population summary data over the fitting with the sample data.

When reviewing the data results, we may be tempted to compare the accident ratios with the baselines and draw conclusions from it. Nonetheless, we think that this would not be correct. The figure on the number of accidents is obtained from real data and it is exact. The accident ratio, however, is estimated. The rsw method is thought to obtain the best possible fit between the synthetic population and the population itself. The conditions imposed by the method, however, do not guarantee good individual estimations, as showed in the following example.

In the example presented in Table 14, instance (a), we can see that there is a big difference between accident ratio and baseline. As already established, we cannot directly deduce anything from it. First, we notice that the baseline corresponds with what we could expect. In effect, instances with different age blocks (b and c) and autonomous communities (d) have similar baselines, but when considering same age, sex and all the activities in the same autonomous community (e), the baseline is lower. The remarkable figure here is the accident ratio. The high accident ratio can be explained because of the many accidents occurring, but also because of a bad estimation of the number of workers. The number of workers is deduced from the general information on affiliation and from the EPA poll. If the number of young women working in this autonomous community and activity is much higher than expected from the general data and very few of them answered the survey, their number can be underestimated. Conversely, the baseline reflects the information of the whole ensemble of the information available and is a trustworthy reference.

Download:

Table 14. Results example (&2).

https://doi.org/10.1371/journal.pone.0294707.t016

It has to be mentioned that the results obtained refers to sets of instances with specific values of age, sex, autonomous community and activity variables, but can be used to perform a less detailed analysis. The instance (e) from Table 14, for example, refers to workers with age in block E1, female, working in Aragon and performing any activity. The accident ratio was obtained by adding the accidents and workers corresponding to the different activities and dividing them. Additionally, the baseline was calculated in the same way by taking the numbers of accidents corresponding to the baseline of each activity.

Some limitations of this research work have to be mentioned. It has not been possible to test the homogeneity of the blocks of activity used, which correspond to the EPA survey. The availability of more detailed information would make it possible to confirm the appropriateness of the blocks or to adopt different ones. Additionally, the application of the proposed method to a single country and year could also be seen as a limitation.

Future research could overcome the two limitations mentioned. Additionally, data from several years could be analysed and the time trend influence be considered. Different criteria for aggregating data could also be considered and assessed. On the other hand, the application of the method to the analysis of the occupational accidents in other countries would require in most cases an adaptation of the model. The work could also be extended in other directions. In particular, more summary data on the population could be considered. Other factors, such as being self-employed, could also be included in the analysis. Other prediction algorithms or ensembles could be considered as well. Finally, the availability of the microdata of the population would dramatically increase the trustworthiness of the results.

References

1. Jacinto C, Aspinwall E. A survey on occupational accidents’ reporting and registration systems in the European Union. Safety Science. 2004 Dec 1;42(10):933–60.
- View Article
- Google Scholar
2. Ministerio de Trabajo y Economía Social. Estadística de Accidentes de Trabajo (ATR). Informe metodológico estandarizado. 2021. Available from: https://www.mites.gob.es/estadisticas/eat/Informe_Metodologico_Estadarizado_ATR.pdf
- View Article
- Google Scholar
3. Ministerio de Trabajo y Economía Social. Afiliación de Trabajadores al Sistema de la Seguridad Social [Internet]. 2022 [cited 2022 Sep 4]. Available from: https://www.mites.gob.es/es/estadisticas/mercado_trabajo/afi/welcome.htm
- View Article
- Google Scholar
4. Instituto Nacional de Estadística. INEbase. Encuesta de población activa [Internet]. Instituto Nacional de Estadística. 2022 [cited 2022 Sep 4]. Available from: https://www.ine.es/dyngs/INEbase/operacion.htm?c=Estadistica_C&cid=1254736176918&menu=resultados&secc=1254736195129&idp=1254735976595#!tabs-1254736030639
5. Turkiewicz A, Nilsson PM, Kiadaliri A. Probabilistic Quantification of Bias to Combine the Strengths of Population-Based Register Data and Clinical Cohorts-Studying Mortality in Osteoarthritis. Am J Epidemiol. 2020 Dec 1;189(12):1590–9. pmid:32639513
- View Article
- PubMed/NCBI
- Google Scholar
6. Bonander C, Nilsson A, Björk J, Blomberg A, Engström G, Jernberg T, et al. The value of combining individual and small area sociodemographic data for assessing and handling selective participation in cohort studies: Evidence from the Swedish CardioPulmonary bioImage Study. PLoS One. 2022 Mar 1;17(3 March 2022). pmid:35259202
- View Article
- PubMed/NCBI
- Google Scholar
7. García Mainar I, Montuenga Gómez V. Causas de los accidentes de trabajo en España: análisis longitudinal con datos de panel. Gac Sanit. 2009 May 1;23(3):174–8.
- View Article
- Google Scholar
8. Himes S, Gayah V, Gooch J, Read S. Estimating Baseline Numbers for Safety Measure Target Setting in Virginia. https://doi.org/101177/0361198120920632 [Internet]. 2020 Jun 2 [cited 2022 Sep 15];2674(8):523–35. Available from: https://journals.sagepub.com/doi/abs/10.1177/0361198120920632
- View Article
- Google Scholar
9. Jilcha K, Kitaw D. A literature review on global occupational safety and health practice & accidents severity. International Journal for Quality Research. 2016;10(2):279–310.
- View Article
- Google Scholar
10. Marhavilas PK, Koulouriotis D, Gemeni V. Risk analysis and assessment methodologies in the work sites: On a review, classification and comparative study of the scientific literature of the period 2000–2009. J Loss Prev Process Ind. 2011 Sep 1;24(5):477–523.
- View Article
- Google Scholar
11. Hämäläinen P, Takala J, Saarela KL. Global estimates of occupational accidents. Saf Sci. 2006 Feb 1;44(2):137–56.
- View Article
- Google Scholar
12. Hämäläinen P, Leena Saarela K, Takala J. Global trend according to estimated number of occupational accidents and fatal work-related diseases at region and country level. J Safety Res. 2009 Jan 1;40(2):125–39. pmid:19433205
- View Article
- PubMed/NCBI
- Google Scholar
13. Gallego V, Sánchez A, Martón I, Martorell S. Analysis of occupational accidents in Spain using shrinkage regression methods. Saf Sci. 2021 Jan 1;133:105000.
- View Article
- Google Scholar
14. Fernández-Muñiz B, Montes-Peón JM, Vázquez-Ordás CJ. Occupational accidents and the economic cycle in Spain 1994–2014. Saf Sci. 2018 Jul 1;106:273–84.
- View Article
- Google Scholar
15. Sharot T. Weighting survey results. Vol. 28, Journal of the Market Research Society. 1986.
16. Mercer A, Lau A, Kennedy C. How different weighting methods work [Internet]. Pew Research Center Methods. 2018 [cited 2022 Aug 30]. Available from: https://www.pewresearch.org/methods/2018/01/26/how-different-weighting-methods-work/
- View Article
- Google Scholar
17. Kolenikov S. Calibrating Survey Data using Iterative Proportional Fitting (Raking). The Stata Journal: Promoting communications on statistics and Stata. 2014;14(1).
- View Article
- Google Scholar
18. Barratt S, Angeris G, Boyd S. Optimal representative sample weighting. Stat Comput. 2021 Feb 1;31(2).
- View Article
- Google Scholar
19. Krasniqi D, Bardet JM, Rynkiewicz J. Parametric and XGBoost Hurdle Model for estimating accident frequency [Internet]. 2022. Available from: https://hal.archives-ouvertes.fr/hal-03739838
- View Article
- Google Scholar
20. Ma J, Ding Y, Cheng JCP, Tan Y, Gan VJL, Zhang J. Analyzing the Leading Causes of Traffic Fatalities Using XGBoost and Grid-Based Analysis: A City Management Perspective. IEEE Access. 2019;7:148059–72.
- View Article
- Google Scholar
21. Fissler T, Lorentzen C, Mayer M. Model Comparison and Calibration Assessment: User Guide for Consistent Scoring Functions in Machine Learning and Actuarial Practice. 2022 Feb 25; Available from: http://arxiv.org/abs/2202.12780
- View Article
- Google Scholar
22. European Commission (2008) NACE Rev. 2 –Statistical classification of economic activites in the European Community
- View Article
- Google Scholar
23. Instituto Nacional de Seguridad y Salud en el Trabajo. Informe anual de accidentes de trabajo en España 2019. 2020.
- View Article
- Google Scholar
24. Barratt S, Angueris G, Boyd S. rsw. Optimal representative sample weighting (rsw) in Python [Internet]. Github; 2020 [cited 2022 Aug 28]. Available from: https://github.com/cvxgrp/rsw.
- View Article
- Google Scholar

[ref1] 1. Jacinto C, Aspinwall E. A survey on occupational accidents’ reporting and registration systems in the European Union. Safety Science. 2004 Dec 1;42(10):933–60.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Ministerio de Trabajo y Economía Social. Estadística de Accidentes de Trabajo (ATR). Informe metodológico estandarizado. 2021. Available from: https://www.mites.gob.es/estadisticas/eat/Informe_Metodologico_Estadarizado_ATR.pdf
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Ministerio de Trabajo y Economía Social. Afiliación de Trabajadores al Sistema de la Seguridad Social [Internet]. 2022 [cited 2022 Sep 4]. Available from: https://www.mites.gob.es/es/estadisticas/mercado_trabajo/afi/welcome.htm
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Instituto Nacional de Estadística. INEbase. Encuesta de población activa [Internet]. Instituto Nacional de Estadística. 2022 [cited 2022 Sep 4]. Available from: https://www.ine.es/dyngs/INEbase/operacion.htm?c=Estadistica_C&cid=1254736176918&menu=resultados&secc=1254736195129&idp=1254735976595#!tabs-1254736030639

[ref5] 5. Turkiewicz A, Nilsson PM, Kiadaliri A. Probabilistic Quantification of Bias to Combine the Strengths of Population-Based Register Data and Clinical Cohorts-Studying Mortality in Osteoarthritis. Am J Epidemiol. 2020 Dec 1;189(12):1590–9. pmid:32639513
View Article
PubMed/NCBI
Google Scholar

[12] View Article

[13] PubMed/NCBI

[14] Google Scholar

[ref6] 6. Bonander C, Nilsson A, Björk J, Blomberg A, Engström G, Jernberg T, et al. The value of combining individual and small area sociodemographic data for assessing and handling selective participation in cohort studies: Evidence from the Swedish CardioPulmonary bioImage Study. PLoS One. 2022 Mar 1;17(3 March 2022). pmid:35259202
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref7] 7. García Mainar I, Montuenga Gómez V. Causas de los accidentes de trabajo en España: análisis longitudinal con datos de panel. Gac Sanit. 2009 May 1;23(3):174–8.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Himes S, Gayah V, Gooch J, Read S. Estimating Baseline Numbers for Safety Measure Target Setting in Virginia. https://doi.org/101177/0361198120920632 [Internet]. 2020 Jun 2 [cited 2022 Sep 15];2674(8):523–35. Available from: https://journals.sagepub.com/doi/abs/10.1177/0361198120920632
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Jilcha K, Kitaw D. A literature review on global occupational safety and health practice & accidents severity. International Journal for Quality Research. 2016;10(2):279–310.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Marhavilas PK, Koulouriotis D, Gemeni V. Risk analysis and assessment methodologies in the work sites: On a review, classification and comparative study of the scientific literature of the period 2000–2009. J Loss Prev Process Ind. 2011 Sep 1;24(5):477–523.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Hämäläinen P, Takala J, Saarela KL. Global estimates of occupational accidents. Saf Sci. 2006 Feb 1;44(2):137–56.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Hämäläinen P, Leena Saarela K, Takala J. Global trend according to estimated number of occupational accidents and fatal work-related diseases at region and country level. J Safety Res. 2009 Jan 1;40(2):125–39. pmid:19433205
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref13] 13. Gallego V, Sánchez A, Martón I, Martorell S. Analysis of occupational accidents in Spain using shrinkage regression methods. Saf Sci. 2021 Jan 1;133:105000.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref14] 14. Fernández-Muñiz B, Montes-Peón JM, Vázquez-Ordás CJ. Occupational accidents and the economic cycle in Spain 1994–2014. Saf Sci. 2018 Jul 1;106:273–84.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref15] 15. Sharot T. Weighting survey results. Vol. 28, Journal of the Market Research Society. 1986.

[ref16] 16. Mercer A, Lau A, Kennedy C. How different weighting methods work [Internet]. Pew Research Center Methods. 2018 [cited 2022 Aug 30]. Available from: https://www.pewresearch.org/methods/2018/01/26/how-different-weighting-methods-work/
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref17] 17. Kolenikov S. Calibrating Survey Data using Iterative Proportional Fitting (Raking). The Stata Journal: Promoting communications on statistics and Stata. 2014;14(1).
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref18] 18. Barratt S, Angeris G, Boyd S. Optimal representative sample weighting. Stat Comput. 2021 Feb 1;31(2).
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref19] 19. Krasniqi D, Bardet JM, Rynkiewicz J. Parametric and XGBoost Hurdle Model for estimating accident frequency [Internet]. 2022. Available from: https://hal.archives-ouvertes.fr/hal-03739838
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref20] 20. Ma J, Ding Y, Cheng JCP, Tan Y, Gan VJL, Zhang J. Analyzing the Leading Causes of Traffic Fatalities Using XGBoost and Grid-Based Analysis: A City Management Perspective. IEEE Access. 2019;7:148059–72.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref21] 21. Fissler T, Lorentzen C, Mayer M. Model Comparison and Calibration Assessment: User Guide for Consistent Scoring Functions in Machine Learning and Actuarial Practice. 2022 Feb 25; Available from: http://arxiv.org/abs/2202.12780
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref22] 22. European Commission (2008) NACE Rev. 2 –Statistical classification of economic activites in the European Community
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref23] 23. Instituto Nacional de Seguridad y Salud en el Trabajo. Informe anual de accidentes de trabajo en España 2019. 2020.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref24] 24. Barratt S, Angueris G, Boyd S. rsw. Optimal representative sample weighting (rsw) in Python [Internet]. Github; 2020 [cited 2022 Aug 28]. Available from: https://github.com/cvxgrp/rsw.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

Figures

Abstract

1. Introduction

1.1 The case of Spain

1.2 General approach

2. Problem definition and methodology

2.1 Problem definition

2.2 Generation of a synthetic population

2.3 Baseline determination

3. Application of the method to the case

3.1 Selection of variables and data transformation

3.2 Synthetic population generation

3.3 Baselines determination

4. Discussion of results and conclusions

References