QSAR modelling of a large imbalanced aryl hydrocarbon activation dataset by rational and random sampling and screening of 80,086 REACH pre-registered and/or registered substances

doi:10.1371/journal.pone.0213848

Fig 1.

An overview of the workflow.

Pink box: the steps of structure curation and preparation of test sets and datasets for training set construction. Light blue box: the steps of training set inactives selections and model building. Dark blue box: predicting the external validation test set, cross-validation sets and the REACH set in the four models. Green box: inter-model comparisons of the predictive performances from the external validations and the coverage of the REACH set.

More »

Expand

Table 1.

Overview of the datasets and their distributions of active and inactive experimental results.

More »

Expand

Table 2.

The results from the 10 times 20% out LPDM cross-validations of the three modelling approaches applied to the 2:1 training set (within the structural and probability AD).

More »

Expand

Table 3.

The results from the two times five-fold DTU Food cross-validation procedure of the cocktail models with different active-to-inactive ratios.

More »

Expand

Table 4.

The results from the external validation of the models including model AD sizes for the test set.

More »

Expand

Fig 2.

The most significant activity and inactivity structural features occurring in the Rational-final model.

(A) Structural features alerting for activity in the Rational-final model. (B) Structural features alerting for inactivity in the Rational-final model. The selection of activity features was based on a ranking by the formula |0.2 - |∙ χ², where is the mean activity of all training set structures containing the feature. The selection of inactivity features was done by significance (χ²) among the ‘pure’ inactivity features, i.e. only appearing in inactive substances. In both cases χ² denotes Chi-square independence test with one degree of freedom with Yates’ correction.

More »

Expand

Fig 3.

The most significant activity and inactivity structural features occurring in the Random-final model.

(A) Structural features alerting for activity in the Random-final model. (B) Structural features alerting for inactivity in the Random-final model. The selection of activity features was based on a ranking by the formula |0.2 - |∙ χ², where is the mean activity of all training set structures containing the feature. The selection of inactivity features was done by significance (χ²) among the ‘pure’ inactivity features, i.e. only appearing in inactive substances. In both cases χ² denotes Chi-square independence test with one degree of freedom with Yates’ correction.

More »

Expand

Fig 4.

Performance of QSAR2:1, QSAR3:1, QSAR4:1 and QSAR4:1R vs. REACH coverage.

The performance is described by Sensitivity (A), Specificity (B) and Balanced Accuracy (C). The following tokens correspond to the rational selection approach: a yellow diamond for QSAR2:1, a yellow triangle for QSAR3:1 and a yellow square for QSAR4:1. The blue circle corresponds to the random selection approach for QSAR4:1-R.

More »

Expand

Table 5.

Number of substances covered, (% of screened REACH substances), number of predicted actives (% of covered) and number of predicted inactives (% of covered) from predicting the REACH set of 80,086 substances.

More »

Expand