Visualising statistical models using dynamic nomograms

Translational Statistics proposes to promote the use of Statistics within research and improve the communication of statistical findings in an accurate and accessible manner to diverse audiences. When statistical models become more complex, it becomes harder to evaluate the role of explanatory variables on the response. For example, the interpretation and communication of the effect of predictors in regression models where interactions or smoothing splines are included can be challenging. Informative graphical representations of statistical models play a critical translational role; static nomograms are one such useful tool to visualise statistical models. In this paper, we propose the use of dynamic nomogram as a translational tool which can accommodate models of increased complexity. In theory, all models appearing in the literature could be accompanied by the corresponding dynamic nomogram to translate models in an informative manner. The R package presented will facilitate this communication for a variety of linear and non-linear models.


Introduction
Translational Medicine, within biomedical and public health research domains, is defined as the convergence of basic and clinical research with the aim to transfer knowledge on the benefits and risks of therapies. The concept of 'Translational Statistics' was proposed [1] to facilitate the integration of biostatistics within clinical research in order to enhance communication of statistical research findings in an accurate and accessible manner to diverse audiences (e.g. policy makers, patients and the media).
The use of appropriate graphics is central to all areas of statistical research. For example, the graphical representation of data is necessary to provide a way of assessing at least parts of any assumed statistical model before engaging in formal analysis and to aid in the presentation and understanding of results and conclusions. Nomograms have been used to visualise statistical models, playing the role of a graphical 'predict' function, facilitating the calculation of a point estimate of the response variable for a particular set of values of the explanatory variables. In this paper, we propose the use of dynamic nomograms as a visualisation and translational tool to further aid the communication of the results of a statistical analysis to a non-statistical a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 audience. We introduce the R package DynNom for generating dynamic nomograms (as Shiny objects) for a variety of linear and non-linear models. In theory, most models presented in the literature could have an accompanying web address to direct the reader to the corresponding dynamic nomogram allowing them to 'interact' with the model to gain insight into the effect of each explanatory variable on the primary response of interest.

Nomograms
The field of nomography was invented by the French mathematician Maurice d'Ocagne in 1880 to provide engineers with fast graphical calculation tools for complicated formulas to a practical level of precision [2,3]. An early use of nomograms is attributed to [4], where they were employed in mathematics for calculating elementary arithmetic, quadratic equations and trigonometrical functions. One of the first nomograms in statistics was used to calculate the coefficient of variation, interval estimates for a mean and the comparison of two means and variances [5]. Other examples of nomograms in this domain are the chi-square test nomogram (Fig 1), Altman's nomogram to calculate sample size or power [6] and Fagan's nomogram [7] for the applications of Bayes's theorem. Fagan's nomogram, widely used in the context of diagnostic tests, was recently reproduced as the Held's nomogram for the calculation of post-test probabilities [8,9].
Nomograms have been widely used in applications in science, engineering, the military, clinical research and epidemiology; see [4,11]. The use of nomograms in clinical decision making includes prediction of prostate cancer severity [12,13], and the Partin nomogram (known as Partin Table) [14] which has been used mainly as a prognostic tool over the last two decades for the prediction of pathological stage in prostate cancer.
More recently, nomograms have been used to visualise statistical models. A nomogram generated from a model plays the role of a graphical 'predict' function facilitating the calculation of a point estimate of the response variable for a particular set of values of the explanatory variables [15,16]. It consists of rulers for each predictor (X) along with two rulers to convert the total points to the desired risk scale. The length of the i th predictor ruler provides a visual representation of the relative effect sizes β) of each explanatory variable in the model. It is defined in the Eq 1, which is calculated as a scaled version of the proportion of the predictor's contribution divided by the maximum predictor contribution: Software is available to create nomograms for statistical models in SAS [17], Stata [18], Python [19] and as online tools for constructing simple JAVA-based interactive nomograms [20] as well as the rms [21] and hdnom [22] packages in R.
The rms package in R [21] includes the nomogram function to generate nomograms from a fitted statistical model. For example, the code below creates a nomogram from a logistic regression model used to model the relationships between age, gender and passenger class on surviving the Titanic disaster (data are included in the PASWR package [23]).
R> data(titanic3) R> t.data <-datadist(titanic3) R> options(datadist = "t.data") R> fit <-lrm(survived~age + pclass + sex, data = titanic3) R> plot(nomogram(fit, fun = function(x)plogis(x))) The resulting nomogram is given in Fig 2. The horizontal line at the top labelled 'points' allows the effect size of each variable to be assessed, including direct visualisation of the potential variation in the effect. The points measure the contribution of the values/levels of the explanatory variable of interest on the 'linear predictor' scale. The total points are then mapped from the 'linear predictor' to obtain the 'predicted value', which in this case is a probability. It is clear from the graph on the top that all three explanatory variables make similar More recent developments in this area include the coloured-based nomogram in the VRPM package [24] proposed by Belle and Caster in 2015 [25]. Fig 3 represents a colour-based nomogram for the main effect model fitted to the Titanic data. The survival probability is calculated by summing up the contribution from each covariate/factor when matched to the colour legend. The effect of gender relative to age is again evident. The accuracy of any prediction depends on the users' ability to match colours accordingly.
When modelling a binary outcome (e.g. using logistic regression), modelling the log odds is mathematically attractive, summarising a treatment effect as an odds ratio may be misleading [26,27]. A nomogram is helpful as the predicted response is given as a probability.
It has been argued [28] that a summary quoting the underlying probabilities is more informative than one based on ratios of odds or probabilities. Odds and probability are often (mistakenly) used interchangeably where large and small odds ratios are interpreted as meaningful without reference to the underline probability of the event of interest. Similarly, in a logistic regression model the importance of an explanatory variable is often assessed (incorrectly) by the magnitude of the regression coefficient while ignoring the scale of the variable (e.g. hours, minutes, days) or by the p-value while ignoring the sample size. Given that the response variable of interest is a binary outcome, the effect of each variable on the response may be more easily translated if represented on a probabilistic scale.
Nomograms achieve this and provide an attractive and useful graphical summary of a logistic regression model with the ability to make predictions as probabilities. Static Nomograms can become cumbersome to use as models become more complex (e.g. higher order interactions) or the inclusion of smoothers [29], and when visualising uncertainty (e.g. interval estimates). One solution is to create dynamic nomograms as interactive applications to visualise statistical models such that effect sizes can be assessed graphically, by visualising the predicted value and corresponding confidence interval automatically.

Dynamic nomograms
The emergence of the rpanel [30] and Shiny packages [31] for R allowed the creation of interactive, user-friendly graphical interfaces to be displayed either in a standalone window or delivered as a webpage.
The DynNom package [32] is built to generate dynamic nomograms as a Shiny application for a variety of statistical models to allow a reader to interact with the model in a user-friendly manner. The package can be install from the CRAN repository using the following command: R> install.packages("DynNom") For example, the R code below will create a dynamic nomogram, using a simple function, for a logistic regression model of the Titanic data.
R> fit <-glm(survived~(age + pclass + sex)^3, titanic3, family = "binomial") R> DynNom(fit) The dynamic nomogram displays sliders for covariates (bounded by their observed ranges) and drop-down boxes for factors. The predict function maps the appropriate inverse link function (η = g(μ)) from the generalised linear model object to generate predicted values (on the scale of the response variable) and corresponding 95% confidence intervals (see Fig 4). The predicted value and corresponding confidence interval are plotted and presented in the tooltip labels and the 'Numerical Summary' tab. Further, a formatted model output summary is displayed in the 'Model Summary' tab.
Predicted quantities are accompanied by measures of uncertainty [33] in the form of confidence intervals. The confidence interval for the mean response will quantify the uncertainty around the estimation of the mean response for a given set of explanatory variables. In ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where x is a vector of observed predictors,Ẑ i ¼ x T �b is the linear equation estimates, u 1À a 2 indicates the corresponding critical value and S denotes the covariance matrix. The confidence interval is estimated in the linear form; however, for a clear interpretation, it is best to apply the inverse link transformation g −1 (.) (see Table 1) to the confidence limits in formula (2) [34].
Two main functions of the package are DynNom and DNbuilder functions. The former extracts the model class from the fitted model to build a dynamic nomogram as a shiny app which computes and displays the point and interval estimate using Eq 2 while the latter generates the files necessary to deploy the app.
The DynNom package supports model objects created by the lm, glm and coxph functions in R. It also supports ols, Glm, lrm and cph models in the rms package which allow the inclusion of smoothing splines as well as gam functions from mgcv [35] and gam packages [36]. When applied to a Cox proportional hazards model object, the estimated survival function will be displayed with alpha-blending (i.e. thickness of the line reflex the number of subjects over time) to represent the underlying uncertainty of the estimates. Table 2 summarises all the model objects supported by DynNom which cover a wide variety of the most common models fitted depending on the type of response variable of interest.

Examples
For illustration, dynamic nomograms for a Poisson response, a linear regression incorporating smoothing splines and a Cox proportional hazards model for time-to-event data are presented in the following sections. Examples involving Gamma regression and generalized additive model objects are not presented for brevity as the same procedure applies.

Modelling a count response
The 'crabs' dataset [37] (available in the glm2 package [38]) is used as an example of visualising a Poisson regression model. This dataset comprises 173 female horseshoe crabs, each of which had at least one male crab attached to it in their nest. This study aims to investigate factors affecting the number of male partners in addition to the female's primary partner (called satellites). The explanatory variables include the width of the female (Width) and a binary factor indicating whether the female colouring is dark or not (Dark). A Poisson regression model has been used previously [37] to model the relationship between the 'Width' and 'Dark' on the number of partners. The summary of the fitted Poisson model is given in Table 3, where the effect of Width and Dark are identified as significant (at the 5% level). The accompanying dynamic nomogram (Fig 5) can be created as follows: R> fit2 <-glm(Satellites~Width + Dark, family = poisson(log), data = crabs) R> DynNom(fit2) The estimates displayed in Fig 5 are for the predicted (mean) response for scenarios covering the two levels of crab colour and three values of Width. The joint effects of 'Width' and

Modelling a continuous response with a smoothing spline
Interpreting the role explanatory variables play can be difficult in a model when the functional form of an explanatory is modelled as a smooth function. For example, the 'ragweed' dataset, introduced in [39] is an example where a regression model can be used for a continuous response incorporating a smoothing spline (data is available in the SemiPar package [40]). The aim is to forecast the ragweed pollen level as a function of the temperature, rain, wind speed forecast for the following day and pollen season day. The corresponding boxplot and the scatterplot (Fig 6) illustrate the marginal relationships between the explanatory variables and the (transformed) response, the square root of ragweed level. The nonlinear relationship between the day in pollen season and the mean response is modelled using a restricted cubic spline function, with temperature, and wind speed as (linear) predictors and rain presence as a factor. The following code fits the model using the ols function in the rms package (Table 4) and creates the dynamic nomogram shown in Fig 7. R> ragweed$sqrtragweed <-sqrt(ragweed$ragweed) R> dd <-datadist(ragweed) R> options(datadist = "dd") R> fit3 <-ols(sqrtragweed~rain + temperature + wind.speed + rcs(day.in.seas, 5), data = ragweed) R> DynNom(fit3) The need for a non-linear function for the day in season variable appears apparent where the effect increases for up to 25 days in the season and then decreases from that time point onwards. Interpreting the effect of this covariate using the model summary alone, while adjusting for rain, temperature and wind speed, is less clear.
The effect of 'days in season' is arguably more accessible using a dynamic nomogram ( Fig  7) by choosing values of this covariate ranging from low to high and setting the remaining explanatory variables at their mode (for factors) and means (for covariates).

Modelling a time-to-event response
Classical graphical summaries for time-to-event data include plots of the estimated survival or hazard functions. When modelling the effect of covariates and factors on time-to-event data, the Cox proportional hazards model is a popular choice as the underlying survival distribution does not need to be specified. The interpretation of the effect of explanatory variables on the 'risk' of the event of interest is often made using hazard ratios or by using a plot of the estimated survival function conditioning on the set of explanatory variables of interest. For example, the lung cancer dataset in the survival package [41] includes the time to death for 228 advanced lung cancer patients where gender, age, weight loss and daily activity performance scores (such as ECOG) were recorded as potentially useful explanatory variables. The code needed to fit a Cox proportional hazards model and the corresponding dynamic nomogram to investigate the effect of these explanatory variables on time to death is displayed below. R> model <-coxph(Surv(time, status)~age + wt.loss + ph.ecog + sex, data = lung) R> DynNom(model) The summary of the fitted Cox proportional hazards model is given in Table 5, where the effect of each explanatory variable is given on the log scale and as a hazard ratio.
The dynamic nomogram (Fig 8) presents the estimated survival function for a given set of values of the explanatory values which avoid the use of hazard ratios and interpretation of effects that act multiplicatively. Rather than clutter the graph by including the corresponding interval estimates for each survival function, the degree of uncertainty in the estimate is highlighted graphically by adjusting the colour transparency (alpha-blending) to represent the number of subjects at risk over time. If required the predicted survival at any given follow-up time point can be generated using the 'Predicted Survival' tab.

Publishing dynamic nomograms
In order to make a dynamic nomogram accessible to a wider audience additional functions (i.e. DNbuilder family functions) are provided to generate the files necessary (i.e. ui.r, server.r and global.r) to publish the dynamic nomogram as illustrated below: R> library(PASWR) R> library(DynNom) R> data(titanic3) R> fit <-glm(survived~(age + pclass + sex)^3, titanic3, family = "binomial") R> DNbuilder(fit) The corresponding files can then be uploaded to a server running Shiny (e.g. https:// www.shinyapps.io/) which facilitates the deployment of Shiny applications. For example, the dynamic nomogram built for the analysis of the Titanic data described above is published at https://dynnom.shinyapps.io/titanic/. Examples of the use of the DynNom package (and the Visualising statistical models using dynamic nomograms DNbuilder function in particular) to generate linked dynamic nomograms to statistical models published in research articles includes [42][43][44][45][46][47]. All models presented in the literature could in theory and practice have an accompanying web address to direct the reader to the corresponding dynamic nomogram allowing them to interact with the model and identify the exact nature of the model fitted. The 'Model Summary' tab provides the (native) model summary from R allowing an examination of the precise model fitted and accompanying model summaries (e.g. R 2 , AIC etc.), which makes it clear as to the actual model (and its parametrisation) fitted.

Conclusion
As inferential statistical methods become more computational, the models arising are increasingly complex and are often hard to interpret and translate, in particular for non-statisticians. Dynamic nomograms are one such translational tool to help address this issue. They provide a visual representation of a statistical model where the importance of each explanatory variable, in terms of their effect size, becomes more evident through the assessment of the change in the predicted value. This is particularly useful for models containing modifiable risk factors.
The ability to accompany a model summary in a journal with a link to the corresponding dynamic nomogram has the potential to make models more translatable, and analyses more transparent in terms of fulfilling the recommendations of making all research understandable and reproducible.