Prediction of coronary artery bypass graft outcomes using a single surgical note: An artificial intelligence-based prediction model study

John Del Gaizo; Curry Sherard; Khaled Shorbaji; Brett Welch; Roshan Mathi; Arman Kilic

doi:10.1371/journal.pone.0300796

Abstract

Background

Healthcare providers currently calculate risk of the composite outcome of morbidity or mortality associated with a coronary artery bypass grafting (CABG) surgery through manual input of variables into a logistic regression-based risk calculator. This study indicates that automated artificial intelligence (AI)-based techniques can instead calculate risk. Specifically, we present novel numerical embedding techniques that enable NLP (natural language processing) models to achieve higher performance than the risk calculator using a single preoperative surgical note.

Methods

The most recent preoperative surgical consult notes of 1,738 patients who received an isolated CABG from July 1, 2014 to November 1, 2022 at a single institution were analyzed. The primary outcome was the Society of Thoracic Surgeons defined composite outcome of morbidity or mortality (MM). We tested three numerical-embedding techniques on the widely used TextCNN classification model: 1a) Basic embedding, treat numbers as word tokens; 1b) Basic embedding with a dataloader that Replaces out-of-context (ROOC) numbers with a tag, where context is defined as within a number of tokens of specified keywords; 2) ScaleNum, an embedding technique that scales in-context numbers via a learned sigmoid-linear-log function; and 3) AttnToNum, a ScaleNum-derivative that updates the ScaleNum embeddings via multi-headed attention applied to local context. Predictive performance was measured via area under the receiver operating characteristic curve (AUC) on holdout sets from 10 random-split experiments. For eXplainable-AI (X-AI), we calculate SHapley Additive exPlanation (SHAP) values at an ngram resolution (SHAP-N). While the analyses focus on TextCNN, we execute an analogous performance pipeline with a long short-term memory (LSTM) model to test if the numerical embedding advantage is robust to model architecture.

Results

A total of 567 (32.6%) patients had MM following CABG. The embedding performances are as follows with the TextCNN architecture: 1a) Basic, mean AUC 0.788 [95% CI (confidence interval): 0.768–0.809]; 1b) ROOC, 0.801 [CI: 0.788–0.815]; 2) ScaleNum, 0.808 [CI: 0.785–0.821]; and 3) AttnToNum, 0.821 [CI: 0.806–0.834]. The LSTM architecture produced a similar trend. Permutation tests indicate that AttnToNum outperforms the other embedding techniques, though not statistically significant verse ScaleNum (p-value of .07). SHAP-N analyses indicate that the model learns to associate low blood urine nitrate (BUN) and creatinine values with survival. A correlation analysis of the attention-updated numerical embeddings indicates that AttnToNum learns to incorporate both number magnitude and local context to derive semantic similarities.

Conclusion

This research presents both quantitative and clinical novel contributions. Quantitatively, we contribute two new embedding techniques: AttnToNum and ScaleNum. Both can embed strictly positive and bounded numerical values, and both surpass basic embeddings in predictive performance. The results suggest AttnToNum outperforms ScaleNum. With regards to clinical research, we show that AI methods can predict outcomes after CABG using a single preoperative note at a performance that matches or surpasses the current risk calculator. These findings reveal the potential role of NLP in automated registry reporting and quality improvement.

Citation: Del Gaizo J, Sherard C, Shorbaji K, Welch B, Mathi R, Kilic A (2024) Prediction of coronary artery bypass graft outcomes using a single surgical note: An artificial intelligence-based prediction model study. PLoS ONE 19(4): e0300796. https://doi.org/10.1371/journal.pone.0300796

Editor: Amirmohammad Khalaji, Tehran University of Medical Sciences, ISLAMIC REPUBLIC OF IRAN

Received: September 11, 2023; Accepted: March 5, 2024; Published: April 25, 2024

Copyright: © 2024 Del Gaizo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data cannot be shared publicly because of private health information. Data are available from the Medical University of South Carolina Institutional Data Access / Ethics Committee for researchers who meet the criteria for access to confidential data. Specifically, even the deidentified dataset is only available through an appropriately executed data use agreement, upon approval from the Medical University of South Carolina Institutional Data Access / Ethics Committee. For data requests, the contact information for MUSC IRB is as follows: URL: https://research.musc.edu/resources/ori/irb-contacts Phone: 843-792-4148 Stacey Goretzka, CIP Program Manager 843-792-6527 goretzka@musc.edu Summer Young, JD, MPH, CIP IRB Reliance Manager 843-792-4144 youngsn@musc.edu A code example with mock data is provided in the associated code base at the following URL: https://github.com/musc-surgical-innovation-center/attntonum.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Artificial intelligence (AI) in healthcare has demonstrated utility in predictive analytics, imaging interpretation, data extraction, and reducing workload inefficiencies [1–4]; including AI applications specific to cardiac surgery [2, 3, 5–16]. Important drawbacks of AI include the need for large datasets, the potential for error especially in high-risk situations, the possibility of privacy violations and abuse, a lack of trust and understanding of AI among physicians and patients [1, 4, 5, 7, 17], and the need for use case transparency for patient consent.

The Society of Thoracic Surgeons (STS) risk models have served as a credible gold standard for cardiac surgical quality reporting for decades [4, 18, 19]. The STS risk calculator, which is a logistic regression model, achieves an AUC performance of approximately 0.76, with a 95% CI of (0.73, 0.79) [20]. Since some of the STS risk calculator inputs are manually extracted from unstructured data, these models require significant manual data extraction and entry. An automated AI would allow providers to spend more time on clinical duties.

Recent studies that utilize structured EHR data to predict coronary artery bypass grafting (CABG)-associated risk achieve similar [21] or superior [22, 23] performance to the STS risk calculator and identify high-risk predictors [21, 24]. We hypothesize that neural network models can complement these analyses by extracting information from unstructured data, such as patient notes. Whether more condensed data inputs such as a clinical note can serve to develop well performing risk models has yet to be explored [4, 10–16].

To answer this question, we evaluated several embedding techniques with convolutional neural network (CNN) and LSTM models for predicting post-CABG outcomes using a single preoperative surgical consult note, including a novel attention-based technique to embed numerical tokens. Our results indicate that low parameter-count neural networks can achieve superior predictive performance to the current risk calculator standard, given utilization of the novel numerical embeddings and calibration of model output to a desired class-separation threshold.

Materials and methods

Data

Isolated CABG procedures performed from July 1, 2014 to November 1, 2022 at a single center were identified. These notes were all entered within 30 days of CABG. The notes consisted of inpatient consult notes with a “history of present illness” section. If multiple notes existed, the most recent to CABG was the only note included. According to a sample of our data, about 5.4% of the patient population is on-pump. This study was deemed exempt from review by the institutional review Board (IRB Pro00122587), the committee that oversees research ethics at the Medical University of South Carolina.

The notes occasionally listed symptoms in a bulleted format, but often did not. Important metrics recorded in the notes included creatinine and blood urine nitrate (BUN) values, commonly in the format of “BUN 24.0 (H)” or “creatinine .7 (L)”. The healthcare professional transcribed low (L) for the BUN value if it was less than or equal to 20 and high (H) otherwise. The analogous cutoff for creatinine was 1.0.

Primary outcome

The primary outcome was the STS-defined composite outcome of operative mortality or major morbidity (MM). Operative mortality is defined by the STS as mortality occurring within 30 days of surgery or as an inpatient during the index hospitalization. Major morbidities included postoperative stroke, acute renal failure, cardiac reoperation, deep sternal wound infection, and prolonged ventilation.

Predictive performance experiments

We ran 10 experiments of our machine learning pipeline. Each experiment left a random 15% (261 samples) of the data for a holdout set, and then split the remaining notes 85%/15% (1,254/222) for training and validation. The splits were stratified to ensure a similar ratio of MM subjects between the training, validation, and holdout sets. For each experiment, the model was fit on the 1,254 training samples. The same 10 split sets were employed to compare different embedding techniques and models to ensure consistent comparisons.

For each experiment, we calibrate the validation set probability of MM predictions to the observed MM rate via a linear regression to obtain a calibration slope and intercept. Specifically, the validation set subjects are grouped according to MM prediction, binned at 0.1 resolution, and the mean MM rate serves as the dependent variable. The fit linear regression model is then applied to calibrate holdout set predictions. The performance was measured by AUC on the holdout sets for each of the 10 experiments. Standard textual preprocessing was performed, such as regular expressions to remove dates.

Convolutional neural networks

Our base CNN architecture is TextCNN [25]. Previous research indicates that TextCNN shows comparable classification performance on small medical datasets to much larger models such as BERT [10, 26], but with orders of magnitude less memory, thereby enabling faster prototyping and experimentation.

The hyperparameters are as follows: learning rate of 3.5⁻⁴, batch size of 64, 44 kernels per convolutional filter layer, embed dimension of 50, 4 filter layers of sizes (1, 2, 3, 5), 100 epochs, 0.5 dropout applied to the final fully connected layer, and a length cut off of 3,000 tokens with 0 padding for smaller notes. This results in a total of 176 = 44x4 convolutional filters.

Long Short-Term Memory (LSTM) neural networks

The base LSTM architecture consists of the same architecture as the presented TextCNN, except the convolutional layer is replaced with a bi-directional LSTM (Bi-LSTM) layer. We refer to this model as TextLSTM. Similar to TextCNN, a max pool is applied temporally before the classification layer. However, the max pool is applied across the bi-LSTM’s cell output states instead of convolutional filter activations.

We configure the TextLSTM architecture to have the same layer dimensions as TextCNN, except the convolutional layer is replaced with a Bi-LSTM. The LSTM hidden dimension is 88, which means the output dimension of the Bi-LSTM is 176 by 3000 (sequence length), and this is reduced to 176 after the max pool; dropout is 0.0; and the number of epochs is 50 instead of 100 as we found the LSTM model overfits more readily and that dropout does not provide regularization. Note that the hidden dimension size of 88 was chosen so that the classification layer would be of the same dimension as TextCNN’s classification layer, 176.

Embedding numerical information

Encoding numerical information is a challenging task in natural language processing [27]. Even advanced models treat numbers as text and encode them in a similar manner. This means that closely related numbers (i.e. 1.7 and 1.8) appear as 2 separate tokens, where similarity in the model output results from similarity in the semantic space. This poses a challenge for models that are trained on small datasets that may have few or no training samples with a holdout sample’s numerical value.

As a first step, we experiment with replacing uninformative numbers with a tag. Basic exploratory analyses on the dataset indicated that blood urine nitrate (BUN) and creatinine values listed in the notes strongly correlated with MM. Due to the nature of how the physicians transcribe notes, BUN/creatinine values are listed within close range to the key-terms “BUN” or “creatinine”. A common format is “[BUN|creatinine] [numerical value] ([H|L])”, where (H) means the physician labeled the value as high and (L) as low. We define context as tokens within 2 tokens to the right, or 1 token to the left of either BUN or creatinine. These are hyperparameters that can be adjusted.

Out-of-context numbers (OOC) are numerical tokens outside 2 tokens to the right or 1 token to the left. OOC numbers are replaced with the token “_INUM_” or “_lgnum_”, if they are less than or greater than 1000, respectively (Fig 1). This substitution removes sporadic numerical tokens, thereby reducing the embedding matrix dimensionality while retaining BUN and creatinine signal.

Download:

Fig 1. Embedding pipeline for the example text “Cl 93.0 (L) 03/04/18 creatinine 5.2 (H) 03/05/18”, for the TextCNN architecture model with AttnToNum embeddings, 10^th run.

The numerical values correspond to the first dimension of the dimension-50 embeddings.

https://doi.org/10.1371/journal.pone.0300796.g001

We ran experiments with three embedding techniques: 1a) Basic embedding, treat numbers as word tokens; 1b) Basic embedding with a dataloader that Replaces Out-Of-Context numbers with a tag (ROOC); 2) ScaleNum, an embedding technique that scales in-context numbers via a sigmoid of a learned linear-log function; 3) AttnToNum, a ScaleNum-derivative that updates the ScaleNum embeddings via multi-headed attention applied to local context (Fig 1).

ScaleNum

ScaleNum is an embedding technique we developed for numbers that are within context of the tokens “BUN” or “creatinine”. ScaleNum scales the number to a multi-dimensional vector via function, g(x). g(x) first clamps the number between 1 and 1,000, followed by the log function, then a linear layer from dimension 1 to embedding dimension (50), and a sigmoid: Where y, a, and b are vectors of size embedding dimension.

As BUN and creatinine values are frequently greater than 1 and always less than 1000, the clamp removes little numerical information while ensuring stability if there is a number outside this range. Note that g(x) is equivalent to x^ae^b/(x^ae^b + 1) since e^alog(x)+b = x^ae^b; where x^ae^b is element-wise multiplied. The model learns multiple g(x) transformations, and then adds the resultant embeddings together. For this research, the model learns 5 g(x) transformations.

AttnToNum

AttnToNum employs multi-headed self-attention between ScaleNum’s embeddings and the token embeddings within context to generate context-aware number embeddings (Fig 1). This research uses 5 attention heads, with a head dimension of 50/5 = 10.

Embedding-comparison permutation tests

Permutation tests were used to verify if a given numeric-embedding technique statistically significantly improves performance. The 4 techniques are Basic, Basic with Replace Out-Of-Context numbers (ROOC), ScaleNum, and AttnToNum, for a total of 6 comparisons: (1) AttnToNum vs ScaleNum, (2) AttnToNum vs ROOC, (3) ROOC vs Basic, (4) ScaleNum vs ROOC, (5) ScaleNum vs Basic, and (6) ROOC vs Basic. For each comparison, a length-10 AUC gap vector, gv, is calculated as the difference between a given model’s holdout AUC values, A, and a reference model’s holdout AUC values, T.

The sum of the elements of gv is stored as the true gap value:

For each of 35,000 iterations, we generate a length-10 vector composed of random samples from 2 values [1, –1], sn, and take the dot product of sn with gv to obtain a mock gap value, m. This is equivalent to randomly flipping the sign of each element of gv, and taking the sum:

The true gap, t, is compared against the 35,000 mock gaps, m, to obtain 1-sided and 2-side p-values:

Embedding correlations

To test the hypothesis that the model learns to incorporate both number magnitude and local context to derive a numerical semantic space, we calculate the correlations between pre-convolution embeddings associated with the numerical embeddings for a series of BUN and creatinine values. For example: the correlation between the numerical embeddings for “BUN 10.0 (L)” and “creatinine 4.0 (H)”. The pipeline to generate the embedding correlations is as follows:

Generate 2 phrases, each in the format of “[BUN|creatinine] [value] ([L|H])”. If the BUN value is higher than 20, it is labeled as High (H) and low (L) otherwise. For creatinine, this cutoff was 1.0
Pass each phrase to the TextCNN model’s embedding layers to obtain the pre-attention and post-attention embeddings for each phrase. The specific TextCNN instance is selected based on the experiment with the best performance.
Extract the embedding vector that corresponds to the number token ([value]) for each phrase, and then calculate the correlation coefficient between the two vectors.

ngram importance with SHAP-N

We employ an ngram-resolution version of SHapley Additive ExPlanations (SHAP) analysis to identify individual ngram contributions to sample predictions. We term this analysis as SHAP-Ngram resolution (SHAP-N). These contributions can be directly calculated from the TextCNN architecture.

Via the SHAP approach, the aim is to find the simplified, sample-specific, additive model that equals the original model for a given sample, s.

(1)

Each ϕ_f,s equals the relative logit contribution associated with feature f for sample s, where relative contribution is defined as the logit contribution for s from feature f, minus the mean logit contribution for feature f in the dataset: (2)

Finally, ϕ₀ equals the sum of the mean logit contributions. Note that ϕ₀ is sample-independent: (3)

In the TextCNN architecture, each convolutional filter contributes only one activation per sample due to the sample-wide max pool after the convolutional layer, and this activation is multiplied with a single coefficient in the fully connected layer. These scaled activations, one per convolutional filter, are additively combined to create a logit [25]. Therefore, the logit for a sample can be calculated as: (4) Where “F” is the number of filters, and each filter has one and only one logit contribution per sample.

Eq 4 can be reformulated as Eq 1 by subtracting the mean value for each filter: (5) Where ϕ₀ = w₀ + ∑_F w_f μ_f, and ϕ_f,s = w_f (a_{s, f} − μ_f) Therefore, the SHAP value for each filter, f, and sample, s, combination equals w_f (a_{s, f} − μ_f).

The ϕ_{f, s} values directly indicate the importance of each passing ngram if each filter has a distinct ngram. However, 2 common scenarios violate this assumption: (1) the same ngram passes multiple max pools, and (2) sub-ngrams of a passing ngram will pass other max pools. For both scenarios, we add the SHAP values together: ϕ_n,s = ∑_F∊n ϕ_f,s, where n represents the ngram, and F∊n represents the ngrams that equal n, or are sub-ngrams of n. If n only passes one max pool and has no sub-ngrams passing other filters, then ϕ_n,s = ϕ_f,s. Therefore, Eq 5 can be reformulated as follows, where we define the importance value for ngram, n, and subject, s, as ϕ_n,s, and N is the number of ngrams: (6)

Finally, Zhao et al [28] set-union the overlapping ngrams and sum together the associated ϕ_n,s values to create a new set of N’ features: (7)

However, we found that this leads to a smoothing effect of long overlaps that are hard to interpret on our dataset. Therefore, we leave the resolution at the ngram level (Eq 6), and refer to the importances, ϕ_n,s as SHAP-N (SHAP-Ngram) values.

We calculate the final SHAP-N for an ngram as its average SHAP-N value across all the samples in the dataset (holdout or validation): (8)

We present the top SHAP-N values for the highest performant experiment, as well as the SHAP-N values for ngrams that contain or consist of a BUN/creatinine token followed by a number.

Results

A total of 1,738 patients who underwent isolated CABG during the study period had a preoperative surgical consult note within 30 days of surgery with a “history of present illness” section and therefore met inclusion criteria. Of the 1,738 subjects, 567 (32.6%) had the STS-defined outcome of operative mortality or major morbidity. Of the 567 MM patients, 515 had a major morbidity only, 13 had only an operative mortality without major morbidity, and 39 experienced both major morbidity and mortality.