Can we predict when to start renal replacement therapy in patients with chronic kidney disease using 6 months of clinical data?

Purpose We aimed to develop a model of chronic kidney disease (CKD) progression for predicting the probability and time to progression from various CKD stages to renal replacement therapy (RRT), using 6 months of clinical data variables routinely measured at healthcare centers. Methods Data were derived from the electronic medical records of Ajou University Hospital, Suwon, South Korea from October 1997 to September 2012. We included patients who were diagnosed with CKD (estimated glomerular filtration rate [eGFR] < 60 mL·min–1·1.73 m–2 for ≥ 3 months) and followed up for at least 6 months. The study population was randomly divided into training and test sets. Results We identified 4,509 patients who met reasonable diagnostic criteria. Patients were randomly divided into 2 groups, and after excluding patients with missing data, the training and test sets included 1,625 and 1,618 patients, respectively. The integral mean was the most powerful explanatory (R2 = 0.404) variable among the 8 modified values. Ten variables (age, sex, diabetes mellitus[DM], polycystic kidney disease[PKD], serum albumin, serum hemoglobin, serum phosphorus, serum potassium, eGFR (calculated by Chronic Kidney Disease Epidemiology Collaboration [CKD-EPI]), and urinary protein) were included in the final risk prediction model for CKD stage 3 (R2 = 0.330). Ten variables (age, sex, DM, GN, PKD, serum hemoglobin, serum blood urea nitrogen[BUN], serum calcium, eGFR(calculated by Modification of Diet in Renal Disease[MDRD]), and urinary protein) were included in the final risk prediction model for CKD stage 4 (R2 = 0.386). Four variables (serum hemoglobin, serum BUN, eGFR(calculated by MDRD) and urinary protein) were included in the final risk prediction model for CKD stage 5 (R2 = 0.321). Conclusion We created a prediction model according to CKD stages by using integral means. Based on the results of the Brier score (BS) and Harrel’s C statistics, we consider that our model has significant explanatory power to predict the probability and interval time to the initiation of RRT.


Introduction
The incidences of chronic kidney disease (CKD) and end-stage renal disease (ESRD) have been increasing rapidly [1]. The overall prevalence of CKD was found to be 8.2% in South Korea according to a study published in 2016 [2], and most patients with CKD have concerns about starting dialysis or undergoing transplantation. However, accurate prediction of the progression of disease and the timing of renal replacement therapy (RRT) remain problematic because of the lack of an accepted predictive tool for CKD progression that is effective and precise. In clinical practice, it is common for physicians to perform prognostic evaluation of a patient's future disease progression based on a few recent measurements of glomerular filtration rate (GFR) or serum creatinine.
Therefore, physicians have difficulty in deciding which patients will ultimately progress to kidney failure and when they will require RRT. Identifying patients at risk of CKD progression may facilitate more optimal nephrology care. In the present study, we aimed to develop a model of CKD progression for predicting the probability and time to progression from CKD to RRT, using 6 months of clinical data variables routinely measured at healthcare centers. This developed model would provide more precise predictions than the commonly used Kidney Disease: Improving Global Outcomes (KDIGO) CKD stages, based eGFR and albuminuria.

Data source
The data were derived from the electronic medical record (EMR) database at Ajou University Hospital, Suwon, South Korea, from October 1997 to September 2012. This database contains information on patients and medical records, and includes data from all medical departments in the hospital. We extracted the data without personal identification to ensure patient confidentiality. The study was approved by the institutional review board of Ajou University Hospital.

Study population
Study set. We included patients who were diagnosed with CKD and followed up for at least 6 months. The diagnostic criterion for CKD is estimated glomerular filtration rate (eGFR) < 60 mLÁmin -1 Á1.73 m -2 for ! 3 months [3]. The Modification of Diet in Renal Disease (MDRD) study equation or the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) equation was used to calculate eGFR; We used both equations and included the patients if even one of the two equations(eGFR < 60 mLÁmin -1 Á1.73 m -2 ) was satisfactory. We excluded patients who were < 19 years old and those who had undergone RRT within 6 months of the study. Training set and test set. We randomly divided the final study population into a training set and a test set for the verification process (Fig 1).

Observation period and study period
The observation period was defined as interval from the initial day of observation to the day of initiation of RRT or the day of censoring. The initial day is the first day on which the eGFR decreased to < 60 mLÁmin -1 Á1.73 m -2 . RRT included hemodialysis, peritoneal dialysis, and renal transplantation. The initial RRT point was defined as the first day of hemodialysis, day of catheter insertion for peritoneal dialysis, or the day of surgery for renal transplantation. If we could not identify the renal replacement event, we regarded the last follow-up date as the last observation day. The study period refers to the 180 days from the initial day of observation.

Variables
The variables were as follows: demographic variables, including age and sex; comorbid conditions, including diabetes mellitus (DM), hypertension (HTN), glomerular nephritis (GN), systemic lupus erythematosus (SLE), and polycystic kidney disease (PKD); laboratory variables, including levels of blood urea nitrogen (BUN), hemoglobin, serum creatinine, serum calcium, serum phosphate, serum albumin, serum bicarbonate, urinary creatinine, urinary protein, and urinary blood, eGFR by the MDRD, and eGFR by the CKD-EPI. We excluded urine albumin, urine hemoglobin, and urine creatinine levels as variables because they were not measured in more than 50% of the patients.
Data regarding laboratory examination and comorbidity variables were collected throughout the study period. For missing values, we included data for 30 days before the initial day of observation and 30 days after completion of the study period. Urinary protein by dipstick was reported semi-quantitively as trace, 1+, 2+, 3+, or 4+ corresponding to albumin levels of 10, 30, 100, 300, or 1000mg/dl albumin respectively. Urinary protein level was coded as 5 dummy variables on the basis of negative values (trace, 1+, 2+, 3+, 4+). Criteria for the 5 comorbidities are described as follows.

Statistical analysis
Development of representative value. We developed 8 "modified values" that were potentially associated with CKD for 6 months and chose the "representative value" that demonstrated the greatest efficiency in a multivariate Cox proportional hazards regression model.
The modified values were: value at baseline, value at the end of the study period, minimum value, maximum value, ratio of the minimum to maximum values, slope of the minimum to maximum vales, integral means, and slope of initial to integral means(details as follows).
(n = number of values, i = order, a = day of value recording, b = value on that day) 8.
The slope of initial to integral means The integral means À The value at baseline Day ðmaximum valueÞ À Day ðminimum valueÞ þ 90 days Values were excluded if 50% of the cases had missing data. Urinary protein(categorical variable) was only available at baseline. Model development. Multivariate Cox proportional hazards regression was used for model development. We created a prediction model according to CKD stages [4]. The probability of the patient not undergoing RRT at time t (years) is as follows [5].
is defined as the risk index (RI): an increased value indicates a greater probability of RRT. We selected variables using clinical guidance and backward elimination (Wald) methods. The variables that did not contribute to the explanatory power of the RRT predictive model were removed until the remaining variables were significantly related to RRT (p < 0.05).
Evaluation of model performance. To evaluate the expected prediction error of the training model, we calculated the Brier score (BS) [6] and Harrel's C statistics [7]. The BS is the square of deviation of the real value and the expected value. The higher the BS, the higher the expected error. If the BS is > 33%, the expected data show random levels, and if the BS is close to 0%, the expected data show perfect prediction.
Harrell's C statistic is a common and well-validated measure to assess the discrimination. The higher the C-statistic, the better the model can discriminate between subjects who experience the outcome of interest and subjects who do not. C-statistics provide overall measures of predictive accuracy.
Software. We collected EMR data from Microsoft SQL Server 2012, and used PASW statistics (18.0.0) (SPSS Inc., Chicago, IL, USA) for selecting representative values. The multivariate Cox proportional hazards regression model, BS, and Harrel's C statistics were analysed using R package (3.4.3).

Patient selection
We identified 4,509 patients who met reasonable diagnostic criteria. Patients were randomly divided into 2 groups, and after the exclusion of patients with missing values, the training and test sets included 1,625 and 1,618 patients, respectively (Table 1).

Prediction model outcome
Representative values. We developed a multivariate Cox proportional hazards regression model with 8 modified values. We included 2,225 patients in the training set, and considered all collected variables. Eight modified values were all significantly effective, but the integral mean exhibited the most powerful explanatory value (R 2 = 0.404), except for the end value  Table 3). The risk is greater in patients who are female or elderly and in those who have DM. The greater the levels of serum albumin and eGFR (CKD-EPI), the lower the risk; the greater the levels of serum phosphorus and urine protein, the higher the risk. The model for CKD stage 4 that included 10 selected variables (age, sex,  Table 4). The risk is greater in patients who are female and in those who have PKD. Table 5 shows the model for CKD stage 5 which had risk predictive power of approximately 32%. The model for CKD stage 5 included 4 selected variables, which was serum haemoglobin, serum BUN, eGFR[MDRD], and urinary protein. ÂeGFRðCKD EPIÞ þ 5:968 À 0:424 ðif femaleÞ þ 0:527ðif DM is presentÞ þ1:068ðif PKD is presentÞ þ 1:085ðif urine protein ¼ 1þÞ þ1:253ðif urine protein ¼ 2þÞ þ 1:289ðif urine protein ¼ 3þÞ þ1:520ðif urine protein ¼ 4þÞ The RI of CKD stage 4 patients can be defined as follows.

Test set
Brier score. To evaluate the expected prediction error of the training set model, we calculated the weighted BS that gave the weighted value to censored data. The period during which the BS is < 0.33 is approximately 5,000 days at the model of CKD stage 3 and 5. The period during which the BS < 0.33 is approximately 4,000 days at the model of CKD stage 4. Thus, the prediction model gives a marginal predictive result up to approximately 4,000-5,000days (S1-S3 Figs).
Example cases of prediction model application. We analysed 2 cases in which the observation period was approximately 5 years, using the risk prediction model in the test set. Fig 2A  shows the graph for the probability of the event for a 56-year-old female patient who experienced progression to RRT after 5 years. The probability of the event was > 80% at 3 years and > 95% at 5 years. Fig 2B shows the probability of the event in a 58-year-old male patient who did not experience progression to RRT after 5 years. The probability of an event was < 20% at 10 years.

Discussion
CKD is asymptomatic in the early stages, but symptoms appear in the later stages, accompanied by complications such as cardiovascular disease, anemia, infection, cognitive impairment, and impaired physical function [8][9][10][11]. The KDIGO clinical practice guideline suggested a prognostic classification system for CKD divided on the basis of 6 categories of GFR, 3 categories of albuminuria stage, and cause of disease. Based on these findings, KDIGO devised 3 broad risk categories based upon the likelihood of developing future kidney and cardiovascular complications [12]. However, eGFR assessment and ascertainment of albuminuria may not be sufficient for risk prediction in the clinic.
We considered many variables cited in previous articles that could affect renal function, including age, sex, laboratory findings, and comorbidities, to develop a risk prediction model. These included variables such as young age, male sex, African-American ethnicity, DM, HTN, obesity, urine protein, serum albumin, anemia, lipidemia, smoking, and cardiovascular disease [13]. In the Reduction of Endpoints in NIDDM with the Angiotensin II Antagonist Losartan (RENAAL) study, albuminuria, hypoalbuminemia, increased serum creatinine, and decreased hemoglobin were the risk factors associated with ESRD in patients with type 2 DM and nephropathy [14]. We collected data on the above variables, and identified data that were not measured in > 50% of the patients. Our study was performed retrospectively in order to identify missing variables that could significantly affect RRT.
We identified variables that were associated with RRT through the clinical guidance and backward elimination (Wald) methods. From a clinical point of view, models bsased on referral eGFR are more useful than an overall model. Predictions for patient with an eGFR of 60 would probably only be interesting to the patient, while predictions for a patient with an eGFR 15 are critical for dialysis preparation. So we underwent separate analysis according to CKD stages: 1) CKD stage 3: age, sex, DM, PKD, levels of serum albumin, serum hemoglobin, serum phosphate, and serum potassium, eGFR, and urinary protein. 2) CKD stage 4: age, sex, DM, GN, PKD, levels of serum hemoglobin, serum BUN, serum calcium, eGFR, and urinary protein. 3) CKD stage 5: level of serum hemoglobin, serum BUN, eGFR, and urinary protein. The results were similar to those of previous studies. First, one study reported that the risk of progression to ESRD was decreased among older patients with CKD stage 3 (hazard ratio [HR], 0.75; 95% confidence interval, 0.63-0.89 for each 10-year increase in age) [15]. Second, another study showed that male patients with CKD stage 4 and 5 had a shorter time to RRT than did female patients [16]. Third, it is thought that DM is rapidly becoming the most common cause of ESRD and is also associated with an increasing risk of ESRD [17]. In the African American Study of Kidney Disease and Hypertension (AASK) trial, the change in urinary protein level from baseline to 6 months predicted progression to RRT [18]. In the RENAAL study, baseline hemoglobin was an important independent variable for prediction of ESRD among diabetic patients [19]. Moreover, HTN has been found to be predictive of ESRD risk in several large population-based studies [17,20]. However, the presence of HTN was not an independent predictor of kidney failure events in the present study. The RENAAL study showed similar findings, a result likely due to the fact that blood pressure was well controlled in the study patients [14].
To identify representative values that show renal function change over 6 months, we considered 8 modified values and developed a multivariate Cox proportional hazards regression model. The integral mean contains the time and the value in order to obtain sufficient power to explain the change in data over 6 months. The end value had the highest R 2 , but the number of patients was inadequate to evaluate the model. We will compare the integral mean and end value in a larger dataset in a further study.
Finally, we developed the renal prediction model with several variables using integral means from continuous variables. To evaluate prediction error, we calculated the BS and the Harrel's C statistics. From the results of the BS and Harrel's C statistics, we consider that our model has sufficient explanatory power to predict renal progression.
The strength of our analysis is that we divided patients into 2 groups: the training set and the test set. Thus, we calculated the BS and Harrel's C statistics in order to confirm the accuracy of the model. Second, the prediction equation must include variables that are very routinely available in the nephrology clinic for convenience of use. Local healthcare facilities can collect laboratory data easily and integrate the risk prediction tool into decision-making for patients who require further evaluation or in preparation for RRT.
The limitations of our analysis are that the study was performed retrospectively, and therefore, the data obtained are insufficient including blood pressure measurements of the patients, which was important predictor in previous studies [17,20]. We considered many variables from previous studies while developing the risk prediction tool, but insufficient data were available for evaluation from the EMR. Second, patients with missing data were excluded. Since missing data are usually selectively missing this causes a selection bias. Third, all of our study subjects were Asian, especially Korean, there is a limitation about applicability of the results in other occidental countries. Fourth, there is no standard procedure for determining the initiation of RRT; therefore, initiation of therapy may reflect personal opinions, and patients' economic, social, and environmental factors may also affect the timing. However, the selection of the test set and the training set from the same hospital in the present study meant that the prediction error was reduced because the characteristics of patients in the training set and test set were similar. Fifth, our study is a lack of renal diagnosis. Because we collected data not from accurate chart review but from the EMR, we have not been able to present the primary cause of ESRD. Instead, we considered comorbidities with high prevalence as the primary cause of ESRD in Korea [21].
Many studies have identified a wide range of risk factors for the progression of CKD. Although many studies have identified similar risk factors, there has not been sufficient research performed on the risk prediction models for RRT. To develop accurate and easy-touse models, further large prospective studies are required. Our predictive model for CKD may have sufficient power to predict RRT, as shown in 2 cases in the present study. However, there are also cases that did not fit the model. If data were collected from a greater number of patients with greater accuracy, a more precise model could be developed. The development of the representative value seems very complicated for a prediction tool in a clinical setting. To simplify this model could be achieved by cooperation among nephrologists and statisticians.
In summary, a model was developed and validated to predict the risk for ESRD. This model uses commonly available clinical variables and may -provide more precise predictions than the commonly used KDIGO CKD stages, based on eGFR and albuminuria.