Machine learning model for predicting severity prognosis in patients infected with COVID-19: Study protocol from COVID-AI Brasil

The new coronavirus, which began to be called SARS-CoV-2, is a single-stranded RNA beta coronavirus, initially identified in Wuhan (Hubei province, China) and currently spreading across six continents causing a considerable harm to patients, with no specific tools until now to provide prognostic outcomes. Thus, the aim of this study is to evaluate possible findings on chest CT of patients with signs and symptoms of respiratory syndromes and positive epidemiological factors for COVID-19 infection and to correlate them with the course of the disease. In this sense, it is also expected to develop specific machine learning algorithm for this purpose, through pulmonary segmentation, which can predict possible prognostic factors, through more accurate results. Our alternative hypothesis is that the machine learning model based on clinical, radiological and epidemiological data will be able to predict the severity prognosis of patients infected with COVID-19. We will perform a multicenter retrospective longitudinal study to obtain a large number of cases in a short period of time, for better study validation. Our convenience sample (at least 20 cases for each outcome) will be collected in each center considering the inclusion and exclusion criteria. We will evaluate patients who enter the hospital with clinical signs and symptoms of acute respiratory syndrome, from March to May 2020. We will include individuals with signs and symptoms of acute respiratory syndrome, with positive epidemiological history for COVID-19, who have performed a chest computed tomography. We will assess chest CT of these patients and to correlate them with the course of the disease. Primary outcomes:1) Time to hospital discharge; 2) Length of stay in the ICU; 3) orotracheal intubation;4) Development of Acute Respiratory Discomfort Syndrome. Secondary outcomes:1) Sepsis; 2) Hypotension or cardiocirculatory dysfunction requiring the prescription of vasopressors or inotropes; 3) Coagulopathy; 4) Acute Myocardial Infarction; 5) Acute Renal Insufficiency; 6) Death. We will use the AUC and F1-score of these algorithms as the main metrics, and we hope to identify algorithms capable of generalizing their results for each specified primary and secondary outcome.

This is a Registered Report and may have an associated publication; please check the article page on the journal site for any related articles.

Introduction
The first coronaviruses discovered in the world were responsible for respiratory and intestinal infections, of which the vast majority had a self-limited course and led mostly to symptoms of common cold. However, they can eventually develop into serious infections in groups at risk (heart diseases, diabetes, among others), in the elderly and also in children. Before the beginning of this current pandemic, two highly pathogenic coronavirus species (SARS and MERS) were described and were responsible for outbreaks of severe acute respiratory syndromes [1]. Regarding this new coronavirus (COVID- 19) it was recognized as a causative agent of pneumonia that leads to severe acute respiratory syndrome (SARS-CoV-2). One of its main challenges is its rapid transmission capacity and, in some cases, progression to severe pulmonary conditions that have demanded from the health system a care and combat strategy never seen before in the whole world [1][2][3].
In Brazil, the expectation is to be of exponential growth, which is why there is a need to implement drastic measures to control population circulation and prevention. Due to the speed of transmission, in most countries, including Brazil, early stage preventive measures were not implemented, causing an explosion of symptomatic cases, many of them severe, with prolonged demand from tertiary health services.
Given this scenario, the inevitable emergence of a large contingent of critically ill patients with COVID-19, with different prognoses, made it crucial to search for early diagnostic mechanisms for better screening and treatment adequacy in each case.
In this context, even before confirmation of the infection, screening for patients with respiratory symptoms is carried out through clinical analysis and imaging tests such as Chest Computed Tomography (CT).
Chest CT is consider as the best imaging method for assessment of COVID-19, since conventional radiography has low sensitivity, notably in early stages. Typical findings described in the literature include ground-glass opacities (GGO) with a more peripheral distribution, associated with septal thickening and consolidations, usually affecting multiple lobes, although these findings can also be found in other viral pneumonias [5][6][7].
In this way, machine learning applied to diagnostic imaging can enable the development of tools that can standardize the diagnosis and provide potential findings suggestive of the presence of the disease, its severity, and therefore its prognosis.
Since the beginning of the pandemic, due to its emergency, several studies became available in order to try to predict worst outcomes. Main possible risk factors were evaluated in retrospective studies [7][8][9]. Most of these studies show advanced age, obesity and other comorbidities (diabetes, severe asthma and other respiratory diseases, heart, liver, neurological and kidney diseases and autoimmune diseases) as the main players for a worst outcome [8,9]. Machine learning has been used by a growing number of studies in this scenario and in other health related fields, ranging from helping with diagnosis until providing more robust evidence for resource allocation [10,11]. Adly et al [11], advocated that further studies are need to show all the potential of this tool in clinical practice. The major advantage of using machine learning is that we can combine different variables (demographic and clinical data, laboratory assays and imaging) in a large scale, with reduced rates of misdiagnosis and being able to provide fruitful insights in several aspects of the disease. In our protocol, we will extrapolate the current art by 1) acquiring data from hundreds of patients hospitalized in nine different private and public institutions in Brazil, 2) remove the human error and the high interrater agreement on the evaluation of chest CTs, and 3) predict the probability of different outcomes (Time to hospital discharge; length of stay in the ICU; orotracheal intubation due to acute respiratory failure; development of acute respiratory discomfort syndrome, and also other secondary outcomes that will be described later in this protocol).

PLOS ONE
AI is a comprehensive term of computer science that encompasses a wide variety of subfields and techniques, and is used to describe softwares that perform tasks similar to the human mind, such as "learning" and "problem solving" [9]. Currently, there is a rapid evolution in the capacity of AI solutions, mainly due to the increase in available computing technologies, leading to the emergence of tools with performance comparable to that of humans for complex tasks such as translation, image classification, object detection, object recognition automotive vehicle voice and control [12][13][14][15][16].
Recently, this method of data science has been used in diagnostic imaging through applications in different clinical contexts, such as the detection of radiological findings, intelligent scheduling, reduction of the time to acquire medical images, improvement of the technique and automated decision for clinical support, among others [12,17].
As the performance of these algorithms is dependent on data used for their training, there is a genuine risk that their execution will be unsatisfactory when applied to data from other institutions [16]. Therefore, for the adoption of these applications, it is necessary to validate their results, through a careful analysis of effectiveness, quality, and safety [18].
The purpose of this study is to validate the results of AI algorithms in the stratification of pulmonary changes in chest CT with multicenter data through scientific methodology to correlate pulmonary involvement with the clinical outcome of patients.

Hypothesis
The Machine Learning model based on clinical, radiological and epidemiological data will be able to identify imaging and clinical factors correlated to disease severity, and possibly estimate prognosis (hospitalization needs, ICU admission, orotracheal intubation), especially in pandemic times.

Main objective
Develop a model for stratification of pulmonary changes, using AI algorithms. The model will receive a Chest CT as input and will output the stratification of lung parenchyma, discerning regions of the lung with different densities.

Secondary objectives
• To evaluate possible changes in Chest CT, through a score, that suggest a worse prognosis in patients with COVID-19, and to identify patterns correlated with worse clinical developments, to guide, in the prospective unfolding of the study, the evaluation of prognostic markers arising automated analyses of Chest CT and contribute to prioritizing treatment according to severity (orotracheal intubation, hospitalization).
• Organize a database with medical images and their respective anonymised reports for CT modality, in different morph-functional changes, in patients with acute respiratory syndromes.
• Evaluate the performance of AI algorithms in this data for tasks such as classification, segmentation, image registration and interpretation of reports.
• Evaluate the impact of the use of these models on medical practice of imaging professionals.

Materials and methods
We will conduct a retrospectively longitudinal multicenter study (9 Institutions)

Study population
Eligible patients for the study must meet the following characteristics will be considered eligible for the study: 1. Signs and symptoms of acute respiratory syndrome 2. Positive epidemiological history for COVID-19, which may include recent contact (last 14 days) with a confirmed or suspected case, recent trip (last 14 days) to a high-incidence location, or presentation of symptoms after the start of the community transmission phase of SARS-CoV-2 (after 3/20/2020) when the date of hospitalization.
3. Have performed, when symptomatic, a chest computed tomography.

Exclusion criteria
1. Presence of neoplastic (primary or metastatic) lung lesions, manifested as nodules, masses, consolidations, septal thickening (lymphatic carcinomatosis) or pleural thickening; 2. Chest CT with the presence of movement, acquisition or reconstruction artefacts that make it impossible to apply the segmentation algorithms; 3. CT exams with low quality pulmonary segmentation, or slice thickness greater than 3.0 mm.

Composition of study outcomes
The definition of the outcomes of this study were:

Primary outcomes
1. Time to hospital discharge (length of stay, LOS), defined as the period (in days) between the date of admission and the date of discharge (or death); 2. Length of stay in the ICU (ICU LOS), defined as the period (in days) elapsed between admission and discharge (or death) from the ICU; 3. Orotracheal intubation due to acute respiratory failure; 4. Development of acute respiratory discomfort syndrome [19].
2. Hypotension or cardiocirculatory dysfunction requiring the prescription of vasopressors or inotropes.
We will use the AUC and F1-score of these algorithms as the main metrics, and we hope to identify algorithms capable of generalizing their results for each specified primary and secondary outcomes. Fig 1 shows the study flow according to the protocol.

Sample size calculation
In this study, one of the main questions is to verify whether Chest CT is a useful tool as a predictor of severity and worst evolution of COVID-19 pneumonia. Recent studies using a similar methodology [21], showed that the Chest CT scores showed a difference of approximately 17.3 points in the mean value for the scores between the "Survival group" and "Mortality group" among hospitalized patients infected with COVID-19. In this previous study, the pooled standard deviation of the groups, which were approximately 17.5, resulting in an effect size of 0.99. In our study, we would like to know how many patients with "favourable clinical outcome" and with more "severe outcomes" we would need to have. As we expect more patients in the "favourable clinical outcome" group, we made our sample size calculation considering a 3:1 ratio. If the true difference among groups means in our study is 17.3 CT score points, we will need to study at least 15 cases in the "severe outcome" group and 45 control subjects in the "favourable outcome" group to be able to reject the null hypothesis that the population means of both groups are equal with probability (power) of 0,9. The Type I error probability associated with this test of this null hypothesis is 0.05.

Initial clinical evaluation
• Electronic Collection and Data ManagementWe will develop all forms of data collection in the REDCap system (Vanderbilt University-Tennessee, United States) which is hosted on the server of the participating centers. We will use three main functionalities of this system: a) electronic data collection; b) data management; c) study flow management.
• Demographics and clinical history of patientsWe will carry out the survey of the history of each participant by consulting the electronic database of each of the participating centers and by reviewing the medical records. We will collect the following information:  � Patient history: comorbidities (systemic arterial hypertension, diabetes mellitus, asthma, chronic obstructive pulmonary disease, heart disease, dialysis or renal failure, neoplasms, neutropenia, AIDS), in addition to smoking status, and use of medications (notably inhibitors ECA and BRA).
Data will also be analysed such as length of stay in the ICU, in the hospital unit, as well as type of treatment during hospitalization (high oxygen flow, NIV, IOT / invasive ventilation, IOT + extracorporeal membrane oxygenation (ECMO), PaO 2 / FiO 2 , pronation, haemodialysis).
� Laboratory evaluation: The following blood tests (complete blood count, creatinine, PCR, D-dimer, IL-6, INR, Troponin, ProBNP, Bicarbonate, Lactate) will be taken from the medical records and identification of IgM and IgG antibodies will be considered for confirmation of the infection-SARS--CoV-2, in addition to nasal and oral swab. Those who have been positive for IgG antibodies since the beginning of the study will already be considered outcome patients, where history data will be thoroughly collected from medical records, especially regarding respiratory symptoms.
� Evaluation by Chest CT: The Chest CT scans eligible should be performed through volumetric acquisition in a multidetector device, without the use of iodinated contrast medium.
After complete analysis of inclusion criteria, we will classify the exam into one of these standards [22]: 1. Typical COVID-19 pattern: • Multiple GGO (also associated or not with consolidations, septal thickening and "reversed halo" opacities); of peripheral location, some with a rounded shape, diffuse predominating in the lower fields.

Indeterminate COVID-19 pattern
• GGO (associated or not with consolidations) of sparse bilateral ill-defined limits, without peripheral predominance and without nodular configuration.

Atypical COVID-19 pattern
In the presence of isolated segmental or lobar consolidation not associated with ground glass, centrilobular micronodules (including the appearance of "tree-in-bud"), cavitation or septal thickening associated with pleural effusion 4. Negative exam It will be the absence of infectious lung changes.
The extent grade of pulmonary involvement will be calculated by a tomographic index, where each lung is divided into three regions, namely: upper (above the carina), medium (between the carina and the lower pulmonary veins) and lower (below the lower pulmonary veins). For each region (total of 6 for each patient), a score will be administered, according to the amount of affected parenchyma, being 0 when the entire segment is normal, 1 when the involvement is less than 50%, and 2 when it is greater than 50%. The variation of this index will be from 0 to 12.
A database with medical images acquired at each partner center and their respective reports for the chest computed tomography will be carried out retrospectively, all being anonymised.

Algorithm development and AI data evaluation
Data set: The data set will consist of Chest CT performed in laboratories belonging to the Coordinator Center network and partner institutions. Radiologists will annotate part of this set of exams, and each annotation will be composed of a polygon that segments the lung in each slice of the computed tomography. Each training instance will then consist of a slice of a computed tomography exam (entry to the model) and the polygon that separates the lung (expected response from the model). It should be noted that the construction of the algorithm will be performed at the Coordinator Center.
The training of the segmentation model will occur in a supervised manner, using the annotated set of data, where each instance of this set must have a polygon that delimits the region referring to the lung in the exam.
We will use the segmentation model to calculate the lung volumes and percentages of GGO and consolidations in the non-annotated Chest CT. This information, along with clinical and laboratorial data, will provide data to develop a machine learning model that predicts the primary and secondary outcomes.
For all models, we will only consider patients that either died or were discharged. For primary and secondary outcomes analysis, we will exclude the unknown cases.
We will compare different classification models, like LightGBM and Catboost. CatBoost is a model from the family of gradient boosting trees. Although powerful, gradient boosting trees tend to overfit the training data when fed with too many features, that is why we restricted the number of features to be studied in this protocol. We will also restrict the depth of the trees to three and fix the learning rate at 0.01 and the patience for early stopping at 50 epochs.
The primary metric is the area under the receiver operating characteristic curve (AUC-ROC).
For each model, we will separate 20% of the considered patients to build our test set. On the other 80%, we will run a 5-fold cross-validation, and all of our results will have a mean and a standard deviation on the holdout patients.
These algorithms will be evaluated using various performance metrics (sensitivity, specificity, positive and negative predictive values, and F1-Score), to determine the generalization of results in our population, and the performance in relation to effectiveness, safety and quality. We will calculate F1 and accuracy for all thresholds, and select the best threshold.

Missing data
We propose the hypothesis that the machine learning model based on clinical, radiological and epidemiological data will be able to identify imaging and clinical factors correlated to disease severity, and possibly estimate prognosis (hospitalization needs, ICU admission, orotracheal intubation), that is especially needed in pandemic times. In this case, the statistical analysis will consider the intention-to-treat population. Given the design, we expect patients may be were lost to follow-up. Furthermore, we expect a high number of missing data considering that it is a trial conducted with retrospective data collection among hospitalized and discharged patients. We will apply the following strategies to prevent missing data to maintain accuracy of the study: • Use of a data collection tool that is straightforward and allows the collection of use sufficient data to address the research question.
• Research centers' data collection monitoring throughout the study.
• Data-monitoring committee to monitor missing data.
• Choose researchers and sites with a good rate of patient with complete follow-up.
We will handle missing data by modern imputation methods as multiple imputation (each missing value will be replaced by a simulated value). The advantages of these methods are that they minimize the loss of precision and power, estimates have almost optimal statistical properties, results are mostly consistent, therefore generally unbiased in large samples with almost perfect efficiency. We will also perform sensitivity analysis to ensure robustness of the primary result testing without and with adjustment for centers.

Risks
Imaging results will not be identified by name of the patients, therefore it is not possible to characterize the patient, and, also the data will remain with the researchers involved in the project and access to third parties (followers, employers, superiors) will not be allowed, guaranteeing protection against any type of discrimination and/or stigmatization. It is noteworthy that patients with the same conditions have very similar patterns of findings. However, even though remote and despite all precautions, there is always the potential risk of accidental breach of anonymity (loss of reliability). It is noteworthy, however, that the identification of a patient by imaging findings is practically non-existent. In addition, if there is any leak in any patient's data, all appropriate measures will be taken into consideration, in addition efforts will not be measured to minimize the damage that may occur.

Benefits
The validation of AI algorithms allows identifying solutions with potential for future use, if approved by the regulatory agencies, in a vast number of conditions. Especially in this pandemic moment, where the identification of prognostic factors can be decisive, in the best conduct and therapeutic strategy not only for patients but also for the entire health system gear at times like this makes this study very valuable and promising.

Data analysis methodology
Statistical analysis. We will use "R" program (version R-3.6.3 for Windows) to perform statistical analyses. Continuous variables will be described by the mean and standard deviation or by the median and interquartile range (IQR) and compared with the t-test, t-test for paired samples, Mann-Whitney or Wilcoxon U test; categorical variables will be expressed as numbers (%) and the CT scores will be compared using the χ2 test or Fisher's exact test, if appropriate. We assume a bilateral α of less than 0.05 as statistically significant. When calculating samples, a beta error of less than 10% will be used (Power> = 90%).
We will use performance metrics for AI algorithms, such as accuracy, sensitivity, specificity, positive and negative predictive values and F1-Score.

Expected results
To determine possible findings that predict severity and prognosis with AI algorithms in pulmonary segmentation.