Clinical outcome measures and scoring systems used in prospective studies of port wine stains: A systematic review

Background Valid and reliable outcome measures are needed to determine and compare treatment results of port wine stain (PWS) studies. Besides, uniformity in outcome measures is crucial to enable inter-study comparisons and meta-analyses. This study aimed to assess the heterogeneity in reported PWS outcome measures by mapping the (clinical) outcome measures currently used in prospective PWS studies. Methods OVID MEDLINE, OVID Embase, and CENTRAL were searched for prospective PWS studies published from 2005 to May 2020. Interventional studies with a clinical efficacy assessment were included. Two reviewers independently evaluated methodological quality using a modified Downs and Black checklist. Results In total, 85 studies comprising 3,310 patients were included in which 94 clinician/observer-reported clinical efficacy assessments had been performed using 46 different scoring systems. Eighty-one- studies employed a global assessment of PWS appearance/improvement, of which -82% was expressed as percentage improvement and categorized in 26 different scoring systems. A wide variety of other global and multi-item scoring systems was identified. As a result of outcome heterogeneity and insufficient data reporting, only 44% of studies could be directly compared. A minority of studies included patient-reported or objective outcomes. Thirteen studies of good quality were found. Conclusion Clinical PWS outcomes are highly heterogeneous, which hampers study comparisons and meta-analyses. Consensus-based development of a core outcome-set would benefit future research and clinical practice, especially considering the lack of high-quality trials.


Results
In total, 85 studies comprising 3,310 patients were included in which 94 clinician/observerreported clinical efficacy assessments had been performed using 46 different scoring systems. Eighty-one-studies employed a global assessment of PWS appearance/improvement, of which -82% was expressed as percentage improvement and categorized in 26 different scoring systems. A wide variety of other global and multi-item scoring systems was identified. As a result of outcome heterogeneity and insufficient data reporting, only 44% of studies could be directly compared. A minority of studies included patient-reported or objective outcomes. Thirteen studies of good quality were found.

Introduction
Port wine stains (PWS) are congenital vascular malformations that occur in approximately 0.3-0.9% of infants [1][2][3]. Lesions initially present as flat, red-to-pink patches and darken and thicken with age [4]. PWS are most frequently located in the face and neck and can cause functional impairment [5], skin and soft tissue hypertrophy [5], and glaucoma [6], as well as substantial psychosocial morbidity [7][8][9]. Despite therapeutic developments, complete PWS resolution remains rare [10][11][12]. Valid, relevant, and reliable outcome measures are required to accurately gauge treatment effects and compare treatment protocols and therapeutic interventions. Moreover, standardization of outcome measures is imperative for enabling comparisons between studies. Increased awareness of the importance of (high-quality) outcome measures has led to a rise in outcome measure research, especially in dermatology [13] (in particular for psoriasis [14], atopic dermatitis [15], and more recently vitiligo [16]).

Appraisal of study quality
To assess the methodological quality at the study level, a critical appraisal was performed independently by two authors (IvR, SC) using a modified version of the Downs and Black checklist [22][23][24]. This validated checklist consists of 27 questions regarding reporting, internal validity, external validity, and power and has been used for both randomized and non-randomized controlled studies. Additional details of this analysis are described in the supplement (S3 Appendix). In addition to the factors assessed in the Downs and Black checklist, we assessed a few additional aspects related to outcome assessment: 1) the number of outcome assessors; 2) their professional background; 3) whether outcomes were assessed based on photographs, and if so; 4) whether an attempt was made to standardize these photographs.

Data analysis
Outcome measures were classified according to: domain [25], assessor (clinician-, observer-, parent-, or patient-reported), qualitative vs. quantitative, relative (i.e., scoring systems with a single measurement that constitutes the difference in pre-treatment and post-treatment appearance) vs. static (i.e., scoring systems that require repeated pre-and post-treatment measurements with the same scoring tool to calculate a (change) score), and global (i.e., a singleitem assessment) vs. multi-item (i.e., separate assessment of two or more PWS characteristics, such as PWS border, texture, and color) measures. The data were presented using descriptive statistics (frequencies and proportions).

Study outcomes
An overview of all study outcomes is presented in Fig 3. Only studies with a clinical efficacy assessment were included in our analysis. PWS were assessed predominantly with relative (rather than static) measures for both patient-and clinician/observer-reported data (Fig 3). Clinician/observer-reported satisfaction was included in 2 studies (2.4%). Patient-reported outcomes were measured in 32.9% (N = 28/85) of studies, which included satisfaction, PWS improvement, and treatment preference (Section Patient-reported and parent-reported outcome measures).
In addition to clinical assessment, 36 studies (42.4%) used objective instruments to objectively measure treatment efficacy or quantify other factors, such as dermal blood flow reduction. Histological samples to assess photothermally-induced changes and epidermal damage were collected in 8 studies (9.4%).
Of all studies, 77.6% (N = 66/85) systematically collected data on the presence of adverse or side effects. In one study a 4-point scoring system was used to score "safety", i.e., the occurrence and degree of hypopigmentation or hyperpigmentation and hypertrophic or atrophic scarring [75]. Another study classified the degree of crusting based on a 3-point scoring system ('thick', 'thin', or 'none') [32].

Observer/clinician-reported outcome measures and scoring systems
Inasmuch as several studies employed 2 forms of clinical efficacy assessment, a total of 94 observer/clinician-reported clinical efficacy assessments were performed. The scoring systems were not specified in 3 studies. In the remaining studies, 46 different scoring systems were employed (Table 1). Most studies (N = 79/85) used a relative measure as the primary outcome. For relative measures, a global assessment of PWS improvement (also referred to in studies as 'blanching', 'lightening', and 'clearance') was the most prevalent. In a majority of studies (N = 66/81) the global assessment was reported quantitatively as a percentage improvement and was categorized into subgroups (usually quartiles, which were supplemented by additional strata of 0% (N = 23/63) and/or 100% clearance (N = 5/63) in some studies). Alternatively, qualitative scoring systems were used that varied from 2 to 5 grades (Table 1). A multi-item assessment using relative scoring systems was used in 2 studies.
In a few studies with relative measures as the primary outcome, a secondary efficacy outcome was included in the form of another relative (and global) measure (N = 5/85) or a static measure (N = 4/85). Static measures included the Patient and Observer Scar Assessment Scale (POSAS), a 10-point scoring systems for 'redness' or 'cosmetic appearance, the (decrease in) scores on a Munsell color chart, and multi-itemed assessment of skin color and texture. Two studies also utilized dermoscopy-derived outcomes (using an unspecified scale or the intraoperative observation of vascular rupture; not included in the analyses).
Although most studies used a classification based on percentage improvement, the differences in the number of subgroups and subgroup ranges (shown in Fig 4) complicated study comparison. Numerous studies also used inconsistent and contradictory (mathematical) statements to describe subgroups. In total, 26 different scoring systems based on percentage blanching (or percentage 'improvement', 'lightening', or 'clearance') were identified in 65 studies. The data of a maximum of 57.4% of efficacy assessments (N = 54/94) could theoretically be converted into one common, simplified classification of quartile percentages (i.e., 0-25%, 26-50%, 51-75%, and 76-100%). However, many studies failed to report the scores of each category or only reported mean scores for the entire cohort, which precluded actual pooling of the data. Consequently, the data of maximal 43.5% of studies (N = 37/85) could be pooled into one uniform scoring system.
Observer/clinician-reported satisfaction was included as a secondary outcome in two studies (using a 0-4 scoring system or an 'ineffective, 'moderate', 'good', or 'excellent' score).

Patient-reported and parent-reported outcome measures
Satisfaction with treatment (N = 14/85) was the most commonly included patient-reported outcome and was measured using 1 of 9 different scoring systems (Fig 3 and Table 2). Patient or parent-reported PWS improvement was included in 13 studies (15.3%) using 1 of 11 different scoring systems (similar to scoring systems used for clinician/observer-reported assessment). Ten studies assessed patient-reported pain. Patient-preferred treatment (for patients that underwent 2 or more forms of treatment) was included in 6% of within-

Objective measures using optical instruments and digital image analysis
Non-invasive, objective assessment using optical instruments or digital image analysis techniques was used in 42.4% of studies (N = 36/85). The techniques employed were colorimetry 'Excellent' (color is close to normal skin color and no scar formation), 'good' (marked blanching, thicker lesion become flat, no scar formation), 'fair' (partial blanching, thicker lesion becomes moderately flat), 'poor' (slight blanching, thicker lesion becomes slightly flat) or 'no change'

Relative measures
Skin color, skin texture, and overall clinical outcome were assessed separately on a 1-4 scoring system (1 = no signs of skin change associated with PWS, 4 = significant change in skin associated with PWS). Change in overall outcome was converted to a percentage improvement.

(1.2)
Efficacy, purpura and homogeneity were each assessed and classified into 'better with' or 'better without' the study intervention

Scoring system not specified 3 (3.5)
Both global (single-item) and multi-itemed (several, individually scored characteristics) PWS scoring systems are shown. These were divided into qualitative vs. quantitative, and relative (a single measurement score that compares pre-and post-treatment) vs. static (the difference between repeated pre-and post-treatment scores) measures.  Table 3.

Methodological quality of prospective PWS studies
The mean (± SD) Downs and Black risk of bias checklist score was 15.3 ± 4.0 (18.0 ± 3.3 for controlled and 12.4 ± 2.4 for uncontrolled studies; S1 Checklist). Studies were of good (N = 13), fair (N = 31), and poor (N = 41) quality. No excellent studies were found. The mean score per item for controlled and uncontrolled studies is presented in Fig 5. The items related to 'external validity' were scored particularly poor. In controlled studies, the items 9, 11-14, 24, 26, 27 were not satisfied most frequently. In uncontrolled studies most points were lost in items 7, 9, 11-13, and 26. In the studies of good quality a 'percentage improvement' scale was most frequently used as the primary outcome (92.3%, N = 12/13).   Table 1 were stratified according to their categories.

Parent-reported PWS improvement:
Change in size, overall satisfaction with the results, change in color, wish to continue therapy (score 1-5)

Discussion
Our systematic analysis revealed considerable heterogeneity in clinical outcome measures and scoring systems in prospective PWS studies. Most studies used a global (observer/clinicianreported) efficacy assessment with percentage improvement as primary outcome measure. Due to the variations in scoring systems and score conversions (e.g., reporting only the mean improvement for the entire patient cohort), only 44% of studies (N = 37) had a clinical outcome that could be included in one common, simplified scoring system to enable inter-study comparative analysis. Other scoring systems included multi-item assessment of PWS and the difference in repeated pre-and post-treatment appearance scores. Almost half of all studies Δa � is the change in redness (a � ) using the L � a � b � color system as determined by the Commission Internationale de l'Eclairage (CIE). ΔE refers to the color difference according to the CIE76 formula [110], usually in comparison to normal (contralateral) skin. https://doi.org/10.1371/journal.pone.0235657.t003

PLOS ONE
Port wine stain outcome measures based treatment efficacy outcomes, additionally, on an objective measurement, such as colorimetry. Nevertheless, even in these studies there is a gamut of differential outcome measures. Patient-reported outcomes were included in a minority of studies and included pain, PWS improvement, and satisfaction.
In the past two decades, good outcome measures for clinical studies, i.e., those that are valid, consistent, accurate, reproducible, and error-free, have increasingly been recognized as essential elements for evidence-based clinical decision making. As a result, study end-points

PLOS ONE
have come under increasing scrutiny [111]. Concurrently, efforts have been made to standardize trial outcomes and thereby enable meta-analysis and other forms of data pooling, e.g., by developing an agreed minimum set of outcomes known as a 'core outcome set' (facilitated by the 'Core Outcome Measures in Effectiveness Trials (COMET) Initiative [112]) and, in dermatology, the establishment of the International Dermatology Outcome Measures (IDEOM) initiative [113,114] and the Cochrane Skin Group-Core Outcome Set Initiative (CSG-COUSIN) [115]. By mapping the outcome measures currently in use, this review could aid in the development of a core outcome set for PWS.
In 1992, Pickering et al. reviewed the assessment methods used to assess the response of PWS to laser treatment and found substantial variability [116]. The authors pointed to the subjective nature of visual assessment and advocated the use of noninvasive optical instruments, such as colorimetry, reflectance spectrophotometry, and Doppler flowmetry to objectively quantify PWS improvement. This has since been reiterated several times by others [117][118][119][120]. Meanwhile, the objective scoring methods have been expanded and now include digital image analysis [90,118,121], reflectance confocal microscopy [122], optical coherence tomography [123,124], depth measurement videomicroscopy [125], laser speckle contrast imaging [63,126], and spatial frequency domain imaging [127]. Interestingly, these tools do not always correlate with visual assessment [81,128], underscoring potential flaws in subjective assessment tools. On the other hand, the final goal of treating PWS is to improve visibility and noticeability rather than change objective measures, such as dermal blood volume or blood flow, so this should be reflected in study outcomes. Moreover, most of these (optical) instruments are costly and not clinically available. In an altogether different approach, Lanigan proposed to use differences in pre-and post-treatment patient morbidity or satisfaction as study outcomes [129], which closely aligns with the increasing recognition of the importance of patientreported outcomes [130]. In our sample, patient-reported outcomes were included in only 32.9% of studies, none of which involved measures of quality of life or functioning. The objective assessment versus patient-based subjective approach raises an important question as to what is most important clinically: the actual degree of PWS blanching or the patients' perception of therapeutic efficacy (or perhaps even patients-perceived PWS-related life-impact). Regardless of the abovementioned objective assessment methods, (subjective) clinical efficacy assessment remains the most popular approach in prospective PWS studies as evidenced by the results of our systematic analysis (i.e., our study was limited to studies with a clinical assessment but only 20 studies were excluded for not complying with this criterion). Despite past efforts to standardize PWS study outcome measures, clinical scoring systems have remained highly variable.
Our review also showed that the quality of prospective PWS studies is generally poor inasmuch as no excellent and only thirteen good quality studies were found. Low Downs and Black scores was mostly attributable to incomplete disclosure of the patient recruitment process and patients lost to follow up, the randomization process, and the lack of patient blinding. It is likely that scores have been influenced to some extent by poor reporting (rather than poor study practice and design), which could be aggravated by the word limits imposed by dermatological journals. Also, the scores for uncontrolled studies are curbed to some extent due to the inclusion of studies without the primary goal to evaluate intervention effect (N = 12/42, section Study Characteristics).
Our systematic analysis was limited to prospective studies. However, there are no indications that retrospective studies perform any better in terms of outcome homogeneity. Another limitation is the fact that only studies with a clinical assessment were included. This means that some outcomes, particularly technical outcomes, may have been missed. The authors considered the assessment period of over 15 years sufficiently representative of current practice.
Consensus on the best outcome measures for PWS studies is lacking, which makes it impossible to compare trial results and perform meta-analyses. The absence of a standardized scoring system and paucity of high-quality PWS studies consequently limit (the quality of) the evidence available to clinicians to optimize treatment. As such this problem may have contributed to the lacking improvement in treatment outcomes over the last three decades [10]. Thus, the PWS field would benefit from a single, simple, and easy-to-use clinical assessment protocol, which can preferably be applied both in clinical trials and clinical practice. Inasmuch as it is currently unclear what clinical outcome measure is superior in regard to its measurement properties (i.e., validity, reliability, and responsiveness), we are currently performing a systematic review on the measurement properties of PWS outcome measures. Ideally, a future Delphi study would be organized among a large and relevant international group of stakeholders (including patients and healthcare professionals) to achieve consensus on what outcome constructs should be measured and reported in all PWS treatment research and, subsequently, which instruments should be used to measure these outcomes. If needed, new outcome instruments should be developed, including patient-reported outcomes. Accordingly, another essential part of the process of establishing such a core outcome set is the evaluation of the measurement properties of selected, previously established outcome instruments. Pending these developments, we advise PWS studies to (at least) include a physician-reported score of PWS percentage improvement compatible with quartiles (i.e., the most prevalent scoring system at the top in Fig 4) until consensus on this topic is reached and to report the frequency of each individual outcome category.
The methodological quality of prospective PWS studies could be further improved by consistent implementation of blinded, independent, and experienced evaluators, ensuring sufficient follow-up time, and treatment randomization with inclusion of control groups (e.g., a split-face study design). Moreover, it is imperative that photographs used for clinical assessment are taken under standardized conditions (i.e., using identical camera settings, patient positioning, and lighting). Because of the considerable effects on erythema patients should also be comfortable and stay sufficiently long (e.g., > 30 minutes) in a temperature-controlled room in order to achieve equilibrium conditions before photographs or other measurements are taken. Study quality would also benefit from systematic collection of data on adverse effects and inclusion of (validated) patient-reported outcomes.

Conclusions
Outcome measures used in prospective PWS studies are highly heterogeneous, making studies incomparable and hampering evidence-based clinical decision making. The results of this systematic analysis underscore the need for reliable, consensus-based, standardized outcome measures.