An evidence-based methodology for systematic evaluation of clinical outcome assessment measures for traumatic brain injury

Introduction The high failure rate of clinical trials in traumatic brain injury (TBI) may be attributable, in part, to the use of untested or insensitive measurement instruments. Of more than 1,000 clinical outcome assessment measures (COAs) for TBI, few have been systematically vetted to determine their performance within specific “contexts of use (COU).” As described in guidance issued by the U.S. Food and Drug Administration (FDA), the COU specifies the population of interest and the purpose for which the COA will be employed. COAs are commonly used for screening, diagnostic categorization, outcome prediction, and establishing treatment effectiveness. COA selection typically relies on expert consensus; there is no established methodology to match the appropriateness of a particular COA to a specific COU. We developed and pilot-tested the Evidence-Based Clinical Outcome assessment Platform (EB-COP) to systematically and transparently evaluate the suitability of TBI COAs for specific purposes. Methods and findings Following a review of existing literature and published guidelines on psychometric standards for COAs, we developed a 6-step, semi-automated, evidence-based assessment platform to grade COA performance for six specific purposes: diagnosis, symptom detection, prognosis, natural history, subgroup stratification and treatment effectiveness. Mandatory quality indicators (QIs) were identified for each purpose using a modified Delphi consensus-building process. The EB-COP framework was incorporated into a Qualtrics software platform and pilot-tested on the Glasgow Outcome Scale—Extended (GOSE), the most widely-used COA in TBI clinical studies. Conclusion The EB-COP provides a systematic methodology for conducting more precise, evidence-based assessment of COAs by evaluating performance within specific COUs. The EB-COP platform was shown to be feasible when applied to a TBI COA frequently used to detect treatment effects and can be modified to address other populations and COUs. Additional testing and validation of the EB-COP are warranted.


Introduction
Among the more than 1,000 clinical outcome assessment measures (COAs) currently used for traumatic brain injury (TBI), few have been systematically evaluated to determine their performance within specific "contexts of use (COU)." As described by the U.S. Food and Drug Administration, the COU specifies the population of interest and the purpose for which the COA will be employed. COAs are commonly used for screening, diagnostic categorization, outcome prediction and establishing treatment effectiveness. Despite the pivotal role that outcome assessment plays in research, COA selection typically relies on expert consensus. There is currently no methodology designed to determine the appropriateness of a particular COA within a specific COU. To address this gap, we developed and pilottested the Evidence-Based Clinical Outcome Assessment Platform (EB-COP) to efficiently and transparently evaluate the suitability of TBI COAs for specific purposes of use. Development of the EB-COP was informed by the FDA's Roadmap to Patient-Focused Outcome Measurement in Clinical Trials, 1 the American Academy of Neurology's (AAN's) well-established Clinical Practice Guideline Process Manual, 2 the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) 3 and other literature describing standards for measurement development.
The framework of the EB-COP is built around the six distinct "Purposes of Use" (PoUs) shown below: 1. Accurate diagnosis of TBI 2. Detection of TBI sequelae 3. Stratification of TBI subpopulations 4. Prediction of TBI outcome 5. Identification of natural history changes 6. Detection of treatment effects The EB-COP and evaluates the strength of COAs using distinct sets of quality indicators (QIs) that are specific to each purposes of use. This feature helps the user determine the contexts within which the COA should and should not be used.
This Manual of Operating Procedures (MOP) guides the EB-COP user through the COA review process. which relies on the on-line Qualtrics Survey Software System. The Qualtrics platform streamlines the process by efficiently navigating EB-COP user through the review process. Using logic, the software is able to populate the relevant criteria for the user to evaluate and generate recommendations based on users' responses.
See page 1 for a list of the key terms, their abbreviations and definitions. In this MOP is a description of each of the six EB-COP Steps. The core components of the six-step COA evaluation process is depicted in Figure 1. The links to the EB-COP platform are below (Steps III and V completed outside Qualtrics): Step I: https://aan.co1.qualtrics.com/jfe6/form/SV_6VDB0Jpx4Oepr2R Step II: https://aan.co1.qualtrics.com/jfe/form/SV_eb4bYQjmlUet1T7 Step IV-A: https://aan.co1.qualtrics.com/jfe/form/SV_29T0wYcIDmIfLM1 Step IV-B-D: https://aan.co1.qualtrics.com/jfe/form/SV_5chr9XxF1GHl3JX Step VI: https://aan.co1.qualtrics.com/jfe/form/SV_8BOFAW5atR9jNqd Step I: Specify Population, Purpose of Use, Concept of Interest, and COA (P-P-C-C) (What is the intended Context of Use, and which COA has been selected for review (i.e., P-P-C-C)?) Step I serves to frame the evidence question that will direct the literature search and review toward the determination of whether the COA under investigation is 'fit for purpose'. This step is akin to specifying the PICO framework in other systematic review processes. In order to frame the evidence question, the WHO (POPULATION), WHY (PoU), WHAT (COI), and HOW (COA/type/mode) must be specified. In particular, specification of the Population and PoU results in the specification of the COU, which the FDA defines as the 'boundaries' within which the COA is qualified for use. Specifying the COU also defines the boundaries of this evidentiary review.
The Validation Framework (presented below) lists the values (i.e., parameters and options offered in the drop-down menus) currently supported by the EB-COP, but the user is free to make specifications beyond those supported (e.g., via free-text boxes in the Qualtrics web-based version).
You will be asked to provide additional specifications with respect to the P-P-C-C that may be relevant to the parameters of the Review (e.g., Population: +/-presence of a biomarker, or COI: disability due to TBI vs overall trauma).
Based on the P-P-C-C parameters selected above, the evidence question will be presented in the webbased version of Qualtrics. You will be asked to provide an email address. You will receive a copy of your evidence questions, a record of your other specifications, and a link to the next step, Step II, at the email address that you provide.
You may elect to proceed to Step II or end the session and complete Step II at a later time.

SUB-DOMAIN (items in Drop-down Menu) Population
Age Specify the relevant age range e.g. if adults, " >18," "18-65," etc <<allow for free text entry of language(s) and/or cultural adaptation required>>

Additional Specifications (optional)
Additional specifications with respect to the P-P-C-C that may be relevant to the parameters of the Review e.g., Population: +/-presence of a biomarker, or COI: disability due to TBI vs overall trauma <<allow for free text entry of additional specifications>> Step II: Assess COA Fundamentals (Given the Context of Use, is the COA relevant, content-appropriate and feasible to administer?) Step II serves to determine whether the COA under investigation is appropriate in content and feasibility for measuring the desired COI in the specified population given the specified PoU. In Step II, you will assess the 8 fundamental Quality Indicators (QIs) listed at the bottom of this page. These QIs have been selected via a modified Delphi approach to be the most essential to a COA for your pre-specified P (population), P (PoU), and C (COI) (i.e., the evidence question formulated in Step I).
These 8 QIs have been identified as the most essential based on models of outcome measure development, evaluation and regulatory qualification (e.g., FDA) to ensure that the COA is appropriate in content, purpose, and appropriateness. You will be investigating the development of the COA, its content and face validity, feasibility and applicability to your intended context of use. To do this, you will review the items making up the COA, its user manual, early papers of its development, and administration. In this step, if any of the 8 fundamental QIs is not met based on pre-defined criteria, the user will be advised to discontinue the review and select a different COA, unless otherwise indicated. Some questions aim to gather additional information regarding the COA in preparation for subsequent Steps in the EB-COP Review. Descriptions/definitions are provided for each QI, which is operationalized via the associated "judgement criteria".
At the start of Step II, you will be asked to provide your email. You will also be asked to re-enter or paste the COA of interest and P-P-C-C question developed in Step I. (You may refer to the email that you received after completing Step I.) These will be displayed at the top and bottom of the page, respectively, as you advance through answering questions surrounding each of the 8 QI's judgment criteria, which are defined starting on the next page. You will also be presented with a concluding prompt that offers the opportunity to note any other evidence that the COA, as it was framed in terms of the P-P-C-C parameters specified in Step I, fails to meet the fundamental QIs. At the end of Step II, a summary of your responses will be displayed. You will receive a copy of this summary and further information to the email address provided at the beginning of Step II.

If yes, please specify whether the COA has been developed (i) specifically for TBI, (ii) for another population, but subsequently studied in TBI, or (iii) as a generic instrument that has also been studied in TBI:
Fundamental Quality Indicator Judgment Criteria

3) Specification of Intended Concept of Interest (COI)
Clear specification and justification of the COI(s) the COA is intended to measure. Evidence of the relevance of the COI(s) to the Population in which it is being measured should also be provided. For example, in measuring degree of global disability in patients with severe TBI with the GOS-E, it should be evident that it is intended to measure disability that results from the TBI and not from other factors, such as coexisting orthopedic injuries.

A. Does the COA specify the concept of interest (COI) it is intended to measure? <<Yes/No>>
Guidance: The purpose of this first prompt is to determine if the COA developers specified the intended Concept of Interest (COI), as this is a 'hallmark' of good study design for instrument development.

B. Does the intended concept of interest match the COI specified in the evidence question in STEP I? <<Yes/No>>
Guidance: The purpose of this follow-up prompt is to determine if the developers' intended COI corresponds with the COI specified when building the evidence question in Step I.

If no, describe difference or discrepancy between the two (optional):
Fundamental Quality Indicator Judgment Criteria

4) Specification of Intended Purpose(s) of Use (PoU(s))
Clear specification and justification of the purpose for which the COA has been developed, which may include one or more of the six targeted PoUs: (1)

A. Does the COA specify its intended purpose(s) of use (PoUs)? <<Yes/No>>
Guidance: The purpose of this first prompt is to determine if the COA developers specified the intended Purpose of Use (PoU), as this is a 'hallmark' of good study design for instrument development.

B. Does the intended purpose of use match the PoU specified in the evidence question in
Step I? <<Yes/No>>

Guidance: The purpose of this follow-up prompt is to determine if the developers' intended PoU corresponds with the PoU specified when building the evidence question in
Step I.

If no, describe difference or discrepancy between the two (optional):
Fundamental Quality Indicator Judgment Criteria

5) Content Validity
Non-statistical assessment of the degree to which the COA represents all aspects of the COI it is intended to measure. It includes consideration of the development and selection of items, domains and corresponding response options and of their appropriateness and breadth to the COA's intended measurement concept, population, and use. It is typically determined by reviewing the systematic, qualitative studies, e.g., patient/caregiver focus groups and/or expert panels, which informed the development of the conceptual framework that underlies the COA and/or the derivation of the items that comprise the COA. Inclusion of the appropriate stakeholders (e.g. content experts, clinicians, and patients) in the development of the COA is also evaluated.

If yes, please describe this evidence and your rationale:
Step III: Perform Systematic Literature Search (Which relevant, high-quality studies have assessed the COA in/for the intended Context of Use?) In Step III, you will be performing a comprehensive search of the world literature to identify studies that have evaluated the relevant psychometric properties, i.e., quality indicators, of your COA.
Step III represents the first step in identifying the relevant literature for evaluating the COA's ability to measure the concept of interest (COI) in a particular TBI population (P) for a particular Purpose of Use (PoU), as specified in the evidence question. It involves developing a search strategy that is sensitive and specific and searching multiple databases. For guidance on how to conduct the systematic literature review, please consult the 2017 edition of the American Academy of Neurology Clinical Practice Guideline Process Manual: 2 https://www.aan.com/siteassets/home-page/policy-and-guidelines/guidelines/aboutguidelines/17guidelineprocman_pg.pdf 1. Develop a search strategy that consists of collections of terms for the following characteristics: Quality Indicators (QIs or measurement properties)* NOTE: The PoU is not explicitly included but represented by the profile of QIs that need to be identified as 'adequate' to achieve a full recommendation (in Step VI).
2. Search at least two databases (as recommended by AAN), which may include OVID MEDLINE OVID, EMBASE and PsycINFO, EBSCO CINAHL and SCOPUS.
3. Limit articles by excluding dissertations, book chapters, conference proceedings and case studies.
Step IV: Assess the Relevance and Methodological Quality of the Studies Investigating the COA (Which relevant, high-quality studies have assessed the COA in/for the intended Context of Use?) In Step IV, you will use a multi-staged approach to review the articles identified in Step III (Systematic Literature Search). Studies eligible for inclusion must be high-quality according to data quality standards similar to those outlined in the AAN's Clinical Practice Guidelines 2 and COSMIN checklist, 3 among others.
The approach to filtering studies is strategically ordered to enable a top-down review that assesses increasingly more granular aspects of the study. Substeps A-D, as below, begin with a general screen of abstracts followed by a review full-text articles with increasing granularity. Studies meeting inclusion will be those evaluating the quality indicators upon which the COA will be graded in the final step (VI).
Two independent reviewers are required to complete Step IV-A and Steps IV-B-D iteratively for each article identified in the literature search (Step III). Each time, you will be asked to enter your name or unique initials/identification as well as the article identifying information (primary author, publication year, and citation). It is important to make sure you and your co-reviewer are identifying articles in the survey using the same format, i.e. you both are pasting the complete full citation in the same way.
Questions regarding inclusion/exclusion are organized into sub-steps (following which you will have a chance to review your responses). You may enter any comments or notes in the space provided on each page. Studies will be recommended for exclusion from the review if they do not meet all the criteria presented in each sub-step, and you will be redirected to the beginning of the form to start reviewing the next article.
The final decision to exclude the abstract (Step IV-A) or full-text article (Steps IV-B-D) will be made following comparison and reconciliation with your co-reviewer. The EB-COP administrator will email you the results once the other reviewer has finished and the two reviews have been compiled. If you and your co-reviewer have discrepancies regarding whether to bring an article forward, these should be resolved prior to proceeding to Step IV-B and Step V, respectively. If the two reviewers are unable to reconcile their responses, a third reviewer may be needed to break ties.

Step IV-A: Abstract Review: Assess All Articles' Relevance to the P-P-C-C and Sample Size
In Step IV-A, you will answer a short series of questions about each abstract that was retrieved. The aim is to broadly characterize the article's relevance to the P-P-C-C evidence question established in Step I. This will determine which will be brought forward for full-text review. You will assess whether the article A) examines a TBI population in general, B) addresses the chosen age group, C) has adequate sample size, and D) investigates a psychometric property or scoring characteristic of the COA. Should your age criteria change, you will be able to refer back to the second question (B). In the third question (C), you must assess if there are at least 25 appropriately aged participants with TBI in the starting sample (independent of the control group size, if one exists). This question is important when considering the power and generalizability of the study. The 25 number was chosen such that even with up to 80% loss to follow-up, the final sample size would still be 20. The fourth question (D) excludes studies that simply use the COA assuming it is valid and reliable without specifically examining its properties, per COSMIN guidelines. 3 You may elect to complete the form again for another abstract or end the session and finish at a later time. When you have indicated that abstract review is complete, you will be asked for your email address.
(If the response to any of these questions is "No", the article will be excluded from review, pending consensus between the two independent reviewers. If the answer is unclear, the response "Not Stated" should be chosen, and the article may still be included.) Step

IV-B: Full-text Review: Confirm Relevance to P-P-C-C and Sample Size
Steps IV-B-D evaluate the relevance and sample size, generalizability, and methodological quality of studies screened for inclusion after reconciliation of Step IV-A. In the online form, you will be asked to select the PoU from your P-P-C-C question in Step 1. Ensure you have the correct PoU at the beginning of the form, as this cannot be changed in later sub-steps. If appropriate, you will extract relevant information about the study characteristics as well as PoU-specific psychometric properties (i.e., quality indicators) from the full text. After reconciliation of full-text articles identified by you and your coreviewer in this process, those studies deemed relevant and high quality will be brought forward to subsequent steps (V and VI) where they will be used for the grading and development of recommendations for your COA. Note that when using this online form, your progress is cached. Should you need to restart the survey, please clear your browser's cookies or open a private browsing window.
(If the response is "No" or "Not Stated" to any of these questions, the article will be excluded from review, pending consensus between the two independent reviewers.)

Step IV-C: Full-text Review: Confirm Article's Generalizability to P-P-C-C and Overall Methodological Quality
(If the response is "No" or "Not Stated" to any of questions A-D or "Yes" to question E, the article will be excluded from review, pending consensus between the two independent reviewers.) A. Was the TBI sample recruited via a random or consecutive (

E. Is there evidence to suggest that the COA was not administered in accord with the guidelines for the administration described in the COA instructions or manual? <<Yes/No>>
Step IV-D: Full-text Review: Assess the Methodological Quality of the Article at the Level of the QI For each study, you will enter basic information including: sample size of the Population of interest represented in the study (not to include controls), age (mean, median, range), frequency of males in the sample, other relevant characteristics (e.g., ethnicity, language), specific COI the study is measuring, and COA administration (e.g., in person, over the phone, by proxy). You will also indicate the severity of TBI, chronicity, and study setting (emergency department, intensive care unit, outpatient clinic, sports, home care, community residential, military, not specified, other). You will then be presented with a list of QIs and asked to choose the ones evaluated by the article under review, specific to your PoU. If multiple QIs were evaluated by the study, select all (this can be done by holding control and clicking more than one).
You will be prompted to answer questions surrounding QI-specific Methodological Quality Standards (MQS), which are presented below. Your responses will determine whether the article has sufficiently high quality. Articles must meet all of the MQS for mandatory QIs to be included in data analysis and synthesis in subsequent steps (V and VI). When mandatory criteria are not met for a QI under review, you will be notified that the article's methodological quality is insufficient. Refer to the color-coded table above for QIs deemed mandatory (per a modified delphi consensus-building process) for each PoU. This is an iterative process such that when you have indicated that all QIs addressed by the article have been evaluated, you will be asked if you would like to review another article. You may elect to complete the form again for another article or end the session and finish at a later time. Following this substep, your responses for Steps IV-B-D will be reconciled with your co-reviewer's.

29
(User will be asked to select the PoU-specific QIs under investigation in the study, and will only be shown forms for the PoU-specific QIs selected.)

RELIABILITY
Involves assessing the reproducibility or stability of the COA at a particular point in time or within a relatively short time interval (crosssectional), and/or over a longer period of time (longitudinal). Various types exist, including internal consistency, test-retest reliability, inter-rater reliability, intra-rater reliability and alternate/parallel-forms reliability.

Reliability: Internal Consistency
For multi-item COAs, the extent to which the items making up the COA scale or subscale correlate with the other items making up the scale or subscale and with the total score. It is typically evaluated by Cronbach's alpha, Kuder-Richardson formula 20 (KR-20), average inter-item and item-total correlations, split-half reliability coefficient or the Spearman-Brown formula. It can also be evaluated via Rasch analysis (i.e., Person Separation Index (PSI)). In general, evaluation of internal consistency is only meaningful in COAs that are unidimensional (i.e., expected to measure only one construct). NB: Internal consistency and unidimensionality are not the same thing. Unidimensionality is considered separately as a QI in the EB-COP.

If yes, specify the hypothesis(es):
B. Was internal consistency tested between the total score and the subscale scores, and/or between the subscale scores? <<YES/NO>>

J. Summary and Comments (optional):
(Use this space to summarize the study findings and add comments that may be helpful for reconciling discrepancies and analyzing the evidence downstream.)

Reliability: Test-Retest Reliability (longitudinal)
The extent to which the COA remains stable or consistent over repeated administrations delivered over a longer time interval during which change may occur, but is tested on individuals who are clinically stable. It can be calculated using the intraclass correlation coefficient (ICC),

Specify statistic(s) result (including 95% confidence intervals, if available):
H. Was there any other evidence to suggest that there were flaws in the study design, methods or statistical analysis for this QI, rendering the findings unreliable? <<YES/NO>>

I. Summary and Comments (optional):
(Use this space to summarize the study findings and add comments that may be helpful for reconciling discrepancies and analyzing the evidence downstream.)

Reliability: Inter-Rater Reliability (longitudinal)
The extent to which administrations by independent raters of a COA agree when administered on multiple occasions over a longer period time in which change may have occurred, but is tested in individuals who are clinically stable. In particular, assessment of longitudinal inter-rater reliability helps to monitor 'rater drift' or changes in rater behavior (e.g., instruction, cueing/feedback and interpretation) over administrations and time. It can be calculated using the intraclass correlation coefficient (ICC), (weighted) Cohen's Kappa or Pearson's and Spearman's rank correlation coefficients, depending on data type and distribution. Longitudinal models, such as autoregressive, trait-state, growth curve and Rasch models, are used when change may have occurred across the sample tested but the rank order is expected to be preserved.

A. Were at least two, independent administrations of the COA available? <<YES/NO>>
(Here, to qualify as 'independent', the raters should have been blinded/masked to the others' results.)

Specify statistic(s) result (including 95% confidence intervals, if available):
I. Was there any other evidence to suggest that there were flaws in the study design, methods or statistical analysis for this QI, rendering the findings unreliable? <<YES/NO>>

J. Summary and Comments (optional):
(Use this space to summarize the study findings and add comments that may be helpful for reconciling discrepancies and analyzing the evidence downstream.)

Quality Indicator (QI) QI-Specific MQS Prompts
Reliability: Intra-Rater Reliability (crosssectional) The extent to which independent administrations by the same rater of a COA agree when measured at a particular time. It aims to assess whether the COA is amenable to consistent administrations by the same rater. It is typically calculated using intraclass correlation coefficient (ICC), (weighted) Cohen's Kappa or Pearson's and Spearman's rank correlation coefficients, depending on data type and distribution. It is often difficult to separate it from testretest reliability.
A. Were at least two, independent administrations of the COA by the same rater available? <<YES/NO>> B. Were at least two scores available for ≥80% of sample? <<YES/NO>>

Specify statistic(s) result (including 95% confidence intervals, if available):
I. Was there any other evidence to suggest that there were flaws in the study design, methods or statistical analysis for this QI, rendering the findings unreliable? <<YES/NO>>

J. Summary and Comments (optional):
(Use this space to summarize the study findings and add comments that may be helpful for reconciling discrepancies and analyzing the evidence downstream.)

Reliability: Intra-Rater Reliability (longitudinal)
The extent to which independent administrations by the same rater of a COA agree when measured over a longer period of time when change may have occurred, but is tested in individuals who are clinically stable. It aims to assess whether the COA is amenable to consistent administrations by the same rater. It is typically calculated using intraclass correlation coefficient (ICC), (weighted) Cohen's Kappa or Pearson's and Spearman's rank correlation coefficients, depending on data type and distribution.
Longitudinal models, such as autoregressive, trait-state, growth curve and Rasch models, may also be used when change may have occurred across the sample tested but the rank order is expected to be preserved.

Specify statistic(s) result (including 95% confidence intervals, if available):
I. Was there any other evidence to suggest that there were flaws in the study design, methods or statistical analysis for this QI, rendering the findings unreliable? <<YES/NO>>

J. Summary and Comments (optional):
(Use this space to summarize the study findings and add comments that may be helpful for reconciling discrepancies and analyzing the evidence downstream.)

CRITERION VALIDITY
An external measure of validity that aims to evaluate the relationship between the COA and a 'gold' standard. Requires specification of a 'gold' standard, which may include a tool or measurement that is widely accepted as being the best available in the field. Two forms exist: concurrent and predictive validity (see below).

Quality Indicator (QI) QI-Specific MQS Prompts Criterion Validity: Concurrent Validity
The extent to which the COA correlates with a field-accepted 'gold' or criterion standard that is intended to measure the same or similar construct and administered at the same time. Involves consideration and justification of the 'gold' standard. It is typically measured using the correlation coefficient (e.g. Pearson or Spearman's) or the Area Under the Curve (AUC) via Receiver Operating Characteristics (ROC) analysis, upon dichotomization of the 'gold' standard.

B. Was the criterion standard employed adequately justified or, in your best judgment, reasonable? <<YES/NO>>
If no, please specify your reasoning: The extent to which the COA correlates with a field-accepted 'gold' or criterion standard that is intended to measure the same or similar construct and administered at some time in the future. In other words, the extent to which the COA is able to predict the score of this 'gold' standard in the future. For example, the high school grade point average (GPA) is known to have strong predictive validity because it has been found to correlate strongly with students' GPA in college. Evaluation of predictive validity involves consideration and justification of the future 'gold' standard. It is typically measured using the correlation coefficient (e.g. Pearson or Spearman's) or the Area Under the Curve (AUC) via Receiver Operating Characteristics (ROC) analysis, upon dichotomization of the 'gold' standard.

CONSTRUCT VALIDITY
A broad property of measurement instruments that aims to verify that a COA measures the (often intangible) phenomenon, or construct, that it is intended to measure and in the manner that it is expected to measure that construct. Both internal (or 'strong') and external (or 'weak') measures of construct validity exist. Internal construct validity is also referred to as "structural" or "factorial" validity. Assessment of internal construct validity refers to the formal verification of certain pre-specified hypotheses (i.e. a model) regarding the behavior of the COA's composite items and resulting scores, using robust, multi-variate statistical approaches. Confirmation of internal construct validity has become more prevalent with the application of 'modern' approaches to outcome measure development, such as Item Response Theory (IRT) and Rasch analysis. The elements of internal construct validity described in the following QIs are considered critical to COAs and, as such, represent assumptions that underly IRT/Rasch modeling approaches that must be confirmed in the development or assessment of a psychometrically sound COA. While they are most efficiently captured under an IRT model, these properties may also be evaluated using classical test theory (CTT) approaches. Generally, if the exact type of construct validity being reported is not specified, it is most likely convergent validity (a type of external construct validity described below along with divergent/discriminant validity). Traditionally, measures of external construct validity have been the most commonly reported measures of construct validity due to their prevalence in outcome measure development using Classical Test Theory. The FDA define construct validity as "evidence that relationship among items, domains and concepts conform to a priori hypotheses concerning logical relationships that should exist with measures of related concepts or scores produced in similar or diverse patient groups. This form of construct validity has also been referred to as "weak" validity because it is established by correlating it with some external measure.

Unidimensionality (via non-IRT-based approaches)
Evidence that a COA that results in a single (composite/total) score represents a single latent construct, namely the single construct that it is proposed to measure. Failure to show unidimensionality suggests that more than one construct is being captured by the COA, complicating the interpretation of the total score. Unidimensionality is assumed when testing the internal consistency or correlation between the items making up 52 that construct. It is typically assessed via factor analytic techniques, such as confirmatory and exploratory factor analysis, or via IRT/Rasch analysis. It is one of the main assumptions underlying IRT/Rasch models. If IRT/Rasch approaches were applied, complete the MQS prompts for the QI Internal Construct Validity (IRT assumptions), below. May also be referred to as "structural validity".

E. Were the statistics appropriate to the distribution/type of data produced by the COA? <<YES/NO/I don't know>>
(e.g., for non-IRT-based approaches: exploratory or confirmatory factor analysis, principle component analysis)

Specify statistic(s) result:
F. Was there any other evidence to suggest that there were flaws in the study design, methods or statistical analysis for this QI, rendering the findings unreliable? <<YES/NO>>

G. Summary and Comments (optional):
(Use this space to summarize the study findings and add comments that may be helpful for reconciling discrepancies and analyzing the evidence downstream.)

Unidimensionality
Evidence that a COA that results in a single (composite/total) score represents a single latent construct, namely the single construct that it is proposed to measure. Failure to show unidimensionality suggests that more than one construct is being captured by the COA, complicating the interpretation of the total score. Unidimensionality is assumed when testing the internal consistency or correlation between the items making up that construct. It is typically assessed via factor analytic techniques, such as confirmatory and exploratory factor analysis, or via IRT/Rasch analysis. It is one of the main assumptions underlying IRT/Rasch models.

Monotonicity (or Scalability)
Evidence that, as one's score on a COA increases, the score on any single item within the COA increases or at least remains stable, representing a S-shaped function for dichotomous responses. For polytomous items, the "category structure" or "threshold ordering" is determined.
A. Was the single latent trait or construct that the COA is intended to measure adequately described?

Specify statistic(s) used:
Specify the number of items identified as "mis-fitting" the model:

Linearity
Evidence that the COA score is linear throughout its range, such that a 1-point change should be equivalent regardless of the location along the range of the score scale the change occurs. Linearity assumes monotonicity, which is an assumption of the IRT model.

Invariant Item Ordering
In a COA, there will be items that are easier and items that are harder to achieve, complete or endorse. Invariant Item Ordering refers to evidence that order of difficulty of the items does not change across different patients or respondents. For example, in the GOS-E, if Invariant Item Ordering is met, then achieving "Independence in the Home" (item 2) should be easier than achieving "Independence outside of the Home" for all assessed by the GOS-E. Also referred to as "item fit", "differential item functioning" (DIF) or "nonintersection of item response curves".

Local Independence
Evidence that the responses to any given item in a COA is dependent only on the severity or level of the trait or construct being tested and not on the responses to previous items in the COA. In other words, the only significant source of correlation between any 2 or more items in any COA should be the underlying construct that is H. Were the statistics appropriate to the distribution/type of data produced by the two assessments? <<YES/NO/I don't know>>

RESPONSIVENESS
Responsiveness refers to the ability of the COA to detect true change (i.e. change that has occurred) over time. It is an essential property of evaluative COAs. The assessment of responsiveness requires a priori specification of expected magnitude and/or direction of effect or comparison to 'known' groups. Internal and external responsiveness (described below) are the two ways in which it can be assessed.

Internal Responsiveness
The extent to which a COA score changes over a pre-specified time frame due to treatment effects or natural history change.
It is usually measured within the context of the study, e.g. repeated measures design of a treatment/intervention previously shown to be efficacious or well-established natural history changes. It is involves statistics that are based on the distribution of the data generated by the study and can therefore be strongly influence by the study design (e.g., sample size). Common measures of internal responsiveness include Cohen's effect size, the standardized response mean and the paired t-test.
A. Was the COA employed in a longitudinal design with at least two independent measurements? <<YES/NO>> B. Was the time-frame between the two measurements (e.g., before and after the intervention) specified? <<YES/NO>>

Specify time-frame:
C. Were the two measurements of the COA available in ≥80% of total sample? <<YES/NO>>

If yes, briefly describe the intervention/event:
E. Was sufficient prior evidence for the effectiveness of the intervention or anticipated natural history change during the pre-specified time-frame presented? <<YES/NO>>

If yes, describe the evidence:
F. Was there any evidence to suggest that a proportion of the sample showed change (improvement or deterioration) in the intended construct during the study interval (i.e., some indication that the sample was not stable during that time) presented? <<YES/NO>>

Minimal (Statistically) Important Difference (MID)
A statistical or distribution-based estimate of the smallest change in an individual's COA score that needs to be achieved or observed to ensure that it is beyond measurement error. Several approaches to calculating the MID exist, including the minimum detectable change (MDC), which is derived from the standard error of measurement (SEM), the Bland-Altman Plot Limits of Agreement (LoA) and the reliable change index (RCI). The SEM is an absolute measure of reliability and estimates the expected variation in observed scores due to measurement error. MIDs are typically determined in the context of reliability or responsiveness studies.

Specify MID identified:
If no, specify why: G. Was there any other evidence to suggest that there were flaws in the study design, methods or statistical analysis for this QI, rendering the findings unreliable? <<YES/NO>>

If yes, describe this evidence and where it appears in the text (e.g., page and line number):
H. Summary and Comments (optional): (Use this space to summarize the study findings and add comments that may be helpful for reconciling discrepancies and analyzing the evidence downstream.)

External Responsiveness
The extent to which changes in a COA over a specified time frame relate to corresponding changes in an independent, external measure ('gold' or 'peer') of the same or similar construct or another relevant measure of health status. It is typically evaluated by calculating the correlation between the change in the COA and the change in the reference standard (a.k.a. 'correlational approach') or by calculating Area Under the Curve (AUC) via Receiver Operating Characteristics (ROC) analysis in a dichotomized (i.e. changed versus not changed) or binomial reference standard (e.g. a patient's self-reported transition score -improved vs. not improvedfollowing an intervention). It is considered the more rigorous of the two types of responsiveness because it compares to external measures of change, and may form the basis for the identification of the minimum clinically important difference (MCID, described below). External Responsiveness is also known as "longitudinal construct validity". An anchor-based estimate of the smallest change in an individual's COA score that has been determined to be meaningful to the patient and/or clinician. Typically determined by comparing the change in the COA score to a question or survey (i.e. an anchor) regarding the patients perceived health/status at a subsequent time point relative to baseline. May be determined in the context of an external responsiveness study. At the start of Step VI, you will be asked to provide your email, to re-enter or paste the COA of interest and P-P-C-C question, select the PoU from a drop-down menu. The form will populate with the appropriate QIs to be graded based on the chosen PoU. In step VI, are asked to synthesize the Evidence Summary Table from the previous step in order to rate each QI as 'adequate' (i.e., meeting the threshold), 'inadequate' (i.e., not meeting the threshold), or 'not determined' (i.e., not been studied in a relevant, high-quality study). It is possible multiple articles in your review have assessed the same QI. In the event of conflicting findings, judge the methodological qualities of the studies when scoring. If findings are equivocal, and the strength of the study design is equivalent (e.g., both Class II), rate the outcome as "not determined." You may provide optional comments in the online form. Your responses for mandatory QIs will determine the numbers needed in order to produce a grade and recommendation, based on the criteria below. This automated result will be displayed, and the information will also be sent to the email address provided.