Treatment Effects of Removable Functional Appliances in Pre-Pubertal and Pubertal Class II Patients: A Systematic Review and Meta-Analysis of Controlled Studies

Background Treatment effects of removable functional appliances in Class II malocclusion patients according to the pre-pubertal or pubertal growth phase has yet to be clarified. Objectives To assess and compare skeletal and dentoalveolar effects of removable functional appliances in Class II malocclusion treatment between pre-pubertal and pubertal patients. Search methods Literature survey using the Medline, SCOPUS, LILACS and SciELO databases, the Cochrane Library from inception to May 31, 2015. A manual search was also performed. Selection criteria Randomised (RCTs) or controlled clinical trials with a matched untreated control group. No restrictions were set regarding the type of removable appliance whenever used alone. Data collection and analysis For the meta-analysis, cephalometric parameters on the supplementary mandibular growth were the main outcomes, with other cephalometric parameters considered as secondary outcomes. Risk of bias in individual and across studies were evaluated along with sensitivity analysis for low quality studies. Mean differences and 95% confidence intervals for annualised changes were computed according to a random model. Differences between pre-pubertal and pubertal patients were assessed by subgroup analyses. GRADE assessment was performed for the main outcomes. Results Twelve articles (but only 3 RCTs) were included accounting for 8 pre-pubertal and 7 pubertal groups. Overall supplementary total mandibular length and mandibular ramus height were 0.95 mm (0.38, 1.51) and 0.00 mm (-0.52, 0.53) for pre-pubertal patients and 2.91 mm (2.04, 3.79) and 2.18 mm (1.51, 2.86) for pubertal patients, respectively. The subgroup difference was significant for both parameters (p<0.001). No maxillary growth restrain or increase in facial divergence was seen in either subgroup. The GRADE assessment was low for the pre-pubertal patients, and generally moderate for the pubertal patients. Conclusions Taking into account the limited quality and heterogeneity of the included studies, functional treatment by removable appliances may be effective in treating Class II malocclusion with clinically relevant skeletal effects if performed during the pubertal growth phase.


Introduction
The mandibular condyles, including their cartilage, have a primary role in the development and growth of the oro-facial complex. In this regard, a deficient growth of the condyles may results in mandibular retrognathia, also referred as skeletal Class II malocclusion. Interestingly, animal studies have shown that forward mandibular displacement enhances condylar growth resulting in significant changes in the morphology of the Mandible [1], [2]. Such induced condylar growth has been shown to be characterized by a thickness of the condrogenic, proliferative, and hypertrophic layers of condylar cartilage on the posterior aspect of the condyle, thus yielding to an increase in total mandibular length [1], [2].
According to this biological evidence, an orthopaedic approach to treat skeletal Class II malocclusion in growing subjects is based on forward positioning of the mandible [3]. For this purpose, several removable or fixed appliances have been developed [3]. However, reviews reported very limited [4][5][6], partial [7] or relevant [8], [9] effectiveness of such treatment in terms of additional mandibular growth, i.e. correction of skeletal Class II malocclusion. The reason for this apparently inconsistent evidence might reside in the different interventions performed [8], [9] in the large variation in individual responsiveness to functional treatment [10], or in the timing, i.e. pre-pubertal or pubertal growth phase [11], during which treatment is performed. Indeed, growth does not occur at a constant rate and children of the same chronological age might not have equivalent skeletal maturity or growth potential [11]. Interestingly, while previous reviews focused mainly on the appliance type [7], [12], none has focused on the timing of intervention, although this issue has been raised years ago [8]. The only exception is a recent meta-analysis [13] on fixed appliances that reported significant skeletal effects for pubertal patients and not for post-pubertal ones.
A further ethical issue also relates to the clinical trials evaluating the effectiveness of functional treatment for skeletal Class II malocclusion. Indeed, leaving subjects with relevant malocclusions without orthodontic treatment during the pubertal growth phase or after, has limited the execution of randomized clinical trials (RCTs) at this stage of development.
Therefore, reviews including exclusively RCTs [4], [5], might have been focused mostly on prepubertal subjects, leaving the potential effects of treatment on pubertal patients excluded from the analysis. For this reason, the consideration of controlled clinical trials (CCTs) with reasonable methodological quality has been advocated [14]. Moreover, it has been reported that whenever RCTs are not available for meta-analysis, CCTs or observational studies may be used with essentially similar outcomes [15].
Whether the efficiency of functional treatment for skeletal Class II malocclusion is critically dependent on the timing of intervention has still not been clarified, especially for removable appliances. Yet, this information would have relevant clinical implications in terms of treatment planning. Therefore, the aim of the present review and meta-analysis of RCTs and CCTs was to assess the short-term skeletal (mainly supplementary mandibular growth) and dentoalveolar effects of removable functional appliances for the treatment of Class II malocclusion during the pre-pubertal or pubertal growth phase, as compared to matched untreated controls.

Search strategy
The present meta-analysis followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [16] (S1 PRISMA Checklist), used a previous systematic review as a template [13], and it has been registered at the PROSPERO database (http://www. crd.york.ac.uk/PROSPERO). Articles were identified through a literature survey carried out through the following databases: i) PubMed, ii) SCOPUS iii) Latin American and Caribbean Health Sciences (LILACS), iv) Scientific Electronic Library Online (SciELO), and v) The Cochrane Library. The survey covered the period from inceptions to the last access on May 31, 2015 with no language restrictions. The search algorithms used in each database have been published previously [13] and are reported in Table 1. Finally, a manual search was also performed by scoring the references within the studies examined and the titles of the papers published over the last 15 years among the following major journals: i) American Journal of Orthodontics and Dentofacial Orthopedics, ii) European Journal of Orthodontics, iii) Journal of Orofacial Orthopedics, iv) Korean Journal of Orthodontics, v) Orthodontics and Craniofacial Research; vi) Progress in Orthodontics, vii) The Angle Orthodontist, and viii) World Journal of Orthodontics. The eligibility assessment was performed independently by two blinded authors (GP and JP). The intra-examiner reliability in the study selection process was assessed through the Cohen k test assuming a threshold value of 0.61 [17]. Conflicts were resolved by discussion of each article, until consensus was reached. An attempt to contact the corresponding Authors of the included studies was made to retrieve any missing information or clarification of specific items.

Eligibility criteria
The studies retrieved had to be RCTs or either prospective or retrospective CCTs. They had to include healthy patients treated during the pre-pubertal or pubertal growth phases. These studies had to investigate the skeletal and dentoalveolar effects with no restriction as to the type of parameters collected, as long as at least one of the main outcomes (see below) was included. Also, no restrictions were set regarding the type of removable appliance whenever used alone without any other additional therapy (fixed, extra-oral traction, etc.), treatment length or to the cephalometric analysis used. Studies were excluded if a reliable indicator of growth phase (hand-and-wrist maturation [HWM] method or cervical vertebral maturation [CVM] method) was not used. Further inclusions and exclusion criteria are listed in detail in Table 2.

Data items
The following data were extracted independently by two authors (GP and JP): study design, prospective or retrospective enrolment of the treated group, sample size, gender distribution, age, type of functional appliance used, Class II description, indicators of skeletal maturity and distribution of subjects according to growth phase, prognostic or other features, cephalometric magnification factor, full treatment and observational duration, mandibular advancement for treated patients and when treatment was stopped. Regarding the treatment effects, the following items were also collected: success rate (as defined in different studies), skeletal, Table 2. Inclusion and exclusion criteria used in the present review.
Inclusion criteria dentoalveolar and soft tissues effects, and Authors' conclusions on the growth phase and treatment efficiency. Forms used for data extraction were mostly pre-defined at the protocol stage by two authors (GP and LC).

Assessment of risk of bias in individual studies
No single approach in assessing methodological soundness may be appropriate to all systematic reviews [18]. Therefore, risk of bias in individual studies was assessed according to the Cochrane Collaboration's Tool [19] and a slightly modified Downs and Black tool [20] for randomised and non-randomised trials, respectively. The items included in the Cochrane Collaboration's Tool [19] are defined as: sequence generation, allocation concealment, blinding, incomplete outcome data (i.e., drop-out information or cephalometric magnification), selective outcome reporting (i.e., relevant cephalometric parameters), and other risks of bias. In particular, the 'other bias' domain included a set of prespecified entries defined as: i) inclusion of Class II patients relying on overjet alone, which cannot account for a true skeletal Class II malocclusion [21]; ii) lack of analysis of other potentially relevant diagnostic/prognostic features, such as facial divergence, maxillary protrusion, or condylar angle [10].
The original Downs and Black tool is calculated by rating each study across a variety of domains including reporting (10 items), external validity (3 items), internal validity-bias (7 items), internal validity-confounding (6 items), and power (1 item) with maximum score of 32 [20]. In the present review, only minor adaptations were followed to adhere with the studies dealing with functional treatment for Class II malocclusion. These were as follows: i) items were added in the reporting section as: 'Were inclusion and exclusion criteria clearly stated?' (yes, 1 point; no or unclear, 0 points); 'Is the Class II malocclusion fully described?' (fully described including skeletal parameters, or at least reporting a full molar Class II, 1 point; no, 0 points); ii) the original item #14 'Was an attempt made to blind study subjects to the intervention they have received?' was removed as this is not applicable; iii) the original item #20 'Were the main outcome measures accurate (reliable and repeatable)?' was used to derive 2 items for the reliability of the skeletal maturation staging and cephalometric measurements (yes, 1 point; no or unclear, 0 points); iii) The last item on the power was simplified as follows: 'Prior estimate of sample size' (yes, 1 point; no or unclear, 0 points). The maximum score for this modified Downs and Black tool is thus 29.Evaluation was performed without blinding by two Authors (GP and JP) and conflicts were resolved by discussion. A third Author (LC) was consulted if necessary.

Assessment of risk of bias across studies
Heterogeneity was assessed using the χ2-based Q-statistic method with a significant p value <0.1. However, because of the moderate insensitivity of the Q statistic [22], an I 2 index was also reported with values 50% considered associated to a substantial heterogeneity among the studies [23]. In particular, the I 2 index describes the percentage of total variation across studies due to heterogeneity rather than chance. The tau 2 was also calculated for the heterogeneity assessment. The Review Manager software version 5.2.6 (http://www.cochrane.org) was used for the assessment of heterogeneity. Moreover, the Egger test [24] and the Begg and Mazumdar rank correlation test [25] were employed to assess publication bias and to compensate for possible lack of power [26], with significant p value set at p<0.1. Calculations were performed using the Comprehensive Meta-Analysis software version 2.0 (Biostat Inc., Englewood, NJ, USA).

Primary and secondary outcomes
For the meta-analysis, primary outcomes included those cephalometric parameters related to mandibular growth, and expressed as supplementary growth in comparison to the untreated controls. They were: 1) total mandibular length, 2) mandibular ramus height, 3) composite mandibular length (according to Pancherz Analysis) [27], and 4) mandibular base (according to Pancherz Analysis) [27]. Secondary outcomes, again as supplementary changes in comparison to the untreated controls, were: 1) SNA, 2) SNB and 3) ANB angles, 4) maxillary base (according to Pancherz Analysis) [27], 5) total facial divergence, and 6) mandibular incisor proclination (relative to the mandibular plane). Although the measures of total mandibular length, mandibular ramus height, facial divergence, and lower incisor proclination differed slightly among the studies, these were combined in the overall effects according to the concept that the differences in the intra-group changes would be poorly sensitive to the absolute measures from which they are derived.

Summary measures and synthesis of results
The mean difference was used for statistical pooling of data and results were expressed as mean and 95% confidence intervals (CIs). Moreover, 90% prediction intervals were also calculated as previously reported [28]. Subgroup analyses were performed whenever possible according to the growth phase, pubertal or post-pubertal, during which treatment was performed. Moreover, to account for the heterogeneity of the treatments, i.e. differences among the appliance used, treatment length, and cephalometric analysis, a random effect model was used for calculations of all the overall effects [29]. No studies including two or more treated groups compared to a single control group were retrieved. Finally, these analyses were reported according to the different subgroups of pre-pubertal and pubertal subjects and shown through forest plots. Treatment duration was noteworthy different among the retrieved studies; therefore, when not already reported in the articles, annualised changes for all the parameters were calculated and used for meta-analysis. Furthermore, whenever necessary and possible, the magnification for linear parameters was set at 0%. The Review Manager software was used for meta-analysis (S1 Table).

Additional analysis
As for the main analyses, all the additional analyses were performed according to the pre-pubertal and pubertal subgroups. Robustness of the meta-analysis for each outcome was assessed by sensitivity analysis, carried out with the Comprehensive Meta-Analysis software, that was run by eliminating studies one-by-one, and differences in estimations above 0.5 mm (for linear outcomes) or 0.5°(for angular outcomes) were considered as clinically relevant. Moreover, the overall quality of evidence for each of the primary outcomes, according to the pre-pubertal and pubertal subgroups, was evaluated following the Grades of Recommendation, Assessment, Development, and Evaluation (GRADE) guidelines using the GRADE profiler software version 3.6.1 (www.gradeworkinggroup.org) [30]. The GRADE assesses the quality of evidence as high, moderate, low and very low based on eight different domains as follows: i) risk of bias, ii) inconsistency, iii) indirectness, iv) imprecision, v) publication bias, vi) large effect, vii) plausible confounding that would change effect, and viii) dose response gradient [31]. Although the GRADE has been developed for RCTs, also CCTs were entered in the profiler software as randomised studies, provided that they were downgraded by 1 point in the 'risk of bias' domain. All the other GRADE domains were filled according to the published recommendations [30] with the exception of the 'large effect' domain score that was determined on data regarding differential growth increment in untreated Class II and Class I subjects [32]. In particular, the mean annualised changes for the cephalometric measurements in the pre-pubertal and pubertal subjects were derived from this reported growth study [32]. Subsequently, 1 mm was added to account for the cephalometric method error, as this value may be considered representative of linear cephalometric error measurements. Therefore, by a slight excess approximation the large effects were set as 1.5 mm/year for all the primary outcomes for pre-pubertal patients, and as 2.5 mm/year for total and composite mandibular length (Pancherz analysis), and as 2.0 mm/year and 1.5 mm/year for the mandibular ramus height and mandibular base (Pancherz analysis), respectively, in pubertal patients. A very large effect was set by adding 1 mm to each threshold. Moreover, due to the lack of reporting for the composite mandibular length and mandibular base (Pancherz analysis), the total mandibular length and Pogonion to Nasion perpendicular [32], respectively, were used instead to elaborate dimensions of the effect.

Study search
The results of the electronic and manual searches are summarised in Fig 1. According to the electronic search, a total of 2,458 articles were retrieved. Among these, 12 studies [33][34][35][36][37][38][39][40][41][42][43][44] were judged to be relevant to the present review. However, 2 articles were clearly derived from the same study sample reporting either the results about soft tissues and SNA, SNB and ANB angles [41] or other dentoskeletal effects [44] and may be considered as a single study. Full details of the excluded studies at the full text analysis are reported in the Table 3. Four studies could not be retrieved upon internet search, through the local library facility, and after having contacted the Authors (Table 4). Finally, 1 study [40] included in the qualitative synthesis, was not included in the meta-analysis according to the risk of bias and sensitivity analyses (see below).
The mean treatment duration in the pubertal subjects ranged from 1 year [37] to 1.8 years [35], with the appliance being worn at least 14 hours per day [42] to full time wear [34], [37], [40]. Two studies [35], [39] on pubertal subjects did not report the mean appliance wear time. In 1 study [40] including both pre-pubertal and pubertal subjects, treatment duration lasted for 1 year although post-treatment measurements were performed after an additional year of retention. Generally, a single mandibular advancement to an incisor end-to-end relationship was performed for overjet up to 7-10 mm; otherwise, a 2-step procedure was followed [33], [34], [36], [37], [40], [42]. Mandibular advancement by 70% of the maximum protrusive path was used in 1 study [41], [44]. Furthermore, a stepwise advancement of less than 3 mm was performed in one study [43]. Other studies did not report the amount of mandibular   [42], [43], or a normal overjet was achieved in a mandibular retruded position [36], [41], [44]. In 1 RCT [33] treatment was performed for at least 15 months and continued if clinical objectives were not achieved. The rest of the studies did not report when treatment was stopped.
On the contrary, 1 study [40] including pre-pubertal subjects reported no significant soft tissue changes.  Tables 7 and 8 for the RCTs and CCTs, respectively. Briefly, 2 RCTs [33], [36] had an unclear bias with regard to the diagnosis of Class II malocclusion based on the overjet alone, while the last RCT [42] did not show significant risk of bias. Regarding the CCTs, the overall scores ranged from 12 [40] to 24 [41], [43], [44]. Only 1 study had an overall score below the threshold and was thus judged as affected by significant risk of bias [40], two studies [37], [39]

Sensitivity analysis
Sensitivity analysis is detailed in Table 9. Generally, overall effects proved to be robust enough except for the study with higher risk of bias [40]. Given the results of the sensitivity analysis  combined with the overall risk of bias, 1 CCT [40] was excluded from the meta-analyses and GRADE assessment reported below. One study [40] uncovered a relevant effect at the sensitivity analysis. Regarding the pubertal subgroup, the overall (for all studies) total mandibular length and mandibular ramus showed about 0.8 mm difference with the corresponding values without the study with the highest risk of bias assessment [40]. Similarly, clinically relevant effects were seen when removing the same study [40] for the ANB angle and facial divergence. Of note the mandibular incisor proclination also yielded some different estimations between all the studies when a RCT [42] was removed.

Risk of bias among studies
Heterogeneity at the subgroup level was generally low, with I 2 values between 0% and 56% for all the primary outcomes (Figs 2-5). On the contrary, substantial heterogeneity was seen for the SNA, SNB, ANB angles with I 2 values up to 88% (ANB angle, pubertal subgroup) as shown in Figs 6-8. The maxillary base (Pancherz Analysis) and facial divergence showed no or acceptable heterogeneity with I 2 values equal to 0 in both subgroups (Fig 9) or not exceeding 55% (Fig 10), respectively. Finally, lower incisor proclination also showed acceptable heterogeneity with I 2 values not exceeding 47% in both subgroups (Fig 11). Results on the publication bias analyses are shown in Table 10. Generally non-significant p values were seen for all the parameters in both subgroups. Exception were seen for the SNB and ANB angles that yielded a significant publication bias according to the Egger test in the pubertal subgroup (p = 0.020 and p = 0.056, respectively), for the ANB for the pre-pubertal subgroup (p = 0.055), and for the facial divergence for the pre-pubertal subgroup (p = 0.089).

Meta-analysis for the primary outcomes
The cephalometric measurements used in each study and pooled herein for the meta-analysis are reported in Table 11. Detailed results for the meta-analysis for the primary outcomes are shown in Figs 2-5. Overall effects are expressed as mean (95% confidence interval) with 90% prediction intervals summarised in Table 12. For the total mandibular length, no study made use of the Articulare as the endpoint. The overall annualised changes were 0.95 mm (0.38, 1.51) and 2.91 mm (2.04, 3.79) in the pre-pubertal and pubertal subgroups, respectively. The difference between the subgroups was significant at p<0.01 (Fig 2). The prediction intervals of the annualised changes ranged from -0.30 to 2.20 mm and from 1.04 to 4.78 mm in the prepubertal and pubertal subgroups, respectively. Regarding the mandibular ramus height, the overall annualised change in pre-pubertal patients was 0.00 mm (-0.52, 0.53). While in pubertal Faltin et al. [35] Quintão et al. [37] Almeida-Pedrin et al. [38] Cui et al. [39] Singh et al. [40] Baysal and Uysal [41], [44] Perillo et al. [43] 25. Were study subjects in different intervention groups recruited over the same period of time?  patients, the overall annualised change was 2.18 mm (1.51, 2.86). The difference between the subgroups was significant at p<0.01 (Fig 3). The prediction intervals of the annualised changes ranged from -1.69 to 1.69 mm and from 1.17 to 3.19 mm in the pre-pubertal and pubertal subgroups, respectively. For the composite mandibular length, the overall annualised change in pre-pubertal patients was 0.94 mm (0.25, 1.63), while in pubertal patients, the overall annualised change was 2.10 mm (1.02, 3.18). The difference between the subgroups was not significant even though the p value was close to significance at 0.08 (Fig 4). The prediction intervals of the annualised changes ranged from -1.28 to 3.16 mm and from -0.78 to 4.98 mm in the prepubertal and pubertal subgroups, respectively. Regarding the mandibular base (Pancherz Analysis), the overall annualised change in pre-pubertal patients was 1.01 mm (0.21, 1.80), while in  Skeletal Maturation and Class II Treatment pubertal patients, the overall annualised change was 1.63 mm (0.98, 2.28), without significant differences between subgroups (p = 0.24; Fig 3). The prediction intervals of the annualised changes ranged from -2.47 to 4.49 mm and from 0.26 to 3.00 mm in the pre-pubertal and pubertal subgroups, respectively.

Meta-analysis for the secondary outcomes
The cephalometric measurements used in each study, and pooled herein for the meta-analysis are reported in Table 11. Detailed results for the meta-analysis are shown in Figs 6-11 for the secondary outcomes with 90% prediction intervals summarised in Table 12. Overall effects are expressed as mean (95% confidence interval). For the SNA angle, the overall annualised change in pre-pubertal patients was -0.02°(-0.29, 0.25). While in pubertal patients, the overall  annualised change was -0.05°(-1.02, 0.08), but the difference between the two subgroups was not significant at p = 0.15, and the I 2 values were 0% and 56% for the pre-pubertal and pubertal subgroups, respectively (Fig 6). The prediction intervals of the annualised changes ranged from -0.89°to 0.85°and from -3.35°to 2.41°in the pre-pubertal and pubertal subgroups, respectively. Regarding the SNB angle, the overall annualised change in pre-pubertal patients was 0.56°(0.11, 1.01) and of 1.00°(0.60, 1.39) in pubertal patients, with no significant (p = 0.15) differences between the subgroups, and the I 2 values were 72% and 0% for the pre-pubertal and pubertal subgroups, respectively (Fig 7). The prediction intervals of the annualised changes ranged from -2.06°to 3.18°and from -0.27°to 2.27°in the pre-pubertal and pubertal subgroups, respectively. For the ANB angle, the overall annualised change in pre-pubertal patients was -0.73°(-0.95, -0.50) while, in pubertal patients, the overall annualised change was -2.14°( -3.09, -1.18). The difference between the subgroups was significant at p<0.01, and the I 2 values were 0% and 88% for the pre-pubertal and pubertal subgroups, respectively (Fig 8). The  The difference between the subgroups was not significant at p = 0.66, and the I 2 values were 0% for both the subgroups (Fig 9). The prediction intervals of the annualised changes ranged from -1.75 to 0.51 mm and from -1.00 to 0.02 mm in the pre-pubertal and pubertal subgroups, respectively. For the facial divergence, the overall annualised change in pre-pubertal patients was 0.27°(-0.25, 0.79), while in pubertal patients, the overall annualised change was 0.80°(0. 34, 1.26). The difference between the subgroups was not significant at p = 0.14, and the I 2 values were 55% and 0% for the prepubertal and pubertal subgroups, respectively (Fig 10). The prediction intervals of the annualised changes ranged from -1.10°to 1.64°and from -0.25°to 1.35°in the pre-pubertal and  pubertal subgroups, respectively. Finally, for the mandibular incisor proclination, the overall annualised change in pre-pubertal patients was 1.37°(0.38, 2.36) and 0.79°(-0.66, 2.25) in pubertal patients. The difference between the subgroups was not significant at p = 0.52, and the I 2 values were 0% and 47% for the pre-pubertal and pubertal subgroups, respectively (Fig 11). The prediction intervals of the annualised changes was not derivable for the pre-pubertal patients, while for the pubertal patients ranged from -6.49°to 8.07°.

GRADE Assessment
The GRADE assessment for each of the primary outcome with detailed information is shown in Table 13. For the pre-pubertal patients, the quality of evidence was low for all the outcomes.  For the pubertal patients, the overall quality was between low (composite mandibular length) to moderate (for all the other outcomes). Reasons for downgrading were related to the items 'risk of bias' (use of CCT, historical controls, and other bias as stated above) and 'imprecision'  (according to the heterogeneity seen) and for inclusion of small studies. No downgrading was assessed for the inconsistency, indirectness or publication bias (according to the results of the analyses reported above). Finally, upgrading mainly responsible for the greater quality seen in the pubertal subgroup as compared to the pre-pubertal one was due to the dimension of the treatment effect for total mandibular length, mandibular ramus height and mandibular base that reached a 'large effect'.

Discussion
The present review allowed the comparison of the effects of functional treatment of skeletal Class II malocclusion by removable appliances between pre-pubertal and pubertal patients. Study designs and main results at the skeletal, dentoalveolar and soft tissue levels were reviewed. Moreover, cephalometric parameters, mainly regarding mandibular growth, were meta-analysed. Overall, taking into account relevant individual variations, the present results demonstrate clinically relevant skeletal effects in terms of additional mandibular growth only if treatment is performed during the pubertal growth phase. In spite of the large number of studies initially retrieved (Fig 1), most of them analysed in full-text were excluded because they did not consider a reliable indicator of skeletal maturity, or because they lacked of an untreated Class II control group (Table 3). Interestingly, a relevant RCT on pubertal subjects [42] was missed in one of the most recent meta-analyses [6].
Even though different treatment modalities were followed in the included studies (and after removal of a low quality investigation [40], heterogeneity among the studies was acceptable with I 2 mostly below 50% for the primary outcomes and some secondary outcomes (Figs 2-5) with consistent results. On the contrary, SNA, SNB and ANB angles showed substantial heterogeneity. Of note, heterogeneity seen herein at the subgroup level for the main outcomes was generally below those reported in other similar investigations [12] where the growth phase was not considered as a clustering factor. Therefore, the different growth phase may explain part of the heterogeneity (and apparent inconsistency of the results) previously reported.
Herein, clinically relevant effects in terms of additional mandibular elongation was see for the pubertal patients of 2.91 mm/year (Fig 2). Similar clinically relevant results were seen herein for the additional increment of the mandibular ramus height (Fig 3). However, different removable appliances may have different modus operandi requiring differential treatment duration. A previous meta-analysis [4] reported no significant effects of functional treatment in Class II patients. This meta-analysis used standardised mean differences (obtained merging several parameters) for the estimation of the overall effects. However, while standardised mean differences may give an indication of the variability among individuals, they do not describe Variable the magnitude of the effect. Further meta-analyses reported some skeletal effects for functional treatment of Class II malocclusion by the use of the Functional Regulator-2 [12] and Twin-Block [7], even though the Authors were not conclusive in terms of treatment efficiency. On the contrary, the results of the present study on pubertal patients may be compared with those from a recent meta-analysis [13] on fixed functional appliance where the mean additional mandibular (total length) growth as compared to matched untreated subjects was about 2 mm. Even though this previous meta-analysis did not report annualised changes it might be hypothesised that, irrespective of the fixed or removable appliance used, skeletal effects are dependent on the growth phase (pubertal) during which treatment is performed.
Of note, a noteworthy individual variation in terms of treatment responsiveness was also seen in pubertal patients particularly for the annualised total mandibular length increment (prediction  Table 12). In this regard, individual variations has been previously reported in pubertal Class II patients treated by functional appliances with the condylar angle as one of the prognostic feature [10]. Interestingly, none of the included studies herein has classified patients according to this prognostic feature (Table 5). However, the present meta-analysis may not discriminate whether such individual variation in treatment effects was due to the different treatment protocols, patient's compliance or biological individual responsiveness. In spite of these aspects, the present results would be consistent with previous findings reporting insulin growth factor 1 among the key factors promoting chondrogenesis of the condylar cartilage [45], the serum levels of which would be to be significantly greater in the pubertal as compared to prepubertal subjects, as determined though the CVM method [46], [47]. While a relevant 'headgear' effect has been reported for the fixed functional appliances used during the pubertal growth phase [13], herein, irrespective of the growth phase of the patients, a limited maxillary growth restrain (Figs 6 and 9, Table 12) was seen. Taking also into account previous findings [7], it may be hypothesised that removable and fixed functional appliances have different effects on maxillary bone.
An increase in facial divergence was not seen herein (Fig 10, Table 12) while, a slightly greater (although not significant) mandibular incisors proclination was seen for pre-pubertal patients (Fig 11, Table 12). However, this proclination appears to be of limited clinical relevance in either pre-pubertal or pubertal patients. On the contrary, increase in both these parameters have been reported earlier for the Twin-Block treatment [7]. The individual management of the dentition, i.e. extrusion of mandibular teeth, during treatment may explain at least part of this apparent inconsistency.

Limitations of the review
The current investigation on the effects of functional treatment of Class II malocclusion is inherently hampered by some factors. In spite of the use of annualised changes, observational terms may include not only the effective functional treatment, but also variable periods of time of retention, or of further management of the dentition. Therefore, skeletal changes might occur not uniformly during the entire observational term skewing the analysis of treatment outcomes [5]. The studies included were mostly CCTs, and in 5 studies treated groups followed a retrospective enrolment of the treated group [34], [35], [39], [40], [43] (Table 5). Hardly to be avoided, heterogeneity of the selected studies was mainly seen in the treatment duration, type of appliance used (even though they all share the mechanism of forward posturing of the mandible), or severity of malocclusion (Table 5). Moreover, 2 studies [33], [36] used overjet as the only diagnostic criterion for Class II malocclusion, even though in 1 study [33], likely most of the patients had a skeletal Class II malocclusion according to mean ±SD of ANB angle of *6.3°±2.0°. One study [40] was judged to be affected by a significant risk of bias (Table 8) and had to be excluded from the meta-analysis. Some of the included studies had small sample sizes [35], [40], and in 2 studies [37], [39] cephalometric magnification was not declared or retrieved (even though linear measurement used herein were not reported in those investigations, the rest of the data were set at 0% magnification). Moreover, similar skeletal outcomes were defined slight differently at the cephalometric recording (see above and Table 11). Finally, an analysis of the potential responsiveness to treatment according to gender or other prognostic factors was not feasible, and this review has focused on short-term effects.
The GRADE quality of evidence assessment was moderate for several main outcomes (Table 13) mainly due to the large effect assigned to these outcomes according to the re-establishment of normal growth in Class II patients [32]. However, studies with an improved level of quality are necessary, with regard to prospective enrolment, full description of Class II features, adequate statistical analysis, and other related information. Even RCTs should rely on skeletal assessment of Class II malocclusion instead of using the overjet, which is more indicative of prominent upper frontal teeth and not always associated with a genuine Class II skeletal pattern [21].

Clinical implications
Within the limitations and heterogeneity of the included studies it appears that, in spite of the specific type of appliance used and the protocol followed, functional treatment with removable appliances would be valid in correcting skeletal Class II malocclusion. However, the effects behind the correction would be related to treatment timing. Skeletal corrections, including mainly mandibular elongation with minimal or no maxillary growth restrain, may be achieved if treatment is performed during the pubertal rather than pre-pubertal growth phase. All the radiographical methods used in the included studies both based on the HWM [33], [37], [39], [41], [43], [44] and CVM method [34][35][36], [38], [40], [42] methods that have been shown to be related to the mandibular growth spurt and stature height [48], [49], [50]. Moreover, the CVM method has showed to be repeatable to a satisfactorily level when executed by trained operators [51]. Finally, a simplified third finger maturation (derived from full HWM) and CVM methods have showed a good degree of correlation and diagnostic agreement, suggesting a combined use according to the available radiographical record [52]. This would be particularly useful when skeletal maturations has to be followed longitudinally in pre-pubertal patients until the beginning of the pubertal growth phase. However, a pure skeletal effect would not be expected even during puberty, as some dentoalveolar effects are also present, even though, mandibular incisor proclination consequent to functional treatment would be limited with minimal clinical implications, especially for pubertal patients. Similarly, an increase of facial divergence was very minimal or absent in both pre-pubertal and pubertal patients. Even though further evidence is needed, the use of a reliable indicator of skeletal maturity either HWM or CVM may be recommended in routine clinical practice to make efforts to perform treatment during the pubertal growth phase.

Conclusions
Taking into account the still limited quality of the reported studies, and their heterogeneity in terms of study designs, treatment protocols and appliances used, the following conclusion may be drawn: • Functional treatment by removable appliances may be effective in correcting Class II malocclusion with relevant skeletal effects if performed during the pubertal growth phase. Skeletal effects of functional treatment were seen at the mandibular level and consist mainly in mandibular elongation and increase in ramus height, although dentoalveolar effects were detected even in pubertal patients.
• However, both the increases in total mandibular length and in ramus height showed a noteworthy individual variation to treatment responsiveness in pubertal patients.
• Irrespective of the growth phase, no or very minimal effects were seen in terms of maxillary growth restrain or increase in facial divergence • Further high quality RCTs with proper inclusion criteria for skeletal Class II malocclusion are needed to fully elucidate the role of growth phase in the efficiency of functional treatment with removable appliances Supporting Information S1 PRISMA Checklist.
(DOC) S1 Table. Main data set underlying the meta-analysis as RevMan file format. (RM5)