A PRISMA systematic review of adolescent gender dysphoria literature: 2) mental health

It is unclear whether the literature on adolescent gender dysphoria (GD) provides sufficient evidence to inform clinical decision making adequately. In the second of a series of three papers, we sought to review published evidence systematically regarding the extent and nature of mental health problems recorded in adolescents presenting for clinical intervention for GD. Having searched PROSPERO and the Cochrane library for existing systematic reviews (and finding none), we searched Ovid Medline 1946 –October week 4 2020, Embase 1947–present (updated daily), CINAHL 1983–2020, and PsycInfo 1914–2020. The final search was carried out on the 2nd November 2020 using a core strategy including search terms for ‘adolescence’ and ‘gender dysphoria’ which was adapted according to the structure of each database. Papers were excluded if they did not clearly report on clinically-likely gender dysphoria, if they were focused on adult populations, if they did not include original data (epidemiological, clinical, or survey) on adolescents (aged at least 12 and under 18 years), or if they were not peer-reviewed journal publications. From 6202 potentially relevant articles (post deduplication), 32 papers from 11 countries representing between 3000 and 4000 participants were included in our final sample. Most studies were observational cohort studies, usually using retrospective record review (21). A few compared cohorts to normative or population datasets; most (27) were published in the past 5 years. There was significant overlap of study samples (accounted for in our quantitative synthesis). All papers were rated by two reviewers using the Crowe Critical Appraisal Tool v1·4 (CCAT). The CCAT quality ratings ranged from 45% to 96%, with a mean of 81%. More than a third of the included studies emerged from two treatment centres: there was considerable sample overlap and it is unclear how representative these are of the adolescent GD community more broadly. Adolescents presenting for GD intervention experience a high rate of mental health problems, but study findings were diverse. Researchers and clinicians need to work together to improve the quality of assessment and research, not least in making studies more inclusive and ensuring long-term follow-up regardless of treatment uptake. Whole population studies using administrative datasets reporting on GD / gender non-conformity may be necessary, along with inter-disciplinary research evaluating the lived experience of adolescents with GD.

Introduction This is the second of three papers examining the literature on adolescent Gender Dysphoria (GD) (see Thompson et al [1] for paper 1, paper 3 in preparation). Some sections of the introductory and methodological text, and reference to methodological limitations, are necessarily repeated across all three papers. The definitions and terminology used in paper 1 were also used in the present paper [1].
Gender Dysphoria (GD) is a categorical diagnosis in the Fifth Edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) [2]. It is also used as a general descriptive term referring to a person's discontent with assigned gender. In recent years, GD diagnoses have been increasingly made in child and adolescent services [3][4][5]. There has been a parallel increase in demand for gender transition interventions, particularly among natal females [3][4][5]. Current clinical guidance for gender transition in adolescence follow the so-called 'Dutch model', where intervention is staged in accordance with a young person's age and stage of pubertal development [6,7]. The age at which a stage of intervention will be deemed appropriate is based partly on how reversible it is. The first stage, puberty suppression (prevention of the development of secondary sexual characteristics), is reversible (although not without risks to health and wellbeing) [8], the second stage, cross-sex hormone treatment, is reversible to some extent (although there is a lack of evidence regarding its longer term impact) [8], and the third stage, surgical intervention, is irreversible. Consideration of a young person's mental health status is an important component of the assessment process. A young person must be deemed suitably competent to make treatment decisions, as well as suitably distressed to warrant intervention in the first place. There is a need, therefore, to have a good understanding of mental health profiles within this population.
There is now a substantial literature showing that adolescents with GD experience poor mental wellbeing in comparison to their peers [9,10], including suicidal thoughts and behaviour [11,12], and some evidence that mental wellbeing improves for those taking part in intervention programs [6,13,14]. There is inherent bias in some of these studies, however, for example several papers rely on survey data with no record of natal sex, others acknowledged lack of socioeconomic representation in samples, and systematic reviews on the subject have not focused on adolescence [15,16]. Therefore, questions remain about the place of GD within the broader context of a young person's mental health. There is evidence that most prepubertal children with GD desist once they reach puberty [17], whereas adults are more likely to persist [18]. There is some indication that adolescents are unlikely to desist, but there is a lack of relevant recent follow-up studies [19,20]. There is an explicit lack of evidence on adolescent-onset GD, so the interplay between GD and other MH factors in this phenomenon is not well understood [21].
Intense international debate regarding a number of issues relating to GD in adolescence is ongoing, especially within Europe and North America where the main research active treatment centres are based [21]. One recent high profile legal case (Bell vs Tavistock [22]) attracted considerable attention from people and organisations with a range of strongly-held views both in favour of and against the ruling, illustrating the acknowledged lack of good quality evidence regarding treatment comorbidities and outcomes to inform service design [23,24]. Services are sometimes left having to make unilateral decisions against national guidance, e.g., the Karolinska University Hospital in Stockholm, Sweden, changing their policy to limit puberty suppression to the context of clinical studies [25].
literature tells us about gender dysphoria in adolescence. We broke this down into seven specific questions (see below). Paper 1 [1] addressed questions 1-3c (italicised), Paper 2 (the current paper) addresses question 4 (bold text), and Paper 3 will address questions 3d and 5-7 (plain text). 1. What is the prevalence of GD in adolescence?
2. What are the proportions of natal males / females with GD in adolescence (a) and has this changed over time (b)?
3. What is the pattern of age at (a) onset (b) referral (c) assessment (d) treatment? 4. What is the pattern of mental health problems in this population?
5. What treatments have been used to address GD in adolescence? 6. What outcomes are associated with treatment/s for GD in adolescence? 7. What are the long-term outcomes for all (treated or otherwise) in this population?
The present paper focuses on question 4. We have addressed questions 1, 2, 3a, 3b, and 3c in our first paper [ref], and questions 3d, and 5-7 in a final paper (in preparation). The methodology below includes the searches conducted for the whole review.
We set out to include any paper offering primary data in response to any of these questions.

Protocol and registration
The systematic review protocol was submitted to PROSPERO on the 28 th November 2019, and registered on 17 March 2020 (registration number CRD42020162047). An update was uploaded on 2 nd February 2021 to include specific detail on age criteria and clinical verification of condition. The review has been prepared according to PRISMA 2020 [26] guidelines (see S1 Checklist).
Eligibility criteria. The volume of non-peer-reviewed literature in initial searches proved so great that we took the decision to only include peer-reviewed journal papers featuring original research data. This decision was made subsequent to initial PROSPERO registration, but prior to full text screening. Complete inclusion criteria were: • Focused on gender dysphoria or transgenderism; • Includes data on adolescents (aged 12-17 years inclusive); • Includes original data (not review paper or opinion piece); • Peer-reviewed publication (not theses or conference proceedings); • In English language.

Information sources
We searched PROSPERO and the Cochrane library for existing systematic reviews. We searched Ovid Medline 1946 -October week 4 2020, Embase 1947-present (updated daily), CINAHL 1983-2020, and PsycInfo 1914-2020. After selecting the final sample of articles, the first author used their reference lists as a secondary data source.

Search
The final search was carried out on 2 nd November 2020 using a core strategy which was adapted according to the structure of each database. The core strategy included search terms for 'adolescence' and 'gender dysphoria'. This was kept deliberately broad in order to ensure any studies on the subject could be screened for eligibility. The specific search strategy employed in EMBASE is given below, and represents the format followed with the others. The specific search strategies employed in each database are detailed in Table 1. EMBASE search • Included only case studies or selected case series; • Pertained to conditions other than GD (e.g., Disorders of sexual development or HIV); • Did not include clinically-identified GD (e.g., survey where participants self-identify, with no clinical contact); • Pertained to populations other than those with GD (e.g., LGBTQ more broadly); • Pertained to populations including or restricted to those aged 18 years or older. This included papers where adolescents and adults were included in the same sample, but adolescents were not separately reported (in many cases age range was not reported and so a 'balance of probabilities' assessment had to be made based on the reported mean age); • Pertained to populations restricted to those aged under 12 years of age. This included papers where adolescents and children were included in the same sample, but the majority of participants were clearly under 12 (based on mean or median age); • Where participants were practitioners, not patients; • Referred only to conference proceedings; • Were written in a non-European language (e.g., Turkish); • Could not be obtained (including due to being published in non-English language journals, or in theses).
Following initial full text screening, all remaining papers were assessed by a second reviewer to reduce the risk of inclusion bias. Where reviewers reached a different conclusion, discussion took place to reach consensus. If agreement could not be reached, a third reviewer was consulted, and discussion used to reach consensus amongst all three reviewers.
Data extracted from eligible papers were tabulated and used in the qualitative synthesis. Given the limited number of specialist treatment centres globally, we assessed how many of the included papers featured the same or overlapping samples.
Papers included in the sub-sample for the present analysis contained some indication of mental health (MH) status at assessment / pre-intervention / baseline for adolescents experiencing (clinically likely) GD. Although some papers reported on bullying or school victimisation, we have omitted these findings as, whilst they may represent a risk of MH problems, they are not symptoms in and of themselves.

Quality assessment
All papers were rated by two reviewers using the Crowe Critical Appraisal Tool v1�4 (CCAT [29]). CCAT is suitable for a range of methodological approaches, assessing papers in terms of eight categories: Preliminaries (overall clarity and quality); Introduction; Design; Sampling; Data collection; Ethical matters; Results; Discussion. Each category is rated out of 5 and all eight categories summed to give a total out of 40 (converted to a percentage). In the present review, each paper was then assigned to one of five categories, based on the average rating of the reviewers, where a rating of 0-20% was coded 1 (poorest quality), and 81-100% coded 5 (highest quality). Inter-rater reliability was shown to be very good (k = 0�92, SE = 0�05).

Data collection process
Data were extracted from the papers using the CCAT form (https://conchra.com.au/wpcontent/uploads/2015/12/CCAT-form-v1.4.pdf) by two reviewers per paper and compiled by the first author (LT). Once compiled, instances of overlap between papers (i.e., if the same sample was described in two papers) were identified and tabulated, and the final sample for each question defined.

Number of studies included, retained and excluded
The PRISMA diagram in Fig 1 provides details of the screening and exclusion process. The searches returned 8655 results, reduced to 6202 following de-duplication. Titles and abstracts were screened by one reviewer (LT) and 4659 records excluded after initial screening and a further 699 excluded on second stage title / abstract screening. This left 553 eligible for full text screening. An initial screening (LT) of full texts reduced the number of records to 155. Fortyseven papers were included in the final dataset, of which 32 included data for the present paper. Full characteristics of included studies are provided in Table 2.
Most papers (n = 25) had a focused research question pertaining to MH; others set out to measure MH status as part of the assessment process, while others included any available MH data as part of their characterisation of samples. Four papers were interested explicitly in autism symptoms, and one focused on eating disorder symptoms.
Clusters of samples came from the same regions, specifically in the Netherlands (n = 6), the UK (n = 6), and Canada (n = 5). The USA had the highest number of samples (n = 8), with the remainder from the following countries: Belgium (n = 3), Finland (n = 2), Germany (n = 2), Italy (n = 1), Switzerland (n = 1), Australia (n = 1), and Turkey (n = 1) (note two papers together described six samples, hence the total is 36). The Netherlands data all pertained to the same centre and research group. All six of the UK samples came from the same Gender Identity Development Service (GIDS: Tavistock & Portman NHS Trust) in London, three of the Canadian papers came from the same Transgender Youth Clinic in Toronto, three of the US samples, all three Belgian samples, both Finnish samples, and both German samples each came from the same centres. Accordingly, not all 36 samples are necessarily mutually exclusive. Overlapping samples were not always acknowledged, and so where overlap has / may have occurred (based on location, setting, age and date variables) this has been noted and has been taken into account in any analysis. We generally aimed to include the largest sample / sample covering the widest date range, to optimise representativeness, and sought to avoid 'double counting' where samples clearly overlapped. Fig 2 provides a graphical representation of likely overlap between samples. Based on the reported information, in total we estimate between 3000 and 4000 adolescents assessed at specialist centres for GD between 1980 and 2020 were included in the 32 papers.
Most studies were observational cohort studies, usually using retrospective record review (n = 21). A few studies included a comparison group / groups from another GD clinic (n = 2), or from a non-GD population (n = 5). All but one paper was published within the past ten years (2011 or later) and all but five in the past five years (2016 or later). Only two papers explicitly included data from before 2000 (a further six may have included pre-2000 data but did not report dates). All but two papers included both natal male (NM) and natal female (NF) participants (Tack et al (2016) NF only; Tack et al (2017) NM only). All studies reported the proportion of NM and NF participants in their sample (see Table 2 and Fig 2).
The means of assessing mental health status varied across the studies, with self-report (n = 18) and parent-report (n = 17) measures being by far the most common. Participantreported psychiatric history at clinic intake (n = 7), or history from medical notes (n = 7) were also used, and measures requiring clinician assessment reported in a further seven. In 17/36 of    the samples more than one means of assessing MH status was used, meaning that the remaining 19 studies relied on a single type of measure. Twenty samples were reported to have met clinical diagnostic criteria for GD / GID, usually using one of the DSM manuals (2/20 did not state which criteria were applied). The remaining 16 samples did not report whether participants met diagnostic criteria, but were included on the basis of being established patients within a specialist treatment centre, either in active  Under assessment at clinic-beyond referral stage.
MH status codes: 1) parent-report measure; 2) self-report measure; 3) psychiatric / MH history reported at intake assessment; 4) psychiatric / MH history from records; 5) clinician assessment / rating. assessment or treatment (n = 13) or were the result of secondary data mining where authors used ICD 9/10 codes and appropriate keywords to establish likely GD (n = 3). A substantial group of papers narrowly missed inclusion criteria, mostly on the age criterion and some on the clinically likely GD criterion, and were not included in the final sample of reviewed papers. We documented characteristics of all studies excluded at the final full text screen in Table 5.

Overall findings based on included studies
What is the pattern of mental health problems in this population?. Here we present a narrative overview of the findings based on the types of measures employed. Full details of data are in Table 3.
a) Participant-reported or medical record-recorded psychiatric history. Previous or concurrent MH diagnoses were common among the included papers. In their large-scale  interrogation of administrative datasets Becerra-Culqui et al (2018) reported 71% NM and 74% NF had a MH diagnosis 'ever before index date' and 59% NM and 65% NF in the '6 months before index date'. Depressive disorders were most common, followed by anxiety disorders and attention deficit disorders. The authors noted a particularly increased prevalence rate in psychoses for both NM and NF compared to reference females; for autism spectrum disorders (ASDs) in NM and schizophrenia spectrum disorders in NF compared to reference females; and for suicidal ideation and self-harm in both NM and NF compared to reference males.
Among clinic samples, prevalence of MH problems at baseline assessment ranged from 22% (Tack et al, 2017) to 78% (Sorbara et al, 2020). In a similar profile to that reported in the large population analysis from Becerra-Culqui et al (2018), mood disorders were most common (depression ranging from 30% to 78%; anxiety from 21% to 63%) and neurodevelopmental disorders were also relatively common (attention deficit/hyperactivity disorder (ADHD) from 6% to 16%; ASD from 2% to 26%). Eating disorders / difficulties were noted in three papers (12% to 15%). Suicidal ideation (12% to 74%) and self-harm (21% to 55%) were also reported, as well as psychotic symptoms (12-13%) and bipolar disorder (5%). Post-traumatic stress disorder (PTSD) was recorded in one study (23%). Between 31% and 37% had been previously or were currently being prescribed psychoactive medication. b) Clinician-rated assessment. Four included studies utilised the Children's Global Assessment Scale (CGAS [62]): a clinician-rated assessment where a single rating (0-100) is given based on a range of information gathered on the young person, including assessments and interviews with parents, children, and school staff; higher scores indicate better global functioning. GD adolescents did not show severe impairment on this scale, scoring either in the 'minor impairment' (71)(72)(73)(74)(75)(76)(77)(78)(79)(80), 'some problems' (61)(62)(63)(64)(65)(66)(67)(68)(69)(70) (2019) reported that looked after and adopted subgroups scored in a more impaired range than young people living with their birth families, but the samples were small (n = 5 each) in the two former groups, and the differences were not statistically significant. Costa et al (2015) reported GD adolescents had significantly poorer scores than the (no psychiatric symptoms) comparison group, and that NM had more negative mean ratings than NF (although means were both in the same clinical category of 'some noticeable problems'). NM seemed to score more poorly than NF in de Vries et al (2011b) but there was no statistical comparison reported.
Kuper et al (2020) found clinician-reported depressive symptoms (QIDS [63]) were in the 'mild' range, congruent with the self-report version (see below). Scores were significantly higher in NF than NM participants, but authors noted the small effect size (Cohen's f = 0.7) for including gender in the regression model. c) Self-report measures. Most studies (n = 18) used some form of self-report measure. The Youth Self Report (YSR [64]) from the Achenbach System of Empirically Based Assessment (ASEBA [65]) was commonly used as a means of assessing self-reported MH symptoms. Scores on the Total problem scale ranged from 45.7 to 67.1 across included samples, and between 15% and 55% scored 'within the clinical range'. The clinical range was defined in some papers (e.g., T score >63 in Fisher et al, 2017) but not all. Some papers compared their sample to population norms or non-GD samples. In their 2016 paper, de Vries et al reported 41% of participants scoring within the clinical range for Total problems: much higher than in non-referred boys (9%) and girls (8%); Becker-Hebly et al (2020) found YSR scores were significantly worse than published German population norms in adolescents at baseline assessment; and Fisher et al (2017) found their Italian sample of 46 GD adolescents scored higher than non-referred adolescents on Total and Internalising problems (adjusted for age and BMI). Authors observed Internalising problems as being a greater issue than Externalising problems, with some exceptions by natal sex:  (2018) found the proportion of their samples falling into the clinical range on the YSR varied significantly between geographical location. Between 15% (Switzerland) and 46% (UK) scored in the clinical range on the YSR (see Table 3). The UK scored most poorly and had the most prevalent emotional and behavioural problems, followed by Belgium and Switzerland.
Moyer et al (2019) reported on simple MH screening tools (PHQ-9 [66]; GAD-7 [67]) used with a sample of adolescents assessed at a paediatric endocrinology unit in the US. At initial consultation, almost half of participants had clinically significant depression symptoms, and over a third had thoughts of death or self-harm (which was significantly higher in non-binary individuals). Almost three quarters of the sample had clinically significant anxiety problems at initial consultation. There were no other differences by gender identity.
In relation to mood disorders, severity of depressive symptoms from self-report measures ranged from average / mild to severe across the samples. Kuper et al (2019a) found about 40% of their sample scored in the moderate to severe range for depression and Chiniara et al (2018) found anxiety levels (as measured by MASC-2 [68]) to be 'significant' in 44% of NF and 30% NM (p = 0.02).
Other concepts measured within our sample include: self-concept (Piers-Harris 2 [69] (Edwards-Leeper, 2017)) which was found to be in the average range; suicide risk (MAST [70] (Fisher, 2017)) which was found to be elevated; body image (BIS [71] (Kuper et al 2019a, 2019b, 2020); BUT [72] (Fisher, 2017)), which was found to be higher than non-referred participants by Fisher et al (2017), but with no comparative data in the Kuper papers; general family functioning (MFAD [73]), where Levitan and colleagues found a third scored above threshold for 'problematic'; and finally quality of life, where scores were worse than population norms on the mental health dimension of the Kidscreen-27 [39,74] and Mahfouda, Panos et al (2019) reported Peds-QL [75] scores to be significantly lower in those GD participants with indicated ASD (compared to the group for whom ASD was not indicated). d) Parent-report measures. Six studies reported on the Child Behavior Checklist (CBCL [76])-the parent-report analogue of the YSR (see above). Reported mean Total problem scores ranged from 44.3 to 64.5 across included samples and between 31% and 55% of samples scored 'within the clinical range', a narrower range than for the YSR. As with the YSR, the clinical range was defined in some papers (e.g., T score >63 in de Vries et al 2011b or score >90 th percentile in de Graaf (2018)) but not all. One paper compared their sample to a non-GD sample: de Vries et al (2016) reported 55% of participants scored within the clinical range for Total problems: much higher than in a non-referred sample (9%). As with the YSR, authors observed Internalising problems as being a greater issue than Externalising problems in general, with some exceptions by natal sex: de Graaf et al (2018) found NF scored significantly higher than NM on Externalising problems and there were significantly more NF scoring in the clinical range on Total score; de Vries et al (2011b) also found significantly higher Externalising problem scores in NF compared to NM. Two papers were able to compare across regions: Canadian and UK participants had higher scores and were more likely to have scores in the clinical range than Dutch participants (de Vries et al, 2016; de Graaf et al 2018).
Two studies utilised the DISC-IV [77] to assess psychopathology: Cohen-Kettenis et al (2002) were able to make a diagnosis in 4 of 11 assessed: 1 specific phobia (NF), 1 transient tic disorder (NM), 1 oppositional defiant disorder (NF), 1 overanxious disorder (NM); and de Vries (2011b) reported 32.4% had one or more DSM-IV diagnosis, with anxiety and mood disorders being most prevalent. They also reported that NM were more likely to have two or more diagnoses, mood disorder, or social anxiety disorder than NF (see Table 3).
Akgul et al (2018) reported a GD sample scored higher (more poorly) than a control group on all three domains (metacognitive, behavioural regulation, global executive composite) of the Behavior Rating Inventory of Executive Function (BRIEF [78]).
Three papers used the Social Responsiveness Scale (SRS-2 [79]) to indicate possible ASD, but reported the outcomes in disparate ways: Akgul et al (2018) reported their GD sample scored more highly than controls, and a significantly greater percentage of the GD sample were 'in the clinical range' (but the threshold being used is not clear); Mahfouda, Panos, et al (2019) found 22.1% of their sample to be in the 'severe' clinical range (indicating likely ASD); and Russell et al (2020) report scores in their London sample to be within the mild (T = 60-65) and normal (T�59) range for NM and NF respectively. Mahfouda, Panos et al (2019) found no differences in SRS-2 score by natal sex, and Akgul et al (2018) reported NF scored significantly higher than NM on SRS social and SRS ADHD-like subscales within the GD group. The Children's Social Behavior Questionnaire (CBSQ [80]) is similar to the SRS-2 in aiming to measure ASD symptoms. In comparison to a typically developing comparison group, van der Miesen et al (2018) found a GD sample had significantly higher scores on all subscales than a typically developing comparison group, but significantly lower than an ASD comparison group.

Temporal precedence
None of the papers reviewed allows a robust assessment of the place of GD within the context of other MH problems (i.e., 'which came first?'). Becerra-Culqui et al [54] were able to show that there was a high degree of MH problems on records prior to first GD-related presentation, and other studies asked about psychiatric / MH history [30][31][32][33]35,37,38,61], but none reported prospective concurrent measurement of GD and other MH problems.

Quality assessment
The CCAT quality ratings ranged from 45% to 96%, with a mean of 81%. All but one paper achieved an overall rating of 4 (good) or 5 (very good), with strengths and weaknesses within certain discrete categories; most papers achieved good ratings in the 'preliminaries' and 'introduction' categories, whereas the 'ethics' and 'discussion' categories were most likely to include lower ratings: 15 papers in each category achieved ratings below 4 (8 of these in both categories). Only one paper was rated as 3 (moderate quality) overall: Cohen-Kettenis & Van Goozen (2002) obtained low ratings across most categories, due to unclear sampling and diagnostic information, lack of information to permit replication, and conclusions which are not supported by the findings. Of the remainder, 14 were rated as high quality (4), and 17 as very high quality (5); see Table 4). There was no relationship between the year of publication and quality rating (r = 0�3).

Discussion
This systematic review synthesises the current evidence regarding the nature and prevalence of mental health (MH) problems in adolescents presenting for assessment for gender dysphoria (GD). We identified 32 papers published primarily in the past ten years, showing that MH problems, especially mood disorders / symptoms, are relatively common in the adolescent GD population. The few papers that drew on large samples or compared with normative data or non-GD comparison samples found a marked difference in prevalence rates of psychiatric diagnoses or assessment scores within an established clinical range. Comorbidity with neurodevelopmental conditions, especially ASD, was also noted.      Whilst the overall picture is one of increased levels of MH problems compared to the general population, the prevalence and nature of MH problems varied across studies. Some differences were noted by geographic region, for example, with those from the Netherlands generally scoring more positively than those in other regions, and the UK and Finland faring poorly [37,42,43]. There are a range of likely reasons for this, the most usually cited being that the culture in the Netherlands is much more open and supportive regarding gender fluidity and there is simply less stigma [42,43]. Other potential reasons, such as inequitable access to appropriate specialist services, have been ruled out by some authors as the countries included in comparison studies (Netherlands, UK, Canada) have universal health care [43]. However, it would be useful to look at factors such as waiting times to receiving specialist care: one paper mentioned similar waiting times in the Netherlands and the UK [58] based on information available on the centre websites, but this has not been systematically investigated and reported and it would seem that, at least in the UK, waiting times have recently increased considerably due to greater demand [81]. In other regions, initial assessment may take place in non-specialist services, which may be more accessible and therefore may increase the likelihood of being seen before considerable mental distress develops.
Stage of presentation for treatment is also relevant and varied across studies. The point at which young people are being assessed for admission to intervention services is likely to be an acutely stressful time; some studies found that adolescents with poorer mental wellbeing were those presenting at a more advanced stage of puberty [54], and in systems / regions where  waiting times are necessarily longer, it is likely that secondary sexual characteristics have already begun to develop, and distress increased. Another distinguishing factor between regions is the underlying level of mental wellbeing in the population. Reports show that adolescents in some regions, such as the Nordic countries and the Netherlands, show better levels of mental wellbeing than those in the UK [82,83], for example. There is acknowledged inequality in service provision and accessibility in the UK, and in general thresholds have gone up so that young people are waiting months to years for initial assessment for a mental health concern (by which point problems have become severe) [84], and this will apply to other regions. It is likely, therefore, that some young people presenting for GD intervention have already experienced years of untreated psychological distress which may or may not have preceded their GD. In the US, there is inherent inequality and significant inter-state variation in the health system, and insurance denial has been noted as a barrier to timely treatment [14,61]. All of these contextual factors, and more, need to be properly characterised and considered in describing adolescent GD populations.
Some differences were also noted between sexes, but the findings were not consistent. In some papers, NF participants scored more poorly than NM [33,40,48], but the reverse was true in others, and some studies found no differences by natal sex. In some studies, the usual pattern of NF scoring more highly on internalising problems and NM more highly on externalising was reversed, suggesting psychopathology in these young people aligned more with their identified gender than their assigned gender [42,43,53]. But again, this pattern was not consistently observed (e.g., Kuper et al, 2019b [58]).
The findings in many studies that NF have poorer mental wellbeing, along with the very rapid increase in NFs presenting for treatment (see paper 1, ref), is notable and requires careful monitoring. The aetiology of GD is not fully understood, and the implications of this demographic change are important. Most papers attribute the increase in young people presenting for treatment to cultural shifts in acceptance of gender fluidity and greater availability of services. Whilst these factors are no doubt important, this alone probably does not explain the dramatic increase in NF presentations: there remains the possibility, not apparently explored in this literature, that modern sociocultural pressures associated with womanhood / femininity are influencing this generation's propensity to seek treatment.
Despite the relatively large number of papers in recent years on this subject, there are questions which it is not possible to answer with the given methods. Most papers were cross-sectional in nature, and those that were longitudinal tended to follow participants through their treatment journey (which will be the focus of paper 3). Samples were often relatively small, and there is an acknowledged lack of representation, with little if any racial diversity among study participants, and socioeconomic status rarely measured / presented. It is also noteworthy that almost all the participants in the included studies are characterised as gender binary: other identities remain under-represented. Diversity in methods of measuring and reporting mental health status limits the ability to synthesise findings and draw over-arching conclusions.
The reality is that young people presenting for GD intervention do so within a highly complex context which studies need to better characterise if we are to fully understand the underlying drivers and improve responses to GD and MH comorbidity. We understand very little about the development of MH problems prior to presentation at GD services, and so do not have a clear understanding of the place of GD within the broader context of young people's MH. Simple developmental screening throughout childhood, such as that in place in some Nordic countries, and the ability to link clinical records to population datasets, would allow us to more clearly address issues of temporal precedence and complex interaction of contextual variables. This may be especially true in relation to the co-occurrence of neurodevelopmental disorders. Whilst studies have found an increased representation of autistic features and ASD diagnoses within the adolescent GD population, the interplay of these characteristics cannot be explored without better longitudinal, population-based studies. Indeed, problems in family functioning, peer relationships, and social relationships more broadly were noted in several of the included studies in the present sample [40,42,43]. To what extent GD may be a reaction to a range of preceding psychosocial stressors for a sub-sample of those presenting for services needs to be understood.
Some studies were keen to present an optimistic picture [56], and it is worth noting that in many studies the majority of participants were not experiencing MH problems. Most of the literature in this review comes from a medical standpoint, which tends towards a deficit model of characterisation as the focus is on identifying and treating illness. Some qualitative studies have addressed more positive accounts and the importance of taking a non-pathologising stance [85], but this research tends to focus on adults' experiences. None in the present sample focused on characterising resilience factors, and this will be an important direction for future research.

Strengths and limitations
This review has strength in the broad search strategy and thorough hand screening process applied. There are methodological limitations which need to be considered. The broad initial search criteria led to the need for some narrowing of criteria following initial screening (but prior to full-text screening). The addition of parameters regarding type of publication, upper age of participants, and the clinical verification of GD naturally narrowed the pool of papers and therefore may have meant papers with important findings have been excluded (for example, if a paper included an upper age limit of 21 even though the majority were younger than 18). We endeavoured to record all papers that only narrowly missed inclusion on the age criterion (Table 5), but literature that was excluded on the basis of type (i.e., conference proceedings and grey literature) were not included at this stage and so the potential contribution of this body of work cannot be quantified or assessed. Similarly, we excluded papers only reporting qualitative findings as these would not have been able to provide the type of data needed for our research questions, but we acknowledge the potential loss of richer lived experience information in so doing. We opted to use a quality assessment tool for studies of diverse designs (CCAT). This allowed all papers to be rated using the same system, but also involved reviewers having to make subjective ratings rather than apply a strictly quantifiable checklist. This may have led to issues with quality, such as over-statement of the significance of findings, not being sufficiently prominent.
Although we were able to include 32 papers from a range of countries in this review, over a third arose from two well-established treatment centres: those in Amsterdam and London. The Amsterdam team has led the way in developing assessment and treatment protocols for GD and provides a wealth of data over a long period (since 1996 within the included papers), and the London GIDS is a hub for the whole of the United Kingdom now dealing with hundreds of referrals per year. This presents the advantage of being able to observe the adolescent GD population over a long period of time, assessed using the same or similar tools, and within a relatively stable social context. It is not clear, however, what proportion of young people experiencing GD have access to these national specialist centres and how many may be accessing private facilities or self-medicating with hormones obtained via other routes: we do not know how representative these samples are. Another disadvantage is that most of the papers included in this review are likely to include data from the same samples of participants, also limiting generalisability. The overlap between samples was rarely overtly stated, and there is a risk that readers may add greater weight to collective findings than is warranted. There is a        Association between GD symptoms in ADHD (hyperactive-impulsive) and CD where parenting stress high.
(Continued )     Group no longer GD after sex reassignment surgery.
(Continued )  Most children with GD were not GD after puberty. Those with persistent GD had more intense GD in childhood than those desisting.

Conclusion
Adolescents presenting at specialist centres for support / treatment in relation to GD are highly likely to have other MH problems (including neurodivergence). If we want to develop a full understanding of individuals' needs, the quality, quantity of the scientific literature needs to improve, as does its representation of different global populations. It will be important to assess and record MH, including GD, in whole populations and to understand the complex contexts of young people's lived experiences. Clinical assessment needs to take a comprehensive approach and include specialists not only in GD but also in other relevant specialisms, such as neurodevelopmental disorders and eating disorders. The long-term outcomes for young people presenting to GD services, regardless of treatment decisions, need to be systematically recorded in an inclusive and representative way, including the use of qualitative methods to ensure young people's voices are not lost.