Analysis of survey on menstrual disorder among teenagers using Gaussian copula model with graphical lasso prior

A high prevalence of menstrual disturbance has been reported among teenage girls, and research shows that there are delays in diagnosis of endometriosis among young girls. Using data from the Menstrual Disorder of Teenagers Survey (administered in 2005 and 2016), we propose a Gaussian copula model with graphical lasso prior to identify cohort differences in menstrual characteristics and to predict endometriosis. The model includes random effects to account for clustering by school, and we use the extended rank likelihood copula model to handle variables of mixed-type. The graphical lasso prior shrinks the elements in the precision matrix of a Gaussian distribution to encourage a sparse graphical structure, where the level of shrinkage is adaptable based on the strength of the conditional associations among questions in the survey. Applying our proposed model to the menstrual disorder data set, we found that menstrual disturbance was more pronouncedly reported over a decade, and we found some empirical differences between those girls with higher risk of developing endometriosis and the general population.


Introduction
Menstrual disturbance, which causes physical discomfort and psychological interruption, is being reported more frequently among teenage girls [1]. It may have an adverse impact on school studies, social relationships, physical and mental health [2,3]. Early investigation of menstrual disorder symptoms may assist teenagers with early diagnosis of some severe diseases, such as endometriosis and polycystic ovarian syndrome. There is an urgent need for a better understanding of the prevalence of menstrual disturbance and variation in experiences of menstruation among teenagers.
Studies using questionnaires to investigate menstrual disorders among adolescent girls had been conducted in multiple study cohorts in different counties [2,[4][5][6]. They are important resources to study menstrual problems and uncover factors that cause abnormal menstrual symptoms during the pubertal development of teenage girls. The Menstrual Disorder of Teenagers (MDOT) survey was conducted in 2005 and 2016 to collect data on the menstrual pat-terns of teenage girls [7]. Both surveys were conducted in the Australian Capital Territory (ACT) using the same questionnaire. The two cohorts of participants were 15-19 years old teenage girls from 4 senior high schools in 2005 and 3 senior high schools in 2016. Consent forms were signed by all the eligible girls before participating in the surveys. The data collection was approved by ACT Health Human Research Ethics Committee (ETHLR. 15.133) and the ACT Government Education Directorate. Using the collected data to conduct analysis was approved by the Human Research Ethics Committee at the Australian National University (Protocol 2017/284). The quality of the data was maximized by the careful design of the questionnaire, getting support from participating schools, and allocating time to fill in the questionnaires during class. The consistency of the data from the two cohorts was guaranteed by using the same questionnaire and following the same data collection procedure from 2005 in 2016.
A sample questionnaire is attached in the S1 File. Information on the following was collected: personal information, typical menstruation characteristics, menstrual symptoms, interference with life, menstrual experiences, and knowledge and diagnosis of some menstrual diseases. Using the MDOT data set, which contains holistic information from more than 100 questions on menstruation, a range of scientific questions can be answered. In this paper, we will provide answers to the following questions using our proposed model in Section 3.
Which menstrual characteristics are changing over time? At the population level, it is of interest to note any pattern of changes of menstrual characteristics, and the MDOT survey provides a unique opportunity to answer this question.
What are the relationships between questions in the survey, i.e. typical menstrual characteristics, menstrual symptoms, and interference with life activities? Previous studies have shown that various menstrual symptoms and menstrual experiences are often interrelated with other each, and understanding how the health indicators emerge together is clinically important.
What variables are related to indicators of problematic periods, for example, pain and school absence? This regression-type analysis can reveal the relationship between a set of predictors and the responses of interest.
Can the MDOT questionnaire identify girls with a higher risk of developing endometriosis? Screening for endometriosis without relying on surgery is highly desirable among teenagers. The MDOT questionnaire covers a wide spectrum of questions and can be potentially used to develop models to evaluate individual risk.
Exploring relationships between questions is of essential importance, however there are statistical challenges to answer these clinical questions compatibly. First of all, the variables are of mixed-type (continuous, binary, ordinal), and there are constraints on their distributions. For example, the age of the senior high school participating girls was between 15 and 19; a list of statements related to periods were binary as true or false, and some life interference questions were asked using an 11-point Likert scale from 0 (no interference) to 10 (major interference). Secondly, there are a large number of questions in the questionnaire, all of which need to be analysed jointly to understand their relationships. Calculating pair-wise correlations (Pearson/ Kendall's tau/Spearman's rho) can be misleading, because the relationship between two variables may be mediated by other variables. Moreover, partial correlation which computes the linear relationship between two continuous variables after controlling for other variables is not appropriate for mixed-type variables. Furthermore, some highly correlated variables may cause multicollinearity when fitting a statistical model, making the results unstable and difficult to interpret. There are methods to reduce dimensionality via variable selection or shrink-age, for example, stepwise regression and lasso [8], but these methods do not directly account for missing values in predictors, which leads to our third challenge. There are moderate amounts of missing data in each variable, resulting in only a small proportion of girls with complete answers to every question. This is not unexpected because of the sheer number of questions in the questionnaire and the teenagers may have been unfamiliar with some terms or did not understand some questions, and so left them blank. Some missingness was caused by a misunderstanding of questions, for example, many participants answered the cycle length question which should typically lie between 21 and 45 days as the number of bleeding days, so we treated any answers outside the proper range as missing values. Fourthly, the data set is multilevel, consisting of two cohorts of girls from seven schools, which should be modeled explicitly. Lastly, there were only 7 cases of self-reported diagnosis of endometriosis in cohort 2, and that question was not collected in cohort 1. Predicting girls with higher risk of developing endometriosis from this study is challenging, given the sparse occurrence in the MDOT data set. Those questions in the MDOT data set motivate us to develop a new modeling strategy to handle all the challenges simultaneously.
Gaussian graphical models, also known as Gaussian Markov random fields [9], are often used to describe the non-causal relationships among variables in multivariate data sets. Consider a data set with p random variables y 1 , . . ., y p . A graphical model can be represented as . ., p} denote the vertices associated with the variables, and E âŠ‚ {(i, j) 2 V × V: i < j} denotes the edges between a pair of variables. In a Gaussian graphical model, the random variables y = (y 1 , . . ., y p ) follow a p-variate Gaussian distribution y * N p (μ, O − 1 ), where μ is the mean vector and O is the precision matrix, which is the inverse of the covariance matrix. The information on conditional independence (dependence) is embedded in the precision matrix O, where the element ω ij = 0 indicates independence between variables i and j conditional on the other variables, and graphically, it corresponds to an absence of the edge between variables i and j.
However, one of the limitations of using the Gaussian graphical model is that the variables are assumed to follow Gaussian distributions. Even after some transformations, the variables can depart from Gaussianity and finding such transformations is usually impossible for discrete variables. Copula models have proven to be very powerful for modeling variables of different types and shapes, when there is an underlying dependence among them [10]. In a copula model, the joint distribution F of variables y 1 , . . ., y p is decomposed into the marginal distributions F 1 , . . ., F p which capture the different distributions of individual variables and a copula function C which describes their dependence, such that F(y 1 , . . ., y p ) = C(F 1 (y 1 ), . . ., F p (y p )). In the Gaussian copula model, C(u 1 , . . ., u p |Γ) = F p (F −1 (u 1 ), . . ., F −1 (u p )|Γ), where u l = F l (y l ), F p is the cumulative distribution function of the p-variate normal distribution with mean zero and correlation matrix Γ (= O − 1 ). Equivalently, the latent variable is defined as z l = F −1 (F l (y l )), and z = (z 1 , . . ., z p ) jointly follow a Gaussian distribution z * N p (0, O − 1 ). Notice that with the Gaussian copula model, the statement of the conditional independence is made on the latent data z which are monotonically transformed from the original data y.
To make the Gaussian copula model generalizable to any ordered variable, we will further couple the Gaussian copula model with the extended rank likelihood method [11], so that inference on the association parameters are purely based on the rank of data without explicitly modeling the marginal distribution functions F l (to be discussed in detail in Section 2.1). To handle the large number of variables in the MDOT data set, we also employ shrinkage methods to estimate a sparse Gaussian graphical model. This can be achieved by imposing a penalty on the O matrix when maximizing the log likelihood in a frequentist model [12,13], or by putting a shrinkage prior on O in a Bayesian model [14][15][16][17].
In this paper, we propose a Gaussian copula model with graphical lasso prior to analyse the MDOT data set, to tackle all the challenges mentioned above. Built upon the extended rank likelihood Gaussian copula model [11], our model is able to capture the dependence between all the variables of mixed-type in a unified framework. The adaptive graphical lasso shrinkage prior [17] is applied to the precision matrix in the graphical model to learn a parsimonious structure of the graph. The school effects are modeled by random effects from which the cohort differences can be derived. Moreover, a predictive score of endometriosis is obtained from the posterior predictive distribution for each person. The statistical inference is carried out under a Bayesian framework where the missing data are integrated out so that the extra variability due to missingness is incorporated in the posterior. The computational algorithm is adapted from the block Gibbs sampling algorithm in the Gaussian graphical model with lasso shrinkage prior [17] and the copula model for mixed-type data in multilevel data sets [18].
The paper is structured as follows. In Section 2, we review the extended rank likelihood method and describe our proposed model. In Section 3 we apply our model to the MDOT data set, and provide our answers to the clinical research questions (a)-(d) mentioned above. Section 4 concludes the paper with some extensions. The computational algorithm is described in detail in Appendix A, and a simulation study on choosing hyper-parameter values is described in Appendix B.

Gaussian copula model with graphical lasso prior
We consider multilevel data sets, with l = 1, . . ., p variables, c = 1, . . ., m clusters and i = 1, . . ., n c units within cluster c. Our proposed model is where F l is the distribution function of variable y l , the vector z ic is the latent variable of unit i in cluster c, corresponding to the observed data vector y ic . The random effect b c is a vector of length p for cluster c which follows a p-variate Gaussian distribution with unconstructed covariance matrix C. Model (1) specifies a Gaussian copula model on the latent variables z, where the precision matrix O contains information of conditional independence between variables after adjustment for the cluster effects.
To obtain a sparse graphical structure, we put the adaptive graphical lasso prior on the precision matrix O as follows where λ = {λ ij , i, j = 1, . . ., p}, and DE(�), Exp(�) and Gamma(�) are probability distribution functions of the double exponential, exponential and Gamma distributions respectively. The shrinkage parameter λ ij is associated with the element ω ij in the O matrix, and O is restricted to positive definite matrics M + . The two hyper-parameters s and t further control λ ij . The adaptive graphical lasso prior (2) allows for different shrinkage effects on each unique elements in O, so that it overcomes some limitations of the double exponential prior which tends to overshrink large parameters but under-shrink small ones [19]. We will discuss the sensitivity to the choice of hyper-parameters in Section 3.5 and Appendix B. We assume a semi-conjugate Inverse Wishart prior for the covariance matrix of random effects: p(C) = Inv Wishart(ν, Λ).

The extended rank likelihood of Gaussian copula
In the Gaussian copula model, specifying each of the marginal distributions F l is labour intensive and the distributions of variables in real data sets may not be accurately represented without a large number of parameters. Some authors suggested transforming the variables using the empirical distributionF l to obtain pseudo data and avoid the parametric estimation of marginal distributions [20]. However, this only applies to continuous variables. To link the discrete variables with continuous latent variables, Hoff provided a simple way of analysing the correlation among variables with meaningful ordering (continuous, binary and ordered categorical variables), via the extended rank likelihood [11]. This makes use of the fact that the order of the underlying latent variable is consistent with the observed data, and inference about the association parameters can be drawn from the 'rank-based' latent variables through a simple parametric form. Therefore, there is no need to specify the marginal distributions F l directly when making inference on O, since we know F −1 (F(�)) is a monotone transformation, the ordering of the data is the same as the ordering of the latent variables z. Specifically, observing y = (y 1 , . . ., y N ) tells us that z = (z 1 , . . ., z N ) must lie in the set: fz 2 R N�p : maxfz hl : y hl < y nl g < z nl < minfz hl : y hl > y nl gg, where the index n = 1, . . ., N runs through all the observations. Let 'D' denote the set of all possible z which are consistent with the ordering of y. Then the event 'z 2 D' can be treated as the observed event upon which inference of O is made, i.e. p(z 2 D|O).
Computationally, any latent variables z associated with observed data are updated from truncated Gaussian distributions subject to their neighboring data points (see details in Step 8 in Appendix A, whereas the latent variables associated with missing data are updated from Gaussian distributions without truncations. The model is able to account for data Missing At Random (MAR), and the uncertainty due to missing values is incorporated in the posterior distribution.

Related works
A copula-based imputation model was proposed by fusing the extended rank likelihood for ordered variables and a multinomial probit model for nominal variables, and added random effects to the latent variables in multilevel data sets [18]. Our model (1) is the same but does not include nominal variables. As for the priors, the semi-conjugate Inverse Wishart priors were put on both C and Γ = O −1 matrices in [18] so that no formal shrinkage effects through regularization priors were imposed.
For multivariate analysis with a moderate to large number of variables, there are some proposed models for learning a sparse Gaussian graphical model using shrinkage methods. This can be achieved by imposing a L 1 penalty on the O matrix in the log likelihood function where S is the sample covariance matrix, and this optimiazation problem was solved by the interior point method in [12]. Notice that by assuming a common shrinkage parameter λ in the graphical lasso prior (2), its maximum a posteriori is the same as the maximum likelihood solution of model (3) when ρ = λ/N. The block Gibbs sampler used in our sampling algorithm follows closely the block coordinate decent method of [13] when solving (3). Putting a different prior, the G-Wishart prior, on the precision matrix O, [16] and [14] proposed a Bayesian Gaussian copula graphical model with the extended rank likelihood, but without random effects in the model. They decomposed the posterior distribution of the graphical model as with a uniform prior on the graph G, and a G-Wishart prior on the O matrix given a graph structure p(O|G). The G-Wishart prior is conjugate to a Gaussian graphical model that allows for the absence of some edges, which is equivalent to have some elements equal to 0 in the O matrix. When the graph is fully connected, the G-Wishart prior reduces to the Wishart prior. The difference between the graphical lasso prior in our model and the G-Wishart prior in their model is that the former has zero probability to shrink w ij to 0 exactly, therefore some post processing decisions should be made to decide a graph structure. In other words, the inference of their model was done in two steps by selecting G and making inference on O whereas the posterior decomposition in our model was , such that the graphical structure G was implied by the O matrix implicitly. [16] applied their model to a functional disability data set which contained 16 binary variables of activities of daily living, to analyse the interactions among them, whereas [21] performed an application to a dupuytren disease data set, aiming to model the relationships between the severity of the disease in the 10 fingers and potential lifestyle risk factors.

Data application to the MDOT data set
In this section we apply our model to the MDOT data set, to answer the four clinical questions listed in the introduction using the collected information from the MCMC. The data set was coded in the following way before fitting the model. We derived the variable BMI from weight and height, and the number of bleeding days from the heaviness of bleeding question. We treated the answers to the cycle length question for fewer than 21 days or longer than 45 days as missing values. As the response options available for the multiple choice questions on menstrual symptoms were not mutually exclusive (Section 3 in the MDOT questionnaire), and because we were more interested in the appearance of symptoms before and during periods, we dichotomized the choices 'just before a period' and 'at the time of period' as 1 and others as 0. For questions regarding interference with life, statements of period and knowledge of menstrual diseases, 'do not know' or 'NA' answers were treated as missing values. Finally, diagnosis of polycystic ovarian syndrome/polycystic ovaries, pelvic inflammatory disease and endometriosis were only asked in the second cohort but not in the first cohort. Nevertheless we still allowed the three questions to enter into our model by treating the answers in the first cohort as missing values, so that only answers to these three questions from cohort 2 provided information on associations with other variables. We excluded 29 girls who answered that they did not have any periods so most of the questions were not applicable to them. The final data set contained 1039 girls from 4 schools in the first cohort and 1041 girls from 3 schools in the second cohort. Two schools participated in both cohorts, and we assumed that the intrinsic characteristics of the schools had changed over the 11-year interval; that is, we assumed responses from a school in cohort 1 were independent of responses from the same school in cohort 2. Due to these characteristics, we included 7 random effects and not 5. There were p = 105 questions entered into the Gaussian copula graphical model, including 5 continuous variables, 19 ordinal variables and 81 binary variables.
We ran the MCMC for 20,000 iterations, saving every 10 th scans and discarding the first half of the samples as the burn-in period. The hyper-parameters for C were chosen to be ν = p + 2, Λ = I p , and for the shrinkage parameters λ ij to be s = 10 −2 and t = 10 −4 . We performed a sensitivity analysis on choosing t which will be described in Section 3.5. We note that the model was insensitive to the settings of other hyper-parameters. For the MCMC diagnostics, we examined the autocorrelations and traceplots of the parameters as well as checking for the mixing and convergence by the potential scale reduction factors [22] from three independent chains with different starting values. There were 4.37% and 0.348% of the parameters in the O and C matrices with their individual potential scale reduction factors exceeding 1.05. Given the sheer number of parameters in the model, the results suggest convergence.
As an alternative approach, we also considered the Bayesian Gaussian copula graphical model of [21] who decomposed the graph as Eq (4) via the extended rank likelihood method with a G-Wishart prior. They used a trans-dimensional birth and death process to select the graph structure, by adding or removing edges in the MCMC. However their approach does not allow explicitly for adding clustering effects. They implemented their algorithm in the 'BDgraph' package in R [23], and we used the function 'bdgraph' to fit model to the MDOT data set, with 20,000 iterations, and assuming non-informative priors.

Cohort differences
Recall that b cl denotes the random effect for school cÂ (c = 1, . . ., 7) in question lÂ (l = 1, . . ., to be the first cohort effect for question l, and sim- to be the second cohort effect. We computed the mean and the 95% highest density region of the difference b Results are shown in the 'GCM_Lasso 95% credible interval' column in Table 1. If the 95% highest density region of the cohort difference b dif,l did not contain 0, heuristically, it suggests that the cohort difference of question l was important. Results for the alternative method, the Gaussian copula graphical model with G-Wishart prior, are reported in the 'BDgraph' columns. To provide an answer to the cohort difference without modeling school effects explicitly, we added the 'cohort' variable as the 106 th variable, to indicate which cohort the respondents came from. Then we calculated the regression type coefficients of the 'cohort' variable on each of the other 105 variables as responses. This is similar to a regression problem in that we treated 'cohort' as a predictor and see if it was significantly associated with the response. The regression coefficient of the predictor 'cohort' on the l th response variable can be derived from the covariance matrix Γ as G 106;l G À 1 106;106 by the conditional distribution of a multivariate Gaussian distribution, because z icl jz ic;106 � Nðb cl þ G 106;l G À 1 106;106 ðz ic;106 À b c;106 Þ; G l;106 G À 1 106;106 G 106;l Þ, or from the precision matrix O as À O 106;l O À 1 ll using the identity O 106;l ¼ À O ll G À 1 106;106 G 106;l . The posterior means of the coefficients are reported in the 'BDgraph coefficient' column in Table 1. The estimated posterior edge inclusion probabilities between the 'cohort' variable and the other 105 variables are reported in the 'BDgraph edge' column of Table 1. When the edge inclusion probability was 0, the corresponding coefficient was 0 as well, which means that the cohort difference was insignificant. The number of significant differences implied by our proposed model and the 'BDgraph' method were 45 and 55 respectively. Many questions reached the same conclusion by both methods. For example, 27% of girls suffered from stabbing pelvic pain in 2005, and the number rose to 41.6% in 2016. The 95% credible interval of b dif ;l ¼ � b ð1Þl À � b ð2Þl was (-0.304,-0.112), which means that the mean random effect for cohort 1 was significantly lower than cohort 2, suggesting more girls in 2016 reported a positive answer to this question. In the 'BDgraph' method, the edge inclusion probability of the 'cohort' variable and the 'stabbing pelvic pain' question was 1, and the coefficient was 0.219, indicating a higher likelihood of having this symptom in the second cohort. Both models provided evidence that menstrual disturbance was generally getting worse or more reported over the decade. More girls suffered from Table 1  diarrhoea, pain before or when passing wind, bleeding from the bottom, and thrush before or during their periods. More girls also reported higher interference with life (completing school work, feeling tired), increased worry about their periods, and more pimples. Sanitary use habits also changed, as girls were more likely to use pads than tampons. Moreover, the psychological related questions indicated a higher level of stress and depression, as 23.4% felt grumpy all the time and 62.4% got teary before or during a period in the second cohort compared with 10.3% and 54.8% respectively in the first cohort, and almost a half of the girls felt overwhelmed (49.4%) and wanted to withdraw or hide (48.8%) before or during their period in 2016, increasing by 15.4% and 21.3% respectively compared with 11 years ago. The two Bayesian approaches further confirmed these significant cohort differences. The score of interference of sexual activity dropped significantly from 3.69 to 2.46 (on a 11-point Likert scale), which was evidenced by the fact that the credible interval was above 0 (0.214,0.437) in our proposed model and the coefficient was negative (-0.169) with edge inclusion probability equal to 1 in the 'BDgraph' method. Some population characteristics were quite consistent though, as the average age was around 17.2 and BMI was around 21.4 in both cohorts. However some of the menstrual characteristics had changed significantly, for example, the age of menarche decreased by 3 months and bleeding time increased by 0.25 days on average. The earlier age at puberty may be explained by socio-economic changes and better diet, among other reasons.

. Comparison of cohort difference in each question by 1) mean responses in the two cohorts, 2) our proposed model (GCM_Lasso), 3) Gaussian copula graphical model with G-Wishart prior (BDgraph
Overall our results demonstrate a worsening in symptoms and an urgent need to investigate menstrual problems among teenage girls, through various channels like GP consultations, school and family education. However, we acknowledge that the differences between cohorts could be the result of the advancement in medical technology, enhancement in knowledge, as well as more self-awareness. Further investigations that control for mediating factors are needed.

Conditional associations among variables of MDOT
Graphical models are a powerful tool to describe associations among variables. In our proposed graphical model (1), each element in the O matrix describes the association between a pair of variables on the latent scale, conditional on the other variables and the school effects. That is, it depicts the genuine associations between two variables after controlling for any known source of information. As the graphical lasso prior places zero probability on the absence of edges, we applied a post-processing procedure after running the MCMC to select those edges with stronger associations. In the application to the MDOT data set, we plotted the top 3% of the parameters in the upper diagonal matrix of O based on the means for each marginal posterior distribution. Note that we only plot 3% edges for a simple visualization of the most important edges, nevertheless the conditional associations between each pair of variables were still encoded in the precision matrix from the MCMC samples. In Fig 1, we removed those isolated questions, and marked the questions from different sections using different colors. Positive relationships were denoted by red edges and negative relationships by blue. The graph was plotted using the 'igraph' package in R [24]. The data entry coding of variables along with the actual questions in the questionnaire were listed as 'variable name' and 'question' respectively in Table 1. The variable coding rules generally follow a 'page(p)section(s)question(q)' format. For example, the variable 'p3s4a' represents the first question (a) of section 4 in page 3 in the MDOT survey, which is 'periods have interfered with attending school'. Because we recoded the data in section 3 of the questionnaire, those questions were denoted as 'Sec3_' from 'a' to 'z'.
We empirically grouped the 105 variables into several clusters in Fig 1. Overall, the segmentation was reasonably clear, although some clusters appeared to merge with each other.
Questions from the same section of the questionnaire tended to cluster together, because they asked about similar domains of menstruation and girls may provide more consistent answers to questions closer to each other in the survey. It is also interesting to see that some questions that were not asked in a consecutive order or were far apart in the questionnaire were still clustered together. For example, the period regularity and clots questions on page 1 sat together with a list of girls' perception questions of their periods on pages 4 and 5. Based on Fig 1, we grouped the non-isolated variables into the following clusters with a communal name in each cluster.
• Perception: regular period (p1q2), clots (p1q5), usually have period every month (p4s5a), never missed a period (p4s5b), have problems with my periods (p4s5c), my periods seem pretty normal (p4s5d), my period are normal most of the time (p5s5a), sometimes think there is something wrong with my periods (p5s5b); sure there is something wrong with my periods (p5s5c).
Some variables had relatively weaker conditional relationships with all the other variables, so they formed their own isolated clusters. They included some menstrual characteristics: total bleeding days, cycle length, age of menarche, and personal characteristics: age and BMI.
Knowing the grouping of variables can help clinicians better understand the interactions among menstrual symptoms and menstrual experiences. Another potential use is to reduce the length of the questionnaire by condensing the questions from one cluster into a single question. At the time the MDOT survey was conducted, it would take at least 20 minutes to complete the 6-page questionnaire. Fortunately the participating schools allocated school time for girls to finish their questionnaires so that the completion rates were quite high, otherwise the quality of a lengthy questionnaire cannot be guaranteed. It would be beneficial to design a shorter version of the MDOT questionnaire, without losing important information.

Conditional associations with pain severity
It is reported that 93% of adolescents experienced pain during menstruation [7] and pain is treated as a typical menstrual disturbance indicator. In the MDOT questionnaire, the pain severity variable (p2q12), was measured on an 11-point Likert scale, and we would like to know which variables in the questionnaire were associated with pain. In this case, the latent variable associated with pain severity was treated as the response, and we examined the conditional associations of the other 104 predictors with the response. Specifically, if the pain severity variable was the l th variable, the coefficients of the other variables (a vector of length 104) on pain were Γ À 1 À l;À l Γ À l;l (similar to the formula showed in Section 3.1), and the 95% credible intervals were obtained from the highest density regions in the MCMC samples. Table 2 shows the variables associated with pain severity whose 95% credible intervals of the conditional association parameters did not contain 0. Not surprisingly, only 9 variables were found to be significantly related to pain variable. This may be due to controlling for a large number of variables but also the shrinkage effect induced by the graphical lasso prior. It is found that a worsening of symptoms over the past 12 months (p1q11), taking medication (p2q13), having pelvic pain aching (Sec3i), scoring higher on interference with life because of pain (p4s4k), more frequent interference with life (p4s4r), never missing a period (p4s5b), having problems with periods (p4s5c), having an ultrasound to look for causes of pain (p5s5p), and getting teary before or during a period (p5s5z) were related to a higher score of pain. The conditional associations of other key variables, for example, school absence, can be analysed similarly.

Prediction of endometriosis
Another useful application of our proposed model is for prediction of endometriosis using the posterior predictive distribution. This is a very challenging task because the disease is rarely diagnosed among teenagers, resulting in a very small number of self-reported cases of endometriosis in the MDOT data set. However, research has found that the disease is not that rare among teenage girls and there is a misconception that girls at a younger age are unlikely to develop the disease so that they are seldom referred to specialists (less than 10% in our study population) or undergo an operation (1%) to look for causes of abnormal menstrual symptoms. It is reported that there have been up to 9 years delay in diagnosis of endometriosis among teenagers compared with women aged > 30 [25,26]. Therefore, early intervention of endometriosis is crucial for general well-being, to minimize disruption of life activities and reduce adverse effects on fertility. Clinicians could potentially follow up with those girls who show a higher risk of developing endometriosis.
In our data set, the question about diagnosis of endometriosis (p6s7k) was only asked in the second cohort where there were 7 cases reported out of 1041 girls. Among the 7 cases we excluded one girl as she stated that she did not have any periods, so she was not considered as part of our study population. We computed a predictive score of endometriosis for each person from the posterior predictive samples. We used the variable index e to denote the endometriosis variable, and the steps to derive the predictive score y � i;c;e for girl i in school c are described as follows: 1. Sample an index s from the saved MCMC samples; 2. Extract the latent variables and parameters z s , b s , Γ s with index s; In step 3 above, the formula to calculate the predictive latent variable z s;� ice is the same as step 8 of the computational algorithm in Appendix A, but without truncation. This is because when drawing posterior predictive samples, we treated the endometriosis variable as 'missing' so that z s;� ice did not have to be constrained by its neighboring y values. In step 4 above, we compared the posterior predictive samples z s;� ice with those z s ice from the MCMC samples whose corresponding y ice = 1. The smallest z s ice jy ice ¼ 1 among the 6 girls was used as the benchmark to determine if y s;� ice equaled 1 or 0. The sampling algorithm for the extended rank likelihood method ensures that the latent variable z s ice of the diagnosed girls must be greater than those who were not diagnosed. If the predictive latent variable z s;� ice exceeded the benchmark, then we set the predictive endometriosis variables y s;� ice ¼ 1, and 0 otherwise. Steps 1-4 were repeated 500 times to produce the posterior predictive samples of the binary variables y s;� ice ; s ¼ 1; . . . ; 500 for each girl, and the predictive probability of endometriosis � y � ice was taken to be the average value of those y s;� ice . To see if our model was a good fit to the data, and if our predictive scores were able to reflect the self-reported diagnosis of endometriosis, we examined the predictive scores of endometriosis � y � ice for each girl in our study population, along with the predictive scores for the 6 girls with endometriosis marked in solid triangles in Fig 2. The 6 scores were 0.228,0.468,0.272,0.338,0.562,0.422 respectively. They indeed scored higher than the majority of individuals, suggesting reasonable predictive performance of our model. We further compared the conditional distributions in some questions of the girls with higher predictive scores in the top 10% quantile (in total 208 girls) with unconditional distributions in our study population (that is, everyone in the sample). We selected those questions that were indicative of menstrual disturbance and possible endometriosis or were expected to show differences between girls who bear higher risk of endometriosis and the general population. The results are summarized in Table 3. Firstly, 65% of girls with higher scores suffered from severe pain (scored 7-10 in the 11-point pain severity variable) which was about the same percentage who had mild pain (scored 0-6) in the general population. Only 2% of girls missed school for every period in the whole population but the number was predicted as high as 8.3% in the girls with higher scores, and nearly half of them had experienced school absence because of their periods. About two fifths of girls (40.9%) with higher predictive scores reported they had regular periods compared with two thirds (64.3%) in the whole sample. In addition, two fifths (38.7%) believed there was something wrong with their periods in the higher risk girls compared with merely 11.2% as a whole. Finally, more girls with higher scores had sought advice from their GP or specialist, or had blood tests, an ultrasound or operation to look for causes of their period pain than the general population.
Here we proposed a way to predict endometriosis using the MDOT questionnaire, and the above analyses indicates some evidence of success for predicting endometriosis and to show differences between those girls with a higher risk of developing endometriosis and the study population. Practically, to predict the likelihood of developing endometriosis for a new girl, we can ask her to fill in the MDOT questionnaire and include her in the model by treating the endometriosis question as a 'missing value', and then use the procedure described above to compute her predictive score. This is essentially what has been achieved in our application to the MDOT data set to predict the probability of having endometriosis for girls in cohort 1 without their answers to that question. However, we need to be very cautious of any conclusions made, because the small number of positive cases in the data set may lead to large uncertainty in the predictive score. Moreover, a self-reported diagnosis need not necessarily be the same as a pathological diagnosis, especially when only 23.2% and 31.3% of girls had heard of endometriosis in 2005 and 2016 respectively.

Sensitivity analysis on the MDOT data set
In this section we discuss the issue of choosing hyper-parameters s and t which control the shrinkage parameter λ ij . In model (1), we put semi-conjugate priors on the covariance matrix C of the random effects, and the hyperparameters were chosen to be weakly informative as ν = p + 2 and Λ = I p . For the adaptive graphical lasso prior (2) of the precision matrix O, the choice of the shrinkage parameter λ ij is adapted from ω ij , and is also controlled by the hyper-parameters s and t. From step 4 of the sampling algorithm in Appendix A, the fully conditional distribution of λ ij followed a gamma distribution: λ ij * Gamma(shape = 1 + s, rate = |ω ij | + t), so that the conditional mean of λ ij is 1þs jo ij jþt . A larger value of t leads to a smaller λ ij , thus penalizing ω ij less, whereas decreasing t towards 0 will push the penalty towards infinity. [17] showed that the hyper-parameter t needs to be chosen small relative to |ω ij | to appreciate the adaptiveness. The shrinkage effects of varying the hyper-parameter t were demonstrated in a simulation experiment in Appendix B.
With the real data application on the MDOT data set, we set s = 0.01 to represent a weak prior degrees of freedom of each individual ω ij , and conducted a sensitivity analysis by choosing between three t values: t = 10 −4 , 10 −5 , 10 −6 . The results of cohort differences (Section 3.1) and conditional associations with pain severity (Section 3.3) did not change much. However, the graphs with top 3% of the edges plotted (Section 3.2) varied slightly. For example, some clusters merged or split, and some variables entered or dropped out in the graph compared with Fig 1. We also examined the posterior predictive scores for endometriosis as in Section 3.4, and results showed that the scores of 6 and 5 self-reported cases (out of 6) sat above the 90% quantile in the study population when t = 10 −5 and 10 −6 respectively. Overall speaking, the model was not very sensitive to the t values within a moderate range. The model with t = 10 −4 achieved an adequate amount of shrinkage and good performance for the prediction of endometriosis, whereas setting t = 10 −6 seemed to over shrink the conditional associations and setting t > 10 −4 did not fully resolve the multicollinearity problem. Therefore we chose to report the results from the model with t = 10 −4 in Sections 3.1-3.4.

Conclusions
Motivated by the MDOT data set, we proposed a Gaussian copula model with graphical lasso prior to learn the conditional independence (dependence) among a large number of variables. The copula model with the extended rank likelihood is able to accommodate ordered variables in the joint model by mapping the observed data onto latent variables which are assumed to follow a multivariate Gaussian distribution. Due to the relatively large number of questions and strong associations between some variables, we put a shrinkage prior on the precision matrix for the latent variables, to encourage sparsity in the conditional associations. The approach is implemented in an R package called "GCMlasso", which is available from github. com/jialiwang1211/GCMlasso.
The school effects are conveniently modeled by the random effects from which the cohort difference can be derived. This approach can be used with similar repeated cross-sectional survey data where the same questions are asked to different sample of individuals at each time and the questions are then compared over time.
Compared with the G-Wishart prior, the graphical lasso prior performs graphical structure selection and precision matrix estimation simultaneously, although some post-processing decisions are needed to determine the 'significant' edges. With the graphical lasso prior, we can obtain the conditional associations between all the questions, and look at the associations between any two variables no matter how big or small. When applied to the MDOT data set, we used an ad-hoc approach by selecting the top 3% of largest elements in the precision matrix to draw a graph. We can vary the quantile level to plot a sparser or denser graph, nevertheless the full information of conditional independence is encoded in the precision matrices from MCMC.
In our application to the MDOT data set, we treated those 'do not know', 'NA' answers and the blank cells as missing data, and assumed they were missing at random. However in some questions, 'NA' may mean a negative answer, or another category. For example, an 'NA' answer to the True or False question 'I have had a blood test for my period pain' was more likely to mean having no tests; an 'NA' answer to the Yes or No question 'have your period symptoms worsened over the past 12 months' probably neither means worsened nor improved, but having no period symptoms at all. An extension of our model would be treating 'NA' as a separate category, and use a multinominal probit model for nominal variables [27]. With shrinkage methods, it is desirable to include or exclude the entire nominal variable instead of selecting an individual category. This can be achieved by using methods similar to group lasso where the penalty is applied to a matrix norm of each nominal variable [28,29]. Moreover, some questions were regarded as sensitive, such as sexual activities and family history diseases, and the missing at random assumption can be violated because the probability of missing a record may depend on the record value itself. If the missingness was assumed to be not missing at random, some extra modeling efforts of the missing process will be needed.
Using the MDOT data set, we showed how our model was fitted in a single MCMC run but could be used for answering a range of research questions by extracting information from the MCMC. We found that menstrual disturbance was more pronouncedly reported over a decade and some menstrual patterns were shifted dramatically, therefore there is an urgent need to investigate the pathological reasons for the changes. Furthermore, some questions were strongly correlated even after controlling for other information, and it is worthwhile looking at the interactions among a group of variables instead of individual questions that lead to some menstrual symptoms. Lastly, our prediction of endometriosis showed that there were empirical differences between girls with higher risk of endometriosis and the general study population. Developing a non-invasive screening tool for predicting endometriosis is of clinical interest.

Appendix A: Sampling algorithm
The sampling algorithm is adapted directly from [17], who proposed a block Gibbs sampler by data augmentation to update the O matrix column-wise with the graphical Lasso prior, and from [18] which provided the formulas to sample from the extended rank likelihood Gaussian copula model with random effects. The steps of the algorithm for our proposed model (1) are outlined below.
1. Sample the random effects b c , c = 1, . . ., m b c � NððC À 1 þ n c OÞ À 1 O P n c i¼1 z i ; ðC À 1 þ n c OÞ À 1 Þ, where n c is the number of observations in cluster c; 2. Sample the covariance matrixΨ for the random effects C � Inverse Wishartðn þ m; L þ P m c¼1 b c b T c Þ; 3. Partition the precision matrix O, scaling matrix Y, and empirical covariance matrix S, and move each column of these matrices to the last column in turn, to facilitate updating them column-wise. For example, if O is a p × p matrix, then O 11 is a (p − 1) × (p − 1) matrix, ω 12 and o T 21 are column vectors of length p − 1, and ω 22 is a scalar. Elements in the matrices S and Y are defined similarly. The scaling matrix Y is introduced as an auxiliary variable to facilitate sampling the precision matrix in step 6. 4. Sample the shrinkage parameters λ ij , i, j = 1, . . ., p λ ij * Gamma(shape = 1 + s, rate = |ω ij | + t); 5. Sample the elements τ ij in the scaling matrix Y, i, j = 1, . . ., p t ij � 1=Inverse Gaussianðl ij =o ij ; l 2 ij Þ; 6. Sample the O matrix column-wise using a data augmentation algorithm. This step is repeated p times by rearranging each column to be the last column in the matrices in step 3 in turn, and sample the parameters. The parameter λ 22 is defined as the shrinkage parameter associated with ω 22 . Sample the auxiliary variables γ * Gamma(shape = N/2 + 1, rate = (s 22  G ½gh� ¼G ½gh� = ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ðG ½gg� þC ½gg� ÞðG ½gg� þC ½gg� Þ q , C ½gh� ¼C ½gh� = ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ðG ½gg� þC ½gg� ÞðG ½gg� þC ½gg� Þ q ; g; h ¼ 1; :::; p; 8. Sample the latent variables z icl , c = 1, . . ., m, i = 1, . . ., n c , l = 1, . . ., p z icl � TNðb cl þ G l;À l G À 1 À l;À l ðz ic;À l À b c;À l Þ; G l;À l G À 1 À l;À l G À l;l Þ, which is a truncated normal distribution with lower bound as lb = max(z hl : y hl < y icl ) and upper bound as ub = min(z hl : y hl > y icl ), h = 1. . ., N. We use z ic,−l to denote the vector of z without variable l, Γ l,−l denotes the l th row of Γ matrix without the l th column, and Γ −l,−l denotes Γ matrix without the l th row and the l th column.

Appendix B: Simulation study on choosing hyper-parameter
We designed a simple simulation study to illustrate the effect of choosing a different hyperparameter t on parameter estimation. The number of clusters was 20, and the cluster sizes were unbalanced, and generated from a Poisson distribution with mean 50. We generated 100 replicate data sets with the true parameters as The hyper-parameter s was fixed as 0.01, and t varied to be 10 −1 , 10 −3 and 10 −5 respectively. We calculated the 95% highest density region of each parameter from the MCMC in each data set to see if the true parameter was included in the region, and then computed the coverage over the 100 data sets. The procedure was repeated with each of the three t values.
The coverage rates of the unique elements in the C and O matrices are reported in Table 4. The coverage rates were generally close to the nominal 95% level in all the parameters in the C matrix for the random effects regardless of different t values. This is because we did not shrink on the C matrix. For the O matrix, it is noticed that larger parameters were insensitive to t values whereas smaller parameters suffered from under coverage with small t values. To see the different shrinkage effects of t more clearly, we show the trace plots of two selected parameters in O with the three t values in Fig 3. In the upper panel the true parameter ω 12 = −0.918 was relatively large, and the MCMC samples were centered around the true parameter with all the different t values. In the lower panel when the true parameter ω 13 = 0.067 was relatively small, the model with t = 10 −1 achieved 96% coverage rate and did not show much shrinkage effect, however the models with t = 10 −3 and t = 10 −5 shrunk ω 13 towards 0 and their coverage rates were down to 89% and 63% respectively. Although the coverage rates were lower than 95% for small parameters when t was small, it is desirable to Supporting information S1 File. (PDF)