A semi-supervised learning approach for automated 3D cephalometric landmark identification using computed tomography

Identification of 3D cephalometric landmarks that serve as proxy to the shape of human skull is the fundamental step in cephalometric analysis. Since manual landmarking from 3D computed tomography (CT) images is a cumbersome task even for the trained experts, automatic 3D landmark detection system is in a great need. Recently, automatic landmarking of 2D cephalograms using deep learning (DL) has achieved great success, but 3D landmarking for more than 80 landmarks has not yet reached a satisfactory level, because of the factors hindering machine learning such as the high dimensionality of the input data and limited amount of training data due to the ethical restrictions on the use of medical data. This paper presents a semi-supervised DL method for 3D landmarking that takes advantage of anonymized landmark dataset with paired CT data being removed. The proposed method first detects a small number of easy-to-find reference landmarks, then uses them to provide a rough estimation of the all landmarks by utilizing the low dimensional representation learned by variational autoencoder (VAE). The anonymized landmark dataset is used for training the VAE. Finally, coarse-to-fine detection is applied to the small bounding box provided by rough estimation, using separate strategies suitable for the mandible and the cranium. For mandibular landmarks, patch-based 3D CNN is applied to the segmented image of the mandible (separated from the maxilla), in order to capture 3D morphological features of mandible associated with the landmarks. We detect 6 landmarks around the condyle all at once rather than one by one, because they are closely related to each other. For cranial landmarks, we again use the VAE-based latent representation for more accurate annotation. In our experiment, the proposed method achieved a mean detection error of 2.88 mm for 90 landmarks using only 15 paired training data.


Introduction
Cephalometric analysis is commonly used by dentists, orthodontists, and oral and maxillofacial surgeons to provide morphometrical guidelines for diagnosis, surgical planning, growth analysis, and treatment planning by analyzing dental and skeletal relationships in the craniofacial complex [1]. It is based on cephalometric landmarks, which serve as proxy to the skull morphological data pertaining to craniofacial characteristics [2]. Conventional cephalometric analysis uses two-dimensional (2D) cephalometric radiographs (lateral and frontal radiographs), which have drawbacks including geometric distortions, superimpositions, and the dependence on correct head positioning [3]. Due to recent advances in image processing techniques and the need for accurate craniofacial analysis, a three-dimensional (3D) approach to the cephalometric landmarks obtaining 3D computerized tomography (CT) images is gaining preference over the conventional 2D techniques [4][5][6].
Recently, there have been many studies conducted on automated cephalometric landmark identification that aims to find the landmarks and enable immediate cephalometric analysis, because manual landmarking and cephalometric analysis are labor-intensive and cumbersome tasks even for the trained experts. Due to recent advances in deep learning techniques, the automated annotation of 2D cephalometric landmarks may now be used for clinical application [7,8]. Conversely, automated 3D cephalometric tracing (for 90 landmarks) may not yet be utilized in clinical applications, wherein the required average error is commonly designated to be less than 2 mm [9][10][11][12][13]. The high dimensionality of the input data (e.g., 512 × 512 × 512) and limited number of training data are the main factors that hinder the training of deep learning networks for learning the 3D landmark positional vectors from 3D CT data. Moreover, due to the current legal and ethical restrictions on medical data, it is very difficult to utilize CT data from patients.
To overcome the above-mentioned learning problems caused by the high input dimensions and training data deficiencies, the method proposed in this study utilizes semi-supervised learning that takes advantage of a large number of anonymized landmark dataset (without using the corresponding CT dataset) which have been used in surgical planning and treatment evaluation. We use these landmark dataset to obtain their low dimensional representations, reducing the dimensions of the total landmark vectors (270 = 90 × 3 dimension) to only 9 latent variables via a variational autoencoder (VAE) [14]. For training the VAE, a normalized landmark dataset is used to efficiently learn skull shape variations while ignoring unnecessary scaling factors. With this dimensionality reduction technique, the positions of all 90 landmarks can be roughly estimated by identifying a small number of easy-to-find reference landmarks (10 landmarks), which can be accurately and reliably identified via a simple deep learning method [11].
The rough estimation of all landmarks is used to provide a small 3D bounding box for each landmark in the 3D CT images. Following this, we apply convolutional neural networks (CNNs) to these small bounding boxes to enable the accurate placement of landmarks. Our fine detection strategy is divided into two parts; mandible and cranium. It is desirable to accurately capture the morphological variability of the mandible because the shape of the mandible can be affected by a variety of factors, including the masticatory occlusal force, muscular force, functional activity such as breathing and swallowing, and age [15]. Noting that landmarks on the mandible represent morphological features of a 3D mandibular surface geometry, we apply 3D CNN to a segmented image of the mandible (separated from the cranium). We follow a recent study [16] for a segmentation method to separate the mandible from the cranium.
Because several landmarks around the condyle are closely related to each other, it is better to detect these landmarks all at once. For the landmarks on the midsagittal plane, it is better to further reduce the dimensionality of the input by using a partially integrated 2D image of the midsagittal plane. For the remaining landmarks lying on the cranium, we again use the anonymized landmark dataset to obtain a more accurate latent representation of all landmarks on the cranium, due to its rigidity. The proposed approach achieved a mean detection error of 2.88 mm for 90 landmarks, which nearly meets the clinically acceptable precision standard. It should be emphasized that this accuracy has been achieved using a very small amount of training data.

Method
We begin by introducing the following notations. Five easy-to-find reference landmarks (CFM, Bregma, Na, and Po (L/R)) are used as the basis for constructing a coordinate system to determine the midsagittal and axial planes, and they are utilized for data normalization (methods for obtaining these five reference landmarks will be described in Section 2.1).
512 for j = 1, 2, 3}. Here, we set v 1 as the normal direction of the midsagittal plane.
• x b denotes a binarized CT image of x (i.e., skull segmentation from the CT image), defined by where ρ is a thresholding value. In our experiment, the value of ρ was consistently chosen as ρ = 500HU, which is known as an effective choice for thresholding-based bone segmentation [17].
• x mid denotes a partially integrated 2D image of x b in the normal direction of the midsagittal plane, defined by where [a, b] determines the truncated volume of x b .
• R cr 2 R 138ð¼46�3Þ and R md 2 R 132ð¼44�3Þ denote the concatenated vectors of 46 cranial and 44 mandibular 3D landmarks, respectively. The entirety of the landmarks R 2 R 270ð¼90�3Þ is defined by R ≔ ½R cr ; R md �. See S1 Table for more detailed information of the landmarks.
• R cr ] 2 R 24ð¼8�3Þ denotes a concatenated vector of the landmarks (Bregma, CFM, Na, ANS, Or (L/R), and Po (L/R)) in the cranium and R md ] 2 R 6ð¼2�3Þ denotes a concatenated vector of the landmarks (MF (L/R)) in the mandible. A reference landmark vector R ] 2 R 30ð¼10�3Þ is We mention here details of reference landmarks; Bregma is the point of junction of the coronal and sagittal sutures of the skull. (i) CFM, (ii) Na, (iii) ANS, (iv) Or, (v) Po, and (vi) MF are the abbrevations of (i) the center of foramen magnum, (ii) nasion, (iii) anterior nasal spine, (iv) orbitale, (v) porion, and (vi) mental foramen, which are defined by (i) the center of an opening for spinal cord, (ii) the center of the midline bony depression between the eyes, where the frontal and two nasal bones meet, just below the glabella, (iii) the projection formed by the fusion of the two maxillary bones at the intermaxillary suture, (iv) lower most point on the The 3D cephalometric landmarking aims to develop a function f : x 7 ! R that maps a 3D CT image x to all landmarks R. To learn the landmark detection map f, deep learning techniques can be used. Unfortunately, due to the legal and ethical restrictions on medical data, a few paired data are available. This severe shortage of paired data makes it difficult to obtain an accurate and reliable map f : x 7 ! R in the following supervised learning framework: where N p is the small number of paired training data, fðx ðiÞ ; R ðiÞ Þ : i ¼ 1; � � � ; N p g is a paired dataset, Net is a deep learning network, and k � k is the standard Euclidean norm. In our study, only 15 paired data are available (i.e., N p = 15). Even with a certain amount of paired data, the learning process (3) of the direct detection map f can be difficult because the dimension of the input image is very large (greater than 10 8 ).
The proposed method attempts to address this problem by taking advantage of a semisupervised learning framework that permits the utilization of the N l number of anonymized landmark data fR ðN p þiÞ g N l i¼1 whose corresponding CT data are not provided. In this research, specifically, 229 anonymized data (i.e., N l = 229) are utilized.
As shown in Fig 2, the proposed method comprises the following three main steps: (i) To obtain easy-to-find reference landmarks R ] , we apply CNN with 2D illuminated images generated from a binarized CT image x b and normalize the output with respect to the cranial volume. (ii) A rough estimation of entire landmarks R is obtained using the partial knowledge R ] and a VAE-based low dimensional representation of R. (iii) Using this estimation, coarse-tofine detection for R is conducted, wherein separate strategies are utilized for the mandibular and cranial landmarks. For the mandibular landmarks, the landmarks are accurately identified

PLOS ONE
by applying 3D patch-based CNNs to capture the morphological features on a 3D surface geometry associated with the landmarks, wherein an input patch is extracted based on the coarse estimation. For cranial landmarks, we first detect three landmarks lying on the midsagittal plane by applying a 2D CNN whose input is an extracted 2D patch from a partially integrated image x mid in basis of the coarse estimation. By utilizing the three finely-detected landmarks and cranial reference landmarks R cr ] as the partial information of R cr , the remaining cranial landmarks are finely annotated via a VAE-based local-to-global estimation utilizing the same method in the previous step.
Each of these steps is described in detail as follows.

Detection of easy-to-find reference landmarks and uniform scaling for skull normalization with respect to the cranial volume
The first step of the proposed method is to find 10 reference landmarks R ] from a given x. Initially, a CT image x is converted into a binarized image x b by (1). From x b , 2D illuminated images are generated by manipulating various lighting and viewing directions (see Fig 1). By applying VGGNet [18] to these illuminated images, the reference landmarks R ] are accurately and automatically identified. This detection method is based on that presented in the recent study [11]. Using these identified reference landmarks, data normalization is conducted for efficient feature learning of skull shape variations in further steps. By applying a uniform scaling with respect to the cranial volume, the landmark vector R ] is normalized, wherein the cranial volume is defined via a product of the distance between the v 1 -coordinate of Po (L) and Po (R) (cranial length), the distance between the v 2 -coordinate of Po (L) and Na (depth), and the distance between the v 3 -coordinate of CFM and Bregma (height). This data normalization minimizes the positional dependencies of the landmarks on the translation, rotation, and overall size of the skull; therefore, shape information of the skull (regarding facial deformities) can be effectively learned in further VAE-based steps. From here on, we will denote all landmark

PLOS ONE
vectors as normalized vectors (e.g., R and R ] are normalized vectors for total landmarks and reference landmarks).

Rough estimation of all landmarks from reference landmarks using VAE-based low dimensional representation
This subsection provides a method for roughly estimating all landmarks R from the reference landmarks R ] that are accurately annotated in the previous step. Based on the method in [13], we build a bridge that connects R ] and R by taking advantage of a low-dimensional representation of R learned by the variational autoencoder (VAE) [14].
The rough estimation obtained from R ] , denoted by R � , is given by where D � F : R ] 7 ! R � is a local-to-global landmark estimation map as described in Fig 3. The map is constructed as follows: First, we train VAE that consists of an encoder (E) and a decoder (D), to learn low dimensional representation of R. Afterwards, we train the nonlinear map F that provides FðR ] Þ � EðRÞ so that D � FðR ] Þ � R.
Specifically, the VAE learns an encoder E : R 7 ! z and a decoder D : The maps E and D learn landmark patterns by

PLOS ONE
leveraging dataset fR ðiÞ g N t i¼1 that consists of the unpaired landmark dataset as well as the paired dataset. The training is achieved via the following energy minimization sense: where N t = N p + N l is the total number of training landmark data, VAE is a class of functions in the form of a given VAE network, N ðm ðiÞ ; S ðiÞ Þ is a d-dimensional normal distribution with a mean μ (i) and a diagonal covariance matrix Here, are the mean and standard deviation vectors obtained in the interim of the encoding process of an i-th training data R ðiÞ (i.e., EðR ðiÞ Þ).
The encoder E can be expressed in the following nondeterministic form: where h noise is a noise sampled from N ð0; IÞ, � is the Hadamard product (i.e., element-wise product), and vectors μ and σ are given by Here, the matrices fE 1 ; E 2 ; E 3 ; E m 4 ; E s 4 g represent fully-connected layers and ReLU is an element-wise activation function defined by ReLU(t) = max(t, 0). The decoder D is the reverse process of the encoder E, which can be represented by where the matrices {D 1 , D 2 , D 3 , D 4 } represent fully-connected layers. The detailed network architecture is described in the red box of Fig 3. After pretraining the VAE, the nonlinear map F : R ] ! z is learned, which connects reference landmarks R ] with a latent variable z ¼ EðRÞ. The architecture of the map F is a fullyconnected neural network as described in Fig 3. The training is achieved by the following minimization sense: Here, we remark that F relies on the pretrained VAE, whose encoder (E) is used only for training and decoder (D) is the component of the local-to-global detection map. The resultant map D � F estimates all landmarks R from the partial knowledge R ] , based on the learned patterns of landmarks by VAE.

Coarse-to-fine detection
This subsection explains coarse-to-fine detection obtained using the initial estimation R � . We put a final touch on R � by utilizing CT image data. The coarse-to-fine detection is based on suitable strategies that rely on the landmark locations (i.e. on the mandible or cranium). The details are explained in the following subsections.

Mandible-cranium segmentation.
In the binarized skull image x b , we segment the mandible and cranium separately using the connected component labeling (CCL) technique [19]. Among all connected components generated from the CCL method, the largest component and the second largest are the cranium and the mandible respectively. The segmented cranium and mandible images are denoted as x cr b and x md b (as shown in Fig 2). Using these images and the rough estimation R � , the following fine detection processes are conducted.

Detection of mandibular landmarks.
For the landmarks on the mandible being articulated to the skull, a patch-based 3D CNN is applied to capture the morphological variability of the 3D mandibular surface geometry associated with the landmarks.
Let R j � 2 R 3 be a roughly estimated position of a landmark with index j in R � . See S1 Table for the details of the landmark index. For each mandibular landmark (i.e., j 2 {49, � � �, 90}), we extract a 3D image patch ðx md b Þ ðZ;R j � Þ that is defined by a cube with edge length of η and center of R j � . By using 3D CNN, we obtain a map f md The 3D CNN is trained by the dataset in the following sense: In practice, data augmentation using translation and horizontal flip is applied. The architecture of the 3D CNN is a modified version of VGGNet [18], which is described in Fig 4. In practice, several landmarks are identified in a group at once. We simultaneously identify six landmarks on the condyle (COR, MCP, LCP, Cp, Ct-in, and Ct-out) that are positionally related to one another, as well as landmarks with bilaterality (e.g. left/right mandibular

PLOS ONE
foramen) that are associated with the symmetric structure of the mandible. For this group detection, we construct a 3D CNN to produce a concatenated vector of all landmark positions on the same group from one 3D image patch. Here, COR is the abbreviation of coronoid point, defined by the anterior process of the superior border of the ramus of the mandible. (i) MCP, (ii) LCP, (iii) Cp, (iv) Ct-in, and (v) Ct-out are the abbreviations of (i) medial condylar point, (ii) lateral condylar point, (iii) posterior condylar point, (iv) medial temporal condylar point, and (v) lateral temporal condylar point.

Detection of cranial landmarks.
Landmarks on the cranium that demonstrates rigidity have less variability between subjects. According to [13], cranial landmarks have smaller variance compared to mandibular landmarks with the normalization presented in Section 2.1. Moreover, our empirical experiment shown in Fig 7 demonstrates that the rough local-to-global estimation achieved using the VAE-based low dimensional representation provides more accurate annotations for cranial landmarks. Therefore, we again utilize a VAEbased low dimensional representation in the same manner as in Section 2.2 by using only the cranial landmarks R cr . To increase the detection accuracy, we enrich the partial knowledge of R cr by accurately detecting three additional cranial landmarks lying near the midsagittal plane (MxDML, Od, and PNS). Here, (i) MxDML, (ii) Od, (iii) PNS are the abbrevation of (i) maxillary dental midline, (ii) odontoid process, and (iii) posterior nasal spine, which are defined by (i) the midsagittal line and point of the maxillary central incisors, usually defined by the junctional line and point of right and left incisal edge and medial surface on maxillary central incisors, (ii) a protuberance (process or projection) of the axis (second cervical vertebra), and (iii) a part of the horizontal plate of the palatine bone of the skull. The overall process is illustrated in Fig 5. First, we compute a partially integrated image x mid from x cr b using (2) so that the center of the truncated volume of x cr b lies on the midsagittal plane. Next, a 2D patch ðx mid Þ ðZ;R j Þ is extracted, which is defined by a square with edge length of η and center of R j � j v 2 ;v 3 . Here,  3 is a vector eliminating the v 1 component in the R j � and j 2 {24, 25, 26}. Using a 2D CNN, we learn a function f cr j that infers an accurate position of a landmark R j in v 2 -and v 3

PLOS ONE
Þ . The landmark position in the v 1 -coordinate is determined by the location of the midsagittal plane.
In the similar manner as in (11), the following training dataset is generated: With the training dataset, the 2D CNN is trained as follows: Here, data augmentation using translation is applied. The architecture of the 2D CNN is modified from VGGNet [18], as illustrated in where F cr : R mid ] 7 ! z cr is a nonlinear map and D cr : z cr 7 ! R cr is a decoder of VAE. Here, z cr 2 R d cr is a d cr -dimensional latent variable given by z cr ¼ EðR cr Þ and E cr : R cr 7 ! z cr is an encoder of VAE. The maps ðE cr ; D cr Þ and F cr are trained in the same manner of the method presented in Section 2.2 using cranial landmarks R cr . The detailed architectures of ðE cr ; D cr Þ and F cr are illustrated in Fig 5.

Dataset and experimental settings
Our experiment used a dataset containing 24 paired data (multi-detector CT images and landmark data) and 229 anonymized landmark data. This dataset was provided by Yonsei University, Seoul, Korea. The paired dataset was obtained from normal Korean adult volunteers (9 males and 15 females; 24.22±2.91 years old) with skeletal class I occlusion and was approved by the local ethics committee of the Dental College Hospital, Yonsei University (IRB number: 2-2009-0026). All informed consents were obtained from each subject. Among the 24 paired data, we used 15 data pairs for training (i.e., N p = 15) and 9 data pairs for testing. The anonymized landmark dataset with 3D landmark coordinates was acquired in an excel format from 229 anonymized subjects with dentofacial deformities and malocclusions (i.e., N l = 229). Manual landmarking for both dataset was performed by one of the authors (S.-H. Lee) who is an expert in 3D cephalometry with more than 20 years of experience. Our deep learning method was implemented with Pytorch [20] in a computer system with 4 GPUs (GeForce RTX 1080 Ti), two Intel(R) Xeon(R) CPU E5-2630 v4, and 128GB DDR4 RAM. In the training process, the Adam optimizer [21] was consistently adopted, which is known as an effective adaptive gradient descent method. In our experiment, optimal values of all learning parameters (epoch and learning rate) were empirically selected via cross validation. For image-based methods, 15-fold cross validation was applied, where 15 paired training data were split into 1-fold for validation and the others for training. For the training of VAE parts, 5-fold cross-validation was applied to 244 training data (229 unpaired and 15 paired ones). Fold values were empirically selected, depending on the amount of available training data.
The nonlinear map F in (10)  Here, S ub denotes a subsampling operator of 10 reference landmarks. This dataset was used in our implementation. Likewise, the map F cr in (15) was trained as the same manner.
For quantitative evaluation, we used mean detection error (MDE) computed as follows: Let fR ðiÞ est g N eval i¼1 be a set of N eval landmarks output to be evaluated. The MDE is computed by where R ðiÞ label is the corresponding ground truth for R ðiÞ est and kR est À R label k is defined by Here, R j est is the vector corresponding to j-th landmark in R est , N lmk is the number of landmarks contained in R est (e.g., N lmk = 90 for entire global landmarks), and kR j est À R j label k is given by kR j est À R j label k ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where R j j v k denotes the v i -coordinate of R j .

Results of reference landmark detection
The detection of the 10 reference landmarks (R # ) provided very accurate and robust results (see Table 1 and Fig 6). These results almost meet clinical requirements, while the intraobserver repeatability has a precision of less than 1 mm and the overall median inter-observer precision is approximately 2 mm in the 3D landmarking system [22]. By using reference landmarks, we normalized the landmark data via uniform scaling by fixing the cranial volume of each subject as the average value of the cranial volume for the training dataset.

Results of VAE
For the local-to-global detection, the VAE was trained using 45000 epochs, a full batch-size, and a learning rate of 0.001. Here, the full batch-size indicates that our dataset was not divided into several batches in the training process. These learning parameters were empirically chosen by comparing validation errors, which were obtained by varying parameters when training VAEs.
To investigate the effect on the dimension of the latent space, we trained VAE with varying the latent space size. The latent dimension is preferred to be as small as possible compared to that of the vector of reference landmarks (R 30 ) as well as global landmarks (R 270 ). Taking this into account, the latent dimension (9) and epochs (45000) were chosen as empirical optimal values based on the validation error. Table 2 shows the variation of the averaged test error for the epoch and latent space dimension. The error tendency for the test set was almost the same as that for the validation set.

PLOS ONE
The averaged representation errors of VAE for 9 test data were 2.89 mm, 3.11 mm, and 3.06 mm for the cranial, mandibular, and all landmarks.

Results of the initial local-to-global detection
The nonlinear map F was trained with 5400 epochs, a full batch-size, and a learning rate of 0.0001. For each landmark, Fig 7 shows the performance evaluation achieved using 9 test data with respect to the averaged error in the sense of (17). The mean detection error was 3.27 mm for the cranial landmarks, 3.90 mm for the mandibular landmarks, and 3.59 mm for all landmarks. The error of the cranial landmark estimation was much smaller than that of the mandibular landmark estimation.
The reference landmark detection outputs were selected as the estimation results instead of VAE outputs. This is because the 2D CNN is specially designed to detect the reference landmarks that are placed on a point with a morphologically distinct feature, whereas the VAEbased estimation focuses on capturing global landmark patterns within acceptable tolerance rather than on accurately detecting the specific landmarks. This also applied for the 2D CNNbased cranial landmark detection (in Section 2.3.3).

Mandibular landmark detection.
For fine detection of the mandibular landmarks, 3D image patches were extracted with size of 80 × 80 × 80 voxels (� 4 × 4 × 4 cm 3 ). To generate the training data in (11), the center location of patch was varied to cover 2 times the

PLOS ONE
maximum error of the initial estimation of R � for the training data. Using the parameters of 20000 epochs, a full batch size, and a learning rate of 0.0005, nine 3D CNNs were trained. Figs 6 and 8(b) show the quantitative and qualitative results of the 3D CNNs. The mean distance error decreased to 2.68 mm from the initial detection error of 3.90 mm. The proposed method achieved an error range of 1 to 4 mm for the detection of most landmarks. In addition, as shown in Fig 10(b), the proposed method significantly reduced the mean and variance of error for the test subjects, compared to the initial detection.

Cranial landmark detection.
To generate the partially integrated image x mid , we set the interval for the truncated volume as ± 7.5 mm v 1 -directionally from the midsagittal plane. Next, 2D image patches were cropped into sizes of 80 × 80 pixels (� 4 × 4 cm 2 ). For training the 2D CNNs, we used the learning parameters of 5000 epochs, a full batch-size, and a learning rate of 0.0001.
In Fig 9 and Table 3, qualitative and quantitative evaluations of the 2D CNN-based detection of three cranial landmarks on the midsagittal plane are provided. The detection achieved relatively accurate annotation on the three target landmarks.
For the estimation of all cranial landmarks, VAE ðE cr ; D cr Þ was trained with 80000 epochs, a full batch, and a learning rate of 0.001. The map F cr was trained with 23000 epochs, a full batch-size, and a learning rate of 0.0001. The latent dimension was empirically set to 15. Figs 6 and 8(a) show the final cranial landmark estimation results in quantitative and qualitative formats, respectively. The mean detection error for all cranial landmarks was 3.08 mm, decreasing from the initial estimation error of 3.27 mm (Fig 10(a)). The error for most cranial landmarks fell within the range of 1 to 4 mm.
In terms of all landmarks, as described in Fig 10(c), our proposed method achieved an error of 2.88 mm (Fig 6), which is much lower than the initial detection error of 3.59 mm (Fig 7).

Discussion and conclusion
This article proposes a fully automatic landmarking system for 3D cephalometry in 3D CT. The proposed method provides the accurate and reliable identification of cephalometric landmarks that can be used in subsequent clinical studies, such as in the development of morphometrical guidelines for diagnosis, surgical planning, and the treatment of craniofacial diseases. The proposed semi-supervised method is designed to use many anonymized landmark dataset to address the severe shortage of training CT data. Currently, only 24 CT data pairs are available due to the legal and ethical restrictions on medical data, while approximately 200 anonymized landmark data are available.
The proposed method is based on the benchmark model [13], which provides 3.63 mm error for annotation of 90 landmarks. This model motivated the backbone structure of the coarse estimation step. The proposed method reduces the average detection error from 3.63 mm to 2.88 mm by employing the coarse-to-fine detection, where appropriate strategies for mandibular and cranial landmarks were considered for their different properties. We expect

PLOS ONE
the detection accuracy to be further improved with increasing amount of available training data.
While it may be possible to directly learn the map from the partial knowledge to the global landmarks, the use of VAE is an effective approach for obtaining a meaningful latent representation in terms of skull morphology while learning the local-to-global estimation map. Human skull morphology follows certain patterns and the positions of landmarks are closely interrelated. A previous study [13] provided empirical evidence that VAE can learn a low-dimensional representation that is strongly associated with the factors determining facial skeletal morphology. It is also a well-known advantage of VAE that the learned latent space is dense and smooth [13,14,23]. Hence, it is expected that the VAE-based local-to-global estimation map not only provides the connection between partial and global landmarks but also follows the merits of VAE for latent representation.
Landmarks on the cranium have smaller variability between subjects compared to those on mandible due to the rigid property of the cranium; therefore, cranial landmarks are relatively suitable for the effective estimation in the sense of finding certain common patterns over the training dataset. Moreover, the use of 3D CNN-based fine annotation for all landmarks requires high computational memory consumption and power budget due to the increased use of 3D networks. Hence, the VAE-based approach can be regarded as an effective strategy to finely detect cranial landmarks with a sufficient level of accuracy. Meanwhile, the positional estimation of the summit position of the cranium (SC) obtained from the relation learned via VAE exhibited the lowest accuracy (see Fig 6). This appears to have occurred because the SC may weakly depend on the positions of other landmarks. A rigorous factor analysis using VAE may be undertaken in future research.
The proposed method has the potential to alleviate the experts' hectic workflow by introducing an automated cephalometric landmarking with high accuracy. In clinical practice, our method allows all 3D landmarks to be estimated from partial information obtained via 3D CT data. Although the error level of some landmarks does not meet the requirement of clinical applications (less than 2 mm), the proposed method may still aid in decisions of clinicians in determining landmark positions, thereby improving their working processes.
Recently, as concerns about the radiation doses have increased, there have been attempts to use dental cone-beam CT for cephalometric analysis instead of the conventional multi-detector CT because cone-beam CT utilizes a much lower radiation dose than multi-detector CT. The investigation of an automated 3D landmarking system for cone-beam CT will therefore be a topic of our future research.