Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Unsupervised anomaly appraisal of cleft faces using a StyleGAN2-based model adaptation technique

  • Abdullah Hayajneh ,

    Contributed equally to this work with: Abdullah Hayajneh, Mohammad Shaqfeh, Erchin Serpedin, Mitchell A. Stotland

    Roles Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    a.hayajneh@tamu.edu

    Affiliation Electrical and Computer Engineering Department, Texas A&M University, College Station, TX, United States of America

  • Mohammad Shaqfeh ,

    Contributed equally to this work with: Abdullah Hayajneh, Mohammad Shaqfeh, Erchin Serpedin, Mitchell A. Stotland

    Roles Methodology, Writing – review & editing

    Affiliation Electrical and Computer Engineering Program, Texas A&M University, Doha, Qatar

  • Erchin Serpedin ,

    Contributed equally to this work with: Abdullah Hayajneh, Mohammad Shaqfeh, Erchin Serpedin, Mitchell A. Stotland

    Roles Supervision, Writing – review & editing

    Affiliation Electrical and Computer Engineering Department, Texas A&M University, College Station, TX, United States of America

  • Mitchell A. Stotland

    Contributed equally to this work with: Abdullah Hayajneh, Mohammad Shaqfeh, Erchin Serpedin, Mitchell A. Stotland

    Roles Project administration, Supervision, Writing – review & editing

    Affiliation Division of Plastic, Craniofacial and Hand Surgery, Sidra Medicine, and Weill Cornell Medical College, Doha, Qatar

Abstract

A novel machine learning framework that is able to consistently detect, localize, and measure the severity of human congenital cleft lip anomalies is introduced. The ultimate goal is to fill an important clinical void: to provide an objective and clinically feasible method of gauging baseline facial deformity and the change obtained through reconstructive surgical intervention. The proposed method first employs the StyleGAN2 generative adversarial network with model adaptation to produce a normalized transformation of 125 faces, and then uses a pixel-wise subtraction approach to assess the difference between all baseline images and their normalized counterparts (a proxy for severity of deformity). The pipeline of the proposed framework consists of the following steps: image preprocessing, face normalization, color transformation, heat-map generation, morphological erosion, and abnormality scoring. Heatmaps that finely discern anatomic anomalies visually corroborate the generated scores. The proposed framework is validated through computer simulations as well as by comparison of machine-generated versus human ratings of facial images. The anomaly scores yielded by the proposed computer model correlate closely with human ratings, with a calculated Pearson’s r score of 0.89. The proposed pixel-wise measurement technique is shown to more closely mirror human ratings of cleft faces than two other existing, state-of-the-art image quality metrics (Learned Perceptual Image Patch Similarity and Structural Similarity Index). The proposed model may represent a new standard for objective, automated, and real-time clinical measurement of faces affected by congenital cleft deformity.

1 Introduction

Cleft lip with or without associated cleft palate (CL +/- CP) is one of the most common major congenital anomalies. The National Birth Defects Prevention Network 2017 Congenital Malformations Surveillance Report, which studied the United States birth cohort between 2010–2014, reported CL +/- CP prevalence at 1 in 1000 live births [1]. Surgical management may involve a series of reconstructive interventions through the childhood and adolescence years, and sometimes beyond. Unfortunately, current state-of-the-art surgery cannot erase all vestiges of the facial deformity, thus leaving some affected patients with lifelong psychosocial burden [2, 3]. Cleft lip deformity manifests itself across a spectrum of severity from “incomplete unilateral” to “complete bilateral” depending on the extent of congenital separation of the central philtral region and the two adjacent lateral lip elements on either side. It would be highly beneficial for treating surgeons to have a universally available, objective, and user-friendly method to measure cleft lip severity. This would facilitate (i) impartial self-assessment of clinical outcome on the part of the surgeon, (ii) meaningful discussion with patient/parent, (iii) outcome comparison between different surgical techniques and surgeons, (iv) enhanced surgical planning and education, (v) research on factors associated with cleft severity, and (vi) explicit characterization of the clinical needs and benefits of surgery for third-party payers. While a number of different methods for facial measurement exist, most have been used for research purposes only, and suffer from some combination of being technique- or technology-dependent, susceptible to human bias, or simply infeasible for use in a real-time clinical setting [427].

An ideal facial anomaly detection and measurement system should demonstrate (i) fine sensitivity: an ability to discern the types of subtle facial irregularities that matter to patients, (ii) order-preservation: reliably ranking anomaly severity within sets of images with a hierarchy corresponding to human appraisal, (iii) indifference to extraneous variation: tolerance to variations in age, gender, race, pose, lighting, etc., and (iv) comprehensibility: revealing an interpretable process of distinguishing subtle zones of facial aberration, e.g., through the provision of corresponding heat detection maps.

Training an effective anomaly detector, if using a supervised training model, is necessarily challenged by the scarcity of anomalous data. Moreover, it is fair to assert that no facial norm (or population average) can be used as a standard against which all faces of different age, gender, ethnicity, and severity of deformity can be compared. Thus, we have previously introduced the concept of comparing any given facial deformity to its counterpart normal transformation obtained using the StyleGAN facial generator [28, 29]. Herein, we extend the prior work by developing a model that provides more realistic facial normalization (employing StyleGAN2 [30]) with better computational performance and stability, along with a more precise and streamlined system for anomaly detection and measurement using a pixel-wise subtraction method.

2 Related work

Previous descriptions of anomaly detectors intended for medical application have targeted structural outliers in images including from CT scans [31] and radiographs [32] of the chest, and mammograms [33]. Those using machine learning methods for anomaly detection have depended on explicit modeling of the normal data distribution after first transforming it into a feature space. Newly introduced samples were then defined as anomalous if located outside the established boundaries of the normal distribution, as illustrated in Fig 1.

thumbnail
Fig 1. Example of a dataset of 1000 samples from Gaussian distribution with normal boundary arbitrarily established.

Anomaly detection methods identify outliers as anomalies and calculate their distance from the center of the distribution.

https://doi.org/10.1371/journal.pone.0288228.g001

Erfani et al. used Support Vector Machines (SVMs) trained on one category of samples that represented features offered by a trained deep belief network model [34]. Using various types of non-medical, low dimensionality data from the UCI Machine Learning Repository [35], each individual dataset (< 600 features/set) was transformed into an alternative feature space where its contents could be better discriminated, allowing for clearer separation between normal and anomalous samples. However, it is unclear whether this type of SVM model would apply itself feasibly to higher dimensionality data such as 1024 × 1024 × 3 pixel images. A different approach employed a Random Forest classifier trained on a publicly available dataset consisting of 384 optical coherence tomography images to detect macular degeneration in an elderly population. Using their model to process 384 samples (115 normal images and 269 images with expert-confirmed macular degeneration) a 96% accuracy was reported [36]. Despite its success, this Random Forest-based method is a supervised learning approach which may not be applicable for the detection of rare anomalies within more complex images, such as those depicting the human face. Seeböck et al. used a variational autoencoder with a one-class SVM model to identify anomalous regions in a retinal imaging dataset with 86.6% accuracy [37]. This represented a good example of unsupervised anomaly detection, however this method employed a complex architecture which may not execute efficiently when processing high dimensionality images.

The emergence of Generative Adversarial Networks (GANs) offers an alternative approach to the design of anomaly detection and localization systems [38]. A GAN can be adapted to create a replica version of an anomaly-containing image—except with the anomalous portion corrected. This presents an opportunity for a more customized anomaly measurement by allowing for comparison of any raw image to its GAN-transformed counterpart, rather than to an arbitrarily determined population norm. In 2017, the AnoGAN framework introduced the application of this concept for the detection of anomalies within retinal tomograms [39]. The described model also demonstrated an ability to localize and severity-score anomalies within an image. AnoGAN was trained from scratch using one million patches to generate new and unique replicas of normal retinal tomographic images. An AnoGAN-generated image most similar to any raw retinal tomogram under consideration was obtained by finding an optimized latent vector corresponding to the required generated sample. The loss function employed for this optimization used both the generator and the discriminator of the GAN. Then, a heatmap was generated by defining the residual image as the pixel-wise difference between the generated image and the real one. The scoring system was built based on the loss function used during the latent mapping operation.

Boyaci et al. [29] reported on the first utilization of a GAN for the assessment of facial anomalies in which the original StyleGAN facial generator [28] was used to normalize faces depicting a variety of different deformities. The loss function pertaining to the optimization in their method employed two components focused on the similarity and averageness of an image. The similarity component attempted to preserve all normal features within an abnormal facial image, while averageness functioned to transform the face towards as close as possible to the “average face”. Following normalization of a raw facial image portraying deformity, a separate neural network trained using human ratings of normal and abnormal facial images was able to generate a “distance-from-normality score” for any raw image. The normalization procedure for this system was not stable in the sense that the convergence was inconsistent, thus requiring adoption of the average result of nine re-executed normalization operations as a final output. The computational overhead of this model requires up to 8 minutes to generate a score for an individual image using an ordinary GPU-based laptop.

The current method described here offers a number of important advantages over our previous work [29], including reliable convergence within an average time of 136 seconds. This is because only a single normalization operation is required due to the use of a tuned similarity loss function that allows for more optimization flexibility (see Methods section 3.2.1). The new system is now supported with a model adaptation step that provides more identity-preserving features of the original face that might not otherwise appear without this function. The instability issue of the previous model has now also been resolved by implementing StyleGAN2’s latent optimization method with acceptable normalization output. Finally, an important aspect of the new method is that the facial scoring pipeline employs simple image processing and comparison algorithms. This makes the proposed method more practical to execute on any ordinary computational device.

The main contributions of this paper are the following:

  • a novel StyleGAN2-based model adaptation algorithm that incorporates additional identity-preserving facial features to fine-tune normalized cleft faces beyond what is achievable with the standard StyleGAN2 projection-based algorithm.
  • a heat map, yielded by an image processing pipeline, to compare any input raw image with its normalized counterpart.
  • a heat map-based facial scoring system to evaluate and sort samples according to their proximity to normality.

The rest of the paper is structured as follows. Section 3, Methods, briefly defines the mathematical notations required to model the facial anomaly scoring problem and presents the overall research methodology including image preprocessing, image normalization, color transformation, morphological erosion, heat map generation, and anomaly scoring steps are described in detail. Also, evaluation criteria, aspects pertaining to data collection and human ratings are presented. Section 4, Results, reports on the performance of the proposed method and compares it to existing metrics. Section 5, Discussion, summarizes the contributions of this work, outlines the advantages and limitations of the proposed pipeline. Section 6, Conclusion, proposes several future research directions to overcome the observed limitations.

3 Materials and methods

This study was approved by the Institutional Review Boards (IRBs) of Sidra Medicine and Texas A&M University. The proposed research model aims to measure the difference D between an abnormal facial image and its normalized counterpart to create a score that describes the severity of the input anomaly. For the purposes of this investigation, we obtained 61 facial images of pediatric patients with various types of cleft lip deformity from the clinical practice of the senior author M.A.S. (each with IRB-approved signed consent), as well as facial images of 64 normal children generated by StyleGAN2. The individuals in this manuscript have given written informed consent (as outlined in PLOS consent form) to publish these case details.

The overall study methodology is summarized in Fig 2.

thumbnail
Fig 2. In order to develop an automated facial scoring system several steps were required: Image preprocessing, face normalization, color transformation, heat map calculation, morphological erosion and anomaly scoring.

https://doi.org/10.1371/journal.pone.0288228.g002

Let denote the image containing the human face with cleft anomaly, with n, m and c representing its height and width and the number of color channels, respectively. The main goal is to assess the deviation of xorg from normality by calculating an index . Index S is obtained by building a relationship between a difference map between xorg and its normalized counterpart xnorm, having the same dimensions as xorg, and containing the same face but with the anomaly region suppressed and replaced with normal facial features (as illustrated in Fig 3).

thumbnail
Fig 3. The objective of the proposed method is to obtain a machine score that aligns closely with a human rating by utilizing a difference map generated by comparison of original image with its normalized counterpart.

https://doi.org/10.1371/journal.pone.0288228.g003

3.1 Image preprocessing

In the first preprocessing step, the face inside the image xorg was detected and localized using the Haar classifier [40], which is a conventional face detection model. The area ratio of face-to-background needed to align with the FFHQ [28] dataset upon which the StyleGAN2 was trained (60% face to 40% background). If the ratio was small, an appropriate area of background was removed;if the ratio was large, the background was enlarged by blurring it and placing 8 horizontally/vertically flipped replicates of the image around it. This helped to match the shape of the input image with the StyleGAN2 generator input shape (1024 × 1024 pixels). To create a blurred background of the image around the face, a mask image is generated by separating the foreground from the background pixels of the original face image using the previously detected face. Afterwards, a Gauss smoothing filter [41] is applied to the mask to smooth the transition between the foreground and the background areas. Then the range of values of the mask image are converted from 0–255 to 0–1 and multiplied with the original face image.

The next preprocessing step is to detect and correct the orientation of the face inside the image. This is conducted by detecting the eyes in the face by using 68 landmarks of the face. The goal is to secure horizontal alignment of the eyes inside the image. Next, the distance between the eyes is measured and the whole image is scaled up or down so that the distance between the centers of the eyes is equal to 100 pixels, a condition which ensures consistency with the StyleGAN2 pretrained model. Finally, 1024 × 1024 pixels are cropped around the face location. Fig 4 illustrates the overall preprocessing steps.

thumbnail
Fig 4. Applying different preprocessing steps to generate consistent images with the StyleGAN2 pretrained model.

This includes adjusting the background, scale, and rotation/flip orientation.

https://doi.org/10.1371/journal.pone.0288228.g004

3.2 Image normalization

After obtaining a well-aligned face image xorg through the preprocessing step, a normalized version xnorm of the original face was created by utilizing the StyleGAN2 face generator which produces unique and high quality faces. The overall proposed normalization algorithm is illustrated in Fig 5. It consists of two successive operations: Latent Optimization and StyleGAN2 Adaptation.

thumbnail
Fig 5. The proposed normalization algorithm searches for the latent vector most closely matching the input face.

The latent vector is then frozen and the StyleGAN2 model is optimized to reconstruct more facial details without anomalies. G refers to the StyleGAN2 generator network.

https://doi.org/10.1371/journal.pone.0288228.g005

3.2.1 StyleGAN2 latent optimization—Face inversion.

For the sake of consistency in the overall pipeline, this step used the standard face inversion method proposed by the StyleGAN2 original paper, which is described in Algorithm 1. This algorithm optimizes the intermediate latent vector W as well as the 18 noise maps ni, i ∈ 0, 1, …, 17, present in the StyleGAN2 architecture to help in finding the best latent vector that encodes most of the facial features present in xorg.

StyleGAN2 latent optimization starts by generating 10,000 random latent vectors Z, transforming them using the mapping network (see the S1 Appendix for more details about StyleGAN2 architecture) to produce 10,000 intermediate latent vectors W and averaging them to get an estimate of the mean intermediate latent vector μ. The algorithm optimizes μ to get an initial estimation of the latent vector corresponding to the closest generated face to the input image. The next step involves calculation of the similarity loss between the original image and the estimated image corresponding to the latent vector using the Learned Perceptual Image Patch Similarity (LPIPS) [42] distance measure . The LPIPS loss function takes the following mathematical expression: (1) where l stands for the layer index, wl, Wl and Hl denote the the learned weights, width and height of the layer l, respectively, and ⊙ represents the Hadamard product (vector element-wise multiplication).

To facilitate the latent vector search, the inversion algorithm additionally optimizes 18 randomly generated noise maps with different resolutions , ri ∈ {1024, 512, …, 8}, for face synthesis operation. As a consequence of optimizing the noise maps, some facial details may sneak into the noise maps during the synthesis operation. Therefore, a regularization term scaled by α is added to the overall loss, which depends on the original noise maps as well as scaled versions of noise maps ni,j, j > 0. These scaled versions are not part of the optimization. The regularization term measures the amount of randomness present in the noise maps and makes sure that the noise maps are still random.

Algorithm 1 Face Inversion, Input: Face image xorg and StyleGAN2 generator model G. Output: Latent vector w and set of noise maps ni, i ∈ 0, 1, …, 17, α = 105

z = {z1, …z10000} ← U{0, 1, …, 10000}

w = {w1, …, w10000} ← G(z)

while not converge do

while i in (0, 1, …, 18) do

  

  ni = ni−∇ni

end while

w = w − ∇w

end while

return w, n1, n2, …ni

3.2.2 StyleGAN2 pretrained model adaptation.

The additional training of the StyleGAN2 model is called model adaptation, a step in which the generator internal weights are updated and forced to represent the remaining details of the input face and to improve the face image. A custom adaptation method is described in Algorithm 2. The core of this algorithm is to incrementally adjust the StyleGAN2 weights W and freeze the initial estimation of the latent vector w obtained from the previous step. The initial guess of the latent vector corresponds to the best normalized face xnorm. This will add more identity-preserving details in the generated face except those related to the anomaly. This step continues for a proper number of training iterations and stops when most of the details are represented in the generated face. To consistently represent more details of the face, a carefully designed loss function was used during the adaptation step as shown in the following equation: (2)

Algorithm 2 Pretrained model adaptation. Input: Face image xorg, its corresponding closest latent vector w and a StyleGAN2 generator G. Output: Adapted generator G

G′ ← G

Wweights(G′)

while not converge do

xnormG′(z)

G′ = G′ + ∇g

end while

return G′; xnorm

The above loss function seeks a compromise between reconstructing semantic details of the face captured by the Learned Perceptual Image Patch Similarity (LPIPS) loss , and the details of image pixels present in both the original and generated faces indicated by . Both of these metrics are valid distance measures. This loss function is used to incrementally update the weights of StyleGAN2 to get the newly adapted generator G′ which contains the closest possible face xnorm to the original one xorg. An example of the complete image normalization pipeline is described in Fig 6.

thumbnail
Fig 6. Facial transformation by applying the face inversion and model adaptation algorithms sequentially.

The face in green border is the one that is taken as the best normalized version of xorg showing distinct features of the face without showing the anomaly. Having more adaptation iterations will reverse the normalization.

https://doi.org/10.1371/journal.pone.0288228.g006

3.3 Color transformation

To ensure better discrimination between noise and face information, xorg and xnorm were transformed from Red-Green-Blue (RGB) color model to YCbCr (see Fig 7). Conversion from RGB to YCbCr color model (after gamma correction) were obtained using these transformations: (3) where R′,G′ and B′ are the three color channels of the input image, Y′,CB and CR are the intensity, blue and the red differences [43], respectively.

thumbnail
Fig 7. Transforming the facial image from RGB to YCbCr.

Note that most of the variation is in the Y component. Cb and Cr exhibit negligible variation.

https://doi.org/10.1371/journal.pone.0288228.g007

The RGB color model describes a pixel information by the amount of each of the three colors included in it. The range of each color value is not linearly related to the intensity. Also, the RGB color system presents a lot of visual information redundancy in its three channels with no separation between color and intensity. YCbCr presents the advantage of separating the intensity values from chromaticity. YCbCr is widely used for skin color detection due to its ability to better represent color images with uneven illumination [44]. This helps in separating color from the illumination information.

3.4 Heatmap Generation via Pixel-wise Squared Error

The Pixelwise Subtraction Error (PSE) measures the difference in the color intensity between two images xorg and xnorm as follows: (4) where ⊙ denotes the Hadamard product [45].

In our framework, we used the PSE similarity measure by considering the squared difference between each pixel in the original image and the corresponding pixel in the normalized version. The pixel-based distance measure included both the real anomalous difference and the difference that was caused by the change in illumination.

3.5 Morphological erosion for noise reduction

The normalized version of the face xnorm is indeed not an exact copy of the original xorg as it might include subtle changes in illumination (alluded to above) or in textural detail. These fine differences may or may not be detectable to human appraisal, but to address these artifacts we applied morphological erosion [46] on xorg as well as xnorm to reduce the effect that noise might have on our generated anomaly score.

The essence of the erosion process is to reduce objects boundaries and enlarge the size of holes. This is done by finding the minimum value of the neighborhood pixels at a specific location of the image. Let F(j, k) represent the pixel value of the grayscale image F under process at location (j, k). The erosion process over a 3 x 3 pixel neighborhood is implemented by means of this transformation: (5)

This process is repeated around the image portion that is subject to analysis. We applied the erosion process on all the Y, Cb and Cr image channels as we assumed that the noise was contained in all the channels but with higher magnitude in the Y channel.

3.6 Anomaly score calculation and comparison

For the purposes of assigning cleft severity scores to all experimental images, we undertook both construction of a machine model as well as collection of human ratings for comparison. One hundred and forty-five human raters were recruited and were randomly shown one of ten image slideshows containing 15 cleft-affected children’s faces and 15 StyleGAN2-generated normal children’s faces. A total of 240 unique images were used. The order of the 30 images within each slideshow was randomly presented. The images were each displayed for only 3 seconds so as to discourage deliberation and yield more instinctive human responses. Subjects were requested to rate each image on a 1 (least normal) to 7 (most normal) scale. We report here only on images that received a minimum of 20 individual ratings (125 out of 240). The average number of human ratings per image under consideration was 25.

In order to generate the machine scores, the proposed framework was applied to the raw 125 samples under study. During the normalization step, 450 face inversion and 50 adaptation iterations have been performed to grossly obtain the desired facial details. The post-processed abnormality heatmap from the previous step was filtered to remove the unwanted parts of the image using a mask image M. Next, all the remaining pixels were summed up and divided by their total count N. This quotient represented an overall anomaly score indicating how much energy was contained in the heatmap. A higher quotient corresponded to a more severe facial anomaly. So as to enhance the sensitivity of the system to subtle facial anomalies, and to align the direction of severity of scoring with the human ratings, we then calculated the negative log transformation of the quotient, using the following equation: (6) where xorg and xnorm are the two face pairs. All scores derived from this calculation were then linearly scaled to facilitate comparable graph visualization. Note that for the evaluation phase, scores were similarly calculated for all 3 types of heatmaps under comparison (LPIPS, Structural Similarity Index Measure (SSIM) [47], and PSE) using the same operations.

The relationship between our machine-generated scores of the 125 facial images and corresponding human ratings of the same images was assessed using Pearson’s correlation: (7) where xi’s are the machine scores, yi’ represent the human scores, and and denote the means of the machine and the human scores, respectively.

3.7 Computing resources

We conducted the analysis on Python 3.6 using Pytorch, Opencv, Scipy and Skimage libraries on Intel i7–10751H CPU 2.6 GHz with Nvidia GeForce 2080 Super with Max-Q design.

4 Results

Representative examples of the proposed PSE and alternative heatmap generation methods are displayed in Fig 8. The original, normalized, and difference heatmap is shown for each cleft image. The oral and nasal areas are highlighted by masking. Abnormal regions under consideration are represented in the heatmaps by “hotter” pixels. Reviewing the three versions of heatmaps, the PSE method appears to light up anomalous cleft regions with more refined localization than the two peer techniques do.

thumbnail
Fig 8. Comparison of various heatmaps generated using LPIPS, SSIM, and the proposed PSE method.

The PSE approach demonstrates a more localized anomaly signal.

https://doi.org/10.1371/journal.pone.0288228.g008

Tables 1 and 2 show the Pearson correlation between machine and human scores for only the 61 real cleft faces and for the whole 125 experimental facial images, respectively. Each table shows the correlation coefficient with different combinations of image processing and heatmap generation methods. The correlation with and without the model adaptation step of image normalization is shown. Also, the correlations after masking everything except the oral/nasal region is compared with the correlations considering the entire face excluding eyes. Heatmaps and associated severity scores generated by the PSE machine-generated method showed an 88.7% and 82.8% correlation with human ratings for oral/nasal region and the entire face, respectively. The obtained correlations for PSE-based scores were achieved when applying both color transformation as well as the erosion operations to the PSE heatmaps. In case of the LPIPS-based scores, the color transformation step improves the correlation (79.4 to 83.5%) without applying the erosion step. Conversely, the SSIM-based scores show the worst performance and correlation degradation (from 81.5 to 68.6% in case of applying YCbCr transformation exclusively) and good improvement in case of adding the erosion operation only (81.5% to 84.9%). In general, the results show the superiority of the simple PSE among the LPIPS and SSIM-based scoring methods in almost all scenarios. Additionally, LPIPS works well with color transformation, while SSIM-based scores improves with the erosion operation.

thumbnail
Table 1. Pearson correlation coefficient between the human and machine scores of the 61 cleft faces under analysis, utilizing different combinations of heatmap generation and image processing approaches.

The best combination is highlighted in bold below.

https://doi.org/10.1371/journal.pone.0288228.t001

thumbnail
Table 2. Pearson correlation coefficient between the human and machine scores of the 125 StyleGAN2-generated and real cleft faces under analysis, utilizing different combinations of heatmap generation and image processing approaches.

The best combinations is highlighted in bold numbers.

https://doi.org/10.1371/journal.pone.0288228.t002

To visually demonstrate the effectiveness of the proposed scoring method compared to others, Fig 9 shows three plots of the human versus machine scores for the 125 faces under study. These plots corresponds to LPIPS, SSIM and PSE-based scoring methods, respectively. It is obvious that the PSE-based scores present better consistency than LPIPS and SSIM-based scores.

thumbnail
Fig 9. The Pearson correlation between human and machine scores of the 125 faces under study using different algorithmic approaches.

The optimal heatmap and imaging processing combination for each of the 3 approaches is depicted here (as reflected in bold in Table 1) Note tighter alignment of human/machine scoring with the PSE method.

https://doi.org/10.1371/journal.pone.0288228.g009

The average computation time required to evaluate the score of one sample, utilizing different combinations of heatmap generation and image processing approaches is shown in Table 3. An overhead of approximately 135 seconds is added for image normalization. The proposed method presents lower overall computational overhead relative to the approach in [29], which requires 9 minutes to score one image.

thumbnail
Table 3. Average evaluation time per sample in seconds for each combination of operations under analysis, when xorg and xnorm are provided.

Another 135 seconds have to be added for each normalization step with adaptation to obtain xnorm, while 123 seconds have to be added when using the standard StyleGAN2 normalization procedure.

https://doi.org/10.1371/journal.pone.0288228.t003

5 Discussion

The notable performance of the proposed model can be attributed to its multi-step methodology, each step contributing to its effectiveness. Image preprocessing was required to reconcile the input cleft images within the framework of the StyeGAN2 architecture. The normalization function reported here reprises our previously described innovation built off StyleGAN, but with important enhancements including (i) the use of a tuned similarity loss function available within the updated StyleGAN2 allowing for more optimization flexibility and resolving the prior instability issue, and (ii) a novel model adaptation technique that retains more identity-preserving features of the input image. Finally, the facial scoring procedure described here employs simple image processing and comparison algorithms, making the overall system more practical for use on any ordinary computational device.

Three key features of our method were (i) leveraging the capabilities offered by the StyleGAN2 facial generator [48] to generate high-quality normalized face images, (ii) complementing the anomaly score generated by the computer model with visual heatmaps to localize the facial abnormality and make the machine scores explainable, and (iii) incorporating conventional image processing techniques to improve the quality of the generated heatmaps and reduce the noise in the generated heatmap, which is needed in order to get more reliable machine scores of the anomalies. An important question that pertained to a given face image with cleft anomaly was: how can StyleGAN2 be utilized to find a matching normal face that does not show the cleft mouth anomaly and at the same time preserves the identity of the face? This was achieved by the face inversion operation which finds the vector in the latent representation of a pretrained StyleGAN2 face generator model G that transforms into a guess of the normal face image that most closely resembles the given abnormal face under evaluation. It should be noted that StyleGAN2 model was trained using normal face images only, and hence it is only capable of producing normal face images. The reason of naming the step of finding the latent vector as face inversion is that the process finds a latent vector given a face image instead of producing a face image given a latent vector, which is the primary common use of StyleGAN2.

The dimensionality of the StyleGAN2 latent vector is 512, and this number is a design parameter chosen by the creators of StyleGAN2. Thus, we stick to this number to ensure consistency of the proposed model with StyleGAN2 model. Having 512 dimensions of the latent vector is considered to be sufficiently good to compress the amount of information that distinguishes an individual face from other human faces. Also, choosing this number depended on the number of samples used to train the model to ensure the convergence of the learning process. Having higher dimensionality of the latent vector can help to achieve better normalization of the original images. However, this requires creating a new deep learning model with more training samples and more complex training process.

Different approaches were proposed in the literature to find the closest latent representation of a face in StyleGAN2. These approaches can be divided into two main categories. The first category is optimization based methods where a latent code is directly optimized for a fixed sample [30, 4951], and the second category comprises encoder based methods where a separate encoder network is built and trained to predict the latent code for the sample [5254]. Encoder based methods have the advantage of producing the corresponding latent vector in a single forward iteration through the encoder. This helps in significantly reducing the computation load overhead compared to the optimization based methods. On the other hand, optimization based methods can produce latent vectors associated with better quality images and closer normalization to the input image. This is important for our work since we need our method to be able to detect subtle changes in facial images. Additionally, a hybrid method lying between the two above mentioned categories was proposed in [55]. The hybrid method first finds an estimate of the latent code using a well trained encoder to shorten the latent optimization overhead. Then it employs an iterative optimization algorithm to refine the latent code and to better represent the semantics of the image in the latent space.

The chosen inversion method in our work resides in the category of optimization based methods and is considered as a standard inversion method. In general, this method iteratively calculates and minimizes a similarity loss between the input face image and a randomly generated face by adjusting a random latent vector using the back-propagation method [56]. The utilized loss function is based on the LPIPS loss . LPIPS calculates the averaged difference between deep features extracted from two images using feature extractor networks such as pretrained VGG [57], Alexnet [58] or SqueezeNet [59].

After conducting the estimation of the normalized face during the inversion step, we observed that the obtained estimate still lacks some identity preserving details of the abnormal face. This may be due to the fact that the training dataset of StyleGAN2, which consists of 70,000 images of human faces, does not completely capture all variations among every individual human’s face. Another reason is that the face inversion step involves a highly nonlinear optimization problem and the numerical algorithm to solve it may fail to explore the entire latent space. This prompted for an enhanced normalization by applying additional training of the StyleGAN2 pretrained model in a Model Adaptation step. Several works have been proposed to adapt new datasets into a pretrained StyleGAN2 model [30, 6062]. Most of them are designed to change the domain of the generated objects. Our approach benefits from these so that a small update can be applied to the generator to produce the required details of the original face. This small update does not alter the remaining latent space and its corresponding generated images. Also, we do not utilize the edited generator for any other task except to refine the target normalized face.

Selecting the number of adaptation iterations of the proposed algorithm depends on the amount of missing details we want to be present. If the input face is very close to normal, then less adaptation iterations are required, as the majority of the face details will be present in the first inversion step. A proper number of iterations to obtain most of the normal missing details of the generated face was found to be 50 adaptation iterations. If the adaptation proceeds with more iterations than this number, abnormal parts of the input face will be gradually reconstructed which harms the normalization result, as shown in Fig 10. Nevertheless, this observation can be useful for other applications, which are out of the scope of this work, such as the simulation of the recovery state of a cleft affected patient.

thumbnail
Fig 10. If we keep adapting the StyleGAN2 generator for more than 50 iterations, the model starts reconstructing the abnormal details of the original face.

Here the cleft anomaly gradually appears in the generated face.

https://doi.org/10.1371/journal.pone.0288228.g010

The bottom part of Table 3 shows the correlation between the machine predictions with the human scores, in the situation when the proposed StyleGAN2 model adaptation is not applied during the normalization phase. Clearly, the correlation is worse except for SSIM as a lot of facial details in the original image were not represented in the normalized face. This explains the importance of applying extra adaptation steps for the StyleGAN2 facial generator model.

One interesting idea to explore is to count the number of adaptation iterations that are needed to fully reconstruct the anomalous parts and to take that number as a measure of severity of the anomaly. The strong correlation between the number of iterations and degree of anomaly severity can be employed to generate an alternative severity index for facial deformities, a research direction that we plan addressing in a future companion paper.

One final comment on the normalization step is that it has a major limitation in its inaccuracy of reconstructing the eyes even with the state-of-the-art deep generative models. There could be different reasons for this difficulty including the fact that the eyes capture reflections of the surrounding environment in front of the person who is captured in the image. These reflections are difficult to reconstruct accurately by deep models including the StyleGAN2 generator. For the sake of our main purpose of scoring cleft abnormalities, this limitation can be mitigated by masking the eyes region.

Following the normalization step, heatmap generation was essential for obtaining the anomaly score. It was important for the heatmap to highlight the anomalous parts rather than the actual pixel difference. In general, almost all the similarity measures rely on heatmaps to calculate the similarity score. This makes it possible to test different methods in terms of their abilities to relate the heatmaps with consistent scoring systems. Generally, the heatmap can be created by comparing the original image with the normalized one by means of a similarity distance measure. Several similarity measures can be used to generate the heatmap. These measures can be divided into two main categories: shallow and deep measures, respectively. The first category uses an explicit definition of the difference between the two images.

Structural Similarity Index Measure (SSIM) is a popular shallow similarity measure that compares the difference of quality between two images in terms of color and contrast as well as structural differences. It also generates a heatmap depending on the above mentioned difference features. The SSIM between a window x any y of size N × N in xorg and xnorm, respectively is expressed using the following equation: (8) where μx, μy are the means of x and y respectively, σxy in the covariance of x and y, σx and σy are the variances of x and y, finally, c1 and c2 are two stabilization factors.

An example of a deep comparison method is LPIPS. Deep similarity measures have implicit definition of the difference between two compared images as they employ features extracted by deep neural networks pretrained on a dataset of different types of objects, i.e., they behave like objects feature extractors.

We found that, despite its simplicity, the PSE heatmaps are more intuitive and consistent with the generated scores than other types of heatmaps if supported with additional image processing techniques. On the other hand, heatmaps generated by LPIPS and SSIM do not greatly improve even if supported by additional image processing techniques. Another important feature of the PSE heatmap is the fact that it was able to highlight the structural changes as well as the large contiguous blobs of cleft abnormalities. Therefore, it turns out that the PSE heatmap is closer to the human intuition than the other generated heatmaps.

Color transformation step was useful to improve the heatmap generation step by producing an anomaly difference map between the abnormal face and the normalized one that highlights the abnormality location in the face but neglects other sorts of differences like small color changes due to lighting. For cleft abnormality, the anomaly causes mainly a geometrical (i.e., structural) change in the image rather than color change. The geometrical information of the image can be obtained more precisely by using the illumination intensity information than the color information. This required a representation of the image using a color system that separates the color information from the intensity. We found that representing the images xorg and xnorm in the YCbCr color system during the scoring process improves the correlation between human and machine ratings. People’s appraisals of facial abnormality are more sensitive to structural changes in the image rather than to color. Thus, we transform the image from the RGB into the YCbCr color space to better capture the black/white information. Interestingly, color transformation improves the correlation (79.4 to 83.5%) in case of LPIPS as well, unlike SSIM heatmaps. This is due to the fact that the highlighted anomaly regions were not dependent on the color information.

The abnormality score generated by the computer model is based on summing the energy in the generated heatmap of the difference between an input original face image and its normalized version. However, it should be noted that the generated heatmap will always have inevitable added noise on top of the main signal (i.e., the abnormality difference). The noise may come from two main sources. The first source is from the quality of the input source image that depends on the acquisition device (camera) that is used to capture the image. The other source is from the normalization step using StyleGAN2 generator that adds a specific level of structural artifacts when producing a face [63].

It is obvious that the noise in the generated heatmap is distributed all-over the face image, while the abnormality signal is concentrated in the oral and nasal region only. Therefore, in order to enhance the signal to noise ratio, we found that applying a mask to sum (and average) the energy of the heatmap within the oral/nasal region only gives better scoring of abnormality that is more closely correlated with human ratings.

Furthermore, the heatmap morphological erosion step was useful to reduce the noise that is due to the normalization step. Comparing the final correlation between human scores and the PSE-based scores with and without erosion confirms that this additional image post-processing is very useful and well-integrated with the PSE-based heatmaps.

Another important step for the abnormality scoring was taking the log transformation of the PSE-based scores. This was important to resemble the non-linearity nature of human appraisal of abnormalities. We claim that human judgment of facial deformities is very sensitive to small anomaly changes which gradually diminishes as the abnormality becomes larger and more obvious. In other words, human appraisal of abnormality is not linearly scaled with the abnormality area.

6 Conclusions

In this work, we proposed a method to normalize and then gauge the anomalous faces in a manner similar human judgment. A comparison between different combinations of face normalization, model adaptation, image processing and similarity measures have been tested. The proposed method demonstrated an improvement in terms of the computation speed and correlation with human scoring. Also, this study showed that localizing the abnormality regions in a patient’s face can be conducted using the power of Generative Adversarial Networks (GANs). Even if there is no representation in the GAN pretrained model, we were able to embed these details into the StyleGAN2 model using the classical backpropagation algorithm for some of the face details. This helped us to preserve the identity of the patient’s face for the generated image as well as to remove the abnormality regions and replace them with normal face attributes. Therefore, we were able to generate a heat map that highlighted the possible abnormal face parts using the pixel-wise difference between the original abnormal face and the normalized one.

It is worth to indicate that the severity index calculation method relies on pixel-wise subtraction, which measures the difference in intensity between the original face and the normalized counterpart. This means that some semantic image differences may not be detected, which may affect the overall scoring in some cases. Also, the evaluation was done using 125 human rated facial images. This number can be increased to confirm the performance improvements in more diverse cases. Also, a large number of fabricated cleft faces with varying levels of severity can be obtained by building a cleft face generator using the available real cleft faces. The number of adaptation steps is another parameter that can be optimized in the future work to generate a new anomaly index for facial deformities. Finally, different new deep learning models can be utilized in the future to improve the performance of the anomaly detection problem introduced in this work [6470].

Supporting information

S1 Appendix. The StyleGAN2 generator architecture.

This appendix provides a general overview about the StyleGAN2 architecture and its chosen design parameters.

https://doi.org/10.1371/journal.pone.0288228.s001

(PDF)

References

  1. 1. Mai CT, Isenburg JL, Canfield MA, Meyer RE, Correa A, Alverson CJ, et al. National population-based estimates for major birth defects, 2010–2014. Birth defects research. 2019;111(18):1420–1435. pmid:31580536
  2. 2. Demir T, Karacetin G, Baghaki S, Aydin Y. Psychiatric assessment of children with nonsyndromic cleft lip and palate. General hospital psychiatry. 2011;33(6):594–603. pmid:21816483
  3. 3. Hunt O, Burden D, Hepper P, Johnston C. The psychosocial effects of cleft lip and palate: a systematic review. European journal of orthodontics. 2005;27(3):274–285. pmid:15947228
  4. 4. Zaidel DW, Aarde SM, Baig K. Appearance of symmetry, beauty, and health in human faces. Brain and cognition. 2005;57(3):261–263. pmid:15780460
  5. 5. Gunes H, Piccardi M. Assessing facial beauty through proportion analysis by image processing and supervised learning. International journal of human-computer studies. 2006;64(12):1184–1199.
  6. 6. Rhodes G, Proffitt F, Grady JM, Sumich A. Facial symmetry and the perception of beauty. Psychonomic Bulletin & Review. 1998;5:659–669.
  7. 7. Rhodes G, Yoshikawa S, Clark A, Lee K, McKay R, Akamatsu S. Attractiveness of facial averageness and symmetry in non-Western cultures: In search of biologically based standards of beauty. Perception. 2001;30(5):611–625. pmid:11430245
  8. 8. Mosmuller DG, Mennes LM, Prahl C, Kramer GJ, Disse MA, Van Couwelaar GM, et al. The development of the cleft aesthetic rating scale: a new rating scale for the assessment of nasolabial appearance in complete unilateral cleft lip and palate patients. The Cleft Palate-Craniofacial Journal. 2017;54(5):555–561. pmid:27537493
  9. 9. Jack RE, Schyns PG. The human face as a dynamic tool for social communication. Current Biology. 2015;25(14):R621–R634. pmid:26196493
  10. 10. Adolphs R. Perception and emotion: How we recognize facial expressions. Current directions in psychological science. 2006;15(5):222–226.
  11. 11. Hassin R, Trope Y. Facing faces: studies on the cognitive aspects of physiognomy. Journal of personality and social psychology. 2000;78(5):837. pmid:10821193
  12. 12. Albright L, Kenny DA, Malloy TE. Consensus in personality judgments at zero acquaintance. Journal of personality and social psychology. 1988;55(3):387. pmid:3171912
  13. 13. Wong BJ, Karimi K, Devcic Z, McLaren CE, Chen WP. Evolving attractive faces using morphing technology and a genetic algorithm: a new approach to determining ideal facial aesthetics. The Laryngoscope. 2008;118(6):962–974. pmid:18401273
  14. 14. Ishii LE. Moving toward objective measurement of facial deformities: exploring a third domain of social perception. JAMA Facial Plastic Surgery. 2015;17(3):189–190. pmid:25790227
  15. 15. Ishii L, Dey J, Boahene KD, Byrne PJ, Ishii M. The social distraction of facial paralysis: objective measurement of social attention using eye-tracking. The Laryngoscope. 2016;126(2):334–339. pmid:26608714
  16. 16. Ishii L, Carey J, Byrne P, Zee DS, Ishii M. Measuring attentional bias to peripheral facial deformities. The Laryngoscope. 2009;119(3):459–465. pmid:19235748
  17. 17. Boonipat T, Brazile TL, Darwish OA, Montana P, Fleming KK, Stotland MA. Measuring visual attention to faces with cleft deformity. Journal of Plastic, Reconstructive & Aesthetic Surgery. 2019;72(6):982–989. pmid:30598394
  18. 18. Parmar DN, Mehta BB. Face recognition methods & applications. arXiv preprint arXiv:14030485. 2014;.
  19. 19. Farkas LG, Katic MJ, Forrest CR. International anthropometric study of facial morphology in various ethnic groups/races. Journal of Craniofacial Surgery. 2005;16(4):615–646. pmid:16077306
  20. 20. Sinko K, Jagsch R, Prechtl V, Watzinger F, Hollmann K, Baumann A. Evaluation of esthetic, functional, and quality-of-life outcome in adult cleft lip and palate patients. The Cleft palate-craniofacial journal. 2005;42(4):355–361. pmid:16001915
  21. 21. Carruthers J, Flynn TC, Geister TL, Görtelmeyer R, Hardas B, Himmrich S, et al. Validated assessment scales for the mid face. Dermatologic surgery. 2012;38(2ptII):320–332. pmid:22316188
  22. 22. Edler R, Rahim MA, Wertheim D, Greenhill D. The use of facial anthropometrics in aesthetic assessment. The Cleft palate-craniofacial journal. 2010;47(1):48–57. pmid:20078203
  23. 23. Mercan E, Oestreich M, Fisher DM, Allori AC, Beals SP, Samson TD, et al. Objective assessment of the unilateral cleft lip nasal deformity using 3d stereophotogrammetry: severity and outcome. Plastic and reconstructive surgery. 2018;141(4):547e. pmid:29257001
  24. 24. Tse RW, Oh E, Gruss JS, Hopper RA, Birgfeld CB. Crowdsourcing as a novel method to evaluate aesthetic outcomes of treatment for unilateral cleft lip. Plastic and reconstructive surgery. 2016;138(4):864–874. pmid:27673519
  25. 25. Rhee JS, McMullin BT. Outcome measures in facial plastic surgery: patient-reported and clinical efficacy measures. Archives of facial plastic surgery. 2008;10(3):194–207. pmid:18490547
  26. 26. Klassen AF, Cano SJ, Scott A, Snell L, Pusic AL. Measuring patient-reported outcomes in facial aesthetic patients: development of the FACE-Q. Facial Plastic Surgery. 2010;26(04):303–309. pmid:20665408
  27. 27. Meyer-Marcotty P, Gerdes AB, Stellzig-Eisenhauer A, Alpers GW. Visual face perception of adults with unilateral cleft lip and palate in comparison to controls—an eye-tracking study. The Cleft Palate-Craniofacial Journal. 2011;48(2):210–216. pmid:20536370
  28. 28. Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019. p. 4401–4410.
  29. 29. Boyaci O, Serpedin E, Stotland MA. Personalized quantification of facial normality: a machine learning approach. Scientific Reports. 2020;10(1):1–19. pmid:33288815
  30. 30. Karras T, Aittala M, Hellsten J, Laine S, Lehtinen J, Aila T. Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems. 2020;33:12104–12114.
  31. 31. Jiao H, Xu Z, Wu L, Cheng Z, Ji X, Zhong H, et al. Detection of airway anomalies in pediatric patients with cardiovascular anomalies with low dose prospective ECG-gated dual-source CT. PloS one. 2013;8(12):e82826. pmid:24324836
  32. 32. Nakao T, Hanaoka S, Nomura Y, Murata M, Takenaga T, Miki S, et al. Unsupervised deep anomaly detection in chest radiographs. Journal of Digital Imaging. 2021;34(2):418–427. pmid:33555397
  33. 33. Lim SK, Loo Y, Tran NT, Cheung NM, Roig G, Elovici Y. Doping: Generative data augmentation for unsupervised anomaly detection with gan. In: 2018 IEEE International Conference on Data Mining (ICDM). IEEE; 2018. p. 1122–1127.
  34. 34. Erfani SM, Rajasegarar S, Karunasekera S, Leckie C. High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recognition. 2016;58:121–134.
  35. 35. Asuncion A, Newman D. UCI machine learning repository; 2007.
  36. 36. Venhuizen FG, van Ginneken B, Bloemen B, van Grinsven MJ, Philipsen R, Hoyng C, et al. Automated age-related macular degeneration classification in OCT using unsupervised feature learning. In: Medical Imaging 2015: Computer-Aided Diagnosis. vol. 9414. SPIE; 2015. p. 391–397.
  37. 37. Seeböck P, Waldstein S, Klimscha S, Gerendas BS, Donner R, Schlegl T, et al. Identifying and categorizing anomalies in retinal imaging data. ArXiv preprint ArXiv:161200686. 2016;.
  38. 38. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial networks. Communications of the ACM. 2020;63(11):139–144.
  39. 39. Schlegl T, Seeböck P, Waldstein SM, Schmidt-Erfurth U, Langs G. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: International Conference on Information Processing in Medical Imaging. Springer; 2017. p. 146–157.
  40. 40. Viola P, Jones M. Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001. vol. 1. Ieee; 2001. p. I–I.
  41. 41. Davies ER. Computer and machine vision: theory, algorithms, practicalities. Academic Press; 2012.
  42. 42. Zhang R, Isola P, Efros AA, Shechtman E, Wang O. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 586–595.
  43. 43. Koschan A, Abidi M. Digital color image processing. John Wiley & Sons; 2008.
  44. 44. Chai D, Bouzerdoum A. A Bayesian approach to skin color classification in YCbCr color space. In: 2000 TENCON Proceedings. Intelligent Systems and Technologies for the New Millennium (Cat. No. 00CH37119). vol. 2. IEEE; 2000. p. 421–424.
  45. 45. Horn RA. The hadamard product. In: Proc. Symp. Appl. Math. vol. 40; 1990. p. 87–169.
  46. 46. Serra J, Soille P. Mathematical morphology and its applications to image processing. vol. 2. Springer Science & Business Media; 2012.
  47. 47. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing. 2004;13(4):600–612. pmid:15376593
  48. 48. Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T. Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. p. 8110–8119.
  49. 49. Abdal R, Qin Y, Wonka P. Image2stylegan: How to embed images into the stylegan latent space? In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 4432–4441.
  50. 50. Abdal R, Qin Y, Wonka P. Image2stylegan++: How to edit the embedded images? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. p. 8296–8305.
  51. 51. Lipton ZC, Tripathi S. Precise recovery of latent vectors from generative adversarial networks. ArXiv preprint ArXiv:170204782. 2017;.
  52. 52. Luo J, Xu Y, Tang C, Lv J. Learning inverse mapping by autoencoder based generative adversarial nets. In: International Conference on Neural Information Processing. Springer; 2017. p. 207–216.
  53. 53. Guan S, Tai Y, Ni B, Zhu F, Huang F, Yang X. Collaborative learning for faster stylegan embedding. ArXiv preprint ArXiv:200701758. 2020;.
  54. 54. Perarnau G, Van De Weijer J, Raducanu B, Álvarez JM. Invertible conditional gans for image editing. ArXiv preprint ArXiv:161106355. 2016;.
  55. 55. Zhu J, Shen Y, Zhao D, Zhou B. In-domain gan inversion for real image editing. In: European Conference on Computer Vision. Springer; 2020. p. 592–608.
  56. 56. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323(6088):533–536.
  57. 57. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. ArXiv preprint ArXiv:14091556. 2014;.
  58. 58. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Communications of the ACM. 2017;60(6):84–90.
  59. 59. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. ArXiv preprint ArXiv:160207360. 2016;.
  60. 60. Wang Y, Wu C, Herranz L, van de Weijer J, Gonzalez-Garcia A, Raducanu B. Transferring gans: generating images from limited data. In: Proceedings of the European Conference on Computer Vision (ECCV); 2018. p. 218–234.
  61. 61. Mo S, Cho M, Shin J. Freeze the discriminator: a simple baseline for fine-tuning gans. ArXiv preprint ArXiv:200210964. 2020;.
  62. 62. Noguchi A, Harada T. Image generation from small datasets via batch statistics adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 2750–2758.
  63. 63. Tan W, Wen B, Chen C, Zeng Z, Yang X. Systematic Analysis of Circular Artifacts for Stylegan. In: 2021 IEEE International Conference on Image Processing (ICIP). IEEE; 2021. p. 3902–3906.
  64. 64. Zhou P, Xie L, Ni B, Tian Q. Cips-3d: A 3d-aware generator of gans based on conditionally-independent pixel synthesis. arXiv preprint arXiv:211009788. 2021;.
  65. 65. Zhao L, Zhang Z, Chen T, Metaxas D, Zhang H. Improved transformer for high-resolution gans. Advances in Neural Information Processing Systems. 2021;34:18367–18380.
  66. 66. Lin CH, Chang CC, Chen YS, Juan DC, Wei W, Chen HT. Coco-gan: Generation by parts via conditional coordinating. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019. p. 4512–4521.
  67. 67. Skorokhodov I, Ignatyev S, Elhoseiny M. Adversarial generation of continuous images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021. p. 10753–10764.
  68. 68. Karnewar A, Wang O. Msg-gan: Multi-scale gradients for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 7799–7808.
  69. 69. Ruan D, Song X, Gühmann C, Yan J. Collaborative optimization of CNN and GAN for bearing fault diagnosis under unbalanced datasets. Lubricants. 2021;9(10):105.
  70. 70. Ruan D, Chen X, Gühmann C, Yan J. Improvement of Generative Adversarial Network and Its Application in Bearing Fault Diagnosis: A Review. Lubricants. 2023;11(2):74.