Restoration and content analysis of ancient manuscripts via color space based segmentation

Ancient manuscripts are a rich source of history and civilization. Unfortunately, these documents are often affected by different age and storage related degradation which impinge on their readability and information contents. In this paper, we propose a document restoration method that removes the unwanted interfering degradation patterns from color ancient manuscripts. We exploit different color spaces to highlight the spectral differences in various layers of information usually present in these documents. At each image pixel, the spectral representations of all color spaces are stacked to form a feature vector. PCA is applied to the whole data cube to eliminate correlation of the color planes and enhance separation among the patterns. The reduced data cube, along with the pixel spatial information, is used to perform a pixel based segmentation, where each cluster represents a class of pixels that share similar color properties in the decorrelated color spaces. The interfering, unwanted classes can thus be removed by inpainting their pixels with the background texture. Assuming Gaussian distributions for the various classes, a Gaussian Mixture Model (GMM) is estimated through the Expectation Maximization (EM) algorithm from the data, and then used to find appropriate labels for each pixel. In order to preserve the original appearance of the document and reproduce the background texture, the detected degraded pixels are replaced based on Gaussian conditional simulation, according to the surrounding context. Experiments are shown on manuscripts affected by different kinds of degradations, including manuscripts from the DIBCO 2018 and 2019 publicaly available dataset. We observe that the use of a few PCA dominant components accelerates the clustering process and provides a more accurate segmentation.

Introduction Ancient, historical manuscripts consist of a number of different patterns, or layers of information. Besides the main text and the paper texture, they may also contain other informative features, such as annotations, miniatures, stamps, or non-informative interference's due to damages caused by storage or other manhandling with the passage of time. Mostly, these damages appear as spots of humidity and molds, or ink seeped from the reverse side, impairing the main text.
An important aim of digital image processing techniques applied to the available digital copies of ancient manuscripts is to provide the scholars with versions that can help them in reading and interpretation of the main contents. Therefore, the main theme of virtual restoration is to put back the manuscripts to their original appearance, by eliminating only the degradation without destroying the other informative features. In this sense, the plurality of manuscript content should be analyzed and discriminated, in such a way to preserve or even highlight the useful patterns and remove the extra useless patterns that can disturb or even make impossible the interpretation or reading of document contents.
Another goal of digital manuscript restoration is to prepare them for computer based automatic word spotting and/or character recognition. In this case, binarization is usually performed as a first step to extract the foreground text against all other features considered as complex background or noise. A wide interest exists about degraded document binarization [1], and a variety of methods have been proposed. Among those, local and adaptive thresholding, or recurrent, convolutional or deep neural networks have been shown to cope, to some extent, with degradations such as uneven illumination, image contrast variation, changes in stroke width and connection, faded ink of faint characters [2][3][4][5]. To improve binarization, noise filtering and contrast enhancement are also often applied as a pre-processing step [6].
Thus, in general, binarization is also seen as a way to restore degraded documents. Such an approach is however far away from the concept of virtual restoration, where the main focus is restore the aesthetic look of the document.
The bleed-through distortion is perhaps an exception in this panorama, as its removal has raised a certain interest individually and outside the binarization context. Indeed, strong bleed-through that often occur in historical manuscripts cannot be normally removed by binarization alone. They mainly fail due to the significant overlap of bleed-through with the foreground text and the wide variation of its extent and intensity. Methods specifically designed for bleed-through reduction have been proposed, e.g., in [7], where a recursive unsupervised segmentation is suggested; and in [8], where a conditional random field (CRF) using intensity distribution as prior is presented.
A peculiarity of the bleed-through removal problem is that extra information can be exploited, for example, the verso side information of the manuscript. This allows for the design of algorithms that are able to selectively remove the unwanted interference alone, leaving the rest of the manuscript content unaltered. Examples of such techniques can be found in [9], where a classification is performed by segmenting the recto-verso joint histogram with the aid of ground truths. Also, in [10], where a dual-layer Markov Random Field (MRF) prior is combined with a data term derived from user-labeled pixels, and in [11], where correlated component analysis is used to separate the information layers. In [12], the fidelity of the restored manuscript to the original one is further improved by inpainting the bleed-through pixels, detected by the modelbased method in [13], through sparse image representation and dictionary learning.
The performance of the recto-verso based methods depends, however, on the alignment of the two images. Accurate registration is difficult to achieve in this specific case due to document skews, different image resolutions, or wrapped pages when scanning books. Thus, dedicated recto-verso registration algorithms have to be designed. Some joint registration and restoration methods have been proposed in [14][15][16][17].
Although the problem of removing severe bleed-through while preserving the useful content is not yet fully resolved, some of the methods mentioned above can be used for virtual restoration of complex manuscripts.
In this paper, we extend the concept of virtual restoration of a manuscript to the removal of degradation patterns of any type, such as spots, bleed-through, non-uniform illuminations, etc. The proposed method uses single-sided RGB manuscripts, thus avoiding the need for recto-verso alignment, and adopt the approach of performing an analysis of their content, that is of individually detecting and locating their constituent patterns. Virtual restoration is then obtained by inpainting the undesired interfering content with the background texture. It is straightforward to see that such an approach can also serve to facilitate the execution of other tasks besides virtual restoration, for instance, document binarization or text extraction, and geometrical and logical page layout analysis.
The remainder of the paper is organized as follows. the next section briefly presents image segmentation using color information and GMM for data clustering. Then, we present the proposed color based segmentation method and discuss the Gaussian texture inpainting in the next section. A set of experimental results and their qualitative evaluation is presented to validate our claim. Concluding remarks are given in the last section.

Background
In ancient manuscripts acquired in RGB modality, there is strong spectral correlation in patterns associated to same components and high color contrast among different components. This correlation and contrast in spectral information made them perceptually well distinguishable. In fact, some distinctive components, such as initials and stamps, have originally colored in such a way that contrast with the main text or writings. Moreover, the spectral or color homogeneity in components (i.e., text, background, spots, bleed-through, graphical elements etc.) is a distinctive feature of manuscripts with respect to natural scenes. Hence, it is reasonable to assume that clusters of pixels having similar spectral responses might correspond to the same components, and that a color clustering technique can be able to segment the manuscript in its constituent components.

Color space based segmentation
Color space based segmentation is a critical problem in image analysis and computer vision, with a wide variety of applications. The clustering in these applications depends on the used color space and the homogeneity criterion is based on features derived from the spectral components. These color features, derived from different color spaces, contain significant amount of discriminative information. By definition, a color space is a tool to visualize, create, and specify the color [18]. The choice of the color space for image segmentation is delicate: several color spaces, such as RGB, HSI, and CIELuv have been used, but none of them is optimally suited in the same way for all kinds of images and applications [19].
The most commonly used color space is the RGB color space. Other color spaces are usually derived from the RGB color space by means of either linear or nonlinear transformations [20]. However, several studies have shown that the RGB representation is not optimal for segmentation purposes. Indeed, it has some limitations in conveying the information, since its R, G and B channels are highly correlated, and it fails to mimic the color perception of the human visual system, where the color is better represented in terms of hue, saturation, and intensity [20]. The HSI space, obtained from RGB, overcomes this problem, but both RGB and HSI are not perceptually uniform, meaning that the difference in color perceived by the human eye is not represented by similar distance in those spaces. The CIE color spaces (CIELuv, CIELab) introduce a uniform metric for assessing perceptual differences among colors using the Euclidean distance.
The performances of different color spaces for segmentation purposes have been widely compared in the literature, concluding that no single color space performs in the same way for all kinds of images or segmentation methods [19]. In [21] the CMY color space is suggested to perform at best for segmentation based on clustering. A comparison of ten color spaces for skin detection and segmentation is presented in [22], concluding that the HSV color space produces the best results. Similarly, in [23], the HSV color space is recommended for crop image segmentation. Based on spectral color analysis, in [20] the color space is selected according to the task at hand. A color quantization clustering method, using eigenvalues of the color covariance matrix, is suggested in [24]. Recently, different combinations of color spaces have been evaluated in [25] for segmentation tasks using different databases. They concluded that a combination of different color spaces can be used to improve segmentation results and certain classes of objects are better represented by different color spaces.

Gaussian mixture model
In the last decade, mixture models using probability density functions (pdf) of pixel attributes emerged as a powerful tool for data clustering [26]. Assuming that each cluster is derived by a single pdf, these methods automatically provide data clusters on the basis of the mixture components that generate them [27]. In particular, Gaussian mixture models (GMM) are very flexible in capturing the distribution of different feature spaces, and have been specifically suggested for color image segmentation [28]. Furthermore, the mixture parameters with Gaussian components can be efficiently estimated through maximum likelihood estimation using the well established expectation maximization (EM) algorithm. For better results, these methods may also incorporate pixel location information to account for spatial smoothness, with the assumption that nearby pixels belong to the same cluster [29]. In mixture models, image segmentation is framed as a hidden variable problem, where the hidden class labels determine which component generates a specific image pixel.

Our contribution
Based on the discussion above, in this paper we propose to perform color image segmentation through GMMs in conjunction with multiple color space representations, to analyze the content of ancient degraded manuscripts. Unlike the binary restoration, the main focus is to restore the aesthetic look of the manuscript, which has its own importance in processing ancient documents. We thus combine three different color spaces information to create a feature space that is able to capture all the necessary information to discriminate the classes (foreground text, background medium, and a number of other different information/degradation patterns), by highlighting the differences, even slight, in their spectral responses. More specifically, we associate each pixel with its representations in the RGB, CIELUV and CIELAB color spaces, along with its spatial location in the image. The spatial smoothness constraint enforced by pixel spatial information is particularly suited to describe the homogeneity of color usually observed in typical manuscript patterns mentioned above.
In case of our feature space, the color vector coordinates can be considered as a realization of a multivariate Gaussian law and the color image as a realization of a random field, with each random variable described by a GMM. A GMM based clustering can be used for pixel based classification. In order to improve and speed up GMM clustering, we perform principal component analysis (PCA) of the initial data space, to decorrelate it and to reduce its dimension without loosing information. In [30], it has been shown that PCA is particularly beneficial for K-means clustering by eliminating associations between data and improves the quality of segmentation. The PCA data reduction allows to speed up the training process of the GMM model, with a consequent significant reduction of the computational time. In our specific case, we have experimented that a number of 3 dominant PCA components are sufficient to capture the essential information and organize it in a more coherent way. The GMM is estimated through the EM algorithm based on the reduced feature space. Each pixel is labeled as belonging to the class associated to the Gaussian component that returns the highest probability for the feature vector of that pixel.
After segmentation, a virtually restored image of the manuscript with all its informative content is generated by selectively replacing the detected degradation pixels with with an appropriate fill-in pixels that reproduce the textured background. The inpainting is performed via Gaussian conditional simulation [31], accounting for the surrounding context to preserve the aesthetic look of the manuscript.
Experimental results show that the proposed method can be satisfactory used to remove the interference commonly found in ancient manuscripts, and to extract typical salient features. We also test the method also on some images from the DIBCO 2018 data set [32].

The proposed method
Each pixel i in a color image can be defined by a vector , for the RGB color space, or N > >3 in multispectral or hyperspectral images. For an N-color image with m rows and n columns there will be M = m × n such vectors, or equivalently N spectral planes of size m × n. The mean vector of all image vectors can be calculated as: The covariance matrix of D can be approximated as To extract the decorrelated spectral bands, one can use PCA. The PCA is based on the eigenvalue decomposition of the covariance matrix, given as: The linear transformation defined by y i = V T D i (i = 1, 2, . . ., M) is the PCA pixel vector, and all these pixel vectors form the PCA bands of the input image. Let the eigenvalues and eigenvectors be arranged in descending order so that λ 1 > λ 2 > . . . > λ N , thus the first K (usually K �N) rows of the matrix V T , namely the first K eigenvectors, can be used to find an approximation of the input image. Such PCA bands have the highest contrast or variance in the first band and the lowest contrast or variance in the last band. Therefore, the first K PCA bands contain the majority of information residing in the input images and can be used for more effective and accurate analyses because the number of image bands and the amount of image noise involved are reduced.
In the proposed method, we first transform the original R, G and B features of the color manuscript image I into a hybrid color space, including RGB, CIELab and CIELuv. In the color feature set, we retain only one L channel, since it is same in the CIELab and CIELuv color spaces, thus obtaining the eight distinct color features R, G, B, L, U, V, A and B. All the color bands are stacked together to form a multispectral image cube χ. Each pixel in this cube is a feature vector containing eight color values, from three color spaces. Let be the i th pixel vector in our data cube, where d F represents the scalar value observed in the F color space. The constructed feature cube is used to find the dominant PCA bands. These PCA bands are mutually orthogonal and their covariance matrix takes the form The PCA transformation of the original feature cube χ remove the correlation among the bands and facilitate the classification by identifying the optimum linear combination of the spectral bands accounting for the variation of pixel values. We select the first 3 principal components (PC) and pass these PCs along with the pixel spatial location information to the pixel classification step, as a feature space S.

Pixel segmentation
Consider the extracted PCA image I K , K = 1, 2, 3 as an unlabeled random sample, with P pixels, I = (i 1 , i 2 , . . ., I p ), drawn from an independent and identical distribution (i.i.d). In this paper, unlabeled sample means that for any pixel i p 2 I K the true sub-population to which i p belongs is not known. Our main goal is to make accurate statistical inferences on properties of the subpopulations by using only the information of the unlabeled observed image I K .
In general, the classification problem consists of finding C classes C 1 , C 2 , . . ., C L of pixels whose feature vectors satisfy a given similarity criterion, such that every i p , p = 1, 2, .., P, belongs to one of these classes and no i p belongs to two classes at the same time, that is S N n¼1 C n ¼ I and C l T C j = ; 8 l 6 ¼ j. One popular way to perform this classification is to model the feature space S as a realization of a random field, each random variable described by a Gaussian Mixture Model (GMM) [33]. More precisely, each i p 2 S is supposed to be drawn from a Gaussian mixture of multivariate normal distributions given as where Θ = {(α j , θ j )} j = 1,2, . . ., L , G j ðijy j Þ is a multivariate Gaussian probability density function with parameters θ j = (μ j , Λ j ) representing the mean vector and the co-variance matrix, respectively, and α j 2 is the mixing weights, satisfying P L j¼1 a j ¼ 1. Assuming that each image pixel is actually drawn from one of the L Gaussian components in the mixture of Eq (1), then the classification problem can be reformulated as that of finding the parameters of the L Gaussian densities of Eq (1) that best describe L homogeneous partitions of the pixels. In other words, for each i.i.d observation i p , generated according to Eq (1), we assume a hidden class label c l , l 2 [1, 2, . . ., L], indicating the density distribution in the mixture that generates it, and estimate as the maximizer of the following distribution given by the Bayes rule and representing the posterior probability that i p belongs to class C j . To estimate the unknown parameter set Θ, i.e. θ j = (μ j , Λ j ) and class probability p(c j ) = α j , j = 1, 2, . . ., L, associated with each cluster, along with probabilities (2) for each observation, we use the EM algorithm. The strength of EM algorithms is based on the concept of incomplete data and complete data, which allows to simplify parameter estimation through Maximum Likelihood. In our setting, the feature space S is considered as incomplete data, whereas (S, p(c j |i p ), j = 1, 2, . . ., L, p = 1, 2, . . ., P) represents the associated complete data. The likelihood of the observed incomplete data is marginal over the hidden class labels of the joint distribution of the complete data.
EM is an iterative algorithm that at each iteration alternates between two steps, until convergence. In the first step, the E-step, by virtue of the Jensen inequality, the log likelihood of the incomplete data is equivalently substituted with the marginal over the hidden class labels of the log joint distribution of complete data. This expectation is computed conditioned on the observed data S and the current estimates of the parameters Θ. In the M-step, the expectation is maximized to obtain an update of the parameters.
In this specific application, the E-step, given the current parameters Θ itr , returns an estimate of the posterior distribution of Eq 2, in the following form where p(i p |c j ) itr is the j−th multivariate Gaussian distribution with parameters y itr j computed for i p , given as pði p jc j Þ itr ¼ 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi In the M-step, based on the estimated p(c j |i p ) itr , the parameters are updated as follows The stopping criterion for the EM algorithm is set to either a maximum number of iterations or to the difference between two successive iterations, i.e., ||θ itr+1 −θ itr || � τ, where τ is a small constant.
EM based methods require the initialization of the parameters. In many cases, a random initialization is suggested but, depending on data, highly time consuming. in the proposed method, we used the K−means+ + algorithm [34] to estimate the initial values of the GMM parameters θ. The number of clusters L is set in the start and optimally adjusted according to the pixel density, i.e., clusters with small number of pixels are merged with the nearest cluster. For large size images, an optimized version of the EM algorithm presented in [35] is used to speed up the clustering process.
After optimizing the parameters of the GMM and determining the posterior probabilities p (c|i) of Eq (2), the maximizers of such probabilities are used to assign a label c l , l 2 [1, 2, . . ., L] to each pixel i p in the RGB image I.
With successful classification, the user can decide what class of pixels (clean) to keep unaltered in the original RGB image I, and what class of pixels (degradation) to discard, i.e. to inpaint with the background class.

Gaussian texture inpainting
After classification, the pixels belonging to the degradation class are removed and replaced by a befitting value simulating the background. The degraded pixels are treated as missing or corrupted image regions, and replaced with pertinent values that are consistent with the known non-corrupted surrounding regions. In traditional ancient document restoration methods, the noisy pixels are either replaced by a constant value or a random pixel value is assigned from the neighborhood, like in [8]. Unfortunately, in case of ancient documents with textured background, these generic fill-in creates visible artifacts and garbles the original aesthetic look of the document.
Keeping this in mind, we perform texture inpainting using Gaussian conditional simulation [31], inferring the surrounding context to preserve the natural appearance of the document. This Gaussian conditional inpainting is especially efficient to inpaint textured content, as is the case in many ancient document backgrounds. The main idea here is to fill the missing region with a conditional sample of a Gaussian model. For our application, a Gaussian texture model is estimated using an exemplar background mask. This exemplar mask is obtained randomly from the background class, as extracted in the previous section. The Gaussian model is then conditionally sampled to gradually replace the degraded pixels.
Let I 2 R 2 is the input image to inpaint and M 2 I represents the indices of missing pixels, i.e., pixels values of I are known except the missing region M. Let z represent the outer border of the missing region M, with width w pixels as shown in Fig 1. The goal here is to estimate the pixel values in M, given a conditioning set z. The Gaussian model is then conditionally sampled to gradually replace the labelled bleed-through pixels, using the available neighboring pixels on the border. The conditional sample is obtained by combining an innovation component derived from an independent realization of the Gaussian model and a kriging component derived from the conditioning values [31]. The latter extends the longrange correlations and the former adds texture details, in a way that preserves the global covariance of the model. A detailed description of the Gaussian conditional inpainting can be found in [31]. Although designed to model microtextures, this algorithm is able to fill both small and large holes, whatever the regularity of the boundary. On the other hand, typical document backgrounds, representing the motif of the paper fiber, seem to fit well a homogeneous microtexture model.

Experimental results
A set of experimental results on color ancient document images are presented to evaluate the performance of the proposed method. We compare our results with a recently published bleed-through removal method presented in [8], which is one of the most impairing degradations in ancient manuscripts. For this comparison, we use images from the well-known database of ancient documents presented in [36] and available online. This database contains 25 pairs of recto-verso images of ancient manuscripts affected by different levels of bleed-through, along with manually created ground truth binary images of the foreground text. It is worth mentioning that, while this database mainly focuses on bleed-through effects, our method can be used to remove also other document degradations, such as stains, folding marks, etc.
For the proposed method, the transformation from RGB to CIELab and CIELuv color spaces is performed using MATLAB built-in functions. We used the first 3 principal components after PCA reduction. The clustering algorithm is initiated with four classes, L = 4, to account for up to four different colors in the manuscript. The GMM is initialized by using the K-means function available in MATLAB. We use five iterations for the EM algorithm to find the clusters, and the Gaussian texture inpainting is initiated with pixels from the background class. The presented visual results for the method in [8] are provided by the authors.
Quantitative evaluation of ancient document restoration is a non trivial task. Generally, the efficacy is evaluated qualitatively, as in most cases the original clean image is not available. Fig 2 shows a visual comparison of images restored by [8] and the proposed method. Note that, although our results are in color, we show them in grayscale for a fair comparison with the grayscale results of [8]. Since we are interested in preserving the original look of the document, the efficiency of the inpainting reproducing the paper texture in the background has to be evaluated as an index of the restoration quality. As can be seen, the proposed method (Fig 2  (c)) produces comparatively better results in this respect. It efficiently removes the bleedthrough degradation, leaving intact the foreground text, and reproducing well the background texture. The method proposed in [8] (Fig 2(b)) produces comparative results, but some strokes of the foreground text are missing and some bleed-through strokes are still visible. They also replaced the removed bleed-through pixels with a random pixel which destroy the background texture.   of clusters from four to three, according to the number of different patterns in the images. For the first and third column, the use of four classes helped in discriminating the four different information layers of foreground text, background, degradation, and the annotations (stamps).
An example of restored color images from the DIBCO-2019 dataset, publicly available at [37] is presented in Fig 4. It is interesting to note how the proposed method cope with different kind of document degradation, including bleed-through, stain and see-through. In addition, the background texture in all different cases are preserved to restored the aesthetic look of the document.

Computational complexity
Computational complexity is an important aspect to reflect the amount of time and memory required for an algorithm to execute as function of input data size. The most common way to represent the computational complexity of an algorithm is the big O notation. This provides an upper bound on the worst-case running time of an algorithm and estimate the complexity in terms of the input data size. The dominant and most repeatedly used operations in an algorithm are used to calculate the running time. The computation time of the proposed algorithm mainly depends on the repetitions of EM algorithm for the GMM parameters estimation and the Gaussian texture inapianting step. Generally, the computational complexity of EM algorithm for GMM parameter estimation is considered as OðTK 2n Þ, where T is the number of iterations, K is the mixture components, and n represent number of data points. For color images, as in our case, the computational complexity will be OðTK 2nD Þ, where D is the data dimension. As discussed in the experimental section, we used four classes (K = 4) and five iterations (T = 5) of EM algorithm to find the clusters. We used color images and their transformed version CIELab and CIELuv, resulting in a hybrid color space with eight distinct color features (D = 8). The total computational complexity of the proposed algorithm is Oð5 � 4 16n Þ, for a n pixels image.
In practice, on a normal desktop computer (core i5, 4 GB RAM, Windows 10), the proposed algorithm will take around 1.4 to 2 minutes for a color image of size 256 × 256 and 512 × 512, using the MATLAB-2018 implementation.

Conclusion
This paper presents a degradation removal method for color ancient document images. A feature space is created using the spectral information from different color space representations of the input color image. The PCA is used to obtained an uncorrelated reduced feature space that is then passed to a GMM based classification method, along with the pixel spatial information. The pixels are classified into foreground, background, and noise, i.e. the unwanted, interfering degradation. A texture inpainting method based on Gaussian conditional simulation is then used to replace the detected degraded pixels. A set of color manuscripts, effected by different kinds of degradation, is presented to validate the proposed method. The method can be generalized to perform a selective extraction of the different layers of information usually present in the documents (stamps, pencil annotations, decorations, etc.), in order to facilitate page layout analysis. Future studies will regard the robustness of our multiple color space representation against a higher number of patterns present in the manuscripts.