Automated color detection in orchids using color labels and deep learning

The color of particular parts of a flower is often employed as one of the features to differentiate between flower types. Thus, color is also used in flower-image classification. Color labels, such as ‘green’, ‘red’, and ‘yellow’, are used by taxonomists and lay people alike to describe the color of plants. Flower image datasets usually only consist of images and do not contain flower descriptions. In this research, we have built a flower-image dataset, especially regarding orchid species, which consists of human-friendly textual descriptions of features of specific flowers, on the one hand, and digital photographs indicating how a flower looks like, on the other hand. Using this dataset, a new automated color detection model was developed. It is the first research of its kind using color labels and deep learning for color detection in flower recognition. As deep learning often excels in pattern recognition in digital images, we applied transfer learning with various amounts of unfreezing of layers with five different neural network architectures (VGG16, Inception, Resnet50, Xception, Nasnet) to determine which architecture and which scheme of transfer learning performs best. In addition, various color scheme scenarios were tested, including the use of primary and secondary color together, and, in addition, the effectiveness of dealing with multi-class classification using multi-class, combined binary, and, finally, ensemble classifiers were studied. The best overall performance was achieved by the ensemble classifier. The results show that the proposed method can detect the color of flower and labellum very well without having to perform image segmentation. The result of this study can act as a foundation for the development of an image-based plant recognition system that is able to offer an explanation of a provided classification.


Introduction
Identifying a plant is not an easy task, not even for the expert. There are many features of plants that play a role in this task. Color is often used as one of the more important features in flower recognition using image processing [1]. This feature also appears in descriptions used in identification keys, i.e., structured features to identify a species [2]. In this paper we investigate the role color can play in identification, where digital photographs in conjunction with a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 descriptions of orchids are used as the experimental domain. Although color is just one of the features deployed in orchid identification, it is not known how hard it is to detect the colors of an orchid in an image. This is the research question which this paper aims to answer. Our future plan is to build an explainable flower identification computer-based system in which color is one of the features used to support the explanation of why a particular species is most likely.
The colors of some parts of an orchid-the sepals, the petals, and the lip as shown in Fig 1  -are of particular value when identifying a plant. In this research, the sepals and petals together are called the flower, while the lip, that acts as a landing platform for insects, is called the labellum. It may be expected that the difficulty of color detection is not the same for the flower and labellum, because of differences in shape and size. Automatic color detection is not straightforward, as both flower and labellum may contain multiple colors, e.g. red and green, increasing the size of the search space, yet in the face of the availability of only a limited number of photographic images of orchids with varying image quality. In addition, the flowers pictured in the images are normally surrounded by other plants, trees, grass, etc., sometimes making even the detection of the flowers in the image a challenge. As the flowers are not photographed in a standardized fashion, it does not appear easy to develop a segmentation algorithm that is able to distinguish the flowers from the background. Of course, often the photographers did their best to obtain a good (most of the characteristic details of the flower can be distinguished) and clear image of the flowers. However, we noticed that not all photographs are of good quality and even the human eye sometimes has difficulty to find the flowers in the image.
The context of the present research is that of orchid databases (e.g., https://gobotany. nativeplanttrust.org/), nowadays widely available and online accessible through the internet. They contain systematic descriptions of orchids, identified by species name and classified in genera, in terms of specific features (color of the flowers and labellum, shape and texture of the leaves, geographical location where they are found, etc.), complemented by multiple digital photographic images of the plant. As existing image-based flower recognition systems offer black-box approaches, which only output the species name [3][4][5], we believe that the way a taxonomist identifies an orchid, i.e., in terms of its features, is a valuable source of inspiration for designing a plant species identification computer-based system.
In the present paper, a novel approach to automated color detection of flowers in images is proposed; all of the published studies in image-based flower recognition have used color moments and color histograms for extracting color features from images [6]. In contrast, in our research color labels are used, i.e., names of colors used to describe objects, in our case flowers. Taxonomist have used color labels, together with other flower features, to describe the characteristics of flowers for centuries [7]. Thus, in the present paper there are two novel contributions: • Firstly, we have built a new format for a flower-image database, which not only consists of flower images, but in addition of descriptions of the flower characteristics.
• Secondly, we propose a new approach to extract color features from flower images using color labels and deep learning without making use of image segmentation.
To achieve a reliable color detection system, we conducted experiments with five different deep learning architectures, with transfer learning and varying amounts of unfreezing of neural network layers, and various classifier methods (multi-class classifiers, combined binary classifiers, and ensemble classifiers), using different color scheme scenarios. Our experiments show that the ensemble classifier outperforms other classifiers.
Color is widely used as one of the features in object recognition and various methods are employed in identifying color in images. Popular methods are color moments and the color histogram. Color moments measure the color similarity between images by using mean, standard deviation, and skewness [8], whereas the color histogram is based on the computation of the frequency with which colors occur in an image. To compute a color histogram, several color space options can be chosen from, such as the RGB color space, HSV color space, CIE L � a � b � color space, etc. [9].
Besides color moments and the color histogram, another currently popular method is based on color labels. There are two popular methods available to assign linguistic color labels to image pixel, using either chip-based color names or real-world image-based color names. Chip-based color names are obtained by mapping RGB values to the color names of a labeled set of color chips. It works well when the image is taken under conditions of ideal lighting. There are some chip-based color naming references that have been built from different dataset [10][11][12]. Alternatively, real-world image color names are used, where color names are learned from objects in real-world images. Van der Weijer et al. [13] have done research employing this method. They used Probabilistic Latent Semantic Analysis (PLSA), a generative model introduced by Hofmann [14] for document analysis, to obtain the distributions of the color names over L � a � b � values. They claim that their method is photometrically robust because the images for learning have been taken from the internet using Google Search with varying illuminants, cameras, and camera settings. However, in contrast to our approach, the method requires segmentation of an object in an image and the segmented object's color is subsequently determined by counting the number of color pixels. Furthermore, the method is unable to differentiate between the color of the flower and labellum, which is one of the aims of our research.
These limitations brought us to proposing a new method to identify the color of flower and labellum by color labels and deep learning. Basically, we adopted the idea to start with realworld images as suggested by Van der Weijer [13]. Instead of deploying images from general objects (such as cars, shoes, dresses, etc.), we used only flower images. To decide on the color name, we employed deep learning instead of PLSA.
In recent years, flower classification by means of deep learning has been evolving rapidly. Hiary, et al. have proposed a two-step deep-learning method to classify flower species [5]. The first step consists of segmenting the flower region using a Fully Convolutional Network (FCN), composed of 5 blocks from the VGG16 architecture [21] and an additional three deconvolutional layers. The second step is concerned with classifying the type of flower using a Convolutional Neural Network (CNN), which also uses the VGG16 architecture, followed by 3 convolutional layers with 512 feature maps. They chose VGG16 because it better suits the flower classification task compared to other deep-learning methods. Evaluation of the performance of their method was conducted on three different datasets, two from Oxford-the Oxford 17 and the Oxford 102 dataset -, and the Zou-Nagy dataset. The results show that their proposed method can achieve at least an accuracy of 97% on these datasets.
Gurnani et al. have compared the performance of the GoogleNet and AlexNet architecture in classifying different flowers from the Oxford 102 dataset [15]. The GoogleNet architecture uses Inception as the backbone, while AlexNet uses eight layers with the first 5 layers being convolutional layers and the last 3 layers being fully connected layers. They kept the same hyper-parameters during training of both architectures, concluding that the GoogleNet yielded a better performance than the AlexNet.
Use of the Inception-v3 feature extractor with transfer learning, together with a CNN, was proposed recently by Arwatchananukul et al. in an attempt to distinguish 15 species of Paphiopedilum orchids [16]. They also built a new Paphiopedilum orchid database consisting of 1500 images, 100 images per species, each of them front-view images with the flower and labellum placed in a very similar standardized way. The performance of their classification system reached as highest accuracy a value of 98.6%. Inception-v3 was also used by Xia et al. [17] for flower classification. They used Oxford-17 and Oxford-102 flower dataset in their experiment. The results showed that the system can greatly improve the accuracy of flower classification.
In Liu et al. [3], two network architectures, VGG16 and ResNet50, were applied to recognize Chrysanthemum flowers. They used two datasets for training and evaluation. The dataset for training consisted of 14,000 images with 103 cultivars, while another dataset, comprising 197 images, was deployed for evaluation. Deep learning was selected as method in this research because of its potential advantage of achieving a good performance. Furthermore, a research in Orchid flowers recognition has been conducted by Zhang et al. [18]. They built a dataset comprising two million orchid images from 2608 species and proposed a joint framework between deep learning and a tree classifier to identfy plant species in large scale. AlexNet was used for deep learning architecture, while two-layers tree classifier was constructed which can perfectly organize the large number of plant species. The results showed that the proposed framework can achieve very competitive results in accuracy and computational time.
Research to compare the performance of CNNs combined with transfer learning against other machine learning methods using handcrafted feature extraction was initiated by Gogul and Kumar [19]. They used the Inception-v3, Xception, and Overfeat architecture to do feature extraction. In deep learning, specifying the features one by one as normally done in handcrafting is not needed. The handcrafted features that are often used in flower recognition are: color, shape, and texture. All of these features acted as input to the machine learning method. Decision trees, k-nearest neighbor, naive Bayesian networks, and random forests were compared to each other. The results showed that in extracting features, deep learning outperformed the other machine learning methods (using handcrafted feature extraction) in terms of accuracy. In this case, the highest accuracy was achieved by the Inception-v3 architecture.
A comparison of the performance of some deep learning architectures is also made in the work by Basa et al. [20]. In this research, they compared the performance of VGG16, ResNet-50, MobileNet, DenseNet, and NasNet-Mobile combined with a fine-tuning method deploying some datasets including the Oxford-102 Flower dataset. The results showed that VGG16 and ResNet-50 achieved the highest performance, in contrast to NasNet that performed poorly. However, in several studies [22,23], NasNet, which is a relative new architecture in deep learning, often gives the highest performance.
Although the summary of related research above certainly conveys the impression that much progress has been made during the last decade by applying deep learning to flower classification, in our research we explicitly aim to move away from the black-box nature of deep learning, by providing information about the role of each of the features exploited in the classification process, although these features may be identified by deep learning. In this paper, the feature studied is color.

The color of orchids
Relevant features. Different from the situation with other flowering plants (angiosperms), the color of the labellum is employed as extra information by taxonomists, in addition to the color of the flower, for describing the characteristics of the orchid. Thus, • Color of Flower, CF for short, and • Color of Labellum, abbreviated to CL, are the color features of orchids that have been selected for our research because of their easy identifiability by both humans and computer vision systems. In the following, these color features will be abbreviated together as 'CO'. Both CO features have an associated domain (range of values), denoted D(CO). For the domain of these CO features, we have designed two scenarios. The first scenario is using non-binary, multinomial data. For example, the variable CF (Color of Flower) can take color values such as 'red', 'purple', 'yellow', etc. The other, simpler binary scenario assumes that one of the colors is taken as the indicator, e.g., 'red', whereas the other value is 'non-red' (in general, 'color' and 'non-color'). Below, first the design of multinomial color schemes will be described, which will be followed later by the design of multi-class and binary classifiers, where the latter deal with the binary scenario. RGB encoding. We use the RGB (Red Green Blue) color space as a foundation for the color labeling as this color model is also used in the description of the digital images of orchids. RGB color labeling was applied to the color labeling of CF and CL. As the RGB color model describes a 3-dimensional space, we used the 2-dimension matrix mapping, shown in Fig 2, to facilitate designing a mapping.
To reduce the search space, we designed two color schemes from the RGB color model as described in Fig 3. One alternative (called below color scheme 1) investigated is based on the following colors: • Red (standing for redish, i.e., red, brown, and orange): first 2 columns of the matrix of   As blue neither occurs in a flower nor in a labellum of an orchid, columns 7-9 are ignored. Finally, note that white is actually the rectangle M [9,13], but M [9,3] (the end of the Yellow column) is close to white. This choice results in 4 different colors.
The other alternative (color scheme 2) investigated was to employ the following colors: • Red (standing for redish, i.e., red and brown): first column of the matrix of  • White: element M [9,13] in the color matrix.
As again blue neither occurs in an orchid flower nor in its labellum, columns 7-9 are ignored. This choice yields 5 different colors, one color more than color scheme 1. Also observe that only two of the color names of color scheme 1 and 2 have exactly the same semantics: green and purple; the meaning of the other color names is different, which is worth to remember as otherwise it may make the rest of the paper confusing. In the following, the color schemes in relationship to CF (Color of Flower) and CL (Color of Labellum) are referred to be CF1 and CF2, respectively, and CL1 and CL2, respectively.
Primary and secondary color combinations. Often flowers and labellums have multiple colors; the color combination may help in identifying them. One option is to use a multi-label classification method (with the possibility of having two or more color labels at the same time). However, this would imply that we had to combine two (or more) colors according to the Cartesian product of their domain, D(CO) × D(CO) for two colors, yielding 16, e.g. (Red, Yellow) and (Yellow, Red), and 25 colors, respectively, based on color scheme 1 and 2. As we only had a dataset of limited size, we decided to investigate whether the number of labels could be reduced.
As the color of a flower is often not unique, we have defined the variable CO as a subset of the Cartesian product of two other color variables called CO p and CO s , respectively, i.e., D (CO) � D(CO p ) × D(CO s ), with CO p the primary color and CO s the secondary color. Based on the description of orchids, the primary color CO p has a domain with eight values: blue, brown, green, pink, purple, red, white, and yellow; the secondary color CO s has a domain consisting of seven values: brown, green, pink, purple, red, white, yellow. The advantage of this definition of CO is that unlikely or impossible color combinations can be left out of the definition, and do not have to be assessed during statistical estimation of the probability distribution. The order of colors is significant as the first, primary color name, will fill most of the flower or labellum area, and the secondary color a smaller part, often only a rim. However, to reduce the number of colors we assume commutativity of their combination, i.e., for colors A, B we assume that AB = BA. This choice implies that we can compute the number of color combinations with repetitions (the primary and secondary colors can be the same, which is the same as that there is no secondary color), k at the time, of n colors: As in this case we wish to model image colors, not the colors in the descriptions (although there is a strict mapping between the two), we assume that we have to handle 4 and 5 colors, respectively, using the color schemes previously mentioned. Thus, in the present situation for color scheme 1 n = 4 and k = 2, hence: ð 5 2 Þ ¼ 10. The values include, for example, RedRed, which is just Red (the entire flower is red, there is no secondary color). Finally, some of the combinations do not occur in nature. This yields the following combinations:

Orchid dataset
A dataset was composed by us that includes 7156 images of orchid flowers, consisting of 156 different orchid species. Most of the images were Flickr images under the Creative Commons license, downloaded by us through the Flickr API. Some of the images were obtained from websites such as Go Botany (Native Plant Trust) and the Encyclopedia of Life (EoL). Different from other flower image datasets, in addition to the images our dataset contains, descriptions are included of the features for each orchid. The feature descriptions were obtained from the "Go Botany" and "Go Orchids" websites. Several features are used in this dataset: colors, texture, inflorescence, number of flower, and labellum characteristics. However, in this paper, we only focus on the color features.
The dataset is quite challenging because it suffers from imbalance: some classes are covered by a large number of images, whereas other image classes are underrepresented, i.e., only appear in very small numbers. However, this imbalance in the data might be caused by the rareness of the species in nature, leading to a lack of pictures of certain species. Besides that, the dataset contains flower photographic images with a natural background, taken from various position and angles, with varying conditions of illumination and noise, rendering this dataset non-uniform and thus hard to analyze (see Fig 6 for some example pictures).

Deep learning
As deep learning appears to be one of the best available methods for image interpretation, we compared the performance of some promising deep learning architectures on our orchid dataset with the hope of discovering the best architecture for our color detection system. We explored VGG16 [21], Inception-v3 [25], Resnet50 [26], Xception [27], and NasNetLarge [22]. Rather than training a custom-made convolutional network from scratch, transfer learning using these pre-trained architectures, trained on a large dataset (in our case the ImageNet dataset), were used. A pre-trained network is in particular attractive if only a small dataset is available.
In transfer learning, we can freeze or unfreeze the layers in pre-trained model to obtain the best performance. Table 1 shows the number of layers in each architecture. Based on the table, some experiments were performed by adjusting the number of frozen layers in the pre-trained models. We froze 3 4 th, 1 2 , and 1 4 th of the bottom layers. In addition, we also tried to freeze the first layer and unfreeze all of the layers.
We added three new layers in the last layer of pre-trained model. The last new layers were: a flatten layer, a dense layer with 512 neurons, a dropout layer with a probability of 0.5, and as last dense layer one with the number of neurons equal to the number of colors that we wished to detect. ReLU was employed as the activation function in the first dense layer and softmax as the activation function for the last dense layer.
The hyper-parameter for deep learning that we employed in the experiments are shown in Table 2. As input to the neural networks acted an RGB image with size 224 × 224 for all of the architectures except NasNetLarge which uses 331 × 331. We used a batch size equal to 64 and the number of epochs was equal to 100. To achieve a better performance we also fine-tuned our pre-trained models using data augmentation. Data augmentation is a method to increase the diversity of the training set by applying random transformations. The transformations we applied to the dataset were: rotation, shrink, flip, and zoom. The class weight method was applied to the data to obtain more balanced data; it works by replicating the smaller (in number of instances) class until as many samples are obtained as for the larger class. Adam was the optimizer used in this case, with as loss function: weighted binary cross entropy. Software was developed for the experiments based on the software libraries tensorflow and keras, on top of the scripting language python [28,29].

Color classifier methods
As is clear from the description above, the feature of color of both flowers and labellum are described by more than two class labels. Hence, the classifier models we need to learn from the data are of the form C : R p�q ! f1; . . . ; mg, p; q; m 2 N, where an image I 2 R p�q is described by a pq-dimensional (here 224 × 224; 331 × 331 for NasNetLarge) real matrix, and for primary colors only, m = 4 using color scheme 1, and m = 5 for scheme 2; when we consider combining primary and secondary colors, m = 8 for scheme 1 and m = 10 or m = 11 for scheme 2, as discussed above. There are multiple ways in which color classification can be handled. The first way is that one simply learns a color multi-class classifier C. Fig 7 shows the framework for multi-class classifier using deep learning. After dividing the dataset into some training set, validation set and testing set, we use the flower images together with its color labels as the input and train them using multi-class classifier based on one of deep learning architecture. In multi-class classifier, we only need one classifier for predicting various color labels. The outputs for both training and testing are color label and softmax value. The disadvantage is that in situations of a small dataset, it is hard to learn C when the number of color labels is large, such as 10 and 11 in our case. An alternative solution is to learn multiple binary classifiers, C i : R p�q ! f0; 1g, i = 1, . . ., m − 1 and to combine them.
Let I 2 R p�q be an image and let us denote by C i the situation that C i (I) = 1 and by ¬C i the case that C i (I) = 0. Note that the situation where we know nothing about the actual color of a flower or labellum can be summarized by the following logical disjunction (with _ having the meaning of inclusive OR): called the domain closure axiom in artificial intelligence [30], which is augmented with mutual exclusiveness, ¬C i _¬C j , 1 � i, j � m, i 6 ¼ j. It simply means that known is that the actual color is one of the allowed colors, and having two or more colors at the same time is inconsistent. Note the difference with totally knowing nothing; we do know something, but not yet which specific color the plant part has. This formula plays an essential role below. We have the following potential results from the merge of the binary classifiers: • (¬C 1^� � �^¬C i−1^Ci^¬ C i+1^� � �^¬C m−1^¬ C m ), where C i is the only established positive label for image I, ¬C k , k = 1, . . ., m − 1, k 6 ¼ i, is the output of classifier k, and ¬C m is obtained by the mutual exclusiveness axioms; • (¬C 1^� � �^C i^� � �^C j^� � �^¬C m−1 ), with i 6 ¼ j, meaning that the binary classifiers yield contradictory (more than one positive label) result. The label is unknown in that case.
• (¬C 1^� � �^¬C i^� � �^¬C m−1 ), which when combined with Γ yields that C m is the right label of the image.
This way of combining the result of multiple binary classifiers is known as the one-versusthe-rest classifier [31].
A third alternative is to learn an approximate probability function f i : R p�q ! ½0; 1� for each color i = 1, . . ., m by deep learning and to select the color k where an image I has maximum probability: The fourth and last classifier we wish to consider is that of combining a multi-class and combined-binary classifier as an ensemble [31]. Often ensembles of classifiers use some kind of voting mechanism to determine the output. As in this case the ensemble consists just of two classifiers, we have designed two ways to determine which class label to yield as a result. Let C 1 and C 2 be the two classifiers that make up the ensemble. The following heuristics have been designed (and will be evaluated below) yielding two different ensemble methods: • Most likely true color (MLTC) ensemble. If both C 1 and C 2 produce the same label as output, then this is taken as the result. However, when the labels are different, we select the color of C 1 or C 2 that has the highest true positive rate (TPR; see next section for its definition).
• Most likely color ratio (MLCR) ensemble. Similar to the MLTC ensemble, if both C 1 and C 2 produce the same label as output, this is taken as the result. However, when the results are different, we decide to produce the color for which the true positive rate ratio of these two colors between the two classifiers is highest as output. The ratio is interpreted in this case as a heuristic saying that if the difference in ratio between a specific color of a classifier is larger than for the other classifier, then the best color choice is the one with the lowest ratio.
An example may be valuable to help in understanding of what we have just described. Example 1 Consider the two classifiers C 1 and C 2 , respectively, with outputs red and white, respectively, and the following results for the two different ensemble methods. For C 1 = red, TPR = 0.38, whereas for C 1 = white, TPR = 0.70; similarly, for C 2 = red, TPR = 0.58, whereas for C 2 = white, TPR = 0.59. If we use the MLTC ensemble method, the choice would be white (that is, C 2 wins) as 0.59 > 0.38. The MLCR ensemble method will choose red, as the ratio for white is in this case equal to 0.59/0.70 � 0.84 and for red equal to 0.38/0.58 � 0.66 indicating that for red the values are wider apart than for white. Hence, red is the proper choice for the MLRC method.
The detailed schemas to understand easily the scenarios for combined-binary classifiers and ensemble classifier can be seen in the S1 Fig.

Training and testing procedures
We divided the dataset described above into two parts: one for training the classifiers, using a training and validation set, and the other one for testing purposes. The training and validation sets are used for optimization purposes, i.e., to fine-tune a model. The independent testing set ensures that testing is done on images that have not been met during training. The percentages of each part were approximately: 70% (training set), 20% (validation set), and 10% (testing set). We had 5119 images for training, 1235 images for validation, and 802 images for testing. Data augmentation was only applied in the training process. In this dataset, we used two color characteristics, CF and CL, as described in detail above. We extracted the primary color and also primary and secondary color from the images, using the color schemes described above, to handle the large number of flower species. Note that in training the images and associated color labels are used as input, in a form of supervised learning, whereas in testing only the images act as input and the associated color labels are only used to determine the classification performance of the various classifiers.
There are several measures in use to evaluate the performance of a classifier. As we are dealing with multi-class classification problems, no use is made of ROC analysis, which was originally designed for binary classification. Instead we will examine the performance by showing confusion matrices; they have the advantage that they include all the needed information to compute various performance measures, offering detailed insight into how well a classifier performs. A confusion matrix is also useful to visually show imbalance in a dataset.
A confusion matrix is computed by summarizing the number of correct and incorrect predictions per class. There are two kinds of confusion matrices. The first kind is a confusion matrix with entries computed by directly placing the number of correctly or incorrectly predicted cases into the table. Although we will use confusion matrices for multi-class classification, which do not have the 2 × 2 table structure as for binary classification, we illustrate the basic ideas by this simplest possible confusion matrix. TP stands for 'True Positive', representing the number of cases with positive classes that were predicted correctly as being positive. TN stands for 'True Negative'; it represents the number of cases with negative class that is predicted correctly as being negative. FP is short for 'False Positive', being the number of cases with negative class that the classifier predicted as being positive. Finally, FN stands for 'False Negative', being the number of cases with positive class that are predicted as being negative. Table 3(a) summarizes these measures in one matrix.
The second kind of confusion matrix includes rates or frequencies, based on the data in the unnormalized confusion matrix in Table 3(a) which is computed by dividing the number of correctly or incorrectly predicted cases by the total number of cases per class. If we need to obtain a normalized confusion matrix, we only need to divide each entry of the confusion matrix by the total number of cases per class like in as shown in Table 3 Another often used measure is 'accuracy': the total number of correct predictions divided by the total number of cases. The accuracy can be calculated directly from the confusion matrix as follows [32]: In addition, the performance measure F 1 is applied frequently, which defined as the harmonic mean of Recall and Precision [32]: where and In practice we will use the macro-F 1 measure, as it yields insight into the classification performance for the entire class variable, as it is defined as the mean of the F i 1 measure for the individual classes i: where n represents the number of classes. Thus, henceforth, when we refer to F 1 , we actually mean macro-F 1 .

Results
As mentioned above, we conducted the experiments using different deep learning architectures to find the pre-trained model that performed best on our orchid data. Using the best pretrained model, we then conducted further experiments by using two color schemes for primary color only and on relevant combinations of primary and secondary color, where the color schemes are referred to in both cases as CF1, CF2, for color of flowers, and CL1 and CL2, for color of labellum. Three different types of classifier were trained and tested using the data: multi-class classifiers, combined binary classifiers, and ensemble classifiers.

Selection of a pre-trained deep-learning model
As detection of the color of the labellum is a more difficult problem than that of the flower, because of its smaller size, the choice of the architecture was guided by their capability of dealing with this color detection problem. Fig 8 shows the results for different architectures applied to our orchid dataset using the primary color of the labellum, using the CL1 color scheme. When all layers are frozen, each of the pre-trained models offers no more than 70% accuracy, except VGG16, which already is a good feature extractor. Its performance is relatively stable for all freezing and unfreezing scenarios. The performance of the other pre-trained models did not give significant improvement when we tried to freeze 3 4 th of the bottom layers. Inception-v3 and Xception give us a good performance since we freeze only 1 2 of the bottom layers. In contrast, the performance of ResNet50 is decreasing when we freeze 1 2 of the bottom layers. Their performance is quite significantly improving when freezing only the first layer and unfreezing the others, and remains good when unfreezing all layers. In the last two cases, all of the architectures give more or less similar performance. However, Xception gives us the best performance when we only freeze the first layer. Because of that, from now on we will use Xception to conduct further experiments using different color schemes.

Computation times
For training the deep-learning models, a high performance computing system with a graphic processing unit was employed, giving rise to greatly reduced computation times. Training time was around three hours, while the testing process was very fast. There are were no significant differences in training time between VGG16, Inception, Xception, and Resnet50. For NasNet, we ran out of memory when trying to unfreeze the layers.
The results for the different classifiers are discussed next.

Results for the multi-class classifiers
The results of the various multi-class classifiers in terms of accuracy and the F 1 -score are shown in Table 4. From the table, we conclude that color scheme 1 yields better accuracy and F 1 -score for color of labellum (CL), whereas color scheme 2 works better for color of flower (CF). As the results obtained for the detection of primary and secondary color together are worse than that for primary color only, the detection of the combination of colors is clearly more difficult than detecting one color only, which is according to expectations. Certainly part  of the decrease in performance is due to the fact that the number of class labels is much higher for the color combination than for primary color only. Based on these results, we decided to proceed developing separate binary classifiers for individual colors (a kind of 'color specialists'), which would be subsequently be combined into multi-class classifiers, as described above in the section on color classifier methods.

Results for the combined binary classifiers
Recall that for combining binary classifiers, we use two methods: one-versus-the-rest (method 1) and maximum probability (method 2). Table 5 offers a summary of the accuracy and F 1score for all combinations of color schemes and methods. Overall, both for primary, and primary and secondary color, method 1 yields lower performance in comparison to method 2. As Table 6 shows, the lower performance is often due to the unclassifier (inconsistent) cases, which is what we expected. The advantage of method 2 is that it always produces consistent classifications and thus below we will focus on this method.
Both CF and CL using primary colors show better accuracy than using primary and secondary color. However, the color schemes yield different results for CF and CL. When used to classify CL, we see that color scheme 1 appears to work better than color scheme 2. The opposite pattern occurs for CF for primary color, but not always, as shown for the color combination. Hence, it is clear that classifying color for flowers and the labellum are not task with identical difficulty, maybe because the labellum is much smaller than a flower; therefore, the simpler color scheme appears to work well for the labellum.

Results for the ensemble classifiers
Because the results obtained by the combined binary classifiers were not always consistently better than those obtained by the multi-class classifiers, and the opposite was also not true, we decided to investigate two different, although related, ensemble classifiers, as described at the end of the section on the color classifier methods. Hence, two, closely related methods, the MLTC and MLCR methods, were compared to each other. As can be noted from Table 7, in general, identifying color using MLCR yields better accuracy than MLTC. However, in some colors such as CL1 using primary color and using primary and secondary color, MLTC has the same performance as the MLCR method. Even, it slightly outperforms the MLCR method on CF2 using primary and secondary color. It appears that MLCR often has some positive, but slight, effect on the performance in comparison to MLTC, and sometimes not at all. Which classifier performed best? Fig 9 summarizes the performance of the various classifiers by means of bar graphs, with binary classification method 1 now excluded, indicating that all of the classifiers are comparable. However, in general the ensemble classifier using MLCR shows better performance than the multi-class classifiers and combined-binary classifiers.

Discussion
It has been repeatedly demonstrated that color is a useful discriminative feature in imagebased flower recognition [1,33]. As photographic images of flowers are made under varying and usually non-optimal circumstances, color detection of flowers is a far from easy task and thus hard to automate. Often the color histogram has been taken as method of choice. In contrast, we have explored color labels, a choice motivated by the fact that color labels are commonly employed by taxonomists in describing flowers. Automated color detection based on color labels offers certain benefit to the taxonomist. When used as part of a computer-based system that is able to provide the name of a flower species in an image, color labels can be used as part of an explanation of why the flower is classified as a certain species. However, a color label is in itself insufficient as a feature for flower identification, as different flowers often have the same color, implying that color can not uniquely predict the name of a species. Therefore, color features have to be combined with other morphological features.
Even though the results from the classifiers are not significantly different, for the analysis, we only use the results from the ensemble classifier MLCR which slightly outperforms the other classifiers.
As discussed in the previous paragraph, the dataset used in our research was very challenging and reflects common difficulties met in automatic flower identification in the real world. Fig 10 indicates that the dataset suffers from class imbalance; there is a big difference between the number of samples belonging to the majority and the minority class. However, as can be seen in Fig 10, the color classes that have a limited number of training samples do not always have a low F 1 -score, as for example illustrated by the primary color 'Red' for CL1 and CL2, and 'PurpleWhite', 'GreenYellow' for CF2, 'GreenRed', and 'GreenWhite' for CL2 using both primary and secondary color. One explanation is that the class weights used to handle the imbalance in the data during training had indeed a positive effect on the performance for the minority class. Another possibility is that images in the minority class with a good F 1score have a similar appearance compared to other minority classes so that the classifier can recognize them more easily.
Next, we consider the confusion matrices for each classifier to obtain more detailed information about which colors are hard to predict. Our automated color detection system possesses a little bit of overfitting, but not too much. We may note from the confusion matrices, for example from Fig 11, that the classes Yellow and Purple, which have the highest amount of training data, have high accuracy while the other classes are often classified into these classes. Nevertheless, the other classes can still achieve high accuracy (for primary color most of classifications are above 80%), even though there is a big difference in the amount of training data for the class values with the highest and the lowest number of samples (the ratio between the two is between 1.5-13). Using primary color of the labellum, with color schemes 1 and 2, 'Red' is the hardest color to predict. 'Red' is often predicted as 'Yellow' with color scheme 1 and 'White' with color scheme 2. Sometimes, it is also predicted as 'Purple' in both color schemes. In Fig 12, for the color of flower, 'Yellow' is the most difficult color to predict using color It is worthwhile to examine the images used in testing in more detail to understand why the classifiers often achieved a good performance and sometimes also failed. As a consequence, the 20% images that were misclassified were further analyzed. Three main reasons for the misclassification were uncovered: 1. It may be the case that the color of the flower in the image corresponds to that in the literature, whereas the predicted color is different. An example to illustrate this situation is shown in Fig 13. By their uncertain nature, all classifiers make sometimes mistakes.
2. The color appearance in the misclassified images sometimes differs from the colors mentioned in the literature. In that case, the predicted color may correspond either to the color mentioned in the literature (counted as correct), or to the color appearing in the image (which we count as incorrect). Fig 14 shows an example of a case where a seemingly correct prediction is counted as incorrect. Hence, in this case either the literature is mistaken, or the picture taken had for some reason colors that were not described previously. In both cases, a decision has to be made as whether the prediction is considered to be correct or incorrect. We decided to be conservative in our assessment. 3. An image often not only includes the flower that is to be classified, but also other flowers, pictured from different angles, and a variety of backgrounds such as grass, leaves, trunks, etc. There are also some images that show the orchid's seed pods or flower buds which may have a color that differs from the blooming flower. These images were classified by us as hard images as it is almost impossible to detect the color correctly. Fig 15 gives some examples.
Not much can be done about the misclassified images of type (1) as our classifiers are already optimal. The distribution of the other two misclassified image types (2) and (3) is shown in Table 8, indicating that misclassification is reasonably balanced between the two categories.
Furthermore, we carried out an additional experiment, where we manually corrected the potentially wrong labels based on the literature and as sometimes an orchid was known to  have a number of alternative colors, these colors were added. Next, we counted a prediction as being correct if the predicted color occurred among the colors mentioned in the database. Table 9 shows the accuracy before and after label adjustment using set-membership to count the correct predictions. As can be seen, the classifier's performance improved between 1.5-4.5% after implementing these modifications.

Commutativity of color
As mentioned in the material and methods section, we used commutativity of primary and secondary color to reduce the number of color combinations, hoping for an improvement in the classification performance. The reader may wonder whether this effect really occurred, which is why here attention is payed to this issue. We limited the study to the multi-class classification as computation times for the experiments would have increased considerably. This yielded 12, 15, 12, and 16 color combinations for CF1, CF2, CL1, and CL2, respectively. In Table 10, the performance of the multi-class classifier using these color combinations is compared to using color commutativity, confirming a general decrease in performance if commutativity of color is not deployed.

Conclusion and future work
The best classifier is able to detect the color of the flower and labellum based on their appearance in the image pretty well. Of course, applying segmentation to the images, thus isolating the orchid from the background, most likely offers better performance. However, it appears that our results are satisfactory when one wishes to detect orchid colors and segmentation is not needed. This offers the advantage that the user is not confronted with the burden of manual segmentation, whereas automated segmentation of pictures that include plants with their complex background may not be feasible. Even though there is still a little bit of overfitting, the classifier is able to suppress it. Reliable color detection of orchids based on color labels can be used as input to an automated image-based flower recognition program, which may include models that are able to provide a better explanation of the classification than deep learning is able to offer.