Color-complexity enabled exhaustive color-dots identification and spatial patterns testing in images

Our computational developments and analyses on experimental images are designed to evaluate the effectiveness of chemical spraying via unmanned aerial vehicle (UAV). Our evaluations are in accord with the two perspectives of color-complexity: color variety within a color system and color distributional geometry on an image. First, by working within RGB and HSV color systems, we develop a new color-identification algorithm relying on highly associative relations among three color-coordinates to lead us to exhaustively identify all targeted color-pixels. A color-dot is then identified as one isolated network of connected color-pixel. All identified color-dots vary in shapes and sizes within each image. Such a pixel-based computing algorithm is shown robustly and efficiently accommodating heterogeneity due to shaded regions and lighting conditions. Secondly, all color-dots with varying sizes are categorized into three categories. Since the number of small color-dot is rather large, we spatially divide the entire image into a 2D lattice of rectangular. As such, each rectangle becomes a collective of color-dots of various sizes and is classified with respect to its color-dots intensity. We progressively construct a series of minimum spanning trees (MST) as multiscale 2D distributional spatial geometries in a decreasing-intensity fashion. We extract the distributions of distances among connected rectangle-nodes in the observed MST and simulated MSTs generated under the spatial uniformness assumption. We devise a new algorithm for testing 2D spatial uniformness based on a Hierarchical clustering tree upon all involving MSTs. This new tree-based p-value evaluation has the capacity to become exact.


Introduction
Spray technologies via unmanned aerial vehicle (UAV) for liquid chemicals of fertilizers, herbicides and pesticides are at the stage of intensive researches and developments [1].From economic and environmental perspectives, these technologies are deemed vital in Precision Agriculture [2].Since its wide uses will not only save costs from many aspects, particularly on human labors and illness, but also add capabilities of dynamic and optimal managements.However, the success of such technologies heavily relies of effective evaluations of their performances in terms of efficiency and precision.There might be many ways of making such evaluations.One fundamental way is to evaluate their performances by testing whether the sprayed liquid is distributed in a spatially uniform fashion upon a target area.
Recently it becomes very common that companies and research labs design their own experiments to facilitate such a fundamental testing.One key step of this testing involves color image analysis consisting of two coupled computational tasks: exhaustive color-dots identification and spatial patterns extracting and testing.It is crucial to be able to exhaustively identify all targeted color dots of all sizes on a target area.Since each dot of sprayed chemical gives rise to two pieces of information: its amount of chemical and spatial location.Exhaustive search and extraction often are difficult to achieve computationally.Even though color identification is a major topic in computer science, those publicly available techniques, such as Contour on gray scale and recent OpenCV with priori selected color regions, are not practical choices for dealing with heterogeneous shading on color images.
After collecting almost all colored dots and sorting out their multi-scales size-categories, two key difficulties facing us here are: the largeness of number of color-dots and geometric representations of color-dots.We need a practical unit that can embrace effectively the concept of spatial density of dot-locations.We also need a simple enough geometric representation to embed all involving units, so that structural information of spatial distribution can be extracted.Such computational endeavors for multi-scale spatial pattern extractions and testing spatial uniformness upon all size-scales are by-and-large not yet been well reported in literatures.
In this paper, we develop computing algorithms to resolve such coupled computational tasks.We apply our algorithms on five experimental images under rather distinct lighting conditions.We exclusively use one image for illustrating our computational developments (see Figure 1  Here we refrain from discussing Color Theory underlying the color image data, for details see [3].We only briefly mention some relevant information about color data in an image like the one in Figure 1(a).As one whole, this image contains a large amount of pixels (∼ 10 7 ) including the target yellow test paper and background.Each pixel in the image has two main 3D color measurements: RGB ([0, 256) × [0, 256) × [0, 256)) and HSV ([0, 256) × [0, 256) × [0, 256)).These two RGB and HSV data formats, as shown in Figure 1(b), are mutual convertible.We remove most of the background and only focus on the area containing the yellow test paper, which contains around 10 6 pixels.The image of test paper can be reconstructed as any one of RGB or HSV 10 6 × 3 matrices.That is, from perspectives of pixel-wise 3D intensities of RGB and HSV, a color image data here is in a form of structured data.
Based on one test paper (yellow colored-strip) and the two versions of representations of color data: RGB and HSV, our first goal is to exhaustively identify all "purple" dots of all sizes.Here color "purple" is meant to be an unspecified 3D regions within the 3D discrete space [0, 256) × [0, 256) × [0, 256) that reveals purple color through our visual system.Human practically has no capability to pinpoint the exact 3D region for any designated color.Since so many dots of varying sizes look like purple, it is not possible to specify a 3D color region to cover all purple dots.Therefore, the popular color-identification technique OpenCV is neither feasible nor practical for the exhaustive search purpose.Further, one test paper likely includes regions under varying shading conditions due to the photographing condition and experimental setup when the image was taken.Heterogeneous shaded images are vividly seen as well in the four images shown and analyzed in Section 5.Such existential shading will complicate choices of gray scale, and consequently reduce the efficiency of Contour approach significantly.
We propose computational algorithms to resolve this color identification issue.One physical fact of color theory plays an important role: the three dimensions of RGB or HSV are highly associated.This fact interestingly and very importantly points to that, among the 256 3 unit cubes (of size 1 × 1 × 1), a color image's millions of pixels typically only collectively occupy a very small number of cubes.That is, a natural color image usually has a very small "colorcomplexity".The color-complexity of the test paper in Figure 2(a) is 0.002.In contrast, the color-complexity of the whole image in Figure 1(a) is calculated as 393014 256 3 ≈ 0.023.That is, the color-complexity of this test paper is only one tenth of the original image.Therefore, we indeed deal with only several thousands, not a million, of distinct colors.This is the underlying reason why our computing cost is low and our color identification is effective.
After nearly exhaustively extracting all purple pixels, we build purple dots as a connected network based on a common choice of neighborhood on 2D lattice.Then, we measure each dot's size, and classify them into several size categories.We consider two kinds of uniformness: dot-size and spatial.Upon dot-size uniformness, we aim to figure out whether the spraying machine's mechanical design is proper or not.We particularly pay attentions to behavior of the right tail of dot-size distribution (large and very large ones).
Due to the largeness of number of color-dots, we focus on spatial uniformness in the sense of density.In order to bring out sense of density, we divide the test paper into 400 squares, and categorize their densities in terms of their distributions of categorical sizes purple dots contained in them.We first propose to build a minimum spanning tree (MST) to connect all high density squares, and examine whether its tree-based distribution of distance between immediate neighboring-squares is very different from many simulated distributions derived from randomly generated MSTs under uniform assumption.This comparison is carried out by transforming each distance distribution into a histogram with common data-driven bin-boundaries, and then collect all vector of proportions into a matrix.We build a Hierarchical Clustering (HC) tree among these distribution-IDs.Then, we develop a new algorithm to calculate p-value based on the binary structure of HC tree-geometry.We then repeatedly perform the same testing on uniformness by including less dense squares in a cumulative manner.This p-value computation is somehow novel in the sense that it is calculated based on a series of odds-ratios along a descending tree-path leading to the observed MST-tree-leaf.
We then apply computational algorithms developed and illustrated on image no.1 to the rest of four images and report their results in Section 5.

Method 2.1 Identification of purple pixels
As shown in Figure 1(a) and Figure 2(a), it is clear that this test paper contains two main color families, yellow and purple, among the one million (10 6 ) pixels.It also evident that it contains areas of heterogeneous intensities of shades across the entire test paper.The presence of such data complications is rather common in majority of real world color images, It becomes parts of nature of data from Precision Agriculture.Since images might be taken under drastically distinct lighting conditions: with or without sun lights across different parts of days.Further, it is well known that human's visual system via brain and eye is subject to color illusions.Such illusion makes us to identify the same object with different colors under shadows as well as different backgrounds.Thus, any heterogeneously shaded image in general poses various challenges on color identification.One of the challenges is: How to do color identification in data-driven fashion?In other words, it is a necessary capability of identifying color in any image from the perspective of computer, not human.
To make computing feasible via computer, we need to have an idea how many distinct colors are indeed contained in the test paper.This is the concept referred as "color-complexity".Given the discrete nature of color data, it is crucial to ask: how many unit cube of 1 × 1 × 1 among the 256 3 "color-unit-cubes" are indeed occupied by the one million of color-pixels in the test paper?The answer is 28126.So the color-complexity is only 28126 256 3 ≈ 0.002.If we enlarge the scale of unit cube to a scale of 10 × 10 × 10 cube, we checked and found that all potential colors contain within such a cube are still rather "uniform" to our raw eyesight.And, with respect to all pixels in the test paper, that there are 880 among 26 × 26 × 26 of such cubes being occupied.The color-complex of this test paper on this larger scale is 880 26 3 ≈ 0.05.Hence, we decide to begin our machine learning computations upon this scale first, and go back to the unit scale afterward.It is worth emphasizing that such low color-complexity is made possible by very high non-linear associations among R&G&B and H&S&V.This is the underlying foundation to build data-driven algorithms for color identification.
Then we build a geometry among these 880 uniform color cubes.This geometry is intended to serve as a platform for our color identification.We choose this geometry to a tree for computational simplicity and practical applicability.We construct hierarchical clustering(HC) tree as follows.We use the center of mass (3D average) of pixels contained in such a cube as the cube's representative.Upon this collective of 880 representatives, the HC algorithm can work efficiently.
For completing our protocol of color identification, we take a step to tentatively avoid shady areas and background noises.Even though only involving a minority of pixels, their inclusion could yield non-negligible errors.To this end, we choose a rectangle area within the test paper as our "focal area", as shown in Figure 2.This focal area is divided into 39 rows.Each row contains 2.5 × 10 3 pixels (Figure 2(b)), and is further divided into 10 squares.
Our color identification begins with the following row-by-row operation.For each one row's 2.5 × 10 3 pixels, we identify which color-cube it belongs and then find its color-cube representative.The resultant set of distinct color-cube representatives has a size smaller than 880, surely is much smaller than 2.5 × 10 3 .Upon this row-specific set of color-cube representatives, we apply the HC algorithm.For each row-specific HC-tree, via its bifurcation, we collect the representatives within the smaller branch as being designated as "purple" ones, while those in the larger branch as being "yellow".We further use ROC curve analysis for validation checking to avoid misclassification due to uncontrolled environmental and lighting conditions.This validation check is performed upon each square within each row.
The ensemble of color-identification on the focal area via RGB data file is shown in Figure 3 together with results of square-by-square validation check.There are three squares have obvious misclassification.We "clean" these three squares by assigning all pixels in these three squares into the yellow group.

recovering the whole yellow test paper
Given that pixels outside of the focal area have higher potential being subject to shade or other noises, their color identification need extra efforts.We propose the following remedy based on our experiences derived from our explorations and experiments.Upon RGB data format, we need to employ 1 × 1 × 1 small RGB color-cubes, denoted as the scale of "n = 1".That is, we need to drastically sharpen the color-uniformness within each color-cube.So we have to pay more computing cost to achieve the goal of color identification with RGB data, even though, we still enjoy reduction of color complexity because only 0.2% of 1 × 1 × 1 RGB unit color-cubes are occupied.Consequently, we collect the centers of all color-cubes, which have ever being occupied by an identified purple pixel in the focal area.And likewise collect the centers of all color-cubes, which have ever being occupied by an identified yellow pixel in the focal area.Among the pool of these two collections of centers of color-cubes, we compute a closest neighbor to each pixel outside of the focal area, and then declare the color-identification accordingly.In this way, we are able to capture the majority of purple pixels and avoid misclassification as much as possible.
As for HSV data format, we still employ the scale n = 10, i.e. 10 × 10 × 10 HSV color-cubes.We obtain the recovering by both RGB file and HSV file separately.It turns out that RGB file helps identify more pixels in smaller purple dots, while HSV file helps identify more pixels near the bottom and top where the RGB file fails.The two results suggest a better recovering scheme as simply combining these two results together.All results are shown in Figure 4.

Testing uniformness via sizes
As it is intuitively known that a spraying device typically mixes air with liquid, and then push the mixture out.The mixing of air and liquid is determined by a set of tuning parameters.Mechanically speaking, different sets of tuning parameters surely give rise to distinct degrees of inhomogeneous mixing.Consequently, the droplets out of the device are likely heterogeneous in size.So, some tuning parameters are better than others.One merit of exhaustive identification of targeted color-dots contained in an image is to check validity of a parameter-setup of spraying device.For this merit, there are two natural measure sizes of a droplet, which is an identified connected purple-pixel.The first measure is to count number of connecting pixels.The second one is the radius of the smallest circle containing all connecting pixels.Accordingly, the best set of tuning parameters should ideally produce the Poisson distribution with respect to the counting measure, and an Exponential with respect to the continuous measure.We consider the target collection of color-dots identified via the approach of combining the RGB and HSV data, see Figure 4(d).We first compute the MLEs of intensity parameters, λ P and λ E , under the Poisson and Exponential distribution assumptions, respectively, based on the two data sets derived from the target collection of purple-dots within the test paper.
Based on the pixel-count data set, the Poisson distribution specified by MLE of λ P is computed and superimposed onto the histogram constructed based on pixel-counts from the target collection of purple dots, as shown in Figure 5(a).It is evident that many identified purple-dots have large pixel counts that can not be accounted for by Poisson distribution.We can draw a similar conclusion based on the dot-size distribution with superimposed Exponential distribution specified by MLE of λ E shown in Figure 5(b).
While the Q-Q plot Exponential distribution specified by MLE of λ E is compared with empirical Q-Q plot of continuous purple-dots' sizes, as shown in Figure 5(c).We see evident departures from this Q-Q plots comparison.Further, we run the Kolmogorov-Smirnov test, which suggests the observed dot-size is not following an Exponential distribution (with p-value < 0.05)

Testing spatial uniformness via rectangle neighborhood
In this section, we construct our major algorithmic developments for testing against the 2D spatial uniformness.We adopt the concept of 2D neighborhood into 2D spatial characteristics.
The reasons behind are that the number of identified purple-dots are too many, and their sizes are rather heterogeneous.This neighborhood concept directly links to the idea of spatial density, which is a proper expression for addressing spatial uniformness here.
Given that we specifically divide the entire target area into 400 small rectangles, one rectangle is taken as one 2D neighborhood.On this collective of rectangles, we pretend as if they are uniformly colored with an intensity (density) of purple depending on all purple-dots contained in it.In this fashion, we consider the 2D spatial uniformness among 2D-entities of 400 rectangles.Since all identified purple-dots have been classified into three categories of sizes: small, medium and large.So an intensity of purple dots in a rectangle would be also categorized, as given below.We apply Hierarchical Clustering (HC) algorithm to guide this categorization.The categorizing protocol is devised as follows.
We count the numbers of small, medium and large purple-dots contained in a rectangle as the 3 features for this rectangle.That is, each rectangle of many pixels as a unstructured data format is characterized by a 3-dim vector of counts.Via this characterization, we transform a rectangle into a structured data format.We employ a distance measure that is a weighted version of Euclidean distance in R 3 .To reflect larger dot-size giving rise to higher purple-color intensity, this weighting scheme is specified with respect to the 3 averaged sizes: small, medium and large, of purple-dots.With this weighted distance measure, we build a 400 × 400 distance matrix.A HC-tree is computed and reported in Figure 6(b).
Upon this HC-tree, we can see two small branches (red and blue colored) constituting a clear pattern: their member rectangles either contain at least one large or two medium dots.
Locations of these rectangles are shown in Figure 6(a).This data-driven pattern leads us to explore the intensity spectrum via hierarchical clustering tree on these 400 rectangles.We found that there are 25 rectangles belonging to the Highly-dense category, which are located on the blue and red branches of the HC tree in Figure 6(b).The spatial geometries of these four categories of rectangles can be seen in Figure 7 in a cumulative manner.
Via Minimum Spanning Tree (MST) Upon these 25 highly-dense rectangles, we construct a Minimum Spanning Tree (MST), denoted as M obs , as shown in Figure 8(b).The intuitive idea underlying MST is that its tree geometry, which spans a subregion by having one tree-leaf linking to one of its close neighbors, will reflect possibly heterogeneous degrees of spatial concentration among the 25 rectangle members.One way of expressing such heterogeneity in spatial concentration of a MST is to look through the empirical distribution (or histogram) of distances among all connected immediate-tree-neighbors.Such an empirical distribution (or histogram) is an informative summarizing exhibition for the degree of heterogeneous concentration pertaining to a MST.We particularly look out for the extremely high concentrations, which will lead a MST's empirical distribution of distance, or its histogram, to reveal a single mode located at a small distance value.With aforementioned focal characteristics in mind, to test whether M obs is coherent with the 2D spatial uniformness hypothesis, we compare M obs 's empirical distribution (or histogram) with B(= 500) randomly generated MSTs' empirical distributions (or histograms).25 numbers are sampled randomly from a collection of digits {1, 2, ..., 400} with equal probability.We repeat this simulation scheme for B(= 500) times with independency.We accordingly generate corresponding B MSTs, denoted as {M b } B b=1 .So we have B simulated empirical distributions (or histograms) under the spatial uniformness hypotheses.To compare M obs with {M b } B b=1 via their empirical distributions (or histograms), we propose two approaches: Mann-Whitney statistics and unsupervised machine learning approach.in a persistent manner.It is known that Mann-Whitney statistics is an efficient U-statistics when both distributions have the same distributional shape.There is no guarantee for being this case throughout.In fact, despite the required assumption, its fundamental drawback rests on the fact that one dimensional statistics is unlikely to reveal structural differences between two distributions because of their high dimensionality.That is why unsupervised machine learning approach is needed.
Unsupervised machine learning approach: We want to literally compare these B + 1 empirical distributions derived from M obs with {M b } B b=1 .To facilitate such a direct comparison, we pool together all distance values from these B + 1 empirical distributions, and then build a histogram with 10-bins.With such data-driven bin boundaries, we transform each empirical distribution into a 10-dim vector of counts.These B + 1 10-dim vectors are arranged along the row-axis a (B + 1) × 10 matrix.
Due to the equal total counts on all rows, we simply adopt Euclidean distance and then calculate a (B + 1) × (B + 1), with which we apply the Hierarchical Clustering (HC) algorithm to build a HC tree, denoted as T , and superimpose it onto the row-axis of (B + 1) × 10 matrix, as shown in Figure 9(b).The tree-leaf of M obs is marked onto the HC tree upon this rearranged matrix, which is called a heatmap.
The resultant heatmap explicitly shows why M obs is found among an extreme subgroup of the ensemble {M b } B b=1 .The visible pattern is that the 25 members of M obs have dominantly many extremely small distances among immediate neighbors.This pattern indeed indicates high degrees of concentration among 25 members of M obs .This is a strong piece of evidence against the spatial uniformness assumption.How strong it is?Next we develop an algorithm to do such an evaluation.
The HC tree T is binary.Therefore each of B + 1 tree-leave can be located by a binary tree-descending tracing process.If we adopt a coding scheme to encode the left-branching with a code-0 and right-branching with a code-1 at each internal node of T .Then each tree-leaf is encode by a binary code sequence.Denote the binary code sequence for M obs as < d Then there is an odds of correctly guessing which branch contains M obs is calculated as: We then compute the overall odds of guessing correctly on which branch M obs belongs along the entire coding sequence An example is illustrated in Figure 10.Likewise we compute an ensemble of odds {P O(M b )} B b=1 .Then we compute the p-value of observing an odds like P O(M obs ) as the proportion of P O(M b ) being less than P O(M obs ): Upon the HC tree shown in Figure 9(b), we have P O(M obs ) = 0.023 and p-value is p(M obs ) = 0.002.Hence, it turns out that M obs is significantly extreme in the HC tree.Based on the visible patterns observed in the heatmap, we can conclude that the 25 [Highly-dense] rectangles are not uniformly distributed.We apply computational algorithms developed and illustrated through image no.1 to the rest of four images.Heterogeneous shading conditions can be evidently seen across these four images.Overall our exhaustive color identifications are satisfactory and the testings of spatial uniformness are indeed much more effective than the one based on ROC analysis.

Image no.2
The image no.2 consists of two "pages" of test papers, as shown in Figure 13(a).The upper part of these two coupled test papers is under shading.Consequently, the color-dot identification based only RGB has missed quite a few small purple dots, as shown in Figure 13(b).Many of these small dots were also not been picked up via HSV data format based on 1 × 1 × 1 fine scale cubes.
We separately report results of spatial uniformness on the two test papers by focusing only dense squares as shown in Figure 13(c).Two separate results are reported: Figure 14 for the Left and Figure 15 for the Right, respectively.Based on both figures, we see that there exits a small discrepancy in p-values between the result based on ROC in Figure 14(a) against results based on HC-tree and heatmap in Figure 14(b), and result based on ROC in Figure 15(a) against results based on HC-tree and heatmap in Figure 15(b).However, such small discrepancies don't seem to cause incoherent conclusions.

Image no.3
The test paper in image no.3 seems curved a bit, as shown in Figure 16(a).This curved shape likely created shads around upper left and lower right corners of the test paper.The coupled results from RGB and HSV seem to achieve a big degree of exhaustive identification except dots locating the two corners, as shown in Figure 16(b).For the spatial uniformness test, the result based on ROC analysis, as shown in Figure 17

Image no.4
The image no.4 have obvious shades at the upper and lower boundaries of the test paper, as shown in Figure 18(a).We report the color-dot identification result based on HSV cubes of n = 10 scale, as shown in Figure 18(b).For spatial uniformness testing, a big gap is seen

Conclusions
The color-identifications and testing 2D spatial uniformness via MST for the five images are rather satisfactory.Basically, these results collectively strongly indicate that our data-driven computational approach for color-identifications are rather effective, and testing methodology for 2D spatial uniformness is novel and practical.The underlying reason of the effectiveness of our color-identification approach is the low colorcomplexity.This interesting fact is that this simple concept is not well known in literature.In fact, our current color research has shown us that low color-complexity is seen in natural images as well as in images of famous paintings.That is, our data-driven color-identification is applicable in wide range of color images.
The MST structure and its distance distribution are new and essential summarizing pattern information of spatial data.They fit very well into this Data Sciences.The novelty of evaluating p-value via products of odds-ratios based on a tree structure, which is complex, can critically expand the applications of unsupervised machine learning methodologies to wider ranges of scientific fields, including the medical one.
Our way of dealing with shading in images is not sophisticated.We adopt the fact that RGB and HSV data formats could be differentially affected by shading.So, we propose to combine results from both data formats under distinct scales.From the results reported in Section 5, we see that it works with different degrees of successes.It might be possible to develop systematic approaches to remove, or at least lessen the shading effects.This is one of our undertaking research directions right now. (a)).

Figure 1 :
Figure 1: (a): Original image with purple dots in various sizes and shapes located on the yellow test paper; (b): RGB model and HSV model.

Figure 5 :
Figure 5: (a): The histogram of pixel-counts of identified purple-dots superimposed with Poisson distribution specified MLE of λ P (red curve) and kernel density estimates (green dash curve);(b): The histogram of dot-sizes of identified purple-dots superimposed with Exponential distribution specified MLE of λ E (red curve) and kernel density estimates (green dash curve);(c): Empirical Q-Q plot (Circle curve) vs Exponential Q-Q plot specified by MLE of λ E (dash line).

Figure 7 :
Figure 7: The flow of rectangles' 2D -distribution, from large dots to medium and small dots; some specific spatial patterns are present; (a): rectangles with 1 large dot; (b): add rectangles with ≥ 3 medium dots; (c): add rectangles with ≥ 2 medium dots; (d): add rectangles with ≥ 1 medium dot; (e): add rectangles with ≥2 small dots; the blank are with only 1 or 0 small dot.

Figure 8 :
Figure 8: (a): interested 2D distribution of rectangles with ≥ 2 medium dots; (b): using the corresponding Minimum Spanning Tree (MST) to capture the spatial pattern.

o 1 , d o 2 ,
...d o Ko > and a code sequence for M b as < d b 1 , d b 2 , ...d b K b > with b = 1, .., B. The coding lengths K o and {K b } B b=1 vary from one tree-leaf to another tree-leaf.Further, with the binary code sequence as the descending path of bifurcating for locating M obs , we denote the left and right branches at k-th bifurcation as L d o k and R d o k with k = 1, ..., K o .The sizes of the two branches are denoted as |L d o k | and |R d o k |.Then the size of the branch containing M obs at k-th bifurcation is calculated as

Figure 10 :Figure 11 :
Figure 10: An example as illustration of product of odds (P O) and p-value.

Figure 13 :
Figure 13: Image no.2: (a) two test papers in the original data; (b) recovering by RGB file; (c) dividing the paper into rectangles for 2D spatial uniformness testing.