^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: LM PR. Performed the experiments: LM. Analyzed the data: LM PR. Contributed reagents/materials/analysis tools: LM. Wrote the paper: LM PR.

Current address: Helsinki Institute for Information Technology, Aalto University, Helsinki, Finland

Venn diagrams with three curves are used extensively in various medical and scientific disciplines to visualize relationships between data sets and facilitate data analysis. The area of the regions formed by the overlapping curves is often directly proportional to the cardinality of the depicted set relation or any other related quantitative data. Drawing these diagrams manually is difficult and current automatic drawing methods do not always produce appropriate diagrams. Most methods depict the data sets as circles, as they perceptually pop out as complete distinct objects due to their smoothness and regularity. However, circles cannot draw accurate diagrams for most 3-set data and so the generated diagrams often have misleading region areas. Other methods use polygons to draw accurate diagrams. However, polygons are non-smooth and non-symmetric, so the curves are not easily distinguishable and the diagrams are difficult to comprehend. Ellipses are more flexible than circles and are similarly smooth, but none of the current automatic drawing methods use ellipses. We present euler

Data is routinely generated and analysed. For instance, relationships between groups of genes are studied to understand biological processes, improve health care, find cures to illnesses, and solve problems in agriculture. To aid analysis, Venn diagrams are often used. Each data set is represented by a closed curve and each set relation is represented by one of the spatial relationships between the curves. Both the curves and their spatial relationships are often easily visible, as closed curves are processed preattentively and pop out as complete distinct objects

A Venn diagram with ^{n}

Consequently, area-proportional 3-Venn diagrams have been used to, for instance: compare the cell-type of differentially regulated genes after an anti-cancer drug treatment

(A) Comparing the cell-type of differentially regulated genes after an anti-cancer drug treatment

An informal study identified various area-proportional Venn diagrams in the world's most cited journals (e.g., Nature)

Such area-proportional Venn diagrams cannot be drawn analytically using a specific curve shape and so numerical methods or heuristics are required

Ellipses have more degrees of freedom (i.e., a centre, two semi-axes, an angle of rotation) than circles and are similarly smooth. So diagrams drawn with ellipses are more likely to be accurate with respective to the required quantitative data and easy to comprehend due to their distinguishable curves. This is illustrated in

Each of these diagrams depicts the sets and the quantitative data indicated by the numeric labels in the regions of the corresponding diagram in

The benefits of ellipses was noted earlier (in 2004 in the first paper on area-proportional Venn diagrams

Our contributions include: (1) an optimization method to automatically draw accurate diagrams with ellipses comprised of (1a) a novel cost function to direct the optimization process (Section 3.2), (1b) a method to generate a rational starting (Section 3.3), and (1c) a mechanism to adjust the properties of the ellipses in search for a good solution (Section 3.4); (2) evaluation of (2a) the effectiveness of euler

All the experiments mentioned in this article were run on an Intel Core i7-3770 CPU @3.4GHz with 8GB RAM, 64-bit Microsoft Windows 7 Professional SP1 and Java Platform 1.7.0_10.

The first automatic drawing methods to use circles were developed for area-proportional Venn diagrams with two

All the diagrams are meant to depict ^{−6}. Red indicates diagrams with inaccurate or missing regions. D is a redrawing of

The latest proposed method, venneuler, is different from most others as it uses a statistical model for fitting an area-proportional diagram to the required quantitative data. The model is comprised of a normalized loss function

An accurate area-proportional 2-Venn diagram can be drawn for any quantitative data using two circles _{a}_{b}_{ab}_{ab}_{ac}_{bc}_{ac}_{bc}

(A) The quantitative values in each region indicate the required region areas, for which an area-proportional 3-Venn diagram should be drawn. (B) The first step of the construction whereby the three accurate 2-Venn diagrams are drawn. (C) The second step of the construction whereby the identical copies of the circle labelled

The first proposed method, VennMaster

Our drawing method euler

Each of the quantities in the provided data, for which a diagram should be drawn, is first scaled by a factor of (

To verify whether the region areas of an area-proportional diagram are accurately and directly proportional to the required quantitative data, euler

If

In euler^{−6}, which value is consistent with that of other methods when defining a value for zero in their implementation (e.g., venneuler

Rather than using the absolute region area, euler

To obtain a good, accurate diagram with respect to the required quantitative data (as defined in Section 3.1), our optimization algorithm minimizes a cost function that takes into account the accuracy of the diagram as well as paths that could lead to a local minimum. In an informal experimentation, we observed that the cost function of most of the current methods, such as venneuler's

If

then the cost of

Thus, the cost of a diagram is the mean of the cost of all the regions in that diagram. The sum could have been used since this work focuses on 3-Venn diagrams. However, we used mean so this function could be used in other future algorithms for diagrams with any number of curves and overlaps.

A diagram is generated for the required region areas scaled (i.e., those in

The denominator

Though our cost function in

However, these two dimensionless functions will have a different effect from that intended by our non-dimensionless cost function, as the cost of a region would be much smaller than that in

To choose the most effective cost function for euler

The cost function of the optimization algorithm in Section 3.4 was replaced by one of F1–F8 and used to generate diagrams (with the rerun option disabled) for two libraries of 10,000 random 3-set data items each:

This evaluation and experimental comparison indicated that our non-dimensionless cost function F6 is the most effective in:

Generating good diagrams for quantitative data for which a good diagram is known to exist;

Converging to diagrams that have a low

Identifying and avoiding paths that lead the optimization to a local minimum when the overall error of the diagram is reduced at the expense of diminishing the area of a region to a point where it is close to non-existent and its actual-to-required region area ratio is close to zero;

Taking the least amount of the time and iterations to generate a diagram, particularly for data for which a good diagram is known to exist;

Generating a large majority of the diagrams (97.3%,

The effectiveness of F6 over the other cost functions with respect to the generation of good diagrams, the

Following the results of this evaluation, euler

The optimization process has to commence with a solution. This is often an arbitrary or an invariant solution. Both types of starting diagrams were considered for euler

A rational starting diagram that is adapted to the required quantitative data is more effective, as it reduces convergence time and the likelihood of reaching a local minimum. Such a starting diagram is used by for instance venneuler _{1}_{2}_{1}_{2}_{1}_{2}

Changes to the ellipses during the optimization affect the area of the region in exactly the three ellipses. So a starting diagram that minimizes the error of this region seems helpful. To achieve this, the centre for the third ellipse _{3}_{1}_{2}_{1}_{2}_{1}_{1}_{2}_{3}_{3}_{1}_{2}

The centre of ellipse _{3} is a point on the line _{1} and _{2}. The bisection method is applied in the interval indicated by the faded blue circles along

Out of the starting diagrams generated for 10,000 random 3-set data items for which an accurate Venn diagram with ellipses is known to exist, 63% had

Our simple hill-climbing algorithm commences with a rational starting diagram and systematically adjusts the properties of its ellipses to minimize our cost function, until a good diagram with respect to the given quantitative data is obtained. Though simple and a local search, it rarely encounters a local minimum and if it does, our algorithm is capable of handling such cases and obtain a good solution whenever an accurate area-proportional 3-Venn diagram drawn with ellipses is known to exist for the given data (as shown in Section 4.1).

Our optimization algorithm is characterized by the following three parameters that determine how at every iteration, each ellipse

At every iteration of the optimization algorithm, the (A) centre, (B) semi-axes and (C) angle of rotation of every ellipse are respectively modified by parameters

Changes that lead to a reduced cost of the diagram are accepted. At the start,

1:

2:

3:

4:

5:

6:

7:

8:

9:

10:

11:

12: Change the centre of

13:

14:

15:

16:

17:

18: Change the semi-axes of

19:

20:

21:

22:

23:

24: Change the angle of rotation of

25:

26:

27:

28:

29: Divide

30:

31:

32: Divide

33:

34:

35: Divide

36:

37: ^{−6}

38:

39:

40:

41:

42:

43:

44:

Step 38 is reached when a local minimum is encountered. To handle such cases, euler

The software executable and the Java source code are freely available under the GNU General Public License version 3 at

Further details, example how to load the required quantitative data from a file or how to save the diagram, are available on euler

To evaluate the effectiveness of ellipses in drawing accurate area-proportional 3-Venn diagram for given data, we first evaluated the effectiveness of euler

The error of the diagrams generated by euler^{−6}. In our experiments, the number of iterations and the time taken to generate the diagrams were also recorded.

This evaluation focuses on 3-set data that associates a quantity greater than zero to each of the seven regions interior to the curves of a 3-Venn diagram. Diagrams with region areas that are zero percent of the total area of the diagram can still be drawn with euler

In this section,

Diagrams were generated with ellipses by euler

By the first run, good diagrams were generated for 9939 of the 10,000 data items (i.e., 99.4%). Despite generating a non-good diagram for the remaining 61 data items (i.e., 0.6%), the ^{−4}, mean 2.38×10^{−3}, minimum 1.02×10^{−6}, maximum 3.09×10^{−2}) and 54 of them (i.e., 88.5%) had

The number of reruns (1–10) that were required for euler

When the optimization algorithm is rerun, more time and total number of iterations are required to generate a good diagram (

The _{10}_{10}

(A) and (B) illustrate (

The majority of the non-good diagrams generated during the first run had a low ^{−4}.

(A) An example of (^{−4}) generated during the first run and (^{−3}) generated during the first run and (

The results of this evaluation indicate the effectiveness of euler

Since euler

Good diagrams drawn with ellipses were generated for 8607 of the 10,000 data items in L2 (i.e., 86.1%)—8372 after the first run (i.e., 97.3% of the 8607) and 235 after one to a maximum of 10 reruns (i.e., 2.7% of 8607). More than half of the 235 good diagrams (56.2%) were generated during the first rerun and only one was generated after 10 reruns, as the ^{−6}, 3.28×10^{−2}] with median 1.89×10^{−3} and mean 3.77×10^{−3}).

None of the diagrams drawn with circles for the 10,000 data items in L2 were good, and the ^{−2}, 6.73×10^{−2} for circles; 1.65×10^{−2}, 2.11×10^{−2} for ellipses). With a 99% confidence, these results indicate that for 85.2% to 86.9% of random 3-set data, a good diagram can be drawn (using euler

The time and number of iterations that were required for the generation of the good diagrams using ellipses were similar to those of our evaluation in Section 4.1 (this evaluation: medians 0.4 seconds and 35 iterations, means 1.9 seconds and 201 iterations,

The majority of the 10,000 diagrams with ellipses were generated within 1 second (84.1%—8405/8607 good, 0/1393 non-good) and nearly all with ellipses within 10 seconds (96.9%—8569/8607 good, 1119/1393 non-good). So similar to Section 4.1, with 99% confidence, these results indicate that for 83.1% to 85.0% of random 3-set data, euler

This evaluation also revealed that data for which an area-proportional 3-Venn diagram can be drawn with ellipses often has larger areas for the regions in only one curve than those in only two curves, and an area for the region in only the three curves that is typically similar to those for the regions in only one curve.

Using a variant of euler

For euler^{−6} (^{−6}. Thus, to compare the accuracy of the diagrams generated by euler

None of the diagrams generated by venneuler for the 10,000 data items in L2 had ^{−6} or ^{−6}. Thus, none of the diagrams were good according to venneuler's and euler

^{−4} and 3.17×10^{−3} respectively), close to that of a good diagram (i.e., ^{−6}). However, ^{−2} and 2.07×10^{−2} respectively). Some of venneuler's diagrams also had aesthetic features that could impede diagram comprehension ^{−3}, ^{−2}), but greater than that of

Examples of diagrams generated with (^{−4} and ^{−2}. A^{−3} and ^{−2}. A^{−12} and ^{−7}. (B) Diagrams generated for data {^{−3} and ^{−2}. There are two regions in B^{−2} and ^{−2}. B^{−12} and ^{−7}. (C) Diagrams generated for data {^{−3} and ^{−2}. C^{−3} and ^{−2}. C^{−12} and ^{−7}.

The diagrams by euler^{−6} and ^{−6} and were thus considered good by both venneuler's and euler

As shown in

The (A) ^{−5}, 6.14×10^{−1}] with median 3.04×10^{−2} and mean 6.41×10^{−2}, and ^{−3}, 2.46×10^{−1}] with median 4.56×10^{−2} and mean 5.73×10^{−2}. The 10,000 diagrams generated with circles by euler^{−10}, 7.79×10^{−1}] with median 7.00×10^{−2} and mean 1.13×10^{−1}, and ^{−6}, 3.31×10^{−1}] with median 6.28×10^{−2} and mean 6.73×10^{−2}. The 10,000 diagrams generated with ellipses by euler^{−14}, 2.24×10^{−1}] with median 7.59×10^{−12} and mean 1.17×10^{−10}, and ^{−8}, 1.39×10^{−1}] with median 8.00×10^{−7} and mean 2.94×10^{−3}.

The differences between venneuler's diagrams and euler^{χ2}(1) = 2.48,

The differences between venneuler's and euler^{χ2}(1) = 5402.3, ^{−16}) and ^{χ2}(1) = 609.1, ^{−16}). Post-hoc tests using Wilcoxon tests with Bonferroni correction showed significant differences between venneuler and euler^{−16}, ^{−16}, ^{−2}). So, euler^{−6} (Section 4.3), but 28 (i.e., 0.3%) had ^{−6} (the difference between these percentages is statistically significant—using R's pro.test with Yates' continuity correction disabled, ^{χ2}(1) = 28.04, ^{−7}). Thus, with 99% confidence, these ^{−6} can be generated with circles for 0.2% to 0.5% of random 3-set data by euler^{−6}.

This evaluation also revealed that if the required areas for the regions in only one curve are around twice as large as those for the regions in only two curves, and the area for the region in exactly the three curves is larger or as large as the areas for the regions in only one curve, then it is highly likely that a close to accurate area-proportional 3-Venn diagram drawn with circles exists.

With respect to the time taken to generate each diagram, venneuler was faster than euler

Area-proportional 3-Venn diagrams are used extensively in various disciplines to facilitate data analysis, but often the diagrams are more misleading than helpful due to the limitations of the curve shapes used by current drawing methods. We investigated this further using real world medical data obtained from a BMC Medicine journal article

The selected article discusses the results from a web-based survey that assessed whether US trainees in family and internal medicine are aware of the complications, screening methods and therapy for chronic kidney disease (CKD). This survey data was comprised of sets

Diagrams with respect to

The

In

In contrast, most of the diagrams with polygons are either accurate with ^{−6}, as P3, P5, P6, or have region areas that are less misleading than those of diagrams with circles, as P2, P4. The latter is true as for instance, consistent with

Using ellipses, diagram E has region areas that are accurately and directly proportional to the quantities in ^{−6}). It is also easy to comprehend, as the curves are regular and have good continuation like circles. So ellipses can be more effective than both circles and polygons. This was also demonstrated with other real world data in Section ‘

Another area-proportional 3-Venn diagram for the same data sets but for the management of anaemia rather than secondary hyperparathyroidism (so set

(A) The figure with two Venn diagrams drawn with circles in a medical journal article

We have described euler

Our evaluation indicates that using ellipses and euler

The results of our evaluation also indicate great potential for using ellipses to draw area-proportional diagrams with more curves. However, first, further evaluation should be conducted to assess the effectiveness of ellipses and a method like euler

Apart from the shape of the curves, diagram design features (e.g., colours, labelling strategies) can also facilitate or impede understanding of the diagram and the depicted data. The effect of such features and the possible benefits of adding interaction should be investigated. Other features that could aid understanding for users with different abilities (e.g., spatial and numeracy abilities) should also be identified.

A number of the studies could be conducted to understand: how such diagrams are processed perceptually and cognitively; how region areas are perceived; the effect of the shape of the regions and curves on area judgement; what discrepancies in areas are not noticeable; whether perceptual scaling measures like those proposed for map symbols in cartography

Following these studies, aesthetic criteria, metrics and cognitive measures as well as perceptual and design guidelines defining an effective, good diagram for human use that facilitates comprehension and reasoning should be formalized and prioritized. A variant of euler

It might also be interesting to assess the effectiveness of allowing users to select aspects of the diagram that they consider important and they would like to optimize. Such aspects could include aesthetic features, such as the shape of certain curves or the accuracy of the regions.

We acknowledge Prof Leland Wilkinson (University of Illinois at Chicago) for providing us the source code of venneuler