Data visualization in SE research

I recently came across a couple of horrible examples of data visualization in two highly respectable publication venues for Software Engineering research: a major magazine and a major conference.

In one case a bar diagram is added with confidence interval bars that are clearly wrong: the lower bound of the interval is negative for a quantity that by definition must be positive.

In another case “bubbles” are used instead of bars with a result that is confusing and makes it difficult to compare different values.

What puzzles me is: how is it possible to have bright (if not top) scientist make such mistakes or be so careless? Maybe we should build some data visualization contents into the SE curricula (and possibly other engineers too).

Here are the two examples.

Confidence Distrust

The first example, from the main track of The conference in SE, concern a bar graph where bars report the average number of file in each change set. By definition a change set must have at least one file in it (otherwise it is not a change…) therefore the values are strictly positive integers.

In addition the graphs report, “where applicable, standard deviations presented as error bars.” The error bars are used to show the accuracy of the measure or some form of confidence interval: according to an intuitive interpretation the height of the bar represent the point estimate, while the error bar show the range in which the true value mary likely range.

If we take this latter interpretation it appears at least odd to observe the lower limit of the error bar to be negative, and that is for a strictly positive value! What kind of (non misleading) information are we supposed to get from such a graph?

By the way: such a large standard deviation, means that the standardized error is larger that 100%. That is any conclusion derived from this data is completely arbitrary.

Bubble bars

The second example come from a well respected magazine in the area of software engineering.

For each step in a theoretical reference process model, a frequency score is reported. In addition color is used to encode five different macro-phases the steps can be grouped into. This graph exhibit several problems:

  • the numeric score is encoded as an area (not as a length like e.g. in a bar chart) and it is widely known that the average human perception is not able to precisely compare areas, while it is very well suited for comparing lengths;
  • the bubbles are overlapping, so
    1. when two bubbles with different color are next to each other, the rightmost one hides the leftmost one limiting our capability to appreciate the size,
    2. when two bubbles with the same color are next to each other, the actual area of the two bubbles is difficult to determine, even the gestalt principle of closure is not working since the overlap is very pronounced;
  • the bubbles are center aligned, therefore the only non overlapping dimension (the vertical one) is hard to compare for two distinct bubbles;
  • the slanted names are difficult to read, especially then compare to horizontal text.

For the purpose of realizing how impairing are the above problems below I report a possible redesign of the same graph not affected by all the above problems:

Since the graph under consideration appeared in a magazine it is fairly possible that some “designer” produced the graph instead of the original authors, which are not to blame for the initial bad graph design but just for letting the design pass without complaining.

I wonder what is there reason that led to have such a poor graph? Just the (debatable) fact that it is fancier?

Conclusions

Two example of graph appeared in highly respectable publication venues show heavy limitations.

The questions I ask myself are:

  • should we as authors pay more attention to how we visually present our data?
  • should we as reviewers in the community pay more attention to these aspects and reject visually impaired presentations?
  • should we as teachers devote some portion of the curricula to teach how data can be presented in visual form?

Further recommended readings

Edward R. Tufte, 1983. The Visual Display of Quantitative Information. Graphics Press.

Stephen Few, 2004. Show me the numbers. Analytics Press.

Annunci