You know the one about lies, damn lies and statistics? The famous Mark Twain quip remains as valid today as it was in his time, and possibly even more so. Statistical, or pseudo-statistical data abound in the media, books and on the Internet and graphs are often used to visualise it. The picture is, after all, worth a thousand words, as even older saying claims.
A good graph makes it easier for the viewer to grasp the meaning of the data: it should facilitate understanding while staying faithful to what the data contains. Bad graphs break one or the other of these rules.
*
Some bad statistical graphs are simply poorly constructed: they might lack labels, have too much information crammed in which makes unintelligible, use colours or patterns that are hard to tell apart, use a graph style not suited to the character of data.
Such graphs are not simply useless: the reader would be better off going back to the numerical tables or making their own graphs but often the numerical data is not available. Bad graphs like that might cause major misunderstandings or make the readers miss important patterns, albeit not on purpose.
*
Other bad statistical graphs, however, are purposeful constructed to distort or manipulate data. They don’t stay faithful to the data: they work as propaganda tools or marketing aids and not objective scientific instruments.
Such graphs are particularly harmful when presented in brief media releases or online articles, without a link to the bigger data set, but even in a detailed report a cunningly placed graph of this kind might heavily influence the impression readers form, especially non-specialist ones.
How to recognise a bad statistical graph that distorts the data and are often examples of statistics that are worse than lies?
One of the most common and most treacherous ways data is distorted is to change the starting point of the scale to make any differences appear larger. Let’s say a sample of people was asked whether they feared crime in their neighbourhood. In area A 58% people said they feared crime and in area B 55% did. This information can be presented on a bar graph, with two bars of different length But the impression the viewer gains if the scale starts at 0% will be diametrically different from the impression the viewer gains if the scale starts at 50%. In the first case, the difference will appear small (less than a tenth), in the other it will appear huge (almost twice as many). There are some cases when starting at zero will make the chart less legible, and starting lower is perfectly justified.
There are countless ways to magnify differences by playing with scales, but one of the more insidious is to use a two or three dimensional object to represent one dimensional data. A two dimensional depiction of one dimensional data will exaggerate the difference by the square of the difference, a three dimensional depiction by a cube of it.
Some graphs actually change scale half way through the graph: for example a chart with time on the X (horizontal) axis might have ten-year intervals at the beginning which change to one-year intervals later on.
Flipping the X and Y scales can have a distorting effect too, especially if the X axis depicts time, making effects appear steeper than they are.
Statistical graphs with two vertical axes are particularly prone to manipulation and distortion: the impression of a reader will be very heavily influenced by what intervals are used and how the lines or bars are placed against each other.
A common example of a bad statistical graph often seen in newspapers is one that, ostensibly for clarity’s sake, omits the labelling altogether: the reader doesn’t even know where the scale starts, or what intervals are used.
Reporting differences is very prone to distortion. Reporters or researchers might be tempted to prepare graphs (even with a faithful scale) which depict differences that are not statistically significant, i.e. differences that are likely to be due to pure chance. A reader seeing a graph that shows a visible difference between two groups or points of time will assume the difference is meaningful – but often it is not, this fact buried deep in the text or omitted altogether.
Data presented in the graph should also make sense: often it is presented out of context, or the context is manipulated in some way. Many times, a graph that presents two relationship will be designed in a way to emphasise one and de-emphasise another.
A frequent way to lie with graphs is to simply omit part of the data – for example an inconvenient part of the time series, a group that defies the thesis and so on.
*