In his classic paper on Graphs in Statistical Analysis published in 1973, F. J. Anscombe presents what we now know as the Anscombe’s quartet - a set of 4 datasets with identical statistical properties that are very different when plotted.
Anscombe’s quartet highlights the fallacy of relying purely on statistics, and emphasizes the importance of visualization. Anscombe’s paper begins with one of the shortest abstracts I have encountered in a technical paper.
Graphs are essential to good statistical analysis. Ordinary scatterplots and “triple” scatterplots are discussed in relation to regression analysis</blockquote>
That’s it! All of 2 sentences and under 20 words. Anscombe then goes on to discuss regression. It is easy to perform a regression, says Anscombe, but that does not mean that the straight line fit of regression is appropriate to the data
In practice, we do not know that the theoretical description is correct, we should generally suspect that it is not, and we cannot therefore heave a sigh of relief when the regression calculation has been made, knowing that statistical justice has been done.
He recommends plotting the error (the difference between the dependent variable and the fitted values) against the independent variable and separately against the fitted values, as well as checking the distribution of the residuals to see if they approximate a normal distribution.
Hopefully the fitted values follow the observations closely and have a greater variability than the residuals. Certain observed inconsistencies in the residuals can be resolved by either log-transforming the dependent variable (y), or by including higher order terms of the independent variable (x) in the regression. Visualizing the data helps us easily observe outliers, and test our hypothesis of the relationship between the dependent and independent variables without relying on statistical tests. It is here that he presents the quartet of datasets and their plots as a case in point.
The final part of Anscombe’s paper discusses scatterplots. He recommends plotting a scatterplot of the dependent variable against the independent variable. He further recommends a triple scatterplot (TSCP), of plotting the independent variable against one of the dependent variable, and coding the value of a third dependent variable as symbols of varying size and blackness. Two-way tables and looking at the row-means, column-means and residuals are an additional way of looking at the data.
Anscombe saves the best for the last - his concluding paragraph
Unfortunately, most persons who have recourse to a computer for statistical analysis of data are not much interested either in computer programming or in statistical method, being primarily concerned with their own proper business. Hence the common use of library programs and various statistical packages. Most of these originated in the pre-visual era. The user is not showered with graphical displays. He can get them only with trouble, cunning and a fighting spirit. It’s time that was changed.
We have come a long way from 1973 with regards to visualizing data. It may not take as much trouble or cunning as it did to graph data when Anscombe wrote these words, though I think it still takes a bit of that fighting spirit to go the extra length for a better graphical visualization.