RFC: Statistics in Subsurface

Dirk Hohndel dirk at hohndel.org
Thu May 14 09:24:48 PDT 2020



> On May 14, 2020, at 12:21 AM, Willem Ferguson via subsurface <subsurface at subsurface-divelog.org> wrote:
> I must admit that I do not like any of these three representations. They are inappropriate and inaccurate, leading to misinterpretation.
> 
> The top graph is normally used to indicate trends in three *independent* variables that may or may not be correlated. In the dive the data represent a *single* variable with its min and max values.
> 
> The middle graph is a histogram that would normally also represent three *independent* variables that have been sampled on the same x-axis scale. Again, in the dive case the min and max values represent the *same* variable.
> 
> The bottom graph is normally used to indicate the proportion of a total that is formed by a specific component. In the case of this specific graph, the median would be indicated by the height of the orange bar (i.e. vertical distance between the grey-orange border and the orange/blue border). The max would be indicated by the height of the blue part of the graph, etc. Clearly this is not what is meant.
> 

I agree that the middle and bottom option aren't adequate for the purpose.

> I want to make a call that, if we are dealing with representing statistics, we actually use the proper statistics representations that we are all used to. Most likely that is either some variant of a box and whiskers diagram or a vertical bar chart with error bars. If these diagrams have been shown once to an uninformed person, the interpretation will always be easy. Lets use diagrams for what they are meant to convey and not use a sports car to drive offroad. We do not want any statistics related to Subsurface to be presented in an unprofessional and inappropriate way.
> 

I think we have a couple of choices here. Build the right tool for the statistics professional. Or build something that helps make the statistics accessible to most of our users.
The more I think about these options, the more I think that the statistics professional is best served by using R and creating the views that they are looking for - because this will become a never ending "bring me another rock because I want to see things THIS way".

So box and whiskers are out, because the vast majority of our audience has a hard time understanding the difference between a mean and a median, and between naive gas pressure calculations and actually accurate math (I get at least two emails a month stating that our SAC rates are wrong).

Now as for which specific graph to use and which one is easier for users WITHOUT A BACKGROUND IN STATISTICS to grasp, I am certainly open to more input here. Ideally input that is based on actual feedback from such users or presentations about data accessibility and visualization. I found the video that Pedro shared rather compelling (especially if played at 1.25x speed because the presenter is taking his time). Which is why I am leaning towards a line graph, but I certainly could see floating bars with a marker for the mean.

> As far as the horizontal graphs are concerned, they have a place, but we need to understand where they come from, and that is from the old days when we tried to print graphs on a mainframe line printer that could not print characters vertically. The conventional way to represent histograms or bar charts is in the vertical way *unless there is good reason to do otherwise*. These days there is no problem in printing labels vertically. To have a horizontal bar graph with depth measurements along the vertical axis is just totally unorthodox and not up to modern standards.
> 

Willem, those are some very strong statements that initially provoked a rather negative reaction in me. Calling someone else's proposal "not up to modern standards" feels borderline insulting.
As a matter of fact, yes we can show vertical labels. They are also a complete pain to read. I would argue that the readability of a horizontal chart is actually much better than the vertical one that you so strongly argue for.
I did a quick survey of some of the other dive logs that have screen shots of their statistics pages up on their web sites. And they seem to be about equally split between the two different approaches.

To me in the end this doesn't really matter. I don't think I'd ever use this other than to test that it works. Which is true for two thirds, actually, more likely 80% of the features in Subsurface.
What I do care about is that we continue to build something that stays maintainable, stays usable, and serves the need of a broad user base. That's why I refuse the frequent attempts to turn Subsurface into an asset management tool. And that's why I will gently push back to attempts to turn Subsurface into tool for statisticians. There are great tools for those purposes. Use them.

/D

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.subsurface-divelog.org/pipermail/subsurface/attachments/20200514/e5e9e8df/attachment.html>


More information about the subsurface mailing list