RFC: Statistics in Subsurface

Willem Ferguson willemferguson at zoology.up.ac.za
Sat May 16 12:11:43 PDT 2020


On 2020/05/16 18:22, Dirk Hohndel wrote:
> Hi Willem,
>
> That's a great detailed writeup. I'll comment in line and try to keep 
> all of the pictures in place!
>
>> On May 16, 2020, at 7:13 AM, Willem Ferguson 
>> <willemferguson at zoology.up.ac.za 
>> <mailto:willemferguson at zoology.up.ac.za>> wrote:
>>
>> This is just an attempt to enumerate how many types of graphs one is 
>> likely to need, given the discussion until now. As a basis I use 
>> Dirk's Proposal for selecting a appropriate graph.
>>
>> In the above diagram, the different types of variables have different 
>> colours.
>>
>> 1) The yellow ones are just totals (Total # dives, Total no. 
>> minutes/hours) that are unlikely to have any associated minimum or 
>> maximum.
>>
>> 2) The blue ones are variables defined in terms of categories. Date : 
>> day, week, month, etc; Trip : trip locality; suit: type of dive suit; 
>> tags : tag text. There is no dive suite value inbetween wetsuit and 
>> semidry suit because they are two distinct categories.
>>
>> 3) The white ones are continuous numeric variables. Duration can 
>> potentially have any arbitrary number of minutes or hours. The same 
>> goes for Max_depth, Min_temp, SAC, and all the other white ones. 
>> Inbetween any two arbitrary depths there are innumerable intermediate 
>> depths and depth only becomes a value along a continuous scale.
>>
>
> Technically... we store depth in mm (and round when displaying)... 
> yeah, you know where I'm going here. I'm being a childish stickler for 
> details that these are of course discrete values, but for the purpose 
> of this discussion you are correct. They have "nearly continuous" 
> values and in that are distinctly different from the blue categories.
>
> I have two questions:
> - why is 'dive type' not blue. That's likely just an oversight, right? 
> It's not like depth, duration or temp that are "nearly continuous".

Apologies. Yes, an oversight.

Sorry, I did not get your 2nd question.

> - and what about date. that one is kinda weird - but I'd say for 
> someone with 8000 dives since 1983, date is "more continuous" than 
> temperature (we dive in water typically between about -2C and +40C, so 
> even at 0.01 degree intervals that's only ~4200 values - and 0.01deg C 
> is way beyond the measurement accuracy here)... I think 'by date' is 
> either white or it is 'special'.

One needs to make a strong distinction between the way the data are 
stored in Subsurface memory and the conceptual way that these types of 
data are treated as statistical objects. It is as you say "They have 
"nearly continuous" values and in that are distinctly different from the 
blue categories." The fact that depth is stored as mm is a measurement 
error because of the lack of accuracy of the equipment which does not 
make depth an integer.  :-)    If storage space and speed of data 
manipulation was no issue, Subsurface would probably wave stored depth 
as a floating point. As for date, a day, week, month or year is 
conceptually an integer. We would not talk of week 3.576. *Time* is not 
an integer, it is continuous. But when we talk of a day, a week or a 
month, we do not use a concept of continuous time, we use an integer. I 
totally understand your point about time, but this is not useful if I 
want to compare my dives in 2018 with those in 2019, or summer (months 
10 to 3) as compared to winter (months 4 to 9). Hope this makes sense? 
Look at the printout of the annual summary. Years and dates are used as 
discrete numbers, as they should be. On the other hand the Average dive 
duration for Dec 2019 was 55 min and 14 sec. This is a continuous-number 
measurement. No one tries to argue that the duration is exactly 55:14.00 
exactly accurate to the whole second. The calculation just rounds the 
duration to the nearest sec. It does not mean that dive duration is 
calculated in seconds. If we wished to, we could represent it to 10 
decimals but it would not be informative.

>> The type of graph that best depicts a relationship between two types 
>> of variables depends on the colour that each of the variables above 
>> has. I need to emphasize that the graphs below are totally open to 
>> discussion. The purpose here is to assess how many types of graphical 
>> elements one would need for a basic statistics tab in Subsurface.
>>
>> Plotting a yellow variable against a blue variable is probably best 
>> represented by a simplistic bar graph like:
>>
>> There are no min and max values to indicate. The different suit 
>> categories are indicated along the horizontal axis. There is no need 
>> to specify a degree of "granularity" or increment along the 
>> horizontal axis and no min or max values are involved.
>>
>
> Yes, I think that's reasonable. People like me will run into trouble 
> here because of the way I name my suits. "dry, Whites Fusion with 
> Weezle", "wet, 3mm, with 2/1 hooded vest"

Myself included, same thing.


>
>> If one plots a yellow variable against a white (continuous) variable, 
>> then a granularity/increment needs to be specified. In the image 
>> below, an increment of 20m was used.
>>
>> Basically the same type of graph as the one used above. No need for 
>> min/max values.
>>
>
> The challenge here lies in the ability to come up with "clever" 
> groupings. To a human this is likely somewhat obvious, but you need to 
> invest some thought / algorithm / heuristics into this to get this 
> right. Especially if we include 'date' as a white variable (as I think 
> we should).
> Clearly something that is "solvable", just not necessarily straight 
> forward.

I agree completely. But this is an issue of implementation, not an issue 
for putting together a rough framework of visual presentation.

>
>> Of course, as was well argued previously, the bar graph can be 
>> horizontal in the case of long names on the horizontal axis, e.g. 
>> dive site names:
>>
>> While I personally have no qualms with horizontal diagrams where 
>> needed, I would argue it is a regression to default to horizontal 
>> orientations for all bar graphs.
>>
>
> I think for blue variables horizontal might be easier to get right. 
> But I'm willing to have us consider this a "phase 2" think and start 
> with just vertical and see into how much trouble we run there...

Clearly there are cases that would necessitate horizontal. A simple rule 
that I would consider is, "if any label is longer than 8 characters, use 
horizontal, otherwise use vertical". Again my formulation is simplistic 
but I think you would get my drift.

>
>> The above graphs deal with yellow variables in Dirk's proposal. Now 
>> about the other categories. Plotting a White variable against a Blue 
>> variable has several options, including box and whisker plots that 
>> are not popular in this discussion. My proposal two days ago was 
>> something like this and there was some discussion around it:
>>
>>
>> Here SAC is a white (continuous) variable and Suit is a blue 
>> (catagorical) variable. A graphical element that is likely to differ 
>> sharply from the bar graphs used above. Here again, because the 
>> horizontal axis comprises categories, there is no need to specify a 
>> granularity/increment. For lack of a better name (there is actually a 
>> esoteric statistical name for this graph) I call this a dot graph.
>>
>
> The more I look at this, the more I like it.
>
>> What about plotting a Blue (categorical) variable against a White 
>> (continuous) variable? For our case the order in which the blue and 
>> white variables are selected probably does not matter and the dot 
>> graph shown above (or some derivative of it) should suffice.
>>
>
> I had never thought of blue categories being plotted against the white 
> categories. What's the meaning of a plot of my tags over the depth? Or 
> trips over temperature? I'm not saying that this is wrong or not 
> useful, I'm just trying to understand how that would look and what 
> information the user would get from it?

Just Ferguson not being clear in communication. The bane of my life. 
Your next comment shows you got my basic idea.

>
>> What if a white (continuous) variable is plotted against another 
>> white variable (e.g. dive duration against dive depth). The most 
>> appropriate type of graph is probably as scatter diagram:
>>
>> The raw data are indicated on the graph. There is no need for 
>> specifying a granularity value because there in no grouping of values 
>> along the horizontal or vertical axes. If a clear relationship 
>> between the two variables exists, it is clearly visible on the graph 
>> as in this case.
>>
>
> Yes, this seems like the obvious choice here, much better than 
> creating artificial columns by intervals of depth.
>
>> We have now dealt with
>>
>> Yellow/white
>>
>> Yellow/Blue
>>
>> White/Blue and Blue/White
>>
>> White/White
>>
>> What about Blue/Blue?
>>
>
> Same question as above. What does a graph of, say, suit over tags look 
> like? What does it tell me?

Depends how the tags are used by the diver. I have standard tags that 
denote the primary gas type (air, nitrox, trimix) and dive equipment 
(recreational, sidemount, twinset, rebreather) So I could like to 
summarise my dives on the different equipment by the type of suit I use. 
The 3rd variable could be "amount of dive time spent with each 
combination of equipment and suit" or "number of dives with each of such 
combinations".

The case of stacked graphs becomes more interesting when one starts 
graphing a categorical variable such as gas against depth or SAC or dive 
duration. For instance a stacked histogram (like the one below) of gas 
(air, nitrox, trimix) against depth, using either number of dives or 
total dive duration as a 3rd variable and a grouped continuous variable 
on the horizontal axis. This would allow one to investigate why he used 
air on some pretty deep dives,or why she ended up using trimix for some 
pretty shallow dives.

>> There is another type of graph that is potentially extremely useful : 
>> introduce a *third* variable to the graph. For instance, in the case 
>> of the second blue bargraph towards the start of this message 
>> (No.dives vs depth) one could ask what the distribution of a third 
>> category is. For instance, how long did I use various dive suits at 
>> different depths? Or how many dives did I use different dive suits at 
>> different depths? This is the above barchart, divided into the values 
>> for different dive suits. This is also useful to analyse variables 
>> used as tags, e.g. the use of air/nitrox/trimix during dives, the 
>> number of boat/shore dives, the number of training dives compared to 
>> fun dives, the number of dives using different dive modes as a 
>> function of depth, dive duration, temperature, or whatever white 
>> variable has been selected.
>>
>> Since the horizontal axis corresponds to a white (continuous) 
>> variable, one would need to specify a granularity/increment. The UI 
>> cost for this would be an additional dropdown list/comboox to select 
>> the appropriate categorical variable to appropriately subdivide each 
>> bar of the graph (Dirk's Granularity??). This diagram handles cases 
>> of graphs with a blue(categorical) variable plotted against another 
>> blue (categorical) variable, although a third variable needs to be 
>> specified to form the unit of measurement (e.g. dive duration in the 
>> above graph). This can probably be selected using Dirk's Granularity 
>> Combobox in his proposal.
>>
>
> We are of course going to get ourselves into trouble here. Yes, the 
> addition of a third variable and the use of color for that is really 
> cool. But that does add significant complexity. And BTW, I'm sure the 
> next person is going to say "I want to add that to a scatter plot"... 
> e.g., in the duration over depth chart, you could use dot color to 
> indicate the suit, or you could pick some tags to be shown (think 
> nitrox, air, trimix). Which are all wonderfully cool graphs, but the 
> complexity is quickly going up.
>
> How about a fourth axis? Think about this scatter plot where you track 
> suit over depth/duration - would you want to add different dot shapes 
> to indicate your dive type?
>
> I am half joking, but we all know that someone will ask for this as 
> soon as we release the new statistics :-)

Let's do what is realistically doable now without letting Tomaz or 
whoever writes the relevant code to throw up his/her arms in desperation.

>
>> This handles basically all the possibilities of the different 
>> combinations of Yellow, White and Blue variables in Dirk's proposal. 
>> There are fundamentally FOUR types of graphs that would be required, 
>> forming the basis of visual presentation of the Statistics tab.
>>
>
> I'd argue that's six (but that's because of my potential additions):
> two values:
> - Bar chart
> - Dot graph
> - Scatter plot
> three values:
> - Stacked bar chart
> - Dot graph with colors
> - Scatter plot with colors.
>
>> I hope this appears somewhat useful in the present discussion.
>>
>
> Yes, I think this was super useful.
>
If 6 graphs is what it comes down to, then I have achieved my aim with 
this email. !!!!!!!!!


> There are a few questions in my response that I think would benefit 
> from more thought / responses.
>
> And then maybe some more exploration of how things would work to
> - pick which categories to show (for some the number is likely small 
> enough, but for tags, full text, and possibly trip you'd want to be 
> able to select those that you want to plot
> - pick the grouping of date
> - enter the granularity for "near continuous" variables (depth, 
> duration, temperature)

As indicated above one should distinguish between the inherent 
continuous nature of some variables (e.g. depth, duration) from the 
memory representation (mm and sec, respectively) an distinguish that 
from the inherent discrete characteristics of other *related* units such 
as month or year. In the US you write today's date as 2020-16-05, NOT as 
2020-16-05.376. Month is discrete, a floating point representation makes 
no sense. On the other hand Dive duration is continuous, even if it is 
recorded only at the resolution of seconds. These distinctions strongly 
affect the way that they are represented in data summaries. Does this 
sound logical???

Thank you for taking the time to slog through this. There are 
undoubtedly many things I have not thought about.
> Once we have sketches / proposals for those, I think we have the UI 
> side handled and can design the backend data model that would be able 
> to provide those data.
>
> And then start implementing and iterating :-)
>
> Thanks
>
> /D



-- 
This message and attachments are subject to a disclaimer.

Please refer to 
http://upnet.up.ac.za/services/it/documentation/docs/004167.pdf 
<http://upnet.up.ac.za/services/it/documentation/docs/004167.pdf> for
full 
details.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.subsurface-divelog.org/pipermail/subsurface/attachments/20200516/ed1f11ab/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gfnhmollagimhhko.png
Type: image/png
Size: 141121 bytes
Desc: not available
URL: <http://lists.subsurface-divelog.org/pipermail/subsurface/attachments/20200516/ed1f11ab/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgkgbnojkfkdicpc.png
Type: image/png
Size: 41724 bytes
Desc: not available
URL: <http://lists.subsurface-divelog.org/pipermail/subsurface/attachments/20200516/ed1f11ab/attachment-0009.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hnljlgceengjopkb.png
Type: image/png
Size: 15773 bytes
Desc: not available
URL: <http://lists.subsurface-divelog.org/pipermail/subsurface/attachments/20200516/ed1f11ab/attachment-0010.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mengpnbdjbogjbcm.png
Type: image/png
Size: 14416 bytes
Desc: not available
URL: <http://lists.subsurface-divelog.org/pipermail/subsurface/attachments/20200516/ed1f11ab/attachment-0011.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ffkgopmmdoajfbbm.png
Type: image/png
Size: 13128 bytes
Desc: not available
URL: <http://lists.subsurface-divelog.org/pipermail/subsurface/attachments/20200516/ed1f11ab/attachment-0012.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jijkgdeoohbbfibe.png
Type: image/png
Size: 13009 bytes
Desc: not available
URL: <http://lists.subsurface-divelog.org/pipermail/subsurface/attachments/20200516/ed1f11ab/attachment-0013.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dmdckpjgcadkbmmd.png
Type: image/png
Size: 18330 bytes
Desc: not available
URL: <http://lists.subsurface-divelog.org/pipermail/subsurface/attachments/20200516/ed1f11ab/attachment-0014.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: omjglffkhhbefejk.png
Type: image/png
Size: 21213 bytes
Desc: not available
URL: <http://lists.subsurface-divelog.org/pipermail/subsurface/attachments/20200516/ed1f11ab/attachment-0015.png>


More information about the subsurface mailing list