<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body>
<div class="moz-cite-prefix">On 2020/05/16 18:22, Dirk Hohndel
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:6C3524BC-FCA2-4FCC-8761-9D7EEE8DB0CD@hohndel.org">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
Hi Willem,
<div class=""><br class="">
</div>
<div class="">That's a great detailed writeup. I'll comment in
line and try to keep all of the pictures in place!<br class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On May 16, 2020, at 7:13 AM, Willem Ferguson
<<a href="mailto:willemferguson@zoology.up.ac.za"
class="" moz-do-not-send="true">willemferguson@zoology.up.ac.za</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252" class="">
<div class="">
<p class="">This is just an attempt to enumerate how
many types of graphs one is likely to need, given the
discussion until now. As a basis I use Dirk's Proposal
for selecting a appropriate graph.</p>
<p class=""><img alt="" class="" apple-inline="yes"
id="B8AF1A02-F723-414A-BB96-2AD3B3E19E47"
src="cid:part2.C94A5E9F.CACA6A4A@zoology.up.ac.za"></p>
<p class="">In the above diagram, the different types of
variables have different colours.</p>
<p class="">1) The yellow ones are just totals (Total #
dives, Total no. minutes/hours) that are unlikely to
have any associated minimum or maximum.</p>
<p class="">2) The blue ones are variables defined in
terms of categories. Date : day, week, month, etc;
Trip : trip locality; suit: type of dive suit; tags :
tag text. There is no dive suite value inbetween
wetsuit and semidry suit because they are two distinct
categories. <br class="">
</p>
<p class="">3) The white ones are continuous numeric
variables. Duration can potentially have any arbitrary
number of minutes or hours. The same goes for
Max_depth, Min_temp, SAC, and all the other white
ones. Inbetween any two arbitrary depths there are
innumerable intermediate depths and depth only becomes
a value along a continuous scale.<br class="">
</p>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Technically... we store depth in mm (and round when
displaying)... yeah, you know where I'm going here. I'm
being a childish stickler for details that these are of
course discrete values, but for the purpose of this
discussion you are correct. They have "nearly continuous"
values and in that are distinctly different from the blue
categories.</div>
<div><br class="">
</div>
<div>I have two questions: </div>
<div>- why is 'dive type' not blue. That's likely just an
oversight, right? It's not like depth, duration or temp that
are "nearly continuous".</div>
</div>
</div>
</blockquote>
<p>Apologies. Yes, an oversight. <br>
</p>
<p>Sorry, I did not get your 2nd question.<br>
</p>
<blockquote type="cite"
cite="mid:6C3524BC-FCA2-4FCC-8761-9D7EEE8DB0CD@hohndel.org">
<div class="">
<div>
<div>- and what about date. that one is kinda weird - but I'd
say for someone with 8000 dives since 1983, date is "more
continuous" than temperature (we dive in water typically
between about -2C and +40C, so even at 0.01 degree intervals
that's only ~4200 values - and 0.01deg C is way beyond the
measurement accuracy here)... I think 'by date' is either
white or it is 'special'.</div>
</div>
</div>
</blockquote>
<p>One needs to make a strong distinction between the way the data
are stored in Subsurface memory and the conceptual way that these
types of data are treated as statistical objects. It is as you say
"They have "nearly continuous" values and in that are distinctly
different from the blue categories." The fact that depth is stored
as mm is a measurement error because of the lack of accuracy of
the equipment which does not make depth an integer. :-) If
storage space and speed of data manipulation was no issue,
Subsurface would probably wave stored depth as a floating point.
As for date, a day, week, month or year is conceptually an
integer. We would not talk of week 3.576. *Time* is not an
integer, it is continuous. But when we talk of a day, a week or a
month, we do not use a concept of continuous time, we use an
integer. I totally understand your point about time, but this is
not useful if I want to compare my dives in 2018 with those in
2019, or summer (months 10 to 3) as compared to winter (months 4
to 9). Hope this makes sense? Look at the printout of the annual
summary. Years and dates are used as discrete numbers, as they
should be. On the other hand the Average dive duration for Dec
2019 was 55 min and 14 sec. This is a continuous-number
measurement. No one tries to argue that the duration is exactly
55:14.00 exactly accurate to the whole second. The calculation
just rounds the duration to the nearest sec. It does not mean that
dive duration is calculated in seconds. If we wished to, we could
represent it to 10 decimals but it would not be informative.<br>
</p>
<blockquote type="cite"
cite="mid:6C3524BC-FCA2-4FCC-8761-9D7EEE8DB0CD@hohndel.org">
<div class="">
<div><img src="cid:part3.118C2D00.886070C1@zoology.up.ac.za"
alt="">
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class=""> </p>
<p class="">The type of graph that best depicts a
relationship between two types of variables depends on
the colour that each of the variables above has. I
need to emphasize that the graphs below are totally
open to discussion. The purpose here is to assess how
many types of graphical elements one would need for a
basic statistics tab in Subsurface.<br class="">
</p>
<p class="">Plotting a yellow variable against a blue
variable is probably best represented by a simplistic
bar graph like:</p>
<p class=""><img alt="" class="" apple-inline="yes"
id="BE088A1B-5DB3-4478-85E9-7826514BED36"
src="cid:part4.173AA067.B1273741@zoology.up.ac.za"></p>
<p class="">There are no min and max values to indicate.
The different suit categories are indicated along the
horizontal axis. There is no need to specify a degree
of "granularity" or increment along the horizontal
axis and no min or max values are involved.<br
class="">
</p>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Yes, I think that's reasonable. People like me will run
into trouble here because of the way I name my suits. "dry,
Whites Fusion with Weezle", "wet, 3mm, with 2/1 hooded vest"</div>
</div>
</div>
</blockquote>
<p>Myself included, same thing.<br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:6C3524BC-FCA2-4FCC-8761-9D7EEE8DB0CD@hohndel.org">
<div class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class=""> </p>
<p class="">If one plots a yellow variable against a
white (continuous) variable, then a
granularity/increment needs to be specified. In the
image below, an increment of 20m was used.</p>
<p class=""><img alt="" class="" apple-inline="yes"
id="6BF535B6-F135-4402-AD1B-5D1A6BE5946F"
src="cid:part5.71103FCC.ED744AFB@zoology.up.ac.za"></p>
<p class="">Basically the same type of graph as the one
used above. No need for min/max values. </p>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>The challenge here lies in the ability to come up with
"clever" groupings. To a human this is likely somewhat
obvious, but you need to invest some thought / algorithm /
heuristics into this to get this right. Especially if we
include 'date' as a white variable (as I think we should).</div>
<div>Clearly something that is "solvable", just not
necessarily straight forward. <br>
</div>
</div>
</div>
</blockquote>
<p>I agree completely. But this is an issue of implementation, not
an issue for putting together a rough framework of visual
presentation.<br>
</p>
<blockquote type="cite"
cite="mid:6C3524BC-FCA2-4FCC-8761-9D7EEE8DB0CD@hohndel.org">
<div class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class="">Of course, as was well argued previously,
the bar graph can be horizontal in the case of long
names on the horizontal axis, e.g. dive site names:</p>
<p class=""><img alt="" class="" apple-inline="yes"
id="80375CB5-F42E-4BB3-A58F-A57BABF4F688"
src="cid:part6.ADC48F17.035B7EE5@zoology.up.ac.za"></p>
<p class="">While I personally have no qualms with
horizontal diagrams where needed, I would argue it is
a regression to default to horizontal orientations for
all bar graphs.</p>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>I think for blue variables horizontal might be easier to
get right. But I'm willing to have us consider this a "phase
2" think and start with just vertical and see into how much
trouble we run there...</div>
</div>
</div>
</blockquote>
<p>Clearly there are cases that would necessitate horizontal. A
simple rule that I would consider is, "if any label is longer than
8 characters, use horizontal, otherwise use vertical". Again my
formulation is simplistic but I think you would get my drift.<br>
</p>
<blockquote type="cite"
cite="mid:6C3524BC-FCA2-4FCC-8761-9D7EEE8DB0CD@hohndel.org">
<div class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class="">The above graphs deal with yellow variables
in Dirk's proposal. Now about the other categories.
Plotting a White variable against a Blue variable has
several options, including box and whisker plots that
are not popular in this discussion. My proposal two
days ago was something like this and there was some
discussion around it:</p>
<p class=""><br class="">
</p>
<p class=""><img alt="" class="" apple-inline="yes"
id="849E9EA8-AD6A-4E4A-B79B-646FDA5DD21E"
src="cid:part7.19127227.77A562EB@zoology.up.ac.za"></p>
<p class="">Here SAC is a white (continuous) variable
and Suit is a blue (catagorical) variable. A graphical
element that is likely to differ sharply from the bar
graphs used above. Here again, because the horizontal
axis comprises categories, there is no need to specify
a granularity/increment. For lack of a better name
(there is actually a esoteric statistical name for
this graph) I call this a dot graph.</p>
</div>
</div>
</blockquote>
<div><br class="">
</div>
The more I look at this, the more I like it.</div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class="">What about plotting a Blue (categorical)
variable against a White (continuous) variable? For
our case the order in which the blue and white
variables are selected probably does not matter and
the dot graph shown above (or some derivative of it)
should suffice. <br class="">
</p>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>I had never thought of blue categories being plotted
against the white categories. What's the meaning of a plot
of my tags over the depth? Or trips over temperature? I'm
not saying that this is wrong or not useful, I'm just trying
to understand how that would look and what information the
user would get from it?</div>
</div>
</div>
</blockquote>
<p>Just Ferguson not being clear in communication. The bane of my
life. Your next comment shows you got my basic idea.<br>
</p>
<blockquote type="cite"
cite="mid:6C3524BC-FCA2-4FCC-8761-9D7EEE8DB0CD@hohndel.org">
<div class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class=""> </p>
<p class="">What if a white (continuous) variable is
plotted against another white variable (e.g. dive
duration against dive depth). The most appropriate
type of graph is probably as scatter diagram:</p>
<p class=""><img alt="" class="" apple-inline="yes"
id="13405809-25EF-4159-9A69-051527FDD233"
src="cid:part8.BE82B45E.95ADE9FF@zoology.up.ac.za"></p>
<p class="">The raw data are indicated on the graph.
There is no need for specifying a granularity value
because there in no grouping of values along the
horizontal or vertical axes. If a clear relationship
between the two variables exists, it is clearly
visible on the graph as in this case.</p>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Yes, this seems like the obvious choice here, much better
than creating artificial columns by intervals of depth.</div>
<div><br class="">
</div>
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class="">We have now dealt with <br class="">
</p>
<p class="">Yellow/white</p>
<p class="">Yellow/Blue</p>
<p class="">White/Blue and Blue/White<br class="">
</p>
<p class="">White/White</p>
<p class="">What about Blue/Blue?<br class="">
</p>
</div>
</div>
</blockquote>
<div><br class="">
</div>
Same question as above. What does a graph of, say, suit over
tags look like? What does it tell me?<br class="">
</div>
</div>
</blockquote>
<p>Depends how the tags are used by the diver. I have standard tags
that denote the primary gas type (air, nitrox, trimix) and dive
equipment (recreational, sidemount, twinset, rebreather) So I
could like to summarise my dives on the different equipment by the
type of suit I use. The 3rd variable could be "amount of dive time
spent with each combination of equipment and suit" or "number of
dives with each of such combinations".</p>
<p>The case of stacked graphs becomes more interesting when one
starts graphing a categorical variable such as gas against depth
or SAC or dive duration. For instance a stacked histogram (like
the one below) of gas (air, nitrox, trimix) against depth, using
either number of dives or total dive duration as a 3rd variable
and a grouped continuous variable on the horizontal axis. This
would allow one to investigate why he used air on some pretty deep
dives,or why she ended up using trimix for some pretty shallow
dives.<br>
</p>
<blockquote type="cite"
cite="mid:6C3524BC-FCA2-4FCC-8761-9D7EEE8DB0CD@hohndel.org">
<div class="">
<div>
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class=""> </p>
<p class="">There is another type of graph that is
potentially extremely useful : introduce a *third*
variable to the graph. For instance, in the case of
the second blue bargraph towards the start of this
message (No.dives vs depth) one could ask what the
distribution of a third category is. For instance, how
long did I use various dive suits at different depths?
Or how many dives did I use different dive suits at
different depths? This is the above barchart, divided
into the values for different dive suits. This is also
useful to analyse variables used as tags, e.g. the use
of air/nitrox/trimix during dives, the number of
boat/shore dives, the number of training dives
compared to fun dives, the number of dives using
different dive modes as a function of depth, dive
duration, temperature, or whatever white variable has
been selected.<br class="">
</p>
<p class=""><img alt="" class="" apple-inline="yes"
id="6EBBA3AF-B61D-4B5F-84D3-57027468C870"
src="cid:part9.697259B5.424939C8@zoology.up.ac.za"></p>
<p class="">Since the horizontal axis corresponds to a
white (continuous) variable, one would need to specify
a granularity/increment. The UI cost for this would be
an additional dropdown list/comboox to select the
appropriate categorical variable to appropriately
subdivide each bar of the graph (Dirk's
Granularity??). This diagram handles cases of graphs
with a blue(categorical) variable plotted against
another blue (categorical) variable, although a third
variable needs to be specified to form the unit of
measurement (e.g. dive duration in the above graph).
This can probably be selected using Dirk's Granularity
Combobox in his proposal.<br class="">
</p>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>We are of course going to get ourselves into trouble
here. Yes, the addition of a third variable and the use of
color for that is really cool. But that does add significant
complexity. And BTW, I'm sure the next person is going to
say "I want to add that to a scatter plot"... e.g., in the
duration over depth chart, you could use dot color to
indicate the suit, or you could pick some tags to be shown
(think nitrox, air, trimix). Which are all wonderfully cool
graphs, but the complexity is quickly going up.</div>
<div><br class="">
</div>
<div>How about a fourth axis? Think about this scatter plot
where you track suit over depth/duration - would you want to
add different dot shapes to indicate your dive type?</div>
<div><br class="">
</div>
<div>I am half joking, but we all know that someone will ask
for this as soon as we release the new statistics :-)</div>
</div>
</div>
</blockquote>
<p>Let's do what is realistically doable now without letting Tomaz
or whoever writes the relevant code to throw up his/her arms in
desperation.<br>
</p>
<blockquote type="cite"
cite="mid:6C3524BC-FCA2-4FCC-8761-9D7EEE8DB0CD@hohndel.org">
<div class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class=""> </p>
<p class="">This handles basically all the possibilities
of the different combinations of Yellow, White and
Blue variables in Dirk's proposal. There are
fundamentally FOUR types of graphs that would be
required, forming the basis of visual presentation of
the Statistics tab.<br class="">
</p>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>I'd argue that's six (but that's because of my potential
additions):</div>
<div>two values:</div>
<div>- Bar chart</div>
<div>- Dot graph</div>
<div>- Scatter plot</div>
<div>three values:</div>
<div>- Stacked bar chart</div>
<div>- Dot graph with colors</div>
<div>- Scatter plot with colors.</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">
<p class=""> </p>
<p class="">I hope this appears somewhat useful in the
present discussion.</p>
</div>
</div>
</blockquote>
<br class="">
</div>
<div>Yes, I think this was super useful.</div>
<br class="">
</div>
</blockquote>
<p>If 6 graphs is what it comes down to, then I have achieved my aim
with this email. !!!!!!!!!</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:6C3524BC-FCA2-4FCC-8761-9D7EEE8DB0CD@hohndel.org">
<div class="">There are a few questions in my response that I
think would benefit from more thought / responses. </div>
<div class=""><br class="">
</div>
<div class="">And then maybe some more exploration of how things
would work to</div>
<div class="">- pick which categories to show (for some the number
is likely small enough, but for tags, full text, and possibly
trip you'd want to be able to select those that you want to plot</div>
<div class="">- pick the grouping of date</div>
<div class="">- enter the granularity for "near continuous"
variables (depth, duration, temperature)</div>
</blockquote>
<p>As indicated above one should distinguish between the inherent
continuous nature of some variables (e.g. depth, duration) from
the memory representation (mm and sec, respectively) an
distinguish that from the inherent discrete characteristics of
other *related* units such as month or year. In the US you write
today's date as 2020-16-05, NOT as 2020-16-05.376. Month is
discrete, a floating point representation makes no sense. On the
other hand Dive duration is continuous, even if it is recorded
only at the resolution of seconds. These distinctions strongly
affect the way that they are represented in data summaries. Does
this sound logical???<br>
</p>
Thank you for taking the time to slog through this. There are
undoubtedly many things I have not thought about.<br>
<blockquote type="cite"
cite="mid:6C3524BC-FCA2-4FCC-8761-9D7EEE8DB0CD@hohndel.org">
<div class="">Once we have sketches / proposals for those, I think
we have the UI side handled and can design the backend data
model that would be able to provide those data.</div>
<div class=""><br class="">
</div>
<div class="">And then start implementing and iterating :-)</div>
<div class=""><br class="">
</div>
<div class="">Thanks</div>
<div class=""><br class="">
</div>
<div class="">/D</div>
</blockquote>
<p><br>
</p>
</body>
</html>
<br>
<div style="font-family:Arial,Helvetica,sans-serif"><span style="font-size:10.0pt;line-height:105%;font-family:"Segoe UI",sans-serif">This message and attachments are subject to a disclaimer.<br>
Please refer to <a href="http://upnet.up.ac.za/services/it/documentation/docs/004167.pdf" target="_blank">http://upnet.up.ac.za/<wbr>services/it/documentation/<wbr>docs/004167.pdf</a> </span><span style="font-size:10pt;line-height:105%;font-family:Tahoma,sans-serif">for
full details.</span></div>