[PATCH 05/12] git repository reading: start reading the actual file contents

Linus Torvalds torvalds at linux-foundation.org
Sun Mar 9 20:31:55 PDT 2014


On Sun, Mar 9, 2014 at 8:10 PM, Dirk Hohndel <dirk at hohndel.org> wrote:
>
> But then I thought... hey, we should just start a worker thread that
> downloads the dive computers once the other data is loaded. Yes, if
> someone starts Subsurface and immediately selects all dives or opens the
> yearly statistics, things will come to a halt - but for 99% of the
> "normal" use cases this would be a huge win.
>
> The one thing to make sure of is that we first load all the dives (for
> the dive list) and then load the dive computers for the first dive we
> display, and then load all the other dive computers in the background.
>
> Does that make sense?

That would probably work fine.

That said, I'm actually happy with how fast the git loader is. I was
expecting parsing to be much faster than the XML parsing (and it is),
but I was a bit nervous about the git overhead. It doesn't seem to be
bad at all.

So I did end up doing the dive computer data in git objects of their
own so that we can try to load them asynchronously (or synchronously,
just on demand) later, but I would suggest that be a long-time plan,
rather than anything in the near future. Performance-wise, it doesn't
seem to be an immediate problem (I did do some timing, and quite
frankly, all the shared library loading overhead dominates startup
times as far as my profiles seemed to say).

But I think it is nice to know that we *could* load stuff
incrementally if it ever becomes a problem.

Another thing I'd like to eventually think about is to allow people to
put other random data in the git branch. Right now that "works" in the
sense that subsurface will ignore stuff it doesn't know about, but
when then saving changes we currently always start from an empty tree.

But down the line, we could have things like "Pictures" directories
that subsurface leaves alone, but that can be used to carry data
around along with the dive data quite naturally.

And one of the many nice nice things about the git object model is
that if we do subdirectories like that, they can be arbitrarily large,
and it won't affect reading *or* writing speed - the directories that
don't get changed simply don't get rewritten, we just carry the SHA1
pointers around. So it scales well.

But the series I sent out does none of that. It kind of prepares for
reading dive computer data separately by consciously *not* reading all
the blob data in one central place, but instead leaving the reading of
the data to whatever wants to parse it - so that if some sub-parser
decodes "I can do this later" it would work fine.

One note: the current code very much also always saves a new commit
whenever you save - even if the tree doesn't actually change. So it
acts a bit like "git commit --allow-empty" in that it will create a
new commit with an empty diff to the previous commit. Eventually I was
planning on adding code that says "if the data you just wrote is the
same, don't create a new commit" (trivial: compare the SHA1's for the
tree object about to be committed with the parent), but for testing it
was actually nice to see the explicit commit with the exact same data.

So there are a few sort-term things worth doing, and I'm sure there
are some bugs in all that new code, but on the whole I think it's in
fairly good shape.

Oh, and one thing I really want to do is get rid of some of those "if
(!strcmp(..))" things and add another level or arrays of keys and
function pointers, but that's a cleanup. The main parser already is a
binary tree search so that it doesn't have to do a linear search
through string compares, so it's just a few special cases that do
repeated string compares the stupid way..

But I'll likely not get back to this for a few days, although I'll
respond to bug reports.

                Linus


More information about the subsurface mailing list