Find moved images: match path instead of hashes?
Berthold Stoeger
bstoeger at mail.tuwien.ac.at
Sat Jun 9 04:53:31 PDT 2018
Dear all,
In a previous mail I noted that the current file-hashing of images is a
disaster when it comes to video support. Moreover, the complexity of the
hashing-code caused me quite some headache and the current implementation is
buggy (different hashes for original and local filename). Instead of coming up
with a scheme for videos (e.g. hash 1 MB in the middle of the file if it
exceeds a certain size) and fixing those bugs, I wonder if the whole hashing
thing is necessary at all.
AFAIK, the reason for the hashes is twofold:
1) Notice when images have changed to recalculate their thumbnails.
2) Use it to find moved images, i.e. if the log is transported to a different
computer.
The first case never worked, since in routine operation the images are not
rehashed. This functionality is now instead provided by PR#1336.
The second case is questionable, because users might have edited their files
without changing the filename. To me this seems to be a more likely case than
pictures getting renamed. The only reason of the hash seems therefore to
protect from equally-named images. This can be circumvented by not only
checking the filename, but also the names of the parent directories.
I implemented a proof-of-concept in PR#1349. In principle, it does two things:
1) Replace all the hash-to-filename associations by a simple
canonical_filename->local_filename associative array.
2) Find moved images based on file-paths. The way this works is by scoring the
match between the file-paths: the higher the number of matching path-items
(starting from the filename up to the first miss), the higher the score.
Note that a significant part of the PR is actually the conversion of the old
associations to the simplified ones. Apart from this, the final code is
distinctly less complex than the original one. It can/should of course still
be improved. For example, we could improve the heuristics by remembering image
meta-data. And certainly, the user should be presented the list of new
associations, before actually applying them.
But before I continue to work on this (as probably all of us, I have only
limited time) there needs to be a decision made on whether this is the correct
path forward and what are the must-features (at least some sort of user
interaction me thinks).
If we chose to go this way, there is at least one additional implementation
detail to discuss: To emulate the old behavior, *all* pictures that we ever
encountered are matched. If a log is opened, all pictures in the log are
remembered in the canonical_filename->local_filename associative array. I
wonder if it would not be more sensible to match only the pictures of the
currently opened log (or even only selected dives?). Thus we would only have
to remember those images where canonical and local filenames differ.
Thanks,
Berthold
More information about the subsurface
mailing list