Find moved images: match path instead of hashes?

Sat Jun 9 04:53:31 PDT 2018

Dear all,

In a previous mail I noted that the current file-hashing of images is a 
disaster when it comes to video support. Moreover, the complexity of the 
hashing-code caused me quite some headache and the current implementation is 
buggy (different hashes for original and local filename). Instead of coming up 
with a scheme for videos (e.g. hash 1 MB in the middle of the file if it 
exceeds a certain size) and fixing those bugs, I wonder if the whole hashing 
thing is necessary at all.

AFAIK, the reason for the hashes is twofold:
1) Notice when images have changed to recalculate their thumbnails.
2) Use it to find moved images, i.e. if the log is transported to a different 
computer.

The first case never worked, since in routine operation the images are not 
rehashed. This functionality is now instead provided by PR#1336.

The second case is questionable, because users might have edited their files 
without changing the filename. To me this seems to be a more likely case than 
pictures getting renamed. The only reason of the hash seems therefore to 
protect from equally-named images. This can be circumvented by not only 
checking the filename, but also the names of the parent directories.

I implemented a proof-of-concept in PR#1349. In principle, it does two things:
1) Replace all the hash-to-filename associations by a simple 
canonical_filename->local_filename associative array.
2) Find moved images based on file-paths. The way this works is by scoring the 
match between the file-paths: the higher the number of matching path-items 
(starting from the filename up to the first miss), the higher the score.

Note that a significant part of the PR is actually the conversion of the old 
associations to the simplified ones. Apart from this, the final code is 
distinctly less complex than the original one. It can/should of course still 
be improved. For example, we could improve the heuristics by remembering image 
meta-data. And certainly, the user should be presented the list of new 
associations, before actually applying them.

But before I continue to work on this (as probably all of us, I have only 
limited time) there needs to be a decision made on whether this is the correct 
path forward and what are the must-features (at least some sort of user 
interaction me thinks).

If we chose to go this way, there is at least one additional implementation 
detail to discuss: To emulate the old behavior, *all* pictures that we ever 
encountered are matched. If a log is opened, all pictures in the log are 
remembered in the canonical_filename->local_filename associative array. I 
wonder if it would not be more sensible to match only the pictures of the 
currently opened log (or even only selected dives?). Thus we would only have 
to remember those images where canonical and local filenames differ.

Thanks,

Berthold