Find moved images: match path instead of hashes?

Willem Ferguson willemferguson at zoology.up.ac.za
Sat Jun 9 10:24:04 PDT 2018


On 09/06/2018 13:53, Berthold Stoeger wrote:
> Dear all,
>
> In a previous mail I noted that the current file-hashing of images is a
> disaster when it comes to video support. Moreover, the complexity of the
> hashing-code caused me quite some headache and the current implementation is
> buggy (different hashes for original and local filename). Instead of coming up
> with a scheme for videos (e.g. hash 1 MB in the middle of the file if it
> exceeds a certain size) and fixing those bugs, I wonder if the whole hashing
> thing is necessary at all.
>
> AFAIK, the reason for the hashes is twofold:
> 1) Notice when images have changed to recalculate their thumbnails.
> 2) Use it to find moved images, i.e. if the log is transported to a different
> computer.
>
> The first case never worked, since in routine operation the images are not
> rehashed. This functionality is now instead provided by PR#1336.
>
> The second case is questionable, because users might have edited their files
> without changing the filename. To me this seems to be a more likely case than
> pictures getting renamed. The only reason of the hash seems therefore to
> protect from equally-named images. This can be circumvented by not only
> checking the filename, but also the names of the parent directories.
>
> I implemented a proof-of-concept in PR#1349. In principle, it does two things:
> 1) Replace all the hash-to-filename associations by a simple
> canonical_filename->local_filename associative array.
> 2) Find moved images based on file-paths. The way this works is by scoring the
> match between the file-paths: the higher the number of matching path-items
> (starting from the filename up to the first miss), the higher the score.
>
> Note that a significant part of the PR is actually the conversion of the old
> associations to the simplified ones. Apart from this, the final code is
> distinctly less complex than the original one. It can/should of course still
> be improved. For example, we could improve the heuristics by remembering image
> meta-data. And certainly, the user should be presented the list of new
> associations, before actually applying them.
>
> But before I continue to work on this (as probably all of us, I have only
> limited time) there needs to be a decision made on whether this is the correct
> path forward and what are the must-features (at least some sort of user
> interaction me thinks).
>
> If we chose to go this way, there is at least one additional implementation
> detail to discuss: To emulate the old behavior, *all* pictures that we ever
> encountered are matched. If a log is opened, all pictures in the log are
> remembered in the canonical_filename->local_filename associative array. I
> wonder if it would not be more sensible to match only the pictures of the
> currently opened log (or even only selected dives?). Thus we would only have
> to remember those images where canonical and local filenames differ.
>
> Thanks,
>
> Berthold
>
>
> _______________________________________________
> subsurface mailing list
> subsurface at subsurface-divelog.org
> http://lists.subsurface-divelog.org/cgi-bin/mailman/listinfo/subsurface

I would support any move to simplify a complex process such as hashing 
and looking for files across the while directory tree. My problem is 
that I have many files with the same name. They are in different dives, 
but the actual images reside in a directory structure in off-line 
storage. For instance, with fish photos I often have the English and 
scientific names as the file name. For this reason I have many files 
entitled "Redfang triggerfish Odonus niger.jpg". I am concerned that 
this would cause confusion in a filename-based system.

Kind regards,

willem



-- 
This message and attachments are subject to a disclaimer.

Please refer to 
http://upnet.up.ac.za/services/it/documentation/docs/004167.pdf 
<http://upnet.up.ac.za/services/it/documentation/docs/004167.pdf> for
full 
details.


More information about the subsurface mailing list