So, I’m selfhosting immich, the issue is we tend to take a lot of pictures of the same scene/thing to later pick the best, and well, we can have 5~10 photos which are basically duplicates but not quite.
Some duplicate finding programs put those images at 95% or more similarity.

I’m wondering if there’s any way, probably at file system level, for the same images to be compressed together.
Maybe deduplication?
Have any of you guys handled a similar situation?

    • simplymath@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      10 months ago

      Compressed length is already known to be a powerful metric for classification tasks, but requires polynomial time to do the classification. As much as I hate to admit it, you’re better off using a neural network because they work in linear time, or figuring out how to apply the kernel trick to the metric outlined in this paper.

      a formal paper on using compression length as a measure of similarity: https://arxiv.org/pdf/cs/0111054

      a blog post on this topic, applied to image classification:

      https://jakobs.dev/solving-mnist-with-gzip/

          • simplymath@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            10 months ago

            Yeah. That’s what an MP4 does, but I was just saying that first you half to figure out which images are “close enough” to encode this way.

        • simplymath@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          10 months ago

          Yeah. I understand. But first you have to cluster your images so you know which ones are similar and can then do the depulication. This would be a powerful way to do that. It’s just expensive compared to other clustering algorithms.