So, I’m selfhosting immich, the issue is we tend to take a lot of pictures of the same scene/thing to later pick the best, and well, we can have 5~10 photos which are basically duplicates but not quite.
Some duplicate finding programs put those images at 95% or more similarity.

I’m wondering if there’s any way, probably at file system level, for the same images to be compressed together.
Maybe deduplication?
Have any of you guys handled a similar situation?

  • simplymath@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    1
    ·
    10 months ago

    Compressed length is already known to be a powerful metric for classification tasks, but requires polynomial time to do the classification. As much as I hate to admit it, you’re better off using a neural network because they work in linear time, or figuring out how to apply the kernel trick to the metric outlined in this paper.

    a formal paper on using compression length as a measure of similarity: https://arxiv.org/pdf/cs/0111054

    a blog post on this topic, applied to image classification:

    https://jakobs.dev/solving-mnist-with-gzip/

        • simplymath@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          10 months ago

          Yeah. That’s what an MP4 does, but I was just saying that first you half to figure out which images are “close enough” to encode this way.

      • simplymath@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        1
        ·
        10 months ago

        Yeah. I understand. But first you have to cluster your images so you know which ones are similar and can then do the depulication. This would be a powerful way to do that. It’s just expensive compared to other clustering algorithms.