So, I’m selfhosting immich, the issue is we tend to take a lot of pictures of the same scene/thing to later pick the best, and well, we can have 5~10 photos which are basically duplicates but not quite.
Some duplicate finding programs put those images at 95% or more similarity.

I’m wondering if there’s any way, probably at file system level, for the same images to be compressed together.
Maybe deduplication?
Have any of you guys handled a similar situation?

  • smpl@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    1
    ·
    2 months ago

    The first thing I would do writing such a paper would be to test current compression algorithms by create a collage of the similar images and see how that compares to the size of the indiviual images.

    • simplymath@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      2 months ago

      Compressed length is already known to be a powerful metric for classification tasks, but requires polynomial time to do the classification. As much as I hate to admit it, you’re better off using a neural network because they work in linear time, or figuring out how to apply the kernel trick to the metric outlined in this paper.

      a formal paper on using compression length as a measure of similarity: https://arxiv.org/pdf/cs/0111054

      a blog post on this topic, applied to image classification:

      https://jakobs.dev/solving-mnist-with-gzip/

      • smpl@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 months ago

        I was not talking about classification. What I was talking about was a simple probe at how well a collage of similar images compares in compressed size to the images individually. The hypothesis is that a compression codec would compress images with similar colordistribution in a spritesheet better than if it encode each image individually. I don’t know, the savings might be neglible, but I’d assume that there was something to gain at least for some compression codecs. I doubt doing deduplication post compression has much to gain.

        I think you’re overthinking the classification task. These images are very similar and I think comparing the color distribution would be adequate. It would of course be interesting to compare the different methods :)

        • smpl@discuss.tchncs.de
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 months ago

          Wait… this is exactly the problem a video codec solves. Scoot and give me some sample data!

          • simplymath@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            2 months ago

            Yeah. That’s what an MP4 does, but I was just saying that first you half to figure out which images are “close enough” to encode this way.

        • simplymath@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          2 months ago

          Yeah. I understand. But first you have to cluster your images so you know which ones are similar and can then do the depulication. This would be a powerful way to do that. It’s just expensive compared to other clustering algorithms.