• FooBarrington@lemmy.world
    link
    fedilink
    English
    arrow-up
    21
    arrow-down
    1
    ·
    13 hours ago

    It’s closer to running 8 high-end video games at once. Sure, from a scale perspective it’s further removed from training, but it’s still fairly expensive.

    • brucethemoose@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      edit-2
      9 hours ago

      Not at all. Not even close.

      Image generation is usually batched and takes seconds, so 700W (a single H100 SXM) for a few seconds for a batch of a few images to multiple users. Maybe more for the absolute biggest (but SFW, no porn) models.

      LLM generation takes more VRAM, but is MUCH more compute-light. Typically one has banks of 8 GPUs in multiple servers serving many, many users at once. Even my lowly RTX 3090 can serve 8+ users in parallel with TabbyAPI (and modestly sized model) before becoming more compute bound.

      So in a nutshell, imagegen (on an 80GB H100) is probably more like 1/4-1/8 of a video game at once (not 8 at once), and only for a few seconds.

      Text generation is similarly efficient, if not more. Responses take longer (many seconds, except on special hardware like Cerebras CS-2s), but it parallelized over dozens of users per GPU.


      This is excluding more specialized hardware like Google’s TPUs, Huawei NPUs, Cerebras CS-2s and so on. These are clocked far more efficiently than Nvidia/AMD GPUs.


      …The worst are probably video generation models. These are extremely compute intense and take a long time (at the moment), so you are burning like a few minutes of gaming time per output.

      ollama/sd-web-ui are terrible analogs for all this because they are single user, and relatively unoptimized.

    • jsomae@lemmy.ml
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      3
      ·
      13 hours ago

      really depends. You can locally host an LLM on a typical gaming computer.

      • FooBarrington@lemmy.world
        link
        fedilink
        English
        arrow-up
        10
        arrow-down
        1
        ·
        12 hours ago

        You can, but that’s not the kind of LLM the meme is talking about. It’s about the big LLMs hosted by large companies.

      • floquant@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        5
        ·
        edit-2
        12 hours ago

        True, and that’s how everyone who is able should use AI, but OpenAI’s models are in the trillion parameter range. That’s 2-3 orders of magnitude more than what you can reasonably run yourself

        • jsomae@lemmy.ml
          link
          fedilink
          English
          arrow-up
          4
          arrow-down
          4
          ·
          edit-2
          12 hours ago

          This is still orders of magnitude less than what it takes to run an EV, which are an eco-friendly form of carbrained transportation. Especially if you live in an area where the power source is renewable. On that note, it looks to me like AI is finally going to be the impetus to get the U.S. to invest in and switch to nuclear power – isn’t that altogether a good thing for the environment?

      • Thorry84@feddit.nl
        link
        fedilink
        English
        arrow-up
        5
        ·
        edit-2
        12 hours ago

        Well that’s sort of half right. Yes you can run the smaller models locally, but usually it’s the bigger models that we want to use. It would also be very slow on a typical gaming computer and even a high end gaming computer. To make it go faster not only is the hardware used in datacenters more optimised for the task, it’s also a lot faster. This is both a speed increase per unit as well as more units being used than you would normally find in a gaming PC.

        Now these things aren’t magic, the basic technology is the same, so where does the speed come from? The answer is raw power, these things run insane amounts of power through them, with specialised cooling systems to keep them cool. This comes at the cost of efficiency.

        So whilst running a model is much cheaper compared to training a model, it is far from free. And whilst you can run a smaller model on your home PC, it isn’t directly comparable to how it’s used in the datacenter. So the use of AI is still very power hungry, even when not counting the training.

      • CheeseNoodle@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        1
        ·
        12 hours ago

        Yeh but those local models are usually pretty underpowered compared to the ones that run via online services, and are still more demanding than any game.

      • FooBarrington@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        11 hours ago

        I compared the TDP of an average high-end graphics card with the GPUs required to run big LLMs. Do you disagree?

          • FooBarrington@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            1
            ·
            10 hours ago

            They are, it’d be uneconomical not to use them fully the whole time. Look up how batching works.

            • Jakeroxs@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              3
              arrow-down
              2
              ·
              edit-2
              10 hours ago

              I mean I literally run a local LLM, while the model sits in memory it’s really not using up a crazy amount of resources, I should hook up something to actually measure exactly how much it’s pulling vs just looking at htop/atop and guesstimating based on load TBF.

              Vs when I play a game and the fans start blaring and it heats up and you can clearly see the usage increasing across various metrics

              • PeriodicallyPedantic@lemmy.ca
                link
                fedilink
                English
                arrow-up
                3
                ·
                10 hours ago

                He isn’t talking about locally, he is talking about what it takes for the AI providers to provide the AI.

                To say “it takes more energy during training” entirely depends on the load put on the inference servers, and the size of the inference server farm.

                • Jakeroxs@sh.itjust.works
                  link
                  fedilink
                  English
                  arrow-up
                  3
                  ·
                  9 hours ago

                  There’s no functional difference aside from usage and scale, which is my point.

                  I find it interesting that the only actual energy calculations I see from researchers is the training and the things going along with the training, rather then the usage per actual request after training.

                  People then conflate training energy costs to normal usage cost without data to back it up. I don’t have the data either but I do have what I can do/see on my side.

                  • PeriodicallyPedantic@lemmy.ca
                    link
                    fedilink
                    English
                    arrow-up
                    1
                    ·
                    5 hours ago

                    I’m not sure that’s true, if you look up things like “tokens per kwh” or “tokens per second per watt” you’ll get results of people measuring their power usage while running specific models in specific hardware. This is mainly for consumer hardware since it’s people looking to run their own AI servers who are posting about it, but it sets an upper bound.

                    The AI providers are right lipped about how much energy they use for inference and how many tokens they complete per hour.

                    You can also infer a bit by doing things like looking up the power usage of a 4090, and then looking at the tokens per second perf someone is getting from a particular model on a 4090 (people love posting their token per second performance every time a new model comes out), and extrapolate that.

              • MotoAsh@lemmy.world
                link
                fedilink
                English
                arrow-up
                3
                arrow-down
                1
                ·
                10 hours ago

                One user vs a public service is apples to oranges and it’s actually hilarious you’re so willing to compare them.

                • Jakeroxs@sh.itjust.works
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  9 hours ago

                  It’s literally the same thing, the obvious difference is how much usage it’s getting at a time per gpu, but everyone seems to assume all these data centers are running at full load at all times for some reason?

              • FooBarrington@lemmy.world
                link
                fedilink
                English
                arrow-up
                2
                arrow-down
                1
                ·
                10 hours ago

                My guy, we’re not talking about just leaving a model loaded, we’re talking about actual usage in a cloud setting with far more GPUs and users involved.

                  • FooBarrington@lemmy.world
                    link
                    fedilink
                    English
                    arrow-up
                    1
                    ·
                    9 hours ago

                    Given that cloud providers are desperately trying to get more compute resources, but are limited by chip production - yes, of course? Why do you think they’re trying to expand their resources while their existing resources aren’t already limited?