• sugar_in_your_tea@sh.itjust.works
    link
    fedilink
    arrow-up
    4
    arrow-down
    1
    ·
    edit-2
    9 个月前

    I asked an LLM to generate tests for a 10 line function with two arguments, no if branches, and only one library function call. It’s just a for loop and some math. Somehow it invented arguments, and the ones that actually ran didn’t even pass. It made like 5 test functions, spat out paragraphs explaining nonsense, and it still didn’t work.

    This was one of the smaller deepseek models, so perhaps a fancier model would do better.

    I’m still messing with it, so maybe I’ll find some tasks it’s good at.

    • KillingTimeItself@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      2
      ·
      9 个月前

      from what i understand the “preview” models are quite handicapped, usually the benchmark is the full fat model for that reason. the recent openAI one (they have stupid names idk what is what anymore) had a similar problem.

      If it’s not a preview model, it’s possible a bigger model would help, but usually prompt engineering is going to be more useful. AI is really quick to get confused sometimes.

      • sugar_in_your_tea@sh.itjust.works
        link
        fedilink
        arrow-up
        1
        ·
        9 个月前

        It might be, idk, my coworker set it up. It’s definitely a distilled model though. I did hope it would do a better job on such a small input though.

        • KillingTimeItself@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          2
          ·
          9 个月前

          the distilled models are a little goofier, it’s possible that might influence it, since they tend to behave weirdly sometimes, but it depends on the model and the application.

          AI is still fairly goofy unfortunately, it’ll take time for it to become omniscient.