Google's Gemini 2.5 pro is out of beta.

diz@awful.systems · edit-2 4 months ago

Google's Gemini 2.5 pro is out of beta.

fubarx@lemmy.world · edit-2 4 months ago

233,324,900,064.

Off by 474,720.

mountainriver@awful.systems · 4 months ago

I find it a bit interesting that it isn’t more wrong. Has it ingested large tables and got a statistical relationship between certain large factors and certain answers? Or is there something else going on?

CodexArcanum@lemmy.dbzer0.com · 4 months ago

I posted a top level comment about this also, but Anthropic has done some research on this. The section on reasoning models discusses math I believe. The short version is it has a bunch of math in its corpus so it can approximate math (kind of, seemingly, similar to how you’d do a back of the envelope calculation in your head to get the orders of magnitude right) but it can’t actually do calculations which is why they often get the specifics wrong.

froztbyte@awful.systems · 4 months ago

reasoning models

that’s a shot, everyone drink up

CodexArcanum@lemmy.dbzer0.com · 4 months ago

One of the big AI companies (Anthropic with claude? Yep!) wrote a long paper that details some common LLM issues, and they get into why they do math wrong and lie about it in “reasoning” mode.

It’s actually pretty interesting, because you can’t say they “don’t know how to do math” exactly. The stochastic mechanisms that allow it to fool people with written prose also allow it to do approximate math. That’s why some digits are correct, or it gets the order of magnitude right but still does the math wrong. It’s actually layering together several levels of approximation.

The “reasoning” is just entirely made up. We barely understsnd how LLMs actually work, so none of them have been trained on research about that, which means LLMs don’t understand their own functioning (not that they “understand” anything strictly speaking).

scruiser@awful.systems · 4 months ago

We barely understsnd how LLMs actually work

I would be careful how you say this. Eliezer likes to go on about giant inscrutable matrices to fearmoner, and the promptfarmers use the (supposed) mysteriousness as another avenue for crithype.

It’s true reverse engineering any specific output or task takes a lot of effort and requires access to the model’s internals weights and hasn’t been done for most tasks, but the techniques exist for doing so. And in general there is a good high level conceptual understanding of what makes LLMs work.

which means LLMs don’t understand their own functioning (not that they “understand” anything strictly speaking).

This part is absolutely true. If you catch them in mistake, most of their data about responding is from how humans respond, or, at best fine-tuning on other LLM output and they don’t have any way of checking their own internals, so the words they say in response to mistakes is just more bs unrelated to anything.

diz@awful.systems · 4 months ago

Thing is, it has tool integration. Half of the time it uses python to calculate it. If it uses a tool, that means it writes a string that isn’t shown to the user, which runs the tool, and tool results are appended to the stream.

What is curious is that instead of request for precision causing it to use the tool (or just any request to do math), and then presence of the tool tokens causing it to claim that a tool was used, the requests for precision cause it to claim that a tool was used, directly.

Also, all of it is highly unnatural texts, so it is either coming from fine tuning or from training data contamination.

HedyL@awful.systems · 4 months ago

Also, if the LLM had reasoning capabilities that even remotely resembled those of an actual human, let alone someone who would be able to replace office workers, wouldn’t they use the best tool they had available for every task (especially in a case as clear-cut as this)? After all, almost all humans (even children) would automatically reach for their pocket calculators here, I assume.

HedyL@awful.systems · 4 months ago

As usual with chatbots, I’m not sure whether it is the wrongness of the answer itself that bothers me most or the self-confidence with which said answer is presented. I think it is the latter, because I suspect that is why so many people don’t question wrong answers (especially when they’re harder to check than a simple calculation).

diz@awful.systems · edit-2 4 months ago

The other interesting thing is that if you try it a bunch of times, sometimes it uses the calculator and sometimes it does not. It, however, always claims that it used the calculator, unless it didn’t and you tell it that the answer is wrong.

I think something very fishy is going on, along the lines of them having done empirical research and found that fucking up the numbers and lying about it makes people more likely to believe that gemini is sentient. It is a lot weirder (and a lot more dangerous, if someone used it to calculate things) than “it doesn’t have a calculator” or “poor LLMs cant do math”. It gets a lot of digits correct somehow.

Frankly this is ridiculous. They have a calculator integrated in the google search. That they don’t have one in their AIs feels deliberate, particularly given that there’s a plenty of LLMs that actually run calculator almost all of the time.

edit: lying that it used a calculator is rather strange, too. Humans don’t say “code interpreter” or “direct calculator” when asked to multiply two numbers. What the fuck is a “direct calculator”? Why is it talking about “code interpreter” and “direct calculator” conditionally on there being digits (I never saw it say that it used a “code interpreter” when the problem wasn’t mathematical), rather than conditional on there being a [run tool] token outputted earlier?

The whole thing is utterly ridiculous. Clearly for it to say that it used a “code interpreter” and a “direct calculator” (what ever that is), it had to be fine tuned to say that. Consequently to a bunch of numbers, rather than consequently to a [run tool] thing it uses to run a tool.

edit: basically, congratulations Google, you have halfway convinced me that an “artificial lying sack of shit” is possible after all. I don’t believe that tortured phrases like “code interpreter” and a “direct calculator” actually came from the internet.

These assurances - coming from an “AI” - seem like they would make the person asking the question be less likely to double check the answer (and perhaps less likely to click the downvote button), In my book this would qualify them as a lie, even if I consider LLM to not be any more sentient than a sack of shit.

ShakingMyHead@awful.systems · edit-2 4 months ago

I don’t believe that tortured phrases like “code interpreter” and a “direct calculator” actually came from the internet.

Code Interpreter was the name for the thing that ChatGPT used to run python code.

So, yeah, still taken from the internet.

diz@awful.systems · edit-2 4 months ago

Hmm, fair point, it could be training data contamination / model collapse.

It’s curious that it is a lot better at converting free form requests for accuracy, into assurances that it used a tool, than into actually using a tool.

And when it uses a tool, it has a bunch of fixed form tokens in the log. It’s a much more difficult language processing task to assure me that it used a tool conditionally on my free form, indirect implication that the result needs to be accurate, than to assure me it used a tool conditionally on actual tool use.

The human equivalent to this is “pathological lying”, not “bullshitting”. I think a good term for this is “lying sack of shit”, with the “sack of shit” specifying that “lying” makes no claim of any internal motivations or the like.

edit: also, testing it on 2.5 flash, it is quite curious: https://g.co/gemini/share/ea3f8b67370d . I did that sort of query several times and it follows the same pattern: it doesn’t use a calculator, it assures me the result is accurate, if asked again it uses a calculator, if asked if the numbers are equal it says they are not, if asked which one is correct it picks the last one and argues that the last one actually used a calculator. I hadn’t ever managed to get it to output a correct result and then follow up with an incorrect result.

edit: If i use the wording of “use an external calculator”, it gives a correct result, and then I can’t get it to produce an incorrect result to see if it just picks the last result as correct, or not.

I think this is lying without scare quotes, because it is a product of Google putting a lot more effort into trying to exploit Eliza effect to convince you that it is intelligent, than into actually making an useful tool. It, of course, doesn’t have any intent, but Google and its employees do.

lIlIlIlIlIlIl@lemmy.world · 4 months ago

Why would you think the machine that’s designed to make weighted guesses at what the next token should be would be arithmetically sound?

That’s not how any of this works (but you already knew that)

GregorGizeh@lemmy.zip · edit-2 4 months ago

Idk personally i kind of expect the ai makers to have at least had the sense to allow their bots to process math with a calculator and not guesswork. That seems like, an absurdly low bar both for testing the thing as a user as well as a feature to think of.

Didn’t one model refer scientific questions to wolfram alpha? How do they smartly decide to do this and not give them basic math processing?

lIlIlIlIlIlIl@lemmy.world · 4 months ago

I would not expect that.

Calculators haven’t been replaced, and the product managers of these services understand that their target market isn’t attempting to use them for things for which they were not intended.

brb, have to ride my lawnmower to work

diz@awful.systems · edit-2 4 months ago

Try asking my question to Google gemini a bunch of times, sometimes it gets it right, sometimes it doesn’t. Seems to be about 50/50 but I quickly ran out of free access.

And google is planning to replace their search (which includes a working calculator) with this stuff. So it is absolutely the case that there’s a plan to replace one of the world’s most popular calculators, if not the most popular, with it.

HedyL@awful.systems · 4 months ago

Also, a lawnmower is unlikely to say: “Sure, I am happy to take you to work” and “I am satisfied with my performance” afterwards. That’s why I sometimes find these bots’ pretentious demeanor worse than their functional shortcomings.

lIlIlIlIlIlIl@lemmy.world · 4 months ago

“Pretentious” is a trait expressed by something that’s thinking. These are the most likely words that best fit the context. Your emotional engagement with this technology is weird

diz@awful.systems · 4 months ago

Pretentious is a fine description of the writing style. Which actual humans fine tune.

swlabr@awful.systems · 4 months ago

Given that the LLMs typically have a system prompt that specifies a particular tone for the output, I think pretentious is an absolutely valid and accurate word to use.

ebu@awful.systems · 4 months ago

“emotional”

let me just slip the shades on real quick

“womanly”

checks out

froztbyte@awful.systems · 4 months ago

“Pretentious” is a trait expressed by something that’s thinking

diz@awful.systems · edit-2 4 months ago

The funny thing is, even though I wouldn’t expect it to be, it is still a lot more arithmetically sound than what ever is it that is going on with it claiming to use a code interpreter and a calculator to double check the result.

It is OK (7 out of 12 correct digits) at being a calculator and it is awesome at being a lying sack of shit.

lIlIlIlIlIlIl@lemmy.world · 4 months ago

lying sack of shit

Random tokens can’t lie to you, because they’re strings of text. Interpreting this as a lie is an interesting response

swlabr@awful.systems · 4 months ago

lol the corollary of this is that LLMs are incapable of producing meaningful output, you insufferable turd

lIlIlIlIlIlIl@lemmy.world · 4 months ago

Im using it literally every single day to make huge gains. Every single day I disprove this comment

self@awful.systems · 4 months ago

I knew you were a lying promptfondler the instant you came into the thread, but I didn’t expect you to start acting like a gymbro trying to justify their black market steroid habit. new type of AI booster unlocked!

now fuck off

diz@awful.systems · 4 months ago

lmao: they have fixed this issue, it seems to always run python now. Got to love how they just put this shit in production as “stable” Gemini 2.5 pro with that idiotic multiplication thing that everyone knows about, and expect what? to Eliza Effect people into marrying Gemini 2.5 pro?

scruiser@awful.systems · 4 months ago

Have they fixed it as in genuinely uses python completely reliably or “fixed” it, like they tweaked the prompt and now it use python 95% of the time instead of 50/50? I’m betting on the later.

aramova@infosec.pub · 4 months ago

Non-deterministic LLMs will always have randomness in their output. Best they can hope for is layers of sanity checke slowing things down and costing more.