Love this because I completely agree. “We fixed it and it no longer does the bad thing”. Uh no, incorrect, unless you literally went through your entire dataset and stripped out every single occurrence of the thing and retrained it, then no there is no way that you 100% “fixed” it
I mean I don’t know for sure but I think they often just code program logic in to filter for some requests that they do not want.
My evidence for that is that I can trigger some “I cannot help you with that” responses by asking completely normal things that just use the wrong word.
It’s not 100%, and you’re more or less just asking the LLM to behave, and filtering the response through another non-perfect model after that which is trying to decide if it’s malicious or not. It’s not standard coding in that it’s a boolean returned - it’s a probability that what the user asked is appropriate according to another model. If the probability is over a threshold then it rejects.
Love this because I completely agree. “We fixed it and it no longer does the bad thing”. Uh no, incorrect, unless you literally went through your entire dataset and stripped out every single occurrence of the thing and retrained it, then no there is no way that you 100% “fixed” it
I mean I don’t know for sure but I think they often just code program logic in to filter for some requests that they do not want.
My evidence for that is that I can trigger some “I cannot help you with that” responses by asking completely normal things that just use the wrong word.
It’s not 100%, and you’re more or less just asking the LLM to behave, and filtering the response through another non-perfect model after that which is trying to decide if it’s malicious or not. It’s not standard coding in that it’s a boolean returned - it’s a probability that what the user asked is appropriate according to another model. If the probability is over a threshold then it rejects.