Thread: lower() and unaccent() not leakproof
Hi, lower() and unaccent() (and most string functions) are not marked as leakproof. Is this due to possible locale / characterencoding errors they might encounter?
On Wednesday, August 25, 2021, Christophe Pettus <xof@thebuild.com> wrote:
lower() and unaccent() (and most string functions) are not marked as leakproof. Is this due to possible locale / character encoding errors they might encounter?
I think you are partially correct. Its due to the fact that error messages, regardless of the root cause, result in the printing of the input value in the error message as context, thus exists a leak via a violation of “ It reveals no information about its arguments other than by its return value. ”
David J.
On 26.08.21 06:52, David G. Johnston wrote: > On Wednesday, August 25, 2021, Christophe Pettus <xof@thebuild.com > <mailto:xof@thebuild.com>> wrote: > > lower() and unaccent() (and most string functions) are not marked as > leakproof. Is this due to possible locale / character encoding > errors they might encounter? > > > I think you are partially correct. Its due to the fact that error > messages, regardless of the root cause, result in the printing of the > input value in the error message as context, thus exists a leak via a > violation of “ It reveals no information about its arguments other than > by its return value. ” I think if you trace the code, you might find that lower() and upper() can't really leak anything. It might be worth taking a careful look and possibly lifting this restriction.
> On 26 Aug 2021, at 09:58, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 26.08.21 06:52, David G. Johnston wrote: >> On Wednesday, August 25, 2021, Christophe Pettus <xof@thebuild.com <mailto:xof@thebuild.com>> wrote: >> lower() and unaccent() (and most string functions) are not marked as >> leakproof. Is this due to possible locale / character encoding >> errors they might encounter? >> I think you are partially correct. Its due to the fact that error messages, regardless of the root cause, result in theprinting of the input value in the error message as context, thus exists a leak via a violation of “ It reveals no informationabout its arguments other than by its return value. ” > > I think if you trace the code, you might find that lower() and upper() can't really leak anything. It might be worth takinga careful look and possibly lifting this restriction. Wouldn’t the difference in possible error messages in upper/lower be able to leak whether the input is ascii or wide chars, and/or the collation? -- Daniel Gustafsson https://vmware.com/
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > On 26.08.21 06:52, David G. Johnston wrote: >> On Wednesday, August 25, 2021, Christophe Pettus <xof@thebuild.com >> <mailto:xof@thebuild.com>> wrote: >>> lower() and unaccent() (and most string functions) are not marked as >>> leakproof. Is this due to possible locale / character encoding >>> errors they might encounter? > I think if you trace the code, you might find that lower() and upper() > can't really leak anything. It might be worth taking a careful look and > possibly lifting this restriction. Generally speaking, we've been resistant to marking anything leakproof unless it has a very small code footprint that can be easily audited. In particular, anything that shares a lot of infrastructure with not-leakproof functions seems quite hazardous. Even if you go through the code and convince yourself that it's OK today, innocent changes to the shared infrastructure could break the leakproofness tomorrow. regards, tom lane
On 26.08.21 10:40, Daniel Gustafsson wrote: >> On 26 Aug 2021, at 09:58, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: >> >> On 26.08.21 06:52, David G. Johnston wrote: >>> On Wednesday, August 25, 2021, Christophe Pettus <xof@thebuild.com <mailto:xof@thebuild.com>> wrote: >>> lower() and unaccent() (and most string functions) are not marked as >>> leakproof. Is this due to possible locale / character encoding >>> errors they might encounter? >>> I think you are partially correct. Its due to the fact that error messages, regardless of the root cause, result inthe printing of the input value in the error message as context, thus exists a leak via a violation of “ It reveals noinformation about its arguments other than by its return value. ” >> >> I think if you trace the code, you might find that lower() and upper() can't really leak anything. It might be worthtaking a careful look and possibly lifting this restriction. > > Wouldn’t the difference in possible error messages in upper/lower be able to > leak whether the input is ascii or wide chars, and/or the collation? Yeah, but there aren't any error messages that relate to the argument string, if you look through the code. There isn't any "could not find lower case equivalent of %s" or anything like that. Once you have found the right collation and locale and server encoding and have allocated some memory, the conversion always succeeds. The collation is not secret, it's determined by parse analysis.
On 26.08.21 16:00, Tom Lane wrote: > Generally speaking, we've been resistant to marking anything leakproof > unless it has a very small code footprint that can be easily audited. > > In particular, anything that shares a lot of infrastructure with > not-leakproof functions seems quite hazardous. Even if you go through > the code and convince yourself that it's OK today, innocent changes > to the shared infrastructure could break the leakproofness tomorrow. I think the complexity of the implementation of upper() and lower() is on the same order as bttextcmp() and similar, so it wouldn't be totally out of scope.
> On 26 Aug 2021, at 16:59, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > On 26.08.21 10:40, Daniel Gustafsson wrote: >> Wouldn’t the difference in possible error messages in upper/lower be able to >> leak whether the input is ascii or wide chars, and/or the collation? > > Yeah, but there aren't any error messages that relate to the argument string, if you look through the code. There isn'tany "could not find lower case equivalent of %s" or anything like that. Correct. My reading of "It reveals no information about its arguments other than by its return value” was that errormessages indicating different code- paths based on argument structure weren't allowed. That might have been a bit too lawyery interpretation though. -- Daniel Gustafsson https://vmware.com/
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > I think the complexity of the implementation of upper() and lower() is > on the same order as bttextcmp() and similar, so it wouldn't be totally > out of scope. True. But you'll recall that the decision to mark bttextcmp() and cohorts as leakproof was not made without fear. IMV, that decision did not rest simply on code review but on two arguments about why we should take the risk: * The query-optimization usefulness of having those be leakproof is extremely high. * btree comparison functions should really not have any user-reachable failure modes (which comes close to being the definition of leakproof). If one did, that would mean there were legal values of the type that couldn't be put into a btree index. Maybe similar arguments can be made about upper/lower, but I think it's a far weaker case. As for unaccent, the fact that it relies on user-controllable definitions seems to me to make it almost certainly unsafe to be leakproof. regards, tom lane