Thread: lower() and unaccent() not leakproof

lower() and unaccent() not leakproof

From
Christophe Pettus
Date:
Hi,

lower() and unaccent() (and most string functions) are not marked as leakproof.  Is this due to possible locale /
characterencoding errors they might encounter? 


Re: lower() and unaccent() not leakproof

From
"David G. Johnston"
Date:
On Wednesday, August 25, 2021, Christophe Pettus <xof@thebuild.com> wrote:
lower() and unaccent() (and most string functions) are not marked as leakproof.  Is this due to possible locale / character encoding errors they might encounter?


I think you are partially correct.  Its due to the fact that error messages, regardless of the root cause, result in the printing of the input value in the error message as context, thus exists a leak via a violation of “ It reveals no information about its arguments other than by its return value. ”

David J.

Re: lower() and unaccent() not leakproof

From
Peter Eisentraut
Date:
On 26.08.21 06:52, David G. Johnston wrote:
> On Wednesday, August 25, 2021, Christophe Pettus <xof@thebuild.com 
> <mailto:xof@thebuild.com>> wrote:
> 
>     lower() and unaccent() (and most string functions) are not marked as
>     leakproof.  Is this due to possible locale / character encoding
>     errors they might encounter?
> 
> 
> I think you are partially correct.  Its due to the fact that error 
> messages, regardless of the root cause, result in the printing of the 
> input value in the error message as context, thus exists a leak via a 
> violation of “ It reveals no information about its arguments other than 
> by its return value. ”

I think if you trace the code, you might find that lower() and upper() 
can't really leak anything.  It might be worth taking a careful look and 
possibly lifting this restriction.



Re: lower() and unaccent() not leakproof

From
Daniel Gustafsson
Date:
> On 26 Aug 2021, at 09:58, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>
> On 26.08.21 06:52, David G. Johnston wrote:
>> On Wednesday, August 25, 2021, Christophe Pettus <xof@thebuild.com <mailto:xof@thebuild.com>> wrote:
>>    lower() and unaccent() (and most string functions) are not marked as
>>    leakproof.  Is this due to possible locale / character encoding
>>    errors they might encounter?
>> I think you are partially correct.  Its due to the fact that error messages, regardless of the root cause, result in
theprinting of the input value in the error message as context, thus exists a leak via a violation of “ It reveals no
informationabout its arguments other than by its return value. ” 
>
> I think if you trace the code, you might find that lower() and upper() can't really leak anything.  It might be worth
takinga careful look and possibly lifting this restriction. 

Wouldn’t the difference in possible error messages in upper/lower be able to
leak whether the input is ascii or wide chars, and/or the collation?

--
Daniel Gustafsson        https://vmware.com/




Re: lower() and unaccent() not leakproof

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> On 26.08.21 06:52, David G. Johnston wrote:
>> On Wednesday, August 25, 2021, Christophe Pettus <xof@thebuild.com 
>> <mailto:xof@thebuild.com>> wrote:
>>> lower() and unaccent() (and most string functions) are not marked as
>>> leakproof.  Is this due to possible locale / character encoding
>>> errors they might encounter?

> I think if you trace the code, you might find that lower() and upper() 
> can't really leak anything.  It might be worth taking a careful look and 
> possibly lifting this restriction.

Generally speaking, we've been resistant to marking anything leakproof
unless it has a very small code footprint that can be easily audited.

In particular, anything that shares a lot of infrastructure with
not-leakproof functions seems quite hazardous.  Even if you go through
the code and convince yourself that it's OK today, innocent changes
to the shared infrastructure could break the leakproofness tomorrow.

            regards, tom lane



Re: lower() and unaccent() not leakproof

From
Peter Eisentraut
Date:
On 26.08.21 10:40, Daniel Gustafsson wrote:
>> On 26 Aug 2021, at 09:58, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>>
>> On 26.08.21 06:52, David G. Johnston wrote:
>>> On Wednesday, August 25, 2021, Christophe Pettus <xof@thebuild.com <mailto:xof@thebuild.com>> wrote:
>>>     lower() and unaccent() (and most string functions) are not marked as
>>>     leakproof.  Is this due to possible locale / character encoding
>>>     errors they might encounter?
>>> I think you are partially correct.  Its due to the fact that error messages, regardless of the root cause, result
inthe printing of the input value in the error message as context, thus exists a leak via a violation of “ It reveals
noinformation about its arguments other than by its return value. ”
 
>>
>> I think if you trace the code, you might find that lower() and upper() can't really leak anything.  It might be
worthtaking a careful look and possibly lifting this restriction.
 
> 
> Wouldn’t the difference in possible error messages in upper/lower be able to
> leak whether the input is ascii or wide chars, and/or the collation?

Yeah, but there aren't any error messages that relate to the argument 
string, if you look through the code.  There isn't any "could not find 
lower case equivalent of %s" or anything like that.  Once you have found 
the right collation and locale and server encoding and have allocated 
some memory, the conversion always succeeds.

The collation is not secret, it's determined by parse analysis.



Re: lower() and unaccent() not leakproof

From
Peter Eisentraut
Date:
On 26.08.21 16:00, Tom Lane wrote:
> Generally speaking, we've been resistant to marking anything leakproof
> unless it has a very small code footprint that can be easily audited.
> 
> In particular, anything that shares a lot of infrastructure with
> not-leakproof functions seems quite hazardous.  Even if you go through
> the code and convince yourself that it's OK today, innocent changes
> to the shared infrastructure could break the leakproofness tomorrow.

I think the complexity of the implementation of upper() and lower() is 
on the same order as bttextcmp() and similar, so it wouldn't be totally 
out of scope.




Re: lower() and unaccent() not leakproof

From
Daniel Gustafsson
Date:
> On 26 Aug 2021, at 16:59, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
> On 26.08.21 10:40, Daniel Gustafsson wrote:

>> Wouldn’t the difference in possible error messages in upper/lower be able to
>> leak whether the input is ascii or wide chars, and/or the collation?
>
> Yeah, but there aren't any error messages that relate to the argument string, if you look through the code.  There
isn'tany "could not find lower case equivalent of %s" or anything like that. 

Correct.  My reading of "It reveals no information about its arguments other
than by its return value” was that errormessages indicating different code-
paths based on argument structure weren't allowed. That might have been a bit
too lawyery interpretation though.

--
Daniel Gustafsson        https://vmware.com/




Re: lower() and unaccent() not leakproof

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> I think the complexity of the implementation of upper() and lower() is 
> on the same order as bttextcmp() and similar, so it wouldn't be totally 
> out of scope.

True.  But you'll recall that the decision to mark bttextcmp() and
cohorts as leakproof was not made without fear.  IMV, that decision
did not rest simply on code review but on two arguments about why
we should take the risk:

* The query-optimization usefulness of having those be leakproof
is extremely high.

* btree comparison functions should really not have any user-reachable
failure modes (which comes close to being the definition of leakproof).
If one did, that would mean there were legal values of the type that
couldn't be put into a btree index.

Maybe similar arguments can be made about upper/lower, but I think
it's a far weaker case.  As for unaccent, the fact that it relies
on user-controllable definitions seems to me to make it almost
certainly unsafe to be leakproof.

            regards, tom lane