Re: can we mark upper/lower/textlike functions leakproof? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: can we mark upper/lower/textlike functions leakproof? |
Date | |
Msg-id | CA+Tgmob78LN+Tkgms=dQyBsLyGWBuehv7_kEvBfjwmj6fDsr7g@mail.gmail.com Whole thread Raw |
In response to | Re: can we mark upper/lower/textlike functions leakproof? (Joe Conway <mail@joeconway.com>) |
Responses |
Re: can we mark upper/lower/textlike functions leakproof?
|
List | pgsql-hackers |
On Wed, Jul 31, 2024 at 9:14 AM Joe Conway <mail@joeconway.com> wrote: > In my opinion, for this use case and others, it should be possible to > redact the values substituted into log messages based on some criteria. > One of those criteria could be "I am in a leakproof call right now". In > fact in a similar fashion, an extension ought to be able to mutate the > log message based on the entire string, e.g. when "ALTER > ROLE...PASSWORD..." is spotted I would like to be able to redact > everything after "PASSWORD". This might be helpful, and unfortunately I'm all too familiar with the ALTER ROLE...PASSWORD example, but I think it's not really related to the question of whether we can mark upper() and lower() leakproof. If there are some inputs that cause upper() and lower() to fail and others that do not, the functions aren't leakproof, because an attacker can extract information about values that they can't see by feeding those values into these functions and seeing whether they get a failure or not. It doesn't matter what error message is produced; the fact that the function throws an error of any kind for some input values but enough for others is enough to make it unsafe, and it seems to me that we've repeatedly found that there's often a way to turn even what seems like a small leak into a very large one, so we need to be quite careful here. Annoyingly, we have a WHOLE BUNCH of different code paths that need to be assessed individually here. I think we can ignore the fact that upper() can throw errors when it's unclear which collation to use; I think that's a property of the query string rather than the input value. It's a bit less clear whether we can ignore out of memory conditions, but my judgement is that a possible OOM from a small allocation is not generally going to be useful as an attack vector. If a long string produces an error and a short one doesn't, that might qualify as a real leak. And other errors that depend on the input value are also leaks. So let's go through the code paths: - When lc_ctype_is_c(), we call asc_toupper(), which calls pg_ascii_toupper(), which just replaces a-z with A-Z. No errors. - When mylocale->provider == COLLPROVIDER_ICU, we call icu_to_uchar() and icu_convert_case(). icu_to_uchar() calls uchar_length(), which has this: if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR) ereport(ERROR, (errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status)))); I don't know what errors can be reported here, or if this ever fails in practice, so I can't prove it never fails consistently for some particular input string. uchar_convert() and icu_convert_case() have similar error reporting. - When mylocale->provider == COLLPROVIDER_BUILTIN, we call unicode_strupper() which calls unicode_to_utf8() which clearly throws no errors. Likewise unicode_utf8len() does not error. I don't see how we can get an error out of this path. - In cases not covered by the above, we take different paths depending on whether a multi-byte encoding is in use. In single-byte encodings, we rely on either pg_toupper() or toupper_l(). The latter is an OS-provided function so can't ereport(), and the former calls either does the work itself or calls toupper() which again is an OS-provided function and can't report(). - Finally, when mylocale == NULL or the provider is COLLPROVIDER_LIBC and the encoding is multi-byte, we use char2wchar(), then towupper_l() or towupper(), then wchar2char(). The whole thing can fall apart if the string is too long, which might be enough to declare this leakproof but it depends on whether the error guarded by /* Overflow paranoia */ is really just paranoia or whether it's actually reachable. Otherwise, towupper*() won't ereport because it's not part of PG, so we need to assess char2wchar() and wchar2char(). Here I note that char2wchar() claims that it does an ereport() if the input is invalidly encoded. This is kind of surprising when you think about it, because our usual policy is not to allow invalidly encoded data into the database in the first place, but whoever wrote this thought it would be possible to hit this case "if LC_CTYPE locale is different from the database encoding." But it seems that the logic here dates to commit 2ab0796d7a3a7116a79b65531fd33f1548514b52 back in 2011, so it seems at least possible to me that we've tightened things up enough since then that you can't actually hit this any more. But then again, maybe not. So in summary, I think upper() is ... pretty close to leakproof. But if ICU sometimes fails on certain strings, then it isn't. And if the multi-byte libc path can be made to fail reliably either with really long strings or with certain choices of the LC_CTYPE locale, then it isn't. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: