pg_strcasecmp(), etc., have a dependency on LC_CTYPE, which means a
dependency on setlocale(). I'd like to eliminate those dependencies in
the backend because they cause significant annoyance, especially when
using non-libc providers.
Right now, these functions are effectively very close to plain-ascii
semantics. If the character is in ASCII range, then it only folds
characters A..Z. If using a multibyte encoding, any other byte is part
of a multibyte sequence, so the behavior of tolower() is undefined, and
I believe usually returns 0.
So the only time tolower() matters is when using a single-byte encoding
and folding a character outside the ASCII range.
Most of the callers seem to use these functions in a context that only
cares about ASCII, anyway.
There are a few callers where it matters, such as the implementations
of UPPER()/LOWER()/INITCAP() and LIKE. Those already need special
cases, so it's easy to inline them and make use of the pg_locale_t
object, thus avoiding the dependency on the global LC_CTYPE.
There's a comment at the top of the file saying:
NB: this code should match downcase_truncate_identifier() in
scansup.c.
but I don't see call sites where that's likely to matter. I'd like to
do something about downcase_identifier() as well, but that has more
serious compatibility issues if someone is affected, so needs a bit
more care. Also, given that downcase_identifier checks for a single
byte encoding and these other functions do not, I don't think there's
any guarantee that they are identical in behavior.
While I can imagine that the tolower() call may have been useful at one
time, the fact that it doesn't work for UTF-8 makes me think it's not
widely relied-upon.
Am I missing something? Perhaps it matters for code outside the
backend?
Attached is a patch to remove the tolower() calls from pgstrcasecmp.c,
and fix up the few call sites where it's needed.
Regards,
Jeff Davis