Home > mailing lists

Can we make pg_strcasecmp(), pg_tolower(), pg_toupper() plain ASCII semantics? - Mailing list pgsql-hackers

From	Jeff Davis
Subject	Can we make pg_strcasecmp(), pg_tolower(), pg_toupper() plain ASCII semantics?
Date	October 21 00:02:47
Msg-id	b2a9bec4d9fb7407967e3c4b762b990155a17340.camel@j-davis.com Whole thread Raw
List	pgsql-hackers

Tree view

pg_strcasecmp(), etc., have a dependency on LC_CTYPE, which means a
dependency on setlocale(). I'd like to eliminate those dependencies in
the backend because they cause significant annoyance, especially when
using non-libc providers.

Right now, these functions are effectively very close to plain-ascii
semantics. If the character is in ASCII range, then it only folds
characters A..Z. If using a multibyte encoding, any other byte is part
of a multibyte sequence, so the behavior of tolower() is undefined, and
I believe usually returns 0.

So the only time tolower() matters is when using a single-byte encoding
and folding a character outside the ASCII range.

Most of the callers seem to use these functions in a context that only
cares about ASCII, anyway.

There are a few callers where it matters, such as the implementations
of UPPER()/LOWER()/INITCAP() and LIKE. Those already need special
cases, so it's easy to inline them and make use of the pg_locale_t
object, thus avoiding the dependency on the global LC_CTYPE.

There's a comment at the top of the file saying:

  NB: this code should match downcase_truncate_identifier() in
scansup.c.

but I don't see call sites where that's likely to matter. I'd like to
do something about downcase_identifier() as well, but that has more
serious compatibility issues if someone is affected, so needs a bit
more care. Also, given that downcase_identifier checks for a single
byte encoding and these other functions do not, I don't think there's
any guarantee that they are identical in behavior.

While I can imagine that the tolower() call may have been useful at one
time, the fact that it doesn't work for UTF-8 makes me think it's not
widely relied-upon.

Am I missing something? Perhaps it matters for code outside the
backend? 

Attached is a patch to remove the tolower() calls from pgstrcasecmp.c,
and fix up the few call sites where it's needed.

Regards,
    Jeff Davis

Attachment

v1-0001-Remove-tolower-call-from-pgstrcasecmp.c-functions.patch

pgsql-hackers by date:

From: Nathan Bossart
Date: 20 October, 23:52:16
Subject: Re: abi-compliance-check failure due to recent changes to pg_{clear,restore}_{attribute,relation}_stats()

From: Álvaro Herrera
Date: 21 October, 00:08:21
Subject: Re: Add \pset options for boolean value display

Can we make pg_strcasecmp(), pg_tolower(), pg_toupper() plain ASCII semantics? - Mailing list pgsql-hackers

Attachment

Previous

Next