Thread: Collation patch's handling of wcstombs/mbstowcs is sheerest fantasy

Collation patch's handling of wcstombs/mbstowcs is sheerest fantasy

From

Tom Lane

Date:

22 April 2011, 20:32:38

I just noticed that the collation patch has modified char2wchar and
wchar2char to accept a collation OID as argument ... but it hasn't done
anything to make those arguments actually work.  Since those functions
depend on wcstombs and mbstowcs, which respond to LC_CTYPE and nothing
else, this flat out does not work in non-default collations.  What's
more, there doesn't seem to be any such thing as wcstombs_l or
mbstowcs_l (at least my Fedora box hasn't got them), so this can't be
fixed within the available glibc API.

Right at the moment this only affects str_tolower, str_toupper, and
str_initcap; there are other uses of these functions in the text search
code, but those always pass DEFAULT_COLLATION_OID.

It's possible that things are not too broken in practice, because it's
likely that the transformations done by these functions only depend on
the encoding indicated by LC_CTYPE, and we (try to) enforce that all
locales used in a given database match the database encoding.  Still,
that's a rather shaky chain of reasoning.

The complete lack of code comments on this doesn't make me any happier
--- in fact, the comments for char2wchar and wchar2char still claim that
they have the same API as wcstombs and mbstowcs, which can hardly be
considered true when they don't even have the same argument lists.

Any thoughts what to do about this?
        regards, tom lane

Re: Collation patch's handling of wcstombs/mbstowcs is sheerest fantasy

From

Tom Lane

Date:

23 April 2011, 03:58:55

I wrote:
> I just noticed that the collation patch has modified char2wchar and
> wchar2char to accept a collation OID as argument ... but it hasn't done
> anything to make those arguments actually work.  Since those functions
> depend on wcstombs and mbstowcs, which respond to LC_CTYPE and nothing
> else, this flat out does not work in non-default collations.  What's
> more, there doesn't seem to be any such thing as wcstombs_l or
> mbstowcs_l (at least my Fedora box hasn't got them), so this can't be
> fixed within the available glibc API.

I poked around a bit and found out that wcstombs_l and mbstowcs_l *do*
exist in some BSD-derived systems (in particular, OS X has 'em; and
Windows too).  Not clear why XBD didn't see fit to include them, because
so far as I can see there simply isn't any useful substitute.

What I think we need to do is:

* Redefine char2wchar and wchar2char as being like mbstowcs_l and
wcstombs_l (in particular, take pg_locale_t *not* a collation OID).

* Use mbstowcs_l and wcstombs_l where available.

* Where they're not, install the locale_t with uselocale(), do
mbstowcs or wcstombs, and revert to the former locale_t setting.
This is ugly as sin, and not thread-safe, but of course lots of
the backend is not thread-safe.  A quick look at the glibc source
for uselocale() suggests that it won't be too horribly inefficient.

Comments, better ideas?
        regards, tom lane

Re: Collation patch's handling of wcstombs/mbstowcs is sheerest fantasy

From

Tom Lane

Date:

23 April 2011, 15:37:58

I wrote:
> * Where they're not, install the locale_t with uselocale(), do
> mbstowcs or wcstombs, and revert to the former locale_t setting.
> This is ugly as sin, and not thread-safe, but of course lots of
> the backend is not thread-safe.

I've been corrected on that: uselocale() *is* thread safe, at least in
glibc (it only affects the locale used by the current thread).  And it's
"only a few instructions" according to Jakub Jelinek.  So a temporary
setting via uselocale is exactly what you're supposed to do for any
locale-sensitive function that hasn't got a *_l variant.
        regards, tom lane

Re: Collation patch's handling of wcstombs/mbstowcs is sheerest fantasy

From

Peter Eisentraut

Date:

24 April 2011, 23:57:08

On fre, 2011-04-22 at 16:32 -0400, Tom Lane wrote:
> It's possible that things are not too broken in practice, because it's
> likely that the transformations done by these functions only depend on
> the encoding indicated by LC_CTYPE, and we (try to) enforce that all
> locales used in a given database match the database encoding.  Still,
> that's a rather shaky chain of reasoning.

Yeah, that was the result I had arrived at.

Re: Collation patch's handling of wcstombs/mbstowcs is sheerest fantasy

From

Peter Eisentraut

Date:

24 April 2011, 23:57:45

On lör, 2011-04-23 at 11:37 -0400, Tom Lane wrote:
> I wrote:
> > * Where they're not, install the locale_t with uselocale(), do
> > mbstowcs or wcstombs, and revert to the former locale_t setting.
> > This is ugly as sin, and not thread-safe, but of course lots of
> > the backend is not thread-safe.
> 
> I've been corrected on that: uselocale() *is* thread safe, at least in
> glibc (it only affects the locale used by the current thread).  And it's
> "only a few instructions" according to Jakub Jelinek.  So a temporary
> setting via uselocale is exactly what you're supposed to do for any
> locale-sensitive function that hasn't got a *_l variant.

Good to know.  It was certainly very strange, this gap in the API.