Thread: Collation patch's handling of wcstombs/mbstowcs is sheerest fantasy
I just noticed that the collation patch has modified char2wchar and wchar2char to accept a collation OID as argument ... but it hasn't done anything to make those arguments actually work. Since those functions depend on wcstombs and mbstowcs, which respond to LC_CTYPE and nothing else, this flat out does not work in non-default collations. What's more, there doesn't seem to be any such thing as wcstombs_l or mbstowcs_l (at least my Fedora box hasn't got them), so this can't be fixed within the available glibc API. Right at the moment this only affects str_tolower, str_toupper, and str_initcap; there are other uses of these functions in the text search code, but those always pass DEFAULT_COLLATION_OID. It's possible that things are not too broken in practice, because it's likely that the transformations done by these functions only depend on the encoding indicated by LC_CTYPE, and we (try to) enforce that all locales used in a given database match the database encoding. Still, that's a rather shaky chain of reasoning. The complete lack of code comments on this doesn't make me any happier --- in fact, the comments for char2wchar and wchar2char still claim that they have the same API as wcstombs and mbstowcs, which can hardly be considered true when they don't even have the same argument lists. Any thoughts what to do about this? regards, tom lane
I wrote: > I just noticed that the collation patch has modified char2wchar and > wchar2char to accept a collation OID as argument ... but it hasn't done > anything to make those arguments actually work. Since those functions > depend on wcstombs and mbstowcs, which respond to LC_CTYPE and nothing > else, this flat out does not work in non-default collations. What's > more, there doesn't seem to be any such thing as wcstombs_l or > mbstowcs_l (at least my Fedora box hasn't got them), so this can't be > fixed within the available glibc API. I poked around a bit and found out that wcstombs_l and mbstowcs_l *do* exist in some BSD-derived systems (in particular, OS X has 'em; and Windows too). Not clear why XBD didn't see fit to include them, because so far as I can see there simply isn't any useful substitute. What I think we need to do is: * Redefine char2wchar and wchar2char as being like mbstowcs_l and wcstombs_l (in particular, take pg_locale_t *not* a collation OID). * Use mbstowcs_l and wcstombs_l where available. * Where they're not, install the locale_t with uselocale(), do mbstowcs or wcstombs, and revert to the former locale_t setting. This is ugly as sin, and not thread-safe, but of course lots of the backend is not thread-safe. A quick look at the glibc source for uselocale() suggests that it won't be too horribly inefficient. Comments, better ideas? regards, tom lane
I wrote: > * Where they're not, install the locale_t with uselocale(), do > mbstowcs or wcstombs, and revert to the former locale_t setting. > This is ugly as sin, and not thread-safe, but of course lots of > the backend is not thread-safe. I've been corrected on that: uselocale() *is* thread safe, at least in glibc (it only affects the locale used by the current thread). And it's "only a few instructions" according to Jakub Jelinek. So a temporary setting via uselocale is exactly what you're supposed to do for any locale-sensitive function that hasn't got a *_l variant. regards, tom lane
On fre, 2011-04-22 at 16:32 -0400, Tom Lane wrote: > It's possible that things are not too broken in practice, because it's > likely that the transformations done by these functions only depend on > the encoding indicated by LC_CTYPE, and we (try to) enforce that all > locales used in a given database match the database encoding. Still, > that's a rather shaky chain of reasoning. Yeah, that was the result I had arrived at.
On lör, 2011-04-23 at 11:37 -0400, Tom Lane wrote: > I wrote: > > * Where they're not, install the locale_t with uselocale(), do > > mbstowcs or wcstombs, and revert to the former locale_t setting. > > This is ugly as sin, and not thread-safe, but of course lots of > > the backend is not thread-safe. > > I've been corrected on that: uselocale() *is* thread safe, at least in > glibc (it only affects the locale used by the current thread). And it's > "only a few instructions" according to Jakub Jelinek. So a temporary > setting via uselocale is exactly what you're supposed to do for any > locale-sensitive function that hasn't got a *_l variant. Good to know. It was certainly very strange, this gap in the API.