Thread: Windows and locales and UTF-8 (oh my)
I've been learning much more than I wanted to know about $SUBJECT since putting in the src/port/chklocale.c code to try to enforce that our database encoding matches the system locale settings. There's an ongoing thread in -patches that's been focused on getting reasonable behavior from the point of view of the Far Eastern contingent: http://archives.postgresql.org/pgsql-patches/2007-10/msg00031.php (Some of that's been applied, but not the very latest proposals.) Here's some more info from an off-list discussion with Dave Page: ------- Forwarded Messages Date: Fri, 05 Oct 2007 20:54:04 +0100 From: Dave Page <dpage@postgresql.org> To: Tom Lane <tgl@sss.pgh.pa.us> Subject: Re: [CORE] 8.3beta1 Available ... Dave Page wrote: > Some further info on that - utf-8 on Windows is actually a > pseudo-codepage (65001) which doesn't have NLS files, hence why we have > to convert to utf-16 before sorting. Perhaps the utf-8/65001 name > difference is the problem here. I'll knock up a quick test program when > the kids have gone to bed. So, my test prog (below) returns the following: Dave@SNAKE:~$ ./setlc "English_United Kingdom.65001" LC_COLLATE=English_United Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United Kingdom.65001;LC_NUMERIC=English_United Kingdom.65001;LC_TIME=English_United Kingdom.65001 So everything other than LC_CTYPE is acceptable in UTF-8 on Windows - and we already handle LC_CTYPE for UTF-8 on Windows through our UTF-8 -> UTF-16 conversions internally. Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps? Regards, Dave. #include <locale.h> main (int argc, char *argv[]) { char *lc; if (argc > 1) setlocale(LC_ALL, argv[1]); lc = setlocale(LC_ALL, NULL); printf("%s\n", lc); } ------- Message 2 Date: Fri, 05 Oct 2007 23:32:36 +0100 From: Dave Page <dpage@postgresql.org> To: Tom Lane <tgl@sss.pgh.pa.us> Subject: Re: [CORE] 8.3beta1 Available ... Tom Lane wrote: > Dave Page <dpage@postgresql.org> writes: >> So, my test prog (below) returns the following: > >> Dave@SNAKE:~$ ./setlc "English_United Kingdom.65001" >> LC_COLLATE=English_United >> Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United >> Kingdom.65001;LC_NUMERIC=English_United >> Kingdom.65001;LC_TIME=English_United Kingdom.65001 > > That's just frickin' weird ... and a bit scary. There is a fair amount > of code in PG that checks for lc_ctype_is_c and does things differently; > one wonders if that isn't going to get misled by this behavior. (Hmm, > maybe this explains some of the "upper/lower doesn't work" reports we've > been getting??) Are you sure all variants of Windows act that way? All the ones we support afaict. >> Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps? > > Is there something in Windows that constrains them to be all the same? > If not this proposal seems just plain wrong :-( But in any case I'd > feel more comfortable having it look at LC_COLLATE. They can all be set independently - it's just that there's no UTF-7 (65000) or UTF-8 (65001) NLS files (http://shlimazl.nm.ru/eng/nls.htm) defining them fully so Windows doesn't know any more than the characters that are in both 'pseudo codepages'. As a result, you can't set LC_CTYPE to .65001 because Windows knows it can't handle ToUpper() or ToLower() etc. but you can use it to encode messages and other text. /D ------- End of Forwarded Messages I am thinking that Dave's discovery explains some previously unsolved bug reports, such as http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php If Windows returns LC_CTYPE=C in a situation like this, then the various single-byte-charset optimization paths that are enabled by lc_ctype_is_c() would be mistakenly used, leading to misbehavior in upper()/lower() and other places. ISTM we had better hack lc_ctype_is_c() so that on Windows (only), if the database encoding is UTF-8 then it returns FALSE regardless of what setlocale says. That still leaves me with a boatload of questions, though. If we can't trust LC_CTYPE as an indicator of the system charset, what can we trust? In particular this seems to say that looking at LC_CTYPE for chklocale's purposes is completely useless; what do we look at instead? Another issue: is it possible to set, say, LC_MESSAGES and LC_TIME to different codepages and if so what happens? If that does enable different bits of infrastructure to return incompatibly encoded strings, seems we need a defense against that --- what should it be? One bright spot is that this does seem to suggest a way to implement the recommendation I made in the -patches thread: if we can't support the encoding (codepage) used by the locale seen by initdb, we could try stripping the codepage indicator (if any) and plastering on .65001 to get a UTF8-compatible locale name. That'd only work on Windows but that seems the platform where we're most likely to see unsupportable default encodings. Comments? I don't have a Windows development environment so I'm not in a position to take the lead on testing/fixing this sort of stuff. regards, tom lane
It seems like the root of the problems we're butting our heads against with encoding and locale is all the same issue: it's nonsensical to take the locale at initdb time per-cluster and then allow user-specified encoding per-database. If anything it would make more sense to go the other way around. But actually it seems to me we could allow changing both on a per-database basis with certain restrictions: . template0 is always SQL_ASCII with locale C . when creating a new database you can specify the encoding and locale and we check that they're compatible. . when creating a new database from a template the new locale and encoding must be identical to the template database's encodingand locale. Unless the template is template0 in which case we rebuild all indexes after copying. We could liberalize this last restriction if we created a new encoding like SQL_ASCII but which enforces 7-bit ascii. But then the index rebuild step could take a long time. This would make the whole locale/encoding issue make much more transparent. In database listings you would see both listed alongside, you wouldn't be bound by any initdb environment choices, and errors when running create database would be able to tell you exactly what you're doing wrong and what you have to do to avoid the problem. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Am Freitag, 12. Oktober 2007 schrieb Gregory Stark: > . when creating a new database from a template the new locale and encoding > must be identical to the template database's encoding and locale. Unless > the template is template0 in which case we rebuild all indexes after > copying. Why would you restrict the index rebuilding only to this particular case? It could be done for any database. The other issue are shared catalogs. -- Peter Eisentraut http://developer.postgresql.org/~petere/
"Peter Eisentraut" <peter_e@gmx.net> writes: > Am Freitag, 12. Oktober 2007 schrieb Gregory Stark: >> . when creating a new database from a template the new locale and encoding >> must be identical to the template database's encoding and locale. Unless >> the template is template0 in which case we rebuild all indexes after >> copying. > > Why would you restrict the index rebuilding only to this particular case? It > could be done for any database. Well there's no guarantee there isn't 8-bit data in other databases which would be invalid in the new encoding. I think it's reasonable to assume there's only 7-bit ascii in template0 however. An alternative would be introducing an ASCII7 encoding which template0 would use and any other database in that encoding could be used as a template for any encoding. However that would still require index rebuilds which would potentially take a long time. Another alternative would be recoding all the data from the template database encoding to the new encoding and throwing an error if a non-encodable character is found. I think it's a lot simpler to just declare it a non-problem by saying there won't be any non-ascii text in template0. > The other issue are shared catalogs. This approach doesn't address that but I don't think it makes the problems there any worse either. That is, I think already have these problems around shared tables. . If you have two databases with locales that don't agree then the indexes on those tables won't function properly. . What happens if you create a user while connected to a latin1 database with an é in his username and then connect to adatabase in a UTF8 database? That username is now an invalidly encoded UTF8 string. Perhaps we should be using pattern_ops for the indexes on the shared tables? Or using bytea with UTF8 encoded strings instead of name and text? That actually sounds reasonable now that we have convert() functions which take and generate bytea, at least for the text fields like in pltemplate -- less so for the name columns. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Fri, Oct 12, 2007 at 02:03:47PM +0100, Gregory Stark wrote: > This approach doesn't address that but I don't think it makes the problems > there any worse either. That is, I think already have these problems around > shared tables. Or we could just setup encodings/locales per column and the problem goes away entirely. Most of the code's already been written, it's not even terribly difficult. Where we're stuck is that we can't agree on a source of locale data. People don't want the ICU or glibc data and there's no other source as readily available. Perhaps we should fix that problem, rather than making more workarounds. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
"Martijn van Oosterhout" <kleptog@svana.org> writes: > People don't want the ICU or glibc data and there's no other source as > readily available. > > Perhaps we should fix that problem, rather than making more > workarounds. Fix the problem by making ICU a smaller less complex dependency? Or fix the problem that glibc isn't everyone's libc? I think realistically we're basically waiting for strcoll_l to become standardized by POSIX so we can depend on it. Personally I think we should just implement our own strcoll_l as a wrapper around setlocale-strcoll-setlocale and use strcoll_l if it's available and our, possibly slow, wrapper if not. If we ban direct use of strcoll and other lc_collate sensitive functions in Postgres we could also remember the last locale used and not do unnecessary setlocales so existing use cases aren't slowed down at all. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Am Freitag, 12. Oktober 2007 schrieb Martijn van Oosterhout: > Where we're stuck is that we can't agree on a > source of locale data. People don't want the ICU or glibc data and > there's no other source as readily available. What were the objections to ICU? -- Peter Eisentraut http://developer.postgresql.org/~petere/
"Peter Eisentraut" <peter_e@gmx.net> writes: > Am Freitag, 12. Oktober 2007 schrieb Martijn van Oosterhout: >> Where we're stuck is that we can't agree on a >> source of locale data. People don't want the ICU or glibc data and >> there's no other source as readily available. > > What were the objections to ICU? It's introducing a new dependency to do something fundamental to Postgres, one that's larger than all of Postgres. It would make Postgres inconsistent and less integrated with the rest of the OS. How do you explain that Postgres doesn't follow the system's configurations and the collations don't agree with the system collations? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Oct 12, 2007, at 10:19 , Gregory Stark wrote: > It would make Postgres inconsistent and less integrated with the > rest of the > OS. How do you explain that Postgres doesn't follow the system's > configurations and the collations don't agree with the system > collations? How is this fundamentally different from PostgreSQL using a separate users/roles system than the OS? Michael Glaesemann grzm seespotcode net
Am Freitag, 12. Oktober 2007 schrieb Gregory Stark: > It would make Postgres inconsistent and less integrated with the rest of > the OS. How do you explain that Postgres doesn't follow the system's > configurations and the collations don't agree with the system collations? We already have our own encoding support (for better or worse), and I don't think having one's own locale support would be that much different. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Michael Glaesemann wrote: > > On Oct 12, 2007, at 10:19 , Gregory Stark wrote: > >> It would make Postgres inconsistent and less integrated with the rest >> of the >> OS. How do you explain that Postgres doesn't follow the system's >> configurations and the collations don't agree with the system >> collations? > > How is this fundamentally different from PostgreSQL using a separate > users/roles system than the OS? Even more, eliminating dependencies on a OS's correct implementation of locale stuff appears A Good Thing to me. I wonder if a compile time option to use ICU in 8.4 should be considered, regarding all those lengthy threads about encoding/locale/collation problems. Regards, Andreas
On Fri, Oct 12, 2007 at 03:28:26PM +0100, Gregory Stark wrote: > Fix the problem by making ICU a smaller less complex dependency? How? It's 95% data, you can't reduce that. glibc also has 10MB of locale data. That actual code is much smaller than postgres and doesn't depend on any other non-system libraries. > I think realistically we're basically waiting for strcoll_l to become > standardized by POSIX so we can depend on it. I think we could be waiting forever then. It's supported by Win32, MacOSX and glibc. The systems that don't support it tend not to support multibyte collation anyway. Patches have been created to use this and rejected because not enough platforms support it... > Personally I think we should just implement our own strcoll_l as a wrapper > around setlocale-strcoll-setlocale and use strcoll_l if it's available and > our, possibly slow, wrapper if not. If we ban direct use of strcoll and other > lc_collate sensitive functions in Postgres we could also remember the last > locale used and not do unnecessary setlocales so existing use cases aren't > slowed down at all. Been done also. As I recall it was *really* slow, not just a little bit. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Peter Eisentraut <peter_e@gmx.net> writes: > Am Freitag, 12. Oktober 2007 schrieb Gregory Stark: >> It would make Postgres inconsistent and less integrated with the rest of >> the OS. How do you explain that Postgres doesn't follow the system's >> configurations and the collations don't agree with the system collations? > We already have our own encoding support (for better or worse), and I don't > think having one's own locale support would be that much different. Well, yes it would be, because encodings are pretty well standardized; there is not likely to be any user-visible difference between one platform's idea of UTF8 and another's. This is very very far from being the case for locales. See for instance the recent thread in which we found out that "en_US" locale has utterly different sort orders on Linux and OS X. regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > On Fri, Oct 12, 2007 at 03:28:26PM +0100, Gregory Stark wrote: >> I think realistically we're basically waiting for strcoll_l to become >> standardized by POSIX so we can depend on it. > I think we could be waiting forever then. strcoll is only a small fraction of the problem anyway. The <ctype.h> and <wctype.h> functions are another chunk of it, and then there's the issues of system message spellings, LC_MONETARY info, etc etc. regards, tom lane
Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: >> Am Freitag, 12. Oktober 2007 schrieb Gregory Stark: >>> It would make Postgres inconsistent and less integrated with the rest of >>> the OS. How do you explain that Postgres doesn't follow the system's >>> configurations and the collations don't agree with the system collations? > >> We already have our own encoding support (for better or worse), and I don't >> think having one's own locale support would be that much different. > > Well, yes it would be, because encodings are pretty well standardized; > there is not likely to be any user-visible difference between one > platform's idea of UTF8 and another's. This is very very far from being > the case for locales. See for instance the recent thread in which we > found out that "en_US" locale has utterly different sort orders on > Linux and OS X. For me, this paragraph is more of in argument *in favour* of having our own locale support. At least for me, consistency between PG running on different platforms would bring more benefits than consistency between PG and the platform it runs on. At the company I used to work for, we had all our databases running with encoding=utf-8 and locale=C, because I didn't want our applications to depend on platform-specific locale issues. Plus, some of the applications supported multiple languages, making a cluster-global locale unworkable anyway - a restriction which would go away if we went with ICU. regards, Florian Pflug
On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote: > I've been learning much more than I wanted to know about $SUBJECT > since putting in the src/port/chklocale.c code to try to enforce > that our database encoding matches the system locale settings. > There's an ongoing thread in -patches that's been focused on > getting reasonable behavior from the point of view of the Far > Eastern contingent: > http://archives.postgresql.org/pgsql-patches/2007-10/msg00031.php > (Some of that's been applied, but not the very latest proposals.) > Here's some more info from an off-list discussion with Dave Page: Sorry for the late response to this. I missed the beginning and then got mixed up in the different threads going aruond :-) > Tom Lane wrote: > > Dave Page <dpage@postgresql.org> writes: > >> So, my test prog (below) returns the following: > > > >> Dave@SNAKE:~$ ./setlc "English_United Kingdom.65001" > >> LC_COLLATE=English_United > >> Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United > >> Kingdom.65001;LC_NUMERIC=English_United > >> Kingdom.65001;LC_TIME=English_United Kingdom.65001 > > > > That's just frickin' weird ... and a bit scary. There is a fair amount > > of code in PG that checks for lc_ctype_is_c and does things differently; > > one wonders if that isn't going to get misled by this behavior. (Hmm, > > maybe this explains some of the "upper/lower doesn't work" reports we've > > been getting??) Are you sure all variants of Windows act that way? > > All the ones we support afaict. AFICT, this has been standard behaviour in Windows since forever. Certainly since Windows 2000 which is what we care about. Windows 9x had different ways of dealing with it since they weren't native UTF16 internally, but that doesn't matter to us here. > >> Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps? > > > > Is there something in Windows that constrains them to be all the same? > > If not this proposal seems just plain wrong :-( But in any case I'd > > feel more comfortable having it look at LC_COLLATE. > > They can all be set independently - it's just that there's no UTF-7 > (65000) or UTF-8 (65001) NLS files (http://shlimazl.nm.ru/eng/nls.htm) > defining them fully so Windows doesn't know any more than the characters > that are in both 'pseudo codepages'. > > As a result, you can't set LC_CTYPE to .65001 because Windows knows it > can't handle ToUpper() or ToLower() etc. but you can use it to encode > messages and other text. Yes. And also important, you can set LC_COLLATE to it, which will make all the UTF16 versions of the functions behave properly. Remember - all the Windows NT+ operations are UTF16 internally. So when you set LC_TIME to it, for example, the API functions will generate the resulting string in UTF16 and then convert it to whatever encoding you chose - be it UTF8 or LATIN1 or whatever. > I am thinking that Dave's discovery explains some previously unsolved > bug reports, such as > http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php > If Windows returns LC_CTYPE=C in a situation like this, then > the various single-byte-charset optimization paths that are enabled by > lc_ctype_is_c() would be mistakenly used, leading to misbehavior in > upper()/lower() and other places. ISTM we had better hack > lc_ctype_is_c() so that on Windows (only), if the database encoding > is UTF-8 then it returns FALSE regardless of what setlocale says. Yes, I think we a change to that routine. But. What about the case when we actually *have* locale=C and encoding=UTF8. We need to care for that one somehow. Perhaps we should look at LC_COLLATE instead (again, on Windows only. Possibly even only in the windows+locale_returns_c+encoring=utf8 case, to distinguish these two)? > That still leaves me with a boatload of questions, though. If we can't > trust LC_CTYPE as an indicator of the system charset, what can we trust? > In particular this seems to say that looking at LC_CTYPE for chklocale's > purposes is completely useless; what do we look at instead? GetACP() returns the "ANSI Codepage", which I *think* is what we're looking for here. http://msdn2.microsoft.com/en-us/library/ms776259.aspx We should eb able to compare that to something? > Another issue: is it possible to set, say, LC_MESSAGES and LC_TIME to > different codepages and if so what happens? If that does enable > different bits of infrastructure to return incompatibly encoded strings, > seems we need a defense against that --- what should it be? AFAIK, yes, and then you get it back in the wrong encoding. But as long as we set them to the same, we should be safe. And AFAIK, only UTF8 (and UTF7, but we don't support that) is the special one we need to care about. > One bright spot is that this does seem to suggest a way to implement the > recommendation I made in the -patches thread: if we can't support the > encoding (codepage) used by the locale seen by initdb, we could try > stripping the codepage indicator (if any) and plastering on .65001 > to get a UTF8-compatible locale name. That'd only work on Windows > but that seems the platform where we're most likely to see unsupportable > default encodings. Um, yes, that should work - assuming encoding is set to UTF8. We can't do that for any other encoding, of course. > Comments? I don't have a Windows development environment so I'm not > in a position to take the lead on testing/fixing this sort of stuff. I have the Windows dev environment, but I feel like I'm on deep water whenever I talk locale/encoding stuff really, I don''t know it as well as I'd like to. But I'm happy to do coding and testing if I can get enough pointers on whast I need to test :) //Magnus
On Mon, Oct 15, 2007 at 11:09:54AM +0200, Magnus Hagander wrote: > On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote: > > I am thinking that Dave's discovery explains some previously unsolved > > bug reports, such as > > http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php > > If Windows returns LC_CTYPE=C in a situation like this, then > > the various single-byte-charset optimization paths that are enabled by > > lc_ctype_is_c() would be mistakenly used, leading to misbehavior in > > upper()/lower() and other places. ISTM we had better hack > > lc_ctype_is_c() so that on Windows (only), if the database encoding > > is UTF-8 then it returns FALSE regardless of what setlocale says. > > Yes, I think we a change to that routine. > > But. What about the case when we actually *have* locale=C and > encoding=UTF8. We need to care for that one somehow. Perhaps we should look > at LC_COLLATE instead (again, on Windows only. Possibly even only in the > windows+locale_returns_c+encoring=utf8 case, to distinguish these two)? Hmm. Looking more at that, may there be another problem? Looking at WriteControlFile(), it writes out what setlocale(LC_CTYPE) returns, which will then be "C" - even if the database isn't in C. But I don't really know when that code is called, or if I'm just looking at things wrong. Just starting up and shutting down the database leaves it at Swedish_Sweden.1252, not C. (1252 is still the wrong encoding specifyer, but it'll work anyway since we convert to UTF16) Now, I came across this trying to find a way for lc_ctype_is_c() to determine if the database is in C locale or not, *without* resorting to setlocale(). Any pointers on how to do that properly? Also, any pointers on a way to check for the kind of failure that's to be expected from this one returning the wrong thing? > > One bright spot is that this does seem to suggest a way to implement the > > recommendation I made in the -patches thread: if we can't support the > > encoding (codepage) used by the locale seen by initdb, we could try > > stripping the codepage indicator (if any) and plastering on .65001 > > to get a UTF8-compatible locale name. That'd only work on Windows > > but that seems the platform where we're most likely to see unsupportable > > default encodings. > > Um, yes, that should work - assuming encoding is set to UTF8. We can't do > that for any other encoding, of course. Looking at that, doesn't actually need to put that at the end of the locale-name - all locale names will work with UTF8, even one specifying 1252. Attached patch seems to work for me for that part. Still doesn't touch lc_ctype_is_c(). //Magnus
Attachment
On Mon, Oct 15, 2007 at 01:26:00PM +0200, Magnus Hagander wrote: > On Mon, Oct 15, 2007 at 11:09:54AM +0200, Magnus Hagander wrote: > > On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote: > > > I am thinking that Dave's discovery explains some previously unsolved > > > bug reports, such as > > > http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php > > > If Windows returns LC_CTYPE=C in a situation like this, then > > > the various single-byte-charset optimization paths that are enabled by > > > lc_ctype_is_c() would be mistakenly used, leading to misbehavior in > > > upper()/lower() and other places. ISTM we had better hack > > > lc_ctype_is_c() so that on Windows (only), if the database encoding > > > is UTF-8 then it returns FALSE regardless of what setlocale says. > > > > Yes, I think we a change to that routine. > > > > But. What about the case when we actually *have* locale=C and > > encoding=UTF8. We need to care for that one somehow. Perhaps we should look > > at LC_COLLATE instead (again, on Windows only. Possibly even only in the > > windows+locale_returns_c+encoring=utf8 case, to distinguish these two)? > > Hmm. Looking more at that, may there be another problem? Looking at > WriteControlFile(), it writes out what setlocale(LC_CTYPE) returns, which > will then be "C" - even if the database isn't in C. > > But I don't really know when that code is called, or if I'm just looking at > things wrong. Just starting up and shutting down the database leaves it at > Swedish_Sweden.1252, not C. > (1252 is still the wrong encoding specifyer, but it'll work anyway since we > convert to UTF16) Gah, got that backwards. Of course it does, because it only returns "C" if we set to Swedish_Sweden.65001, and we don't *do* that with the patch I sent in earlier. We set it to Swedish_Sweden, which is a perfectly valid LC_CTYPE. And given that, do we even nede to special-case lc_ctype_is_c() at all? If we never pass in a .65001 locale (which we don't, because it fails)? //Magnus
Magnus Hagander <magnus@hagander.net> writes: >>>> On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote: >>>> I am thinking that Dave's discovery explains some previously unsolved >>>> bug reports, such as >>>> http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php > ... > And given that, do we even nede to special-case lc_ctype_is_c() at all? If > we never pass in a .65001 locale (which we don't, because it fails)? Hmm. If it doesn't need a special case, then we still lack an explanation for the aforementioned bug report. regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >>>>> On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote: >>>>> I am thinking that Dave's discovery explains some previously unsolved >>>>> bug reports, such as >>>>> http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php >> ... >> And given that, do we even nede to special-case lc_ctype_is_c() at all? If >> we never pass in a .65001 locale (which we don't, because it fails)? > > Hmm. If it doesn't need a special case, then we still lack an > explanation for the aforementioned bug report. From what I can tell that report doesn't tell us very much - we don't know server encoding, we don't know server locale, we don't even know client encoding. So I don't think we know anywhere *near* enough to say it's related to this. //Magnus
Magnus Hagander <magnus@hagander.net> writes: > Tom Lane wrote: >> Hmm. If it doesn't need a special case, then we still lack an >> explanation for the aforementioned bug report. > From what I can tell that report doesn't tell us very much - we don't > know server encoding, we don't know server locale, we don't even know > client encoding. So I don't think we know anywhere *near* enough to say > it's related to this. In the followup we found out that he was using UTF-8 encoding: http://archives.postgresql.org/pgsql-bugs/2007-05/msg00264.php So while that report certainly left a great deal to be desired in terms of precision, my gut tells me it's related. Has anyone tried to reproduce that behavior by initdb'ing 8.2 in a suitable UTF-8-using Windows locale? regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> Tom Lane wrote: >>> Hmm. If it doesn't need a special case, then we still lack an >>> explanation for the aforementioned bug report. > >> From what I can tell that report doesn't tell us very much - we don't >> know server encoding, we don't know server locale, we don't even know >> client encoding. So I don't think we know anywhere *near* enough to say >> it's related to this. > > In the followup we found out that he was using UTF-8 encoding: > http://archives.postgresql.org/pgsql-bugs/2007-05/msg00264.php > So while that report certainly left a great deal to be desired in terms > of precision, my gut tells me it's related. Has anyone tried to > reproduce that behavior by initdb'ing 8.2 in a suitable UTF-8-using > Windows locale? It doesn't tell us if it's the client or the server that's in UTF8, and it doesn't tell us about the locale. Euler Taveira de Oliveira's response says he can't reproduce it. I haven't tried myself, and that webpage really doesn't tell us what what the character is. If someone can comment on that, I can try to repro it on my systems. //Magnus
On Mon, Oct 15, 2007 at 07:44:00PM +0200, Magnus Hagander wrote: > Tom Lane wrote: > > Magnus Hagander <magnus@hagander.net> writes: > >> Tom Lane wrote: > >>> Hmm. If it doesn't need a special case, then we still lack an > >>> explanation for the aforementioned bug report. > > > >> From what I can tell that report doesn't tell us very much - we don't > >> know server encoding, we don't know server locale, we don't even know > >> client encoding. So I don't think we know anywhere *near* enough to say > >> it's related to this. > > > > In the followup we found out that he was using UTF-8 encoding: > > http://archives.postgresql.org/pgsql-bugs/2007-05/msg00264.php > > So while that report certainly left a great deal to be desired in terms > > of precision, my gut tells me it's related. Has anyone tried to > > reproduce that behavior by initdb'ing 8.2 in a suitable UTF-8-using > > Windows locale? > > It doesn't tell us if it's the client or the server that's in UTF8, and > it doesn't tell us about the locale. > > Euler Taveira de Oliveira's response says he can't reproduce it. I > haven't tried myself, and that webpage really doesn't tell us what what > the character is. If someone can comment on that, I can try to repro it > on my systems. Got some help on IRC to dentify the charafters as ç and Ç. I can confirm that both work perfectly fine with UTF-8 and locale Swedish_Sweden.1252. They sort correctly, and they work with both upper() and lower() correctly. This test is with 8.3-HEAD and the patch to allow UTF-8. This leads me to beleive that something is wrong with the ops system. Most likely it's just the client that's in UTF8 mode, and the server is SQL_ASCII. //Magnus
Magnus Hagander wrote: > Got some help on IRC to dentify the charafters as ç and Ç. > Exact. > I can confirm that both work perfectly fine with UTF-8 and locale > Swedish_Sweden.1252. They sort correctly, and they work with both upper() > and lower() correctly. > I didn't remember what locale is. I'll check it. > This test is with 8.3-HEAD and the patch to allow UTF-8. > I tested with 8.2.4 and my encoding is LATIN1 IIRC. Didn't try UTF-8. I'll give it a try when i have my dev environment. -- Euler Taveira de Oliveira http://www.timbira.com/