Thread: locales and encodings on Windows
I would like to bring to your attention a problem regarding locale support on Windows. The description below uses UNICODE/UTF8, but the issue isn't limited to just this encoding. Because Postgres relies on the operating system for some string related functions, the OS needs to support the same encoding as the one that is used as the database encoding. Unfortunately, Windows does not support some encodings that are available as server-side encodings for PG. Here is a short example in case the previous paragraph doesn't make much sense: with a UNICODE database (actually UTF8) you need to use a compatible locale when running initdb; in my case that's "sl_SI.utf8" (on Linux) or "Slovenian_Slovenia.65001" (on Windows). 65001 is Windows codepage number for utf8; except it's not a really a valid codepage. The document at http://www.sharmahd.com/tm/codepages.html states that: "65000 (UTF-7) and 65001 (UTF-8) are pseudo codepages. There are no corresponding NLS files. The code page IDs can only be used with WideCharToMultiByte( ) and MultiByteToWideChar( ) API calls." This means that UPPER(), LOWER() and ORDER BY do not work correctly for unicode databases. Currently it's not even possible to run initdb with a locale which uses 65001 encoding. A small change to initdb enabled me to set LC_COLLATE to Slovenian_Slovenia.65001, but the sort order was still badly messed up, which makes sense considering the above quote. After some checking I came up with this list of encodings which are supported by PG, but not mentioned anywhere as supported by Windows: UTF8 EUC_CN EUC_TW LATIN6 (ISO 8859-10/ECMA 144) LATIN7 (ISO 8859-13) LATIN8 (ISO 8859-14) LATIN10 (ISO 8859-16/ASRO SR 14111) Is there a solution for this, other than marking these encodings as not available on Windows? Regards, Aleksander
Come on, people. This was the second time I reported this bug and also the second time nobody responded to my report. :-( If it is indeed not possible to initdb with a utf8 (65001) locale, then this will cause a flood of bug reports once a large number of people start using PG on Windows. Can somebody try and confirm this problem? Simply try running initdb with a --locale value of german_germany.65001, spanish_spain.65001, french_france.65001 or any other locale you think should be supported by your system. You will need to do this from the command line, not from the installer. Does initdb accept this value or does it replace it with your current system locale? Unless somebody can come up with a solution, my suggestion for a work-around would be to remove unsupported encodings from the installer or at least warn users that their database will not be fully functional if they happen to choose one of the unsupported encodings. Any comments? Last October there was a discussion on pgsql-hackers about writing locale support for PG, so it wouldn't depend on the system for locale functionality any more. Is anyone still working on that? Regards, Aleksander Aleksander Kmetec wrote: > I would like to bring to your attention a problem regarding locale > support on Windows. The description below uses UNICODE/UTF8, but the > issue isn't limited to just this encoding. > > Because Postgres relies on the operating system for some string related > functions, the OS needs to support the same encoding as the one that is > used as the database encoding. Unfortunately, Windows does not support > some encodings that are available as server-side encodings for PG. > > Here is a short example in case the previous paragraph doesn't make much > sense: with a UNICODE database (actually UTF8) you need to use a > compatible locale when running initdb; in my case that's "sl_SI.utf8" > (on Linux) or "Slovenian_Slovenia.65001" (on Windows). > > 65001 is Windows codepage number for utf8; except it's not a really a > valid codepage. The document at > http://www.sharmahd.com/tm/codepages.html states that: "65000 (UTF-7) > and 65001 (UTF-8) are pseudo codepages. There are no corresponding NLS > files. The code page IDs can only be used with WideCharToMultiByte( ) > and MultiByteToWideChar( ) API calls." > > This means that UPPER(), LOWER() and ORDER BY do not work correctly for > unicode databases. Currently it's not even possible to run initdb with > a locale which uses 65001 encoding. A small change to initdb enabled me > to set LC_COLLATE to Slovenian_Slovenia.65001, but the sort order was > still badly messed up, which makes sense considering the above quote. > > After some checking I came up with this list of encodings which are > supported by PG, but not mentioned anywhere as supported by Windows: > UTF8 > EUC_CN > EUC_TW > LATIN6 (ISO 8859-10/ECMA 144) > LATIN7 (ISO 8859-13) > LATIN8 (ISO 8859-14) > LATIN10 (ISO 8859-16/ASRO SR 14111) > > Is there a solution for this, other than marking these encodings as not > available on Windows? > > Regards, > Aleksander > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >
> Come on, people. This was the second time I reported this bug > and also the second time nobody responded to my report. :-( 'fraid I know very little about this stuff, so I can't really comment on the mani issue.. Was hoping someone else would pick it up... > If it is indeed not possible to initdb with a utf8 (65001) > locale, then this will cause a flood of bug reports once a > large number of people start using PG on Windows. Can > somebody try and confirm this problem? > Simply try running initdb with a --locale value of > german_germany.65001, spanish_spain.65001, > french_france.65001 or any other locale you think should be > supported by your system. You will need to do this from the > command line, not from the installer. Does initdb accept this > value or does it replace it with your current system locale? > > Unless somebody can come up with a solution, my suggestion > for a work-around would be to remove unsupported encodings > from the installer or at least warn users that their database > will not be fully functional if they happen to choose one of > the unsupported encodings. Yeah, that sounds like what we'll have to do if nobody can fix this completely. Do you know enough to say exactly which locale/encoding combinations have to be removed fromt he installer? Bruce - we probably need an open item on the backend side of this. If not, then we need at least someone to say we can't fix this for 8.0. Removing it from the installer is just a workaround... > Last October there was a discussion on pgsql-hackers about > writing locale support for PG, so it wouldn't depend on the > system for locale functionality any more. Is anyone still > working on that? I have no idea, but I'm certain if someone is this is definitly not going to happen for 8.0. //Magnus
Added to open items list: o Disallow encodings like UTF8 which which PostgreSQL supports but the operating system does not --------------------------------------------------------------------------- Magnus Hagander wrote: > > Come on, people. This was the second time I reported this bug > > and also the second time nobody responded to my report. :-( > > 'fraid I know very little about this stuff, so I can't really comment on > the mani issue.. Was hoping someone else would pick it up... > > > > If it is indeed not possible to initdb with a utf8 (65001) > > locale, then this will cause a flood of bug reports once a > > large number of people start using PG on Windows. Can > > somebody try and confirm this problem? > > Simply try running initdb with a --locale value of > > german_germany.65001, spanish_spain.65001, > > french_france.65001 or any other locale you think should be > > supported by your system. You will need to do this from the > > command line, not from the installer. Does initdb accept this > > value or does it replace it with your current system locale? > > > > Unless somebody can come up with a solution, my suggestion > > for a work-around would be to remove unsupported encodings > > from the installer or at least warn users that their database > > will not be fully functional if they happen to choose one of > > the unsupported encodings. > > Yeah, that sounds like what we'll have to do if nobody can fix this > completely. Do you know enough to say exactly which locale/encoding > combinations have to be removed fromt he installer? > > Bruce - we probably need an open item on the backend side of this. If > not, then we need at least someone to say we can't fix this for 8.0. > Removing it from the installer is just a workaround... > > > > Last October there was a discussion on pgsql-hackers about > > writing locale support for PG, so it wouldn't depend on the > > system for locale functionality any more. Is anyone still > > working on that? > > I have no idea, but I'm certain if someone is this is definitly not > going to happen for 8.0. > > //Magnus > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Magnus Hagander wrote: > Do you know enough to say exactly which locale/encoding > combinations have to be removed fromt he installer? The encodings are: UTF8 EUC_CN EUC_TW LATIN6 (ISO 8859-10/ECMA 144) LATIN7 (ISO 8859-13) LATIN8 (ISO 8859-14) LATIN10 (ISO 8859-16/ASRO SR 14111 While you can still create databases using these encodings, it's not possible to initb with a locale that uses the same encoding. This means ORDER BY, UPPER() and similar will produce wrong results. I guess I'll resubmit my installer patch for listing locales supported by the system, this time without the encodings listed above. That way most users won't see unsupported encodings, while people who know what they're doing can still reach them by using CREATE DATABASE newdb ENCODING 'encoding'. >>Last October there was a discussion on pgsql-hackers about >>writing locale support for PG, so it wouldn't depend on the >>system for locale functionality any more. Is anyone still >>working on that? > > I have no idea, but I'm certain if someone is this is definitly not > going to happen for 8.0. I know this feature can't make it into 8.0; but blaming Windows for more than one release cycle might not look very good. :-( Regards, Aleksander
Aleksander Kmetec wrote on 11.11.2004 21:05: >> >> I have no idea, but I'm certain if someone is this is definitly not >> going to happen for 8.0. > > > I know this feature can't make it into 8.0; but blaming Windows for more > than one release cycle might not look very good. :-( > I'm not really experienced with the whole locale/character set topic, but what I'm wondering about (and I'm sure others will do as well) is, why other databases (such as Firebird or Oracle) do support UTF8/Unicode on the Windows platform but PostgreSQL does not. Thomas
> >> I have no idea, but I'm certain if someone is this is > definitly not > >> going to happen for 8.0. > > > > > > I know this feature can't make it into 8.0; but blaming Windows for > > more than one release cycle might not look very good. :-( > > > > I'm not really experienced with the whole locale/character > set topic, but what I'm wondering about (and I'm sure others > will do as well) is, why other databases (such as Firebird or > Oracle) do support UTF8/Unicode on the Windows platform but > PostgreSQL does not. PostgreSQL uses the OS functinos to do locale handling. The other databases usually implement their own (or use a library that does it, but they do not rely on the OS) //Magnus
Magnus Hagander wrote on 22.12.2004 13:14: > [about PG not supporting Unicode on Windows] > > PostgreSQL uses the OS functinos to do locale handling. The other > databases usually implement their own (or use a library that does it, > but they do not rely on the OS) Being a developer myself I can understand this reason, but in the PG context I'm a User I not having Unicode support is my eyes a very big deficiency which gives the whole Win32 a semi-professional "touch". Unicode is something very important and I do hope this will be solved in on of the next releases. Cheers Thomas P.S.: don't get me wrong: the Win32 is *very* much appreciated.
Thomas Kellerer <spam_eater@gmx.net> writes: > Magnus Hagander wrote on 22.12.2004 13:14: >>> [about PG not supporting Unicode on Windows] >> >> PostgreSQL uses the OS functinos to do locale handling. The other >> databases usually implement their own (or use a library that does it, >> but they do not rely on the OS) > Being a developer myself I can understand this reason, but in the PG > context I'm a User I not having Unicode support is my eyes a very big > deficiency which gives the whole Win32 a semi-professional "touch". > Unicode is something very important and I do hope this will be solved in on > of the next releases. [ shrug... ] It can't be too important to Windows users, since their platform doesn't support it. regards, tom lane
Tom Lane wrote: > Thomas Kellerer <spam_eater@gmx.net> writes: > > Magnus Hagander wrote on 22.12.2004 13:14: > >>> [about PG not supporting Unicode on Windows] > >> > >> PostgreSQL uses the OS functinos to do locale handling. The other > >> databases usually implement their own (or use a library that does it, > >> but they do not rely on the OS) > > > Being a developer myself I can understand this reason, but in the PG > > context I'm a User I not having Unicode support is my eyes a very big > > deficiency which gives the whole Win32 a semi-professional "touch". > > > Unicode is something very important and I do hope this will be solved in on > > of the next releases. > > [ shrug... ] It can't be too important to Windows users, since their > platform doesn't support it. And what solution options do we have? Is bundling our own Unicode library something we even want to consider? I would think not. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: > Tom Lane wrote: > > Thomas Kellerer <spam_eater@gmx.net> writes: > > > Magnus Hagander wrote on 22.12.2004 13:14: > > >>> [about PG not supporting Unicode on Windows] > > >> > > >> PostgreSQL uses the OS functinos to do locale handling. The other > > >> databases usually implement their own (or use a library that does it, > > >> but they do not rely on the OS) > > > > > Being a developer myself I can understand this reason, but in the PG > > > context I'm a User I not having Unicode support is my eyes a very big > > > deficiency which gives the whole Win32 a semi-professional "touch". > > > > > Unicode is something very important and I do hope this will be solved in on > > > of the next releases. > > > > [ shrug... ] It can't be too important to Windows users, since their > > platform doesn't support it. > > And what solution options do we have? Is bundling our own Unicode > library something we even want to consider? I would think not. If the Win32 Unicode implemetation is buggy, can we work around the bugs in our code? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
> > > >>> [about PG not supporting Unicode on Windows] > > > >> > > > >> PostgreSQL uses the OS functinos to do locale > handling. The other > > > >> databases usually implement their own (or use a > library that does > > > >> it, but they do not rely on the OS) > > > > > > > Being a developer myself I can understand this reason, > but in the > > > > PG context I'm a User I not having Unicode support is my eyes a > > > > very big deficiency which gives the whole Win32 a > semi-professional "touch". > > > > > > > Unicode is something very important and I do hope this will be > > > > solved in on of the next releases. > > > > > > [ shrug... ] It can't be too important to Windows users, since > > > their platform doesn't support it. > > > > And what solution options do we have? Is bundling our own Unicode > > library something we even want to consider? I would think not. > > If the Win32 Unicode implemetation is buggy, can we work > around the bugs in our code? The implementation is not buggy. The implementation of strcoll() etc *does not exist* for UTF-8. There is a perfectly working Unicode system on Windows - it has been there since Windows NT 3.1. *Every* API in Windows is unicode internally. With Unicode in this case, MS means UTF-16. How do other programs do? They convert their strings to UTF-16 and use the unicode functions in the OS. UTF8 support only exists in the two functinos used to convert to/from UTF-16. That's at least how I understand it. I'm not a locale/encoding expert though, so I could be wrong :) Perhaps an emulation layer could be written for port/win32. I can't really say, because I don't know these things well enough (on any platform). //Magnus
Magnus Hagander wrote: > > The implementation is not buggy. > The implementation of strcoll() etc *does not exist* for UTF-8. > There is a perfectly working Unicode system on Windows - it has been > there since Windows NT 3.1. *Every* API in Windows is unicode > internally. With Unicode in this case, MS means UTF-16. > How do other programs do? They convert their strings to UTF-16 and use > the unicode functions in the OS. UTF8 support only exists in the two > functinos used to convert to/from UTF-16. In general I agree. Most programs won't use UTF-8 at all, but will work with wchar_t (i.e. UTF-16 or UTF-32) since coding is easier, and will convert to UTF-8 on interfaces only. Additionally, storing UTF-8 seems uncommon to me too; this is usually done using NVARCHAR. > That's at least how I understand it. I'm not a locale/encoding expert > though, so I could be wrong :) > > Perhaps an emulation layer could be written for port/win32. I can't > really say, because I don't know these things well enough (on any > platform). Shouldn't be too complicated. Regards, Andreas