Thread: locales and encodings on Windows

locales and encodings on Windows

From

Aleksander Kmetec

Date:

06 November 2004, 04:26:19

I would like to bring to your attention a problem regarding locale
support on Windows. The description below uses UNICODE/UTF8, but the
issue isn't limited to just this encoding.

Because Postgres relies on the operating system for some string related
functions, the OS needs to support the same encoding as the one that is
used as the database encoding. Unfortunately, Windows does not support
some encodings that are available as server-side encodings for PG.

Here is a short example in case the previous paragraph doesn't make much
sense: with a UNICODE database (actually UTF8) you need to use a
compatible locale when running initdb; in my case that's "sl_SI.utf8"
(on Linux) or "Slovenian_Slovenia.65001" (on Windows).

65001 is Windows codepage number for utf8; except it's not a really a
valid codepage. The document at
http://www.sharmahd.com/tm/codepages.html states that: "65000 (UTF-7)
and  65001 (UTF-8) are pseudo codepages. There are no corresponding NLS
  files. The code page IDs can only be used with WideCharToMultiByte( )
  and MultiByteToWideChar( ) API calls."

This means that UPPER(), LOWER() and ORDER BY do not work correctly for
  unicode databases. Currently it's not even possible to run initdb with
a  locale which uses 65001 encoding. A small change to initdb enabled me
  to set LC_COLLATE to Slovenian_Slovenia.65001, but the sort order was
  still badly messed up, which makes sense considering the above quote.

After some checking I came up with this list of encodings which are
supported by PG, but not mentioned anywhere as supported by Windows:
UTF8
EUC_CN
EUC_TW
LATIN6 (ISO 8859-10/ECMA 144)
LATIN7 (ISO 8859-13)
LATIN8 (ISO 8859-14)
LATIN10 (ISO 8859-16/ASRO SR 14111)

Is there a solution for this, other than marking these encodings as not
available on Windows?

Regards,
Aleksander

Re: locales and encodings on Windows

From

Aleksander Kmetec

Date:

11 November 2004, 09:37:03

Come on, people. This was the second time I reported this bug and also
the second time nobody responded to my report. :-(

If it is indeed not possible to initdb with a utf8 (65001) locale, then
this will cause a flood of bug reports once a large number of people
start using PG on Windows. Can somebody try and confirm this problem?
Simply try running initdb with a --locale value of german_germany.65001,
spanish_spain.65001, french_france.65001 or any other locale you think
should be supported by your system. You will need to do this from the
command line, not from the installer. Does initdb accept this value or
does it replace it with your current system locale?

Unless somebody can come up with a solution, my suggestion for a
work-around would be to remove unsupported encodings from the installer
or at least warn users that their database will not be fully functional
if they happen to choose one of the unsupported encodings.

Any comments?

Last October there was a discussion on pgsql-hackers about writing
locale support for PG, so it wouldn't depend on the system for locale
functionality any more. Is anyone still working on that?

Regards,
Aleksander

Aleksander Kmetec wrote:
> I would like to bring to your attention a problem regarding locale
> support on Windows. The description below uses UNICODE/UTF8, but the
> issue isn't limited to just this encoding.
>
> Because Postgres relies on the operating system for some string related
> functions, the OS needs to support the same encoding as the one that is
> used as the database encoding. Unfortunately, Windows does not support
> some encodings that are available as server-side encodings for PG.
>
> Here is a short example in case the previous paragraph doesn't make much
> sense: with a UNICODE database (actually UTF8) you need to use a
> compatible locale when running initdb; in my case that's "sl_SI.utf8"
> (on Linux) or "Slovenian_Slovenia.65001" (on Windows).
>
> 65001 is Windows codepage number for utf8; except it's not a really a
> valid codepage. The document at
> http://www.sharmahd.com/tm/codepages.html states that: "65000 (UTF-7)
> and  65001 (UTF-8) are pseudo codepages. There are no corresponding NLS
>  files. The code page IDs can only be used with WideCharToMultiByte( )
>  and MultiByteToWideChar( ) API calls."
>
> This means that UPPER(), LOWER() and ORDER BY do not work correctly for
>  unicode databases. Currently it's not even possible to run initdb with
> a  locale which uses 65001 encoding. A small change to initdb enabled me
>  to set LC_COLLATE to Slovenian_Slovenia.65001, but the sort order was
>  still badly messed up, which makes sense considering the above quote.
>
> After some checking I came up with this list of encodings which are
> supported by PG, but not mentioned anywhere as supported by Windows:
> UTF8
> EUC_CN
> EUC_TW
> LATIN6 (ISO 8859-10/ECMA 144)
> LATIN7 (ISO 8859-13)
> LATIN8 (ISO 8859-14)
> LATIN10 (ISO 8859-16/ASRO SR 14111)
>
> Is there a solution for this, other than marking these encodings as not
> available on Windows?
>
> Regards,
> Aleksander
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Re: locales and encodings on Windows

From

"Magnus Hagander"

Date:

11 November 2004, 11:46:31

> Come on, people. This was the second time I reported this bug
> and also the second time nobody responded to my report. :-(

'fraid I know very little about this stuff, so I can't really comment on
the mani issue.. Was hoping someone else would pick it up...


> If it is indeed not possible to initdb with a utf8 (65001)
> locale, then this will cause a flood of bug reports once a
> large number of people start using PG on Windows. Can
> somebody try and confirm this problem?
> Simply try running initdb with a --locale value of
> german_germany.65001, spanish_spain.65001,
> french_france.65001 or any other locale you think should be
> supported by your system. You will need to do this from the
> command line, not from the installer. Does initdb accept this
> value or does it replace it with your current system locale?
>
> Unless somebody can come up with a solution, my suggestion
> for a work-around would be to remove unsupported encodings
> from the installer or at least warn users that their database
> will not be fully functional if they happen to choose one of
> the unsupported encodings.

Yeah, that sounds like what we'll have to do if nobody can fix this
completely. Do you know enough to say exactly which locale/encoding
combinations have to be removed fromt he installer?

Bruce - we probably need an open item on the backend side of this. If
not, then we need at least someone to say we can't fix this for 8.0.
Removing it from the installer is just a workaround...


> Last October there was a discussion on pgsql-hackers about
> writing locale support for PG, so it wouldn't depend on the
> system for locale functionality any more. Is anyone still
> working on that?

I have no idea, but I'm certain if someone is this is definitly not
going to happen for 8.0.

//Magnus

Re: locales and encodings on Windows

From

Bruce Momjian

Date:

11 November 2004, 18:05:22

Added to open items list:

        o Disallow encodings like UTF8 which which PostgreSQL supports
          but the operating system does not


---------------------------------------------------------------------------

Magnus Hagander wrote:
> > Come on, people. This was the second time I reported this bug
> > and also the second time nobody responded to my report. :-(
>
> 'fraid I know very little about this stuff, so I can't really comment on
> the mani issue.. Was hoping someone else would pick it up...
>
>
> > If it is indeed not possible to initdb with a utf8 (65001)
> > locale, then this will cause a flood of bug reports once a
> > large number of people start using PG on Windows. Can
> > somebody try and confirm this problem?
> > Simply try running initdb with a --locale value of
> > german_germany.65001, spanish_spain.65001,
> > french_france.65001 or any other locale you think should be
> > supported by your system. You will need to do this from the
> > command line, not from the installer. Does initdb accept this
> > value or does it replace it with your current system locale?
> >
> > Unless somebody can come up with a solution, my suggestion
> > for a work-around would be to remove unsupported encodings
> > from the installer or at least warn users that their database
> > will not be fully functional if they happen to choose one of
> > the unsupported encodings.
>
> Yeah, that sounds like what we'll have to do if nobody can fix this
> completely. Do you know enough to say exactly which locale/encoding
> combinations have to be removed fromt he installer?
>
> Bruce - we probably need an open item on the backend side of this. If
> not, then we need at least someone to say we can't fix this for 8.0.
> Removing it from the installer is just a workaround...
>
>
> > Last October there was a discussion on pgsql-hackers about
> > writing locale support for PG, so it wouldn't depend on the
> > system for locale functionality any more. Is anyone still
> > working on that?
>
> I have no idea, but I'm certain if someone is this is definitly not
> going to happen for 8.0.
>
> //Magnus
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: locales and encodings on Windows

From

Aleksander Kmetec

Date:

11 November 2004, 23:05:27

Magnus Hagander wrote:
> Do you know enough to say exactly which locale/encoding
> combinations have to be removed fromt he installer?

The encodings are:
UTF8
EUC_CN
EUC_TW
LATIN6 (ISO 8859-10/ECMA 144)
LATIN7 (ISO 8859-13)
LATIN8 (ISO 8859-14)
LATIN10 (ISO 8859-16/ASRO SR 14111

While you can still create databases using these encodings, it's not
possible to initb with a locale that uses the same encoding. This means
ORDER BY, UPPER() and similar will produce wrong results.

I guess I'll resubmit my installer patch for listing locales supported
by the system, this time without the encodings listed above. That way
most users won't see unsupported encodings, while people who know what
they're doing can still reach them by using CREATE DATABASE newdb
ENCODING 'encoding'.

>>Last October there was a discussion on pgsql-hackers about
>>writing locale support for PG, so it wouldn't depend on the
>>system for locale functionality any more. Is anyone still
>>working on that?
>
> I have no idea, but I'm certain if someone is this is definitly not
> going to happen for 8.0.

I know this feature can't make it into 8.0; but blaming Windows for more
than one release cycle might not look very good. :-(

Regards,
Aleksander

Re: locales and encodings on Windows

From

Thomas Kellerer

Date:

22 December 2004, 15:12:16

Aleksander Kmetec wrote on 11.11.2004 21:05:
>>
>> I have no idea, but I'm certain if someone is this is definitly not
>> going to happen for 8.0.
>
>
> I know this feature can't make it into 8.0; but blaming Windows for more
> than one release cycle might not look very good. :-(
>

I'm not really experienced with the whole locale/character set topic, but
what I'm wondering about (and I'm sure others will do as well) is, why
other databases (such as Firebird or Oracle) do support UTF8/Unicode on the
Windows platform but PostgreSQL does not.

Thomas

Re: locales and encodings on Windows

From

"Magnus Hagander"

Date:

22 December 2004, 15:14:34

> >> I have no idea, but I'm certain if someone is this is
> definitly not
> >> going to happen for 8.0.
> >
> >
> > I know this feature can't make it into 8.0; but blaming Windows for
> > more than one release cycle might not look very good. :-(
> >
>
> I'm not really experienced with the whole locale/character
> set topic, but what I'm wondering about (and I'm sure others
> will do as well) is, why other databases (such as Firebird or
> Oracle) do support UTF8/Unicode on the Windows platform but
> PostgreSQL does not.

PostgreSQL uses the OS functinos to do locale handling. The other
databases usually implement their own (or use a library that does it,
but they do not rely on the OS)

//Magnus

Re: locales and encodings on Windows

From

Thomas Kellerer

Date:

22 December 2004, 19:26:47

Magnus Hagander wrote on 22.12.2004 13:14:
 > [about PG not supporting Unicode on Windows]
>
> PostgreSQL uses the OS functinos to do locale handling. The other
> databases usually implement their own (or use a library that does it,
> but they do not rely on the OS)

Being a developer myself I can understand this reason, but in the PG
context I'm a User I not having Unicode support is my eyes a very big
deficiency which gives the whole Win32 a semi-professional "touch".

Unicode is something very important and I do hope this will be solved in on
of the next releases.

Cheers
Thomas

P.S.: don't get me wrong: the Win32 is *very* much appreciated.

Re: locales and encodings on Windows

From

Tom Lane

Date:

22 December 2004, 20:21:47

Thomas Kellerer <spam_eater@gmx.net> writes:
> Magnus Hagander wrote on 22.12.2004 13:14:
>>> [about PG not supporting Unicode on Windows]
>>
>> PostgreSQL uses the OS functinos to do locale handling. The other
>> databases usually implement their own (or use a library that does it,
>> but they do not rely on the OS)

> Being a developer myself I can understand this reason, but in the PG
> context I'm a User I not having Unicode support is my eyes a very big
> deficiency which gives the whole Win32 a semi-professional "touch".

> Unicode is something very important and I do hope this will be solved in on
> of the next releases.

[ shrug... ]  It can't be too important to Windows users, since their
platform doesn't support it.

            regards, tom lane

Re: locales and encodings on Windows

From

Bruce Momjian

Date:

22 December 2004, 20:25:41

Tom Lane wrote:
> Thomas Kellerer <spam_eater@gmx.net> writes:
> > Magnus Hagander wrote on 22.12.2004 13:14:
> >>> [about PG not supporting Unicode on Windows]
> >>
> >> PostgreSQL uses the OS functinos to do locale handling. The other
> >> databases usually implement their own (or use a library that does it,
> >> but they do not rely on the OS)
>
> > Being a developer myself I can understand this reason, but in the PG
> > context I'm a User I not having Unicode support is my eyes a very big
> > deficiency which gives the whole Win32 a semi-professional "touch".
>
> > Unicode is something very important and I do hope this will be solved in on
> > of the next releases.
>
> [ shrug... ]  It can't be too important to Windows users, since their
> platform doesn't support it.

And what solution options do we have?  Is bundling our own Unicode
library something we even want to consider?  I would think not.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: locales and encodings on Windows

From

Bruce Momjian

Date:

22 December 2004, 20:28:03

Bruce Momjian wrote:
> Tom Lane wrote:
> > Thomas Kellerer <spam_eater@gmx.net> writes:
> > > Magnus Hagander wrote on 22.12.2004 13:14:
> > >>> [about PG not supporting Unicode on Windows]
> > >>
> > >> PostgreSQL uses the OS functinos to do locale handling. The other
> > >> databases usually implement their own (or use a library that does it,
> > >> but they do not rely on the OS)
> >
> > > Being a developer myself I can understand this reason, but in the PG
> > > context I'm a User I not having Unicode support is my eyes a very big
> > > deficiency which gives the whole Win32 a semi-professional "touch".
> >
> > > Unicode is something very important and I do hope this will be solved in on
> > > of the next releases.
> >
> > [ shrug... ]  It can't be too important to Windows users, since their
> > platform doesn't support it.
>
> And what solution options do we have?  Is bundling our own Unicode
> library something we even want to consider?  I would think not.

If the Win32 Unicode implemetation is buggy, can we work around the bugs
in our code?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: locales and encodings on Windows

From

"Magnus Hagander"

Date:

22 December 2004, 20:40:24

> > > >>> [about PG not supporting Unicode on Windows]
> > > >>
> > > >> PostgreSQL uses the OS functinos to do locale
> handling. The other
> > > >> databases usually implement their own (or use a
> library that does
> > > >> it, but they do not rely on the OS)
> > >
> > > > Being a developer myself I can understand this reason,
> but in the
> > > > PG context I'm a User I not having Unicode support is my eyes a
> > > > very big deficiency which gives the whole Win32 a
> semi-professional "touch".
> > >
> > > > Unicode is something very important and I do hope this will be
> > > > solved in on of the next releases.
> > >
> > > [ shrug... ]  It can't be too important to Windows users, since
> > > their platform doesn't support it.
> >
> > And what solution options do we have?  Is bundling our own Unicode
> > library something we even want to consider?  I would think not.
>
> If the Win32 Unicode implemetation is buggy, can we work
> around the bugs in our code?

The implementation is not buggy.
The implementation of strcoll() etc *does not exist* for UTF-8.
There is a perfectly working Unicode system on Windows - it has been
there since Windows NT 3.1. *Every* API in Windows is unicode
internally. With Unicode in this case, MS means UTF-16.
How do other programs do? They convert their strings to UTF-16 and use
the unicode functions in the OS. UTF8 support only exists in the two
functinos used to convert to/from UTF-16.

That's at least how I understand it. I'm not a locale/encoding expert
though, so I could be wrong :)

Perhaps an emulation layer could be written for port/win32. I can't
really say, because I don't know these things well enough (on any
platform).

//Magnus

Re: locales and encodings on Windows

From

Andreas Pflug

Date:

22 December 2004, 21:35:58

Magnus Hagander wrote:
>
> The implementation is not buggy.
> The implementation of strcoll() etc *does not exist* for UTF-8.
> There is a perfectly working Unicode system on Windows - it has been
> there since Windows NT 3.1. *Every* API in Windows is unicode
> internally. With Unicode in this case, MS means UTF-16.
> How do other programs do? They convert their strings to UTF-16 and use
> the unicode functions in the OS. UTF8 support only exists in the two
> functinos used to convert to/from UTF-16.

In general I agree. Most programs won't use UTF-8 at all, but will work
with wchar_t (i.e. UTF-16 or UTF-32) since coding is easier, and will
convert to UTF-8 on interfaces only. Additionally, storing UTF-8 seems
uncommon to me too; this is usually done using NVARCHAR.


> That's at least how I understand it. I'm not a locale/encoding expert
> though, so I could be wrong :)
>
> Perhaps an emulation layer could be written for port/win32. I can't
> really say, because I don't know these things well enough (on any
> platform).

Shouldn't be too complicated.

Regards,
Andreas