Thread: Locale + encoding combinations

Locale + encoding combinations

From

Dave Page

Date:

09 October 2007, 17:33:03

I'm working on some code for pgInstaller that will check the locale and
encoding selected by the user are a valid combination.

The changes recently added to initdb (which highlighted the UTF-8 issue
on Windows that Tom posted about) appear to only allow the default
encoding for the locale to be selected. For example, for me that would be:

"English_United Kingdom.1252"

However, setlocale() will also accept other valid combinations on
Windows, which initdb will not, for example:

"English_United Kingdom.28591" (Latin1)

Is there any reason not to accept other combinations that setlocale() is
happy with?

Regards, Dave

Re: Locale + encoding combinations

From

Peter Eisentraut

Date:

09 October 2007, 18:27:14

Dave Page wrote:
> Is there any reason not to accept other combinations that setlocale()
> is happy with?

setlocale() sets the locale.  How does it "accept" a "combination"?

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Locale + encoding combinations

From

Dave Page

Date:

09 October 2007, 18:39:01

Peter Eisentraut wrote:
> Dave Page wrote:
>> Is there any reason not to accept other combinations that setlocale()
>> is happy with?
> 
> setlocale() sets the locale.  How does it "accept" a "combination"?
> 

setlocale(LC_CTYPE, "English_United Kingdom.65001")

will return null (and not change anything) because it doesn't like the
combination of the locale and that encoding (UTF-8).

setlocale(LC_CTYPE, "English_United Kingdom.1252")

will return "English_United Kingdom.1252" and set the locale accordingly
because WIN1252 is a valid encoding for that locale. Similarly, LATIN1
and numerous other encodings are accepted in combination with that locale.

Should initdb allow any combination that setlocale() accepts, or should
it *only* accept the default encoding for the specified locale?

/D

Re: Locale + encoding combinations

From

Peter Eisentraut

Date:

09 October 2007, 19:55:37

Dave Page wrote:
> setlocale(LC_CTYPE, "English_United Kingdom.65001")
>
> will return null (and not change anything) because it doesn't like
> the combination of the locale and that encoding (UTF-8).

The reason that that call fails is probably that the operating system 
does not provide such a locale.  But that's not what we are interested 
in.  We are interested in compatibility between *existing* operating 
system locales and *PostgreSQL* encoding names.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Locale + encoding combinations

From

Dave Page

Date:

10 October 2007, 05:11:36

Peter Eisentraut wrote:
> Dave Page wrote:
>> setlocale(LC_CTYPE, "English_United Kingdom.65001")
>>
>> will return null (and not change anything) because it doesn't like
>> the combination of the locale and that encoding (UTF-8).
> 
> The reason that that call fails is probably that the operating system 
> does not provide such a locale.  

It doesn't - UTF-8/65001 is a pseudo codepage on Windows with no NLS
file defining collation rules etc. as we already discussed.

> But that's not what we are interested 
> in.  We are interested in compatibility between *existing* operating 
> system locales and *PostgreSQL* encoding names.

Yes.

Let me put my question another way.

Latin1 is a perfectly valid encoding for my locale English_United
Kingdom. It is accepted by setlocale for LC_ALL.

Why does initdb reject it? Why does it insist the encoding is not valid
for the locale?

/D

Re: Locale + encoding combinations

From

Peter Eisentraut

Date:

10 October 2007, 05:45:54

Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
> Latin1 is a perfectly valid encoding for my locale English_United
> Kingdom. It is accepted by setlocale for LC_ALL.
>
> Why does initdb reject it? Why does it insist the encoding is not valid
> for the locale?

Because initdb works with a finite list of known matches, and your particular 
combination might not be in that list -- yet.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Locale + encoding combinations

From

Dave Page

Date:

10 October 2007, 05:51:48

Peter Eisentraut wrote:
> Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
>> Latin1 is a perfectly valid encoding for my locale English_United
>> Kingdom. It is accepted by setlocale for LC_ALL.
>>
>> Why does initdb reject it? Why does it insist the encoding is not valid
>> for the locale?
> 
> Because initdb works with a finite list of known matches, and your particular 
> combination might not be in that list -- yet.

So is it just a case of us generating a list of matches that may be
Windows specific, or is there more to it than that?

/D

Re: Locale + encoding combinations

From

Peter Eisentraut

Date:

10 October 2007, 06:47:24

Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
> So is it just a case of us generating a list of matches that may be
> Windows specific, or is there more to it than that?

You want to peruse src/port/chklocale.c.  There is already explicit Windows 
support in there, so maybe you just need to add on your particular cases.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Locale + encoding combinations

From

Dave Page

Date:

10 October 2007, 08:08:11

Peter Eisentraut wrote:
> Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
>> So is it just a case of us generating a list of matches that may be
>> Windows specific, or is there more to it than that?
> 
> You want to peruse src/port/chklocale.c.  There is already explicit Windows 
> support in there, so maybe you just need to add on your particular cases.

Yup, found that - thanks. I'll look at updating that list.

/D

Re: Locale + encoding combinations

From

Dave Page

Date:

10 October 2007, 08:48:32

Dave Page wrote:
> Peter Eisentraut wrote:
>> Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
>>> So is it just a case of us generating a list of matches that may be
>>> Windows specific, or is there more to it than that?
>> You want to peruse src/port/chklocale.c.  There is already explicit Windows 
>> support in there, so maybe you just need to add on your particular cases.
> 
> Yup, found that - thanks. I'll look at updating that list.

OK so I added the appropriate entries (and posted the patch to
-patches), but my original question remains: why can I only select the
*default* encoding for the chosen locale, but not other ones that are
also be valid according to setlocale? Is this a bug, or is there some
technical reason?

/D

Re: Locale + encoding combinations

From

Peter Eisentraut

Date:

10 October 2007, 09:18:21

Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
> my original question remains: why can I only select the
> *default* encoding for the chosen locale, but not other ones that are
> also be valid according to setlocale? Is this a bug, or is there some
> technical reason?

One locale works only with one encoding.  There are no "default" or perhaps 
alternative encodings for one locale; there is only one.  The whole point of 
the exercise is to determine what the spelling of that one encoding is in 
PostgreSQL.

Perhaps you are confused about the naming.  These are all entirely separate 
locales:

en_GB.iso88591
en_GB.iso885915
en_GB.utf8

Someone was friendly enough to include the name of the encoding used by the 
locale into its name, but that doesn't mean that en_GB has three alternative 
encodings or something.

At least that's the model we have on POSIX platforms.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Locale + encoding combinations

From

Tom Lane

Date:

10 October 2007, 09:37:54

Dave Page <dpage@postgresql.org> writes:
> However, setlocale() will also accept other valid combinations on
> Windows, which initdb will not, for example:
> "English_United Kingdom.28591" (Latin1)
> Is there any reason not to accept other combinations that setlocale() is
> happy with?

Are you certain that that acceptance actually represents support?
Have you checked that it rejects combinations involving real code
pages (ie, NOT 65001) that don't really work with the locale?
        regards, tom lane

Re: Locale + encoding combinations

From

Dave Page

Date:

10 October 2007, 09:50:11

Peter Eisentraut wrote:
> Am Mittwoch, 10. Oktober 2007 schrieb Dave Page:
>> my original question remains: why can I only select the
>> *default* encoding for the chosen locale, but not other ones that are
>> also be valid according to setlocale? Is this a bug, or is there some
>> technical reason?
> 
> One locale works only with one encoding.  There are no "default" or perhaps 
> alternative encodings for one locale; there is only one.  The whole point of 
> the exercise is to determine what the spelling of that one encoding is in 
> PostgreSQL.
> 
> Perhaps you are confused about the naming.  These are all entirely separate 
> locales:
> 
> en_GB.iso88591
> en_GB.iso885915
> en_GB.utf8
> 
> Someone was friendly enough to include the name of the encoding used by the 
> locale into its name, but that doesn't mean that en_GB has three alternative 
> encodings or something.
> 
> At least that's the model we have on POSIX platforms.

OK, sorting out my terminology deficiencies has helped - thanks. The
problem seems to be:

initdb --locale "English_United Kingdom.28591"

works, but

initdb -E LATIN1 --locale "English_United Kingdom"

does not. That's good (albeit inconsistent), I know how to fix
pgInstaller now. What isn't so good is:

============
C:\pg>bin\initdb --locale "English_United Kingdom.99999" -D data
initdb: invalid locale name "English_United Kingdom.99999"
initdb: invalid locale name "English_United Kingdom.99999"
initdb: invalid locale name "English_United Kingdom.99999"
initdb: invalid locale name "English_United Kingdom.99999"
initdb: invalid locale name "English_United Kingdom.99999"
initdb: invalid locale name "English_United Kingdom.99999"
The files belonging to this database system will be owned by user "Dave".
This user must also own the server process.

The database cluster will be initialized with locale English_United
Kingdom.1252
.
The default database encoding has accordingly been set to WIN1252.
===========

Shouldn't that have failed?

Regards, Dave

Re: Locale + encoding combinations

From

Dave Page

Date:

10 October 2007, 09:58:15

Tom Lane wrote:
> Dave Page <dpage@postgresql.org> writes:
>> However, setlocale() will also accept other valid combinations on
>> Windows, which initdb will not, for example:
>> "English_United Kingdom.28591" (Latin1)
>> Is there any reason not to accept other combinations that setlocale() is
>> happy with?
> 
> Are you certain that that acceptance actually represents support?
> Have you checked that it rejects combinations involving real code
> pages (ie, NOT 65001) that don't really work with the locale?

It fails with ones that Microsoft have decided don't belong in my
language group and therefore aren't installed. It accepts all the others
I've tried, but then from the sample I've looked, they all have
0-9a-zA-Z in them so I guess they're all capable of handling English.

Regards, Dave.

Re: Locale + encoding combinations

From

Tom Lane

Date:

10 October 2007, 10:47:06

Dave Page <dpage@postgresql.org> writes:
> Tom Lane wrote:
>> Are you certain that that acceptance actually represents support?
>> Have you checked that it rejects combinations involving real code
>> pages (ie, NOT 65001) that don't really work with the locale?

> It fails with ones that Microsoft have decided don't belong in my
> language group and therefore aren't installed. It accepts all the others
> I've tried, but then from the sample I've looked, they all have
> 0-9a-zA-Z in them so I guess they're all capable of handling English.

That doesn't exactly fill me with confidence.  Maybe you need to make
some tests involving a non-English base locale?
        regards, tom lane

Re: Locale + encoding combinations

From

Dave Page

Date:

10 October 2007, 11:08:55

Tom Lane wrote:
> Dave Page <dpage@postgresql.org> writes:
>> Tom Lane wrote:
>>> Are you certain that that acceptance actually represents support?
>>> Have you checked that it rejects combinations involving real code
>>> pages (ie, NOT 65001) that don't really work with the locale?
> 
>> It fails with ones that Microsoft have decided don't belong in my
>> language group and therefore aren't installed. It accepts all the others
>> I've tried, but then from the sample I've looked, they all have
>> 0-9a-zA-Z in them so I guess they're all capable of handling English.
> 
> That doesn't exactly fill me with confidence.  Maybe you need to make
> some tests involving a non-English base locale?

Hmm, I'm guessing these probably shouldn't work:

Dave@SNAKE:~$ setlc "Japanese_Japan.28605"
Japanese_Japan.28605
Dave@SNAKE:~$ setlc "Japanese_Japan.28595"
Japanese_Japan.28595
Dave@SNAKE:~$ setlc "Russian_Russia.1252"
Russian_Russia.1252
Dave@SNAKE:~$ setlc "Russian_Russia.28591"
Russian_Russia.28591

1252 == WIN1252
28591 == LATIN1
28605 == LATIN9
28595 == ISO8859-5 (Cyrillic)
28597 == ISO8859-7 (Greek)

In fact, it looks like it'll allow me to use anything thats installed,
regardless of whether they're liekly to be compatible.  So much for
trusting setlocale() :-(

/D

Re: Locale + encoding combinations

From

Tom Lane

Date:

10 October 2007, 11:10:19

Dave Page <dpage@postgresql.org> writes:
> OK so I added the appropriate entries (and posted the patch to
> -patches), but my original question remains: why can I only select the
> *default* encoding for the chosen locale, but not other ones that are
> also be valid according to setlocale? Is this a bug, or is there some
> technical reason?

Well, the chklocale code is designed around the assumption that there
*is* only one encoding for which a locale setting will work, with
C/POSIX being a special case.

I think we are talking a bit at cross-purposes here, because the Windows
equivalent to this notion seems to be "English_United Kingdom.1252"
whereas you seem to be defining locale as just "English_United Kingdom".
Does it not work the way you want if you make the installer pass locale
strings of the first form to initdb?
        regards, tom lane

Re: Locale + encoding combinations

From

Dave Page

Date:

10 October 2007, 11:16:23

Tom Lane wrote:
> Dave Page <dpage@postgresql.org> writes:
>> OK so I added the appropriate entries (and posted the patch to
>> -patches), but my original question remains: why can I only select the
>> *default* encoding for the chosen locale, but not other ones that are
>> also be valid according to setlocale? Is this a bug, or is there some
>> technical reason?
> 
> Well, the chklocale code is designed around the assumption that there
> *is* only one encoding for which a locale setting will work, with
> C/POSIX being a special case.
> 
> I think we are talking a bit at cross-purposes here, because the Windows
> equivalent to this notion seems to be "English_United Kingdom.1252"
> whereas you seem to be defining locale as just "English_United Kingdom".
> Does it not work the way you want if you make the installer pass locale
> strings of the first form to initdb?

Yes, it seems it does (see my previous email to Peter):
http://archives.postgresql.org/pgsql-hackers/2007-10/msg00447.php

So I guess that's how I'll fix the installer. There is another issue
though as I mentioned in the post above - that it complains about an
invalid encoding specifier on the encoding name, then ignores it and
uses the default which seems wrong to me.

/D

Re: Locale + encoding combinations

From

Tom Lane

Date:

10 October 2007, 11:55:11

Dave Page <dpage@postgresql.org> writes:
> In fact, it looks like it'll allow me to use anything thats installed,
> regardless of whether they're liekly to be compatible.  So much for
> trusting setlocale() :-(

Yech :-(.  Count on Microsloth to get this wrong.  Anyone have any ideas
on how to tell if a locale setting *really* works on Windows?
        regards, tom lane

Re: Locale + encoding combinations

From

Tom Lane

Date:

10 October 2007, 12:16:06

Dave Page <dpage@postgresql.org> writes:
> ... There is another issue
> though as I mentioned in the post above - that it complains about an
> invalid encoding specifier on the encoding name, then ignores it and
> uses the default which seems wrong to me.

Yeah, if you look at chklocale() in initdb.c this is clearly how it
works, but there's a comment/* should we exit here? */
so whoever wrote it wasn't all that convinced it was the right behavior.

Given that 8.3 is raising the stakes for having a correct locale
specification at initdb time, it seems right to me to error out if a
bogus locale switch is given, rather than whining and then substituting
the environment default.  Any objections?

That still leaves us with the problem of how to tell whether a locale
spec is bad on Windows.  Judging by your example, Windows checks whether
the code page is present but not whether it is sane for the base locale.
What happens when there's a mismatch --- eg, what encoding do system
messages come out in?
        regards, tom lane

Re: Locale + encoding combinations

From

Dave Page

Date:

12 October 2007, 08:53:31

Tom Lane wrote
> That still leaves us with the problem of how to tell whether a locale
> spec is bad on Windows.  Judging by your example, Windows checks whether
> the code page is present but not whether it is sane for the base locale.
> What happens when there's a mismatch --- eg, what encoding do system
> messages come out in?

I'm not sure how to test that specifically, but it seems that accented
characters simply fall back to their undecorated equivalents if the
encoding is not appropriate, eg:

Dave@SNAKE:~$ ./setlc French_France.1252
Locale: French_France.1252
The date is: sam. 01 of août  2007
Dave@SNAKE:~$ ./setlc French_France.28597
Locale: French_France.28597
The date is: sam. 01 of aout  2007

(the encodings used there are WIN1252 and ISO8859-7 (Greek)).

I'm happy to test further is you can suggest how I can figure out the
encoding actually output.

Regards, Dave.

Re: Locale + encoding combinations

From

"Trevor Talbot"

Date:

12 October 2007, 10:04:03

On 10/12/07, Dave Page <dpage@postgresql.org> wrote:
> Tom Lane wrote
> > That still leaves us with the problem of how to tell whether a locale
> > spec is bad on Windows.  Judging by your example, Windows checks whether
> > the code page is present but not whether it is sane for the base locale.
> > What happens when there's a mismatch --- eg, what encoding do system
> > messages come out in?
>
> I'm not sure how to test that specifically, but it seems that accented
> characters simply fall back to their undecorated equivalents if the
> encoding is not appropriate, eg:
>
> Dave@SNAKE:~$ ./setlc French_France.1252
> Locale: French_France.1252
> The date is: sam. 01 of août  2007
> Dave@SNAKE:~$ ./setlc French_France.28597
> Locale: French_France.28597
> The date is: sam. 01 of aout  2007
>
> (the encodings used there are WIN1252 and ISO8859-7 (Greek)).
>
> I'm happy to test further is you can suggest how I can figure out the
> encoding actually output.

The encoding output is the one you specified.  Keep in mind,
underneath Windows is mostly working with Unicode, so all characters
exist and the locale rules specify their behavior there.  The encoding
is just the byte stream it needs to force them all into after doing
whatever it does to them.  As you've seen, it uses some sort of
best-fit mapping I don't know the details of.  (It will drop accent
marks and choose characters with similar shape where possible, by
default.)

I think it's a bit more complex for input/transform cases where you
operate on the byte stream directly without intermediate conversion to
Unicode, which is why UTF-8 doesn't work as a codepage, but again I
don't have the details nearby.  I can try to do more digging if
needed.

Re: Locale + encoding combinations

From

Dave Page

Date:

12 October 2007, 11:26:33

Trevor Talbot wrote:
> The encoding output is the one you specified.  

OK.

> Keep in mind,
> underneath Windows is mostly working with Unicode, so all characters
> exist and the locale rules specify their behavior there.  The encoding
> is just the byte stream it needs to force them all into after doing
> whatever it does to them.  As you've seen, it uses some sort of
> best-fit mapping I don't know the details of.  (It will drop accent
> marks and choose characters with similar shape where possible, by
> default.)

Right, that makes sense. The codepages used by setlocale etc. are just
translation tables to/from the internal unicode representation.

> I think it's a bit more complex for input/transform cases where you
> operate on the byte stream directly without intermediate conversion to
> Unicode, which is why UTF-8 doesn't work as a codepage, but again I
> don't have the details nearby.  I can try to do more digging if
> needed.

It does (sort of) work as a codepage, it just doesn't have the NLS file
to define how things like UPPER() and LOWER() should work.

Regards, Dave

Re: Locale + encoding combinations

From

Magnus Hagander

Date:

12 October 2007, 11:45:43

On Fri, Oct 12, 2007 at 06:03:52AM -0700, Trevor Talbot wrote:
> On 10/12/07, Dave Page <dpage@postgresql.org> wrote:
> > Tom Lane wrote
> > > That still leaves us with the problem of how to tell whether a locale
> > > spec is bad on Windows.  Judging by your example, Windows checks whether
> > > the code page is present but not whether it is sane for the base locale.
> > > What happens when there's a mismatch --- eg, what encoding do system
> > > messages come out in?
> >
> > I'm not sure how to test that specifically, but it seems that accented
> > characters simply fall back to their undecorated equivalents if the
> > encoding is not appropriate, eg:
> >
> > Dave@SNAKE:~$ ./setlc French_France.1252
> > Locale: French_France.1252
> > The date is: sam. 01 of août  2007
> > Dave@SNAKE:~$ ./setlc French_France.28597
> > Locale: French_France.28597
> > The date is: sam. 01 of aout  2007
> >
> > (the encodings used there are WIN1252 and ISO8859-7 (Greek)).
> >
> > I'm happy to test further is you can suggest how I can figure out the
> > encoding actually output.
> 
> The encoding output is the one you specified.  Keep in mind,
> underneath Windows is mostly working with Unicode, so all characters
> exist and the locale rules specify their behavior there.  The encoding
> is just the byte stream it needs to force them all into after doing
> whatever it does to them.  As you've seen, it uses some sort of
> best-fit mapping I don't know the details of.  (It will drop accent
> marks and choose characters with similar shape where possible, by
> default.)
> 
> I think it's a bit more complex for input/transform cases where you
> operate on the byte stream directly without intermediate conversion to
> Unicode, which is why UTF-8 doesn't work as a codepage, but again I
> don't have the details nearby.  I can try to do more digging if
> needed.

Just so the non-windows-savvy people get it.. When Windows documentation or
users refer to Unicode, they mean UTF-16.

//Magnus