Thread: Locale + encoding combinations
I'm working on some code for pgInstaller that will check the locale and encoding selected by the user are a valid combination. The changes recently added to initdb (which highlighted the UTF-8 issue on Windows that Tom posted about) appear to only allow the default encoding for the locale to be selected. For example, for me that would be: "English_United Kingdom.1252" However, setlocale() will also accept other valid combinations on Windows, which initdb will not, for example: "English_United Kingdom.28591" (Latin1) Is there any reason not to accept other combinations that setlocale() is happy with? Regards, Dave
Dave Page wrote: > Is there any reason not to accept other combinations that setlocale() > is happy with? setlocale() sets the locale. How does it "accept" a "combination"? -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: > Dave Page wrote: >> Is there any reason not to accept other combinations that setlocale() >> is happy with? > > setlocale() sets the locale. How does it "accept" a "combination"? > setlocale(LC_CTYPE, "English_United Kingdom.65001") will return null (and not change anything) because it doesn't like the combination of the locale and that encoding (UTF-8). setlocale(LC_CTYPE, "English_United Kingdom.1252") will return "English_United Kingdom.1252" and set the locale accordingly because WIN1252 is a valid encoding for that locale. Similarly, LATIN1 and numerous other encodings are accepted in combination with that locale. Should initdb allow any combination that setlocale() accepts, or should it *only* accept the default encoding for the specified locale? /D
Dave Page wrote: > setlocale(LC_CTYPE, "English_United Kingdom.65001") > > will return null (and not change anything) because it doesn't like > the combination of the locale and that encoding (UTF-8). The reason that that call fails is probably that the operating system does not provide such a locale. But that's not what we are interested in. We are interested in compatibility between *existing* operating system locales and *PostgreSQL* encoding names. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: > Dave Page wrote: >> setlocale(LC_CTYPE, "English_United Kingdom.65001") >> >> will return null (and not change anything) because it doesn't like >> the combination of the locale and that encoding (UTF-8). > > The reason that that call fails is probably that the operating system > does not provide such a locale. It doesn't - UTF-8/65001 is a pseudo codepage on Windows with no NLS file defining collation rules etc. as we already discussed. > But that's not what we are interested > in. We are interested in compatibility between *existing* operating > system locales and *PostgreSQL* encoding names. Yes. Let me put my question another way. Latin1 is a perfectly valid encoding for my locale English_United Kingdom. It is accepted by setlocale for LC_ALL. Why does initdb reject it? Why does it insist the encoding is not valid for the locale? /D
Am Mittwoch, 10. Oktober 2007 schrieb Dave Page: > Latin1 is a perfectly valid encoding for my locale English_United > Kingdom. It is accepted by setlocale for LC_ALL. > > Why does initdb reject it? Why does it insist the encoding is not valid > for the locale? Because initdb works with a finite list of known matches, and your particular combination might not be in that list -- yet. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: > Am Mittwoch, 10. Oktober 2007 schrieb Dave Page: >> Latin1 is a perfectly valid encoding for my locale English_United >> Kingdom. It is accepted by setlocale for LC_ALL. >> >> Why does initdb reject it? Why does it insist the encoding is not valid >> for the locale? > > Because initdb works with a finite list of known matches, and your particular > combination might not be in that list -- yet. So is it just a case of us generating a list of matches that may be Windows specific, or is there more to it than that? /D
Am Mittwoch, 10. Oktober 2007 schrieb Dave Page: > So is it just a case of us generating a list of matches that may be > Windows specific, or is there more to it than that? You want to peruse src/port/chklocale.c. There is already explicit Windows support in there, so maybe you just need to add on your particular cases. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: > Am Mittwoch, 10. Oktober 2007 schrieb Dave Page: >> So is it just a case of us generating a list of matches that may be >> Windows specific, or is there more to it than that? > > You want to peruse src/port/chklocale.c. There is already explicit Windows > support in there, so maybe you just need to add on your particular cases. Yup, found that - thanks. I'll look at updating that list. /D
Dave Page wrote: > Peter Eisentraut wrote: >> Am Mittwoch, 10. Oktober 2007 schrieb Dave Page: >>> So is it just a case of us generating a list of matches that may be >>> Windows specific, or is there more to it than that? >> You want to peruse src/port/chklocale.c. There is already explicit Windows >> support in there, so maybe you just need to add on your particular cases. > > Yup, found that - thanks. I'll look at updating that list. OK so I added the appropriate entries (and posted the patch to -patches), but my original question remains: why can I only select the *default* encoding for the chosen locale, but not other ones that are also be valid according to setlocale? Is this a bug, or is there some technical reason? /D
Am Mittwoch, 10. Oktober 2007 schrieb Dave Page: > my original question remains: why can I only select the > *default* encoding for the chosen locale, but not other ones that are > also be valid according to setlocale? Is this a bug, or is there some > technical reason? One locale works only with one encoding. There are no "default" or perhaps alternative encodings for one locale; there is only one. The whole point of the exercise is to determine what the spelling of that one encoding is in PostgreSQL. Perhaps you are confused about the naming. These are all entirely separate locales: en_GB.iso88591 en_GB.iso885915 en_GB.utf8 Someone was friendly enough to include the name of the encoding used by the locale into its name, but that doesn't mean that en_GB has three alternative encodings or something. At least that's the model we have on POSIX platforms. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Dave Page <dpage@postgresql.org> writes: > However, setlocale() will also accept other valid combinations on > Windows, which initdb will not, for example: > "English_United Kingdom.28591" (Latin1) > Is there any reason not to accept other combinations that setlocale() is > happy with? Are you certain that that acceptance actually represents support? Have you checked that it rejects combinations involving real code pages (ie, NOT 65001) that don't really work with the locale? regards, tom lane
Peter Eisentraut wrote: > Am Mittwoch, 10. Oktober 2007 schrieb Dave Page: >> my original question remains: why can I only select the >> *default* encoding for the chosen locale, but not other ones that are >> also be valid according to setlocale? Is this a bug, or is there some >> technical reason? > > One locale works only with one encoding. There are no "default" or perhaps > alternative encodings for one locale; there is only one. The whole point of > the exercise is to determine what the spelling of that one encoding is in > PostgreSQL. > > Perhaps you are confused about the naming. These are all entirely separate > locales: > > en_GB.iso88591 > en_GB.iso885915 > en_GB.utf8 > > Someone was friendly enough to include the name of the encoding used by the > locale into its name, but that doesn't mean that en_GB has three alternative > encodings or something. > > At least that's the model we have on POSIX platforms. OK, sorting out my terminology deficiencies has helped - thanks. The problem seems to be: initdb --locale "English_United Kingdom.28591" works, but initdb -E LATIN1 --locale "English_United Kingdom" does not. That's good (albeit inconsistent), I know how to fix pgInstaller now. What isn't so good is: ============ C:\pg>bin\initdb --locale "English_United Kingdom.99999" -D data initdb: invalid locale name "English_United Kingdom.99999" initdb: invalid locale name "English_United Kingdom.99999" initdb: invalid locale name "English_United Kingdom.99999" initdb: invalid locale name "English_United Kingdom.99999" initdb: invalid locale name "English_United Kingdom.99999" initdb: invalid locale name "English_United Kingdom.99999" The files belonging to this database system will be owned by user "Dave". This user must also own the server process. The database cluster will be initialized with locale English_United Kingdom.1252 . The default database encoding has accordingly been set to WIN1252. =========== Shouldn't that have failed? Regards, Dave
Tom Lane wrote: > Dave Page <dpage@postgresql.org> writes: >> However, setlocale() will also accept other valid combinations on >> Windows, which initdb will not, for example: >> "English_United Kingdom.28591" (Latin1) >> Is there any reason not to accept other combinations that setlocale() is >> happy with? > > Are you certain that that acceptance actually represents support? > Have you checked that it rejects combinations involving real code > pages (ie, NOT 65001) that don't really work with the locale? It fails with ones that Microsoft have decided don't belong in my language group and therefore aren't installed. It accepts all the others I've tried, but then from the sample I've looked, they all have 0-9a-zA-Z in them so I guess they're all capable of handling English. Regards, Dave.
Dave Page <dpage@postgresql.org> writes: > Tom Lane wrote: >> Are you certain that that acceptance actually represents support? >> Have you checked that it rejects combinations involving real code >> pages (ie, NOT 65001) that don't really work with the locale? > It fails with ones that Microsoft have decided don't belong in my > language group and therefore aren't installed. It accepts all the others > I've tried, but then from the sample I've looked, they all have > 0-9a-zA-Z in them so I guess they're all capable of handling English. That doesn't exactly fill me with confidence. Maybe you need to make some tests involving a non-English base locale? regards, tom lane
Tom Lane wrote: > Dave Page <dpage@postgresql.org> writes: >> Tom Lane wrote: >>> Are you certain that that acceptance actually represents support? >>> Have you checked that it rejects combinations involving real code >>> pages (ie, NOT 65001) that don't really work with the locale? > >> It fails with ones that Microsoft have decided don't belong in my >> language group and therefore aren't installed. It accepts all the others >> I've tried, but then from the sample I've looked, they all have >> 0-9a-zA-Z in them so I guess they're all capable of handling English. > > That doesn't exactly fill me with confidence. Maybe you need to make > some tests involving a non-English base locale? Hmm, I'm guessing these probably shouldn't work: Dave@SNAKE:~$ setlc "Japanese_Japan.28605" Japanese_Japan.28605 Dave@SNAKE:~$ setlc "Japanese_Japan.28595" Japanese_Japan.28595 Dave@SNAKE:~$ setlc "Russian_Russia.1252" Russian_Russia.1252 Dave@SNAKE:~$ setlc "Russian_Russia.28591" Russian_Russia.28591 1252 == WIN1252 28591 == LATIN1 28605 == LATIN9 28595 == ISO8859-5 (Cyrillic) 28597 == ISO8859-7 (Greek) In fact, it looks like it'll allow me to use anything thats installed, regardless of whether they're liekly to be compatible. So much for trusting setlocale() :-( /D
Dave Page <dpage@postgresql.org> writes: > OK so I added the appropriate entries (and posted the patch to > -patches), but my original question remains: why can I only select the > *default* encoding for the chosen locale, but not other ones that are > also be valid according to setlocale? Is this a bug, or is there some > technical reason? Well, the chklocale code is designed around the assumption that there *is* only one encoding for which a locale setting will work, with C/POSIX being a special case. I think we are talking a bit at cross-purposes here, because the Windows equivalent to this notion seems to be "English_United Kingdom.1252" whereas you seem to be defining locale as just "English_United Kingdom". Does it not work the way you want if you make the installer pass locale strings of the first form to initdb? regards, tom lane
Tom Lane wrote: > Dave Page <dpage@postgresql.org> writes: >> OK so I added the appropriate entries (and posted the patch to >> -patches), but my original question remains: why can I only select the >> *default* encoding for the chosen locale, but not other ones that are >> also be valid according to setlocale? Is this a bug, or is there some >> technical reason? > > Well, the chklocale code is designed around the assumption that there > *is* only one encoding for which a locale setting will work, with > C/POSIX being a special case. > > I think we are talking a bit at cross-purposes here, because the Windows > equivalent to this notion seems to be "English_United Kingdom.1252" > whereas you seem to be defining locale as just "English_United Kingdom". > Does it not work the way you want if you make the installer pass locale > strings of the first form to initdb? Yes, it seems it does (see my previous email to Peter): http://archives.postgresql.org/pgsql-hackers/2007-10/msg00447.php So I guess that's how I'll fix the installer. There is another issue though as I mentioned in the post above - that it complains about an invalid encoding specifier on the encoding name, then ignores it and uses the default which seems wrong to me. /D
Dave Page <dpage@postgresql.org> writes: > In fact, it looks like it'll allow me to use anything thats installed, > regardless of whether they're liekly to be compatible. So much for > trusting setlocale() :-( Yech :-(. Count on Microsloth to get this wrong. Anyone have any ideas on how to tell if a locale setting *really* works on Windows? regards, tom lane
Dave Page <dpage@postgresql.org> writes: > ... There is another issue > though as I mentioned in the post above - that it complains about an > invalid encoding specifier on the encoding name, then ignores it and > uses the default which seems wrong to me. Yeah, if you look at chklocale() in initdb.c this is clearly how it works, but there's a comment/* should we exit here? */ so whoever wrote it wasn't all that convinced it was the right behavior. Given that 8.3 is raising the stakes for having a correct locale specification at initdb time, it seems right to me to error out if a bogus locale switch is given, rather than whining and then substituting the environment default. Any objections? That still leaves us with the problem of how to tell whether a locale spec is bad on Windows. Judging by your example, Windows checks whether the code page is present but not whether it is sane for the base locale. What happens when there's a mismatch --- eg, what encoding do system messages come out in? regards, tom lane
Tom Lane wrote > That still leaves us with the problem of how to tell whether a locale > spec is bad on Windows. Judging by your example, Windows checks whether > the code page is present but not whether it is sane for the base locale. > What happens when there's a mismatch --- eg, what encoding do system > messages come out in? I'm not sure how to test that specifically, but it seems that accented characters simply fall back to their undecorated equivalents if the encoding is not appropriate, eg: Dave@SNAKE:~$ ./setlc French_France.1252 Locale: French_France.1252 The date is: sam. 01 of août 2007 Dave@SNAKE:~$ ./setlc French_France.28597 Locale: French_France.28597 The date is: sam. 01 of aout 2007 (the encodings used there are WIN1252 and ISO8859-7 (Greek)). I'm happy to test further is you can suggest how I can figure out the encoding actually output. Regards, Dave.
On 10/12/07, Dave Page <dpage@postgresql.org> wrote: > Tom Lane wrote > > That still leaves us with the problem of how to tell whether a locale > > spec is bad on Windows. Judging by your example, Windows checks whether > > the code page is present but not whether it is sane for the base locale. > > What happens when there's a mismatch --- eg, what encoding do system > > messages come out in? > > I'm not sure how to test that specifically, but it seems that accented > characters simply fall back to their undecorated equivalents if the > encoding is not appropriate, eg: > > Dave@SNAKE:~$ ./setlc French_France.1252 > Locale: French_France.1252 > The date is: sam. 01 of août 2007 > Dave@SNAKE:~$ ./setlc French_France.28597 > Locale: French_France.28597 > The date is: sam. 01 of aout 2007 > > (the encodings used there are WIN1252 and ISO8859-7 (Greek)). > > I'm happy to test further is you can suggest how I can figure out the > encoding actually output. The encoding output is the one you specified. Keep in mind, underneath Windows is mostly working with Unicode, so all characters exist and the locale rules specify their behavior there. The encoding is just the byte stream it needs to force them all into after doing whatever it does to them. As you've seen, it uses some sort of best-fit mapping I don't know the details of. (It will drop accent marks and choose characters with similar shape where possible, by default.) I think it's a bit more complex for input/transform cases where you operate on the byte stream directly without intermediate conversion to Unicode, which is why UTF-8 doesn't work as a codepage, but again I don't have the details nearby. I can try to do more digging if needed.
Trevor Talbot wrote: > The encoding output is the one you specified. OK. > Keep in mind, > underneath Windows is mostly working with Unicode, so all characters > exist and the locale rules specify their behavior there. The encoding > is just the byte stream it needs to force them all into after doing > whatever it does to them. As you've seen, it uses some sort of > best-fit mapping I don't know the details of. (It will drop accent > marks and choose characters with similar shape where possible, by > default.) Right, that makes sense. The codepages used by setlocale etc. are just translation tables to/from the internal unicode representation. > I think it's a bit more complex for input/transform cases where you > operate on the byte stream directly without intermediate conversion to > Unicode, which is why UTF-8 doesn't work as a codepage, but again I > don't have the details nearby. I can try to do more digging if > needed. It does (sort of) work as a codepage, it just doesn't have the NLS file to define how things like UPPER() and LOWER() should work. Regards, Dave
On Fri, Oct 12, 2007 at 06:03:52AM -0700, Trevor Talbot wrote: > On 10/12/07, Dave Page <dpage@postgresql.org> wrote: > > Tom Lane wrote > > > That still leaves us with the problem of how to tell whether a locale > > > spec is bad on Windows. Judging by your example, Windows checks whether > > > the code page is present but not whether it is sane for the base locale. > > > What happens when there's a mismatch --- eg, what encoding do system > > > messages come out in? > > > > I'm not sure how to test that specifically, but it seems that accented > > characters simply fall back to their undecorated equivalents if the > > encoding is not appropriate, eg: > > > > Dave@SNAKE:~$ ./setlc French_France.1252 > > Locale: French_France.1252 > > The date is: sam. 01 of août 2007 > > Dave@SNAKE:~$ ./setlc French_France.28597 > > Locale: French_France.28597 > > The date is: sam. 01 of aout 2007 > > > > (the encodings used there are WIN1252 and ISO8859-7 (Greek)). > > > > I'm happy to test further is you can suggest how I can figure out the > > encoding actually output. > > The encoding output is the one you specified. Keep in mind, > underneath Windows is mostly working with Unicode, so all characters > exist and the locale rules specify their behavior there. The encoding > is just the byte stream it needs to force them all into after doing > whatever it does to them. As you've seen, it uses some sort of > best-fit mapping I don't know the details of. (It will drop accent > marks and choose characters with similar shape where possible, by > default.) > > I think it's a bit more complex for input/transform cases where you > operate on the byte stream directly without intermediate conversion to > Unicode, which is why UTF-8 doesn't work as a codepage, but again I > don't have the details nearby. I can try to do more digging if > needed. Just so the non-windows-savvy people get it.. When Windows documentation or users refer to Unicode, they mean UTF-16. //Magnus