Thread: Enforcing database encoding and locale match

Enforcing database encoding and locale match

From

Tom Lane

Date:

28 September 2007, 13:02:15

I was reminded again just now of the bad consequences of selecting a
database encoding that is not compatible with your LC_CTYPE setting:
http://archives.postgresql.org/pgsql-bugs/2007-09/msg00158.php
Aside from that one, which is perilously close to being a denial of
service attack, there are known problems with sorting, upper()/lower()
behavior, etc etc.  We're going to keep hearing those types of
complaints until we do something about enforcing that people don't use
an incompatible encoding.

This has been discussed before, of course, and has foundered on the
problem that there's no very reliable/portable way to determine what
encoding is implied by LC_CTYPE.  We do have code in initdb that
purports to determine this on common platforms, but I've never trusted
it very much, because it isn't stressed hard in common use.  So the
problem is how to develop some trust in it.

It occurs me that what we could do is put that code into CREATE
DATABASE, but have it throw a WARNING not an ERROR if it thinks the
encoding doesn't match the locale.  That would be sufficiently in
people's faces that we'd hear about it if it didn't work.  After a
release cycle or so of not hearing complaints, we could promote the
warning to an error.

Another possibility is to treat the case as a WARNING if you're
superuser and an ERROR if you're not.  This would satisfy people
who are uncomfortable with the idea that CREATEDB privilege comes
with a built-in denial-of-service attack, while still leaving a
loophole for anyone for whom the test didn't work properly.

Comments?
        regards, tom lane

Re: Enforcing database encoding and locale match

From

Gregory Stark

Date:

28 September 2007, 13:19:47

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Another possibility is to treat the case as a WARNING if you're
> superuser and an ERROR if you're not.  This would satisfy people
> who are uncomfortable with the idea that CREATEDB privilege comes
> with a built-in denial-of-service attack, while still leaving a
> loophole for anyone for whom the test didn't work properly.

That sounds like a good combination

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Enforcing database encoding and locale match

From

Andrew Dunstan

Date:

28 September 2007, 13:30:57


Gregory Stark wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
>   
>> Another possibility is to treat the case as a WARNING if you're
>> superuser and an ERROR if you're not.  This would satisfy people
>> who are uncomfortable with the idea that CREATEDB privilege comes
>> with a built-in denial-of-service attack, while still leaving a
>> loophole for anyone for whom the test didn't work properly.
>>     
>
> That sounds like a good combination
>
>   

+1

cheers

andrew

Re: Enforcing database encoding and locale match

From

Tom Lane

Date:

28 September 2007, 16:31:15

Andrew Dunstan <andrew@dunslane.net> writes:
> Gregory Stark wrote:
>> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>>> Another possibility is to treat the case as a WARNING if you're
>>> superuser and an ERROR if you're not.  This would satisfy people
>>> who are uncomfortable with the idea that CREATEDB privilege comes
>>> with a built-in denial-of-service attack, while still leaving a
>>> loophole for anyone for whom the test didn't work properly.
>> 
>> That sounds like a good combination
> +1

After further experimentation I want to change the proposal a bit.
AFAICS, if we recognize the nl_langinfo(CODESET) result, there is
no reason not to trust the answer, so we might as well throw an
error always.  The case that is problematic is where we can get a
CODESET string but we don't recognize it.  In this case it seems
appropriate to do
   ereport(WARNING,           (errmsg("could not determine encoding for locale \"%s\": codeset is \"%s\"",
    ctype, sys),            errdetail("Please report this to <pgsql-bugs@postgresql.org>.")));

and then let the user do what he wants.

There need to be two exceptions to the error-on-mismatch policy.

First off, if the locale is C/POSIX then we can allow any encoding.

Second, it appears that we have to allow SQL_ASCII encoding to be
selected regardless of locale; if we don't, the "make installcheck"
regression tests fail, because they try to do exactly that; and I'm
sure that there are other users out there who don't (think they)
care about encoding.  This is not quite as bad as the generic mismatch
case, because the backend will never try to do encoding conversion
and so the recursive-error panic can't happen.  But you could still
have unexpected sorting behavior and probably index corruption.

What I propose is that we allow SQL_ASCII databases to be created
when the locale is not C, but only by superusers.

Comments?
        regards, tom lane

Re: Enforcing database encoding and locale match

From

Zdenek Kotala

Date:

28 September 2007, 17:19:04

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> Gregory Stark wrote:
>>> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>>>> Another possibility is to treat the case as a WARNING if you're
>>>> superuser and an ERROR if you're not.  This would satisfy people
>>>> who are uncomfortable with the idea that CREATEDB privilege comes
>>>> with a built-in denial-of-service attack, while still leaving a
>>>> loophole for anyone for whom the test didn't work properly.
>>> That sounds like a good combination
>> +1
>
> After further experimentation I want to change the proposal a bit.
> AFAICS, if we recognize the nl_langinfo(CODESET) result, there is
> no reason not to trust the answer, so we might as well throw an
> error always.

Agree. Code seems to be OK and on POSIX compatible OS it should be work.
  I attached testing code. With following command

  for LOCALE in `locale -a`; do ./a.out $LOCALE ; done

is should be possible to verify status on all unix OS.

On Solaris I got following problematic locales:

C                       ... 646        - NO MATCH
POSIX                   ... 646        - NO MATCH
cs                      ... 646        - NO MATCH
da                      ... 646        - NO MATCH
et                      ... 646        - NO MATCH
it                      ... 646        - NO MATCH
ja_JP.PCK               ... PCK        - NO MATCH
ko                      ... 646        - NO MATCH
no                      ... 646        - NO MATCH
ru                      ... 646        - NO MATCH
sl                      ... 646        - NO MATCH
sv                      ... 646        - NO MATCH
tr                      ... 646        - NO MATCH
zh.GBK                  ... GBK        - NO MATCH
zh_CN.GB18030           ... GB18030    - NO MATCH
zh_CN.GB18030@pinyin    ... GB18030    - NO MATCH
zh_CN.GB18030@radical   ... GB18030    - NO MATCH
zh_CN.GB18030@stroke    ... GB18030    - NO MATCH
zh_CN.GBK               ... GBK        - NO MATCH
zh_CN.GBK@pinyin        ... GBK        - NO MATCH
zh_CN.GBK@radical       ... GBK        - NO MATCH
zh_CN.GBK@stroke        ... GBK        - NO MATCH


> The case that is problematic is where we can get a
> CODESET string but we don't recognize it.  In this case it seems
> appropriate to do
>
>     ereport(WARNING,
>             (errmsg("could not determine encoding for locale \"%s\": codeset is \"%s\"",
>                     ctype, sys),
>              errdetail("Please report this to <pgsql-bugs@postgresql.org>.")));
>
> and then let the user do what he wants.

The another question is what do when we know that this codeset/encoding
is not supported by postgres. Maybe extend encoding match structure to

struct encoding_match
{
    enum pg_enc pg_enc_code;
    const char *system_enc_name;
    bool supported;
};

and in case when it is unsupported then generates error. In case when
codeset does not match anyway then generates only warning.


        Zdenek
#include <locale.h>
#include <langinfo.h>
#include "postgres_fe.h"
//#include "miscadmin.h"
#include "mb/pg_wchar.h"

/*
 * Checks whether the encoding selected for PostgreSQL and the
 * encoding used by the system locale match.
 */

struct encoding_match
{
    enum pg_enc pg_enc_code;
    const char *system_enc_name;
};

static const struct encoding_match encoding_match_list[] = {
    {PG_EUC_JP, "EUC-JP"},
    {PG_EUC_JP, "eucJP"},
    {PG_EUC_JP, "IBM-eucJP"},
    {PG_EUC_JP, "sdeckanji"},

    {PG_EUC_CN, "EUC-CN"},
    {PG_EUC_CN, "eucCN"},
    {PG_EUC_CN, "IBM-eucCN"},
    {PG_EUC_CN, "GB2312"},
    {PG_EUC_CN, "dechanzi"},

    {PG_EUC_KR, "EUC-KR"},
    {PG_EUC_KR, "eucKR"},
    {PG_EUC_KR, "IBM-eucKR"},
    {PG_EUC_KR, "deckorean"},
    {PG_EUC_KR, "5601"},

    {PG_EUC_TW, "EUC-TW"},
    {PG_EUC_TW, "eucTW"},
    {PG_EUC_TW, "IBM-eucTW"},
    {PG_EUC_TW, "cns11643"},

#ifdef NOT_VERIFIED
    {PG_JOHAB, "???"},
#endif

    {PG_UTF8, "UTF-8"},
    {PG_UTF8, "utf8"},

    {PG_LATIN1, "ISO-8859-1"},
    {PG_LATIN1, "ISO8859-1"},
    {PG_LATIN1, "iso88591"},

    {PG_LATIN2, "ISO-8859-2"},
    {PG_LATIN2, "ISO8859-2"},
    {PG_LATIN2, "iso88592"},

    {PG_LATIN3, "ISO-8859-3"},
    {PG_LATIN3, "ISO8859-3"},
    {PG_LATIN3, "iso88593"},

    {PG_LATIN4, "ISO-8859-4"},
    {PG_LATIN4, "ISO8859-4"},
    {PG_LATIN4, "iso88594"},

    {PG_LATIN5, "ISO-8859-9"},
    {PG_LATIN5, "ISO8859-9"},
    {PG_LATIN5, "iso88599"},

    {PG_LATIN6, "ISO-8859-10"},
    {PG_LATIN6, "ISO8859-10"},
    {PG_LATIN6, "iso885910"},

    {PG_LATIN7, "ISO-8859-13"},
    {PG_LATIN7, "ISO8859-13"},
    {PG_LATIN7, "iso885913"},

    {PG_LATIN8, "ISO-8859-14"},
    {PG_LATIN8, "ISO8859-14"},
    {PG_LATIN8, "iso885914"},

    {PG_LATIN9, "ISO-8859-15"},
    {PG_LATIN9, "ISO8859-15"},
    {PG_LATIN9, "iso885915"},

    {PG_LATIN10, "ISO-8859-16"},
    {PG_LATIN10, "ISO8859-16"},
    {PG_LATIN10, "iso885916"},

    {PG_WIN1252, "CP1252"},
    {PG_WIN1253, "CP1253"},
    {PG_WIN1254, "CP1254"},
    {PG_WIN1255, "CP1255"},
    {PG_WIN1256, "CP1256"},
    {PG_WIN1257, "CP1257"},
    {PG_WIN1258, "CP1258"},
#ifdef NOT_VERIFIED
    {PG_WIN874, "???"},
#endif
    {PG_KOI8R, "KOI8-R"},
    {PG_WIN1251, "CP1251"},
    {PG_WIN866, "CP866"},

    {PG_ISO_8859_5, "ISO-8859-5"},
    {PG_ISO_8859_5, "ISO8859-5"},
    {PG_ISO_8859_5, "iso88595"},

    {PG_ISO_8859_6, "ISO-8859-6"},
    {PG_ISO_8859_6, "ISO8859-6"},
    {PG_ISO_8859_6, "iso88596"},

    {PG_ISO_8859_7, "ISO-8859-7"},
    {PG_ISO_8859_7, "ISO8859-7"},
    {PG_ISO_8859_7, "iso88597"},

    {PG_ISO_8859_8, "ISO-8859-8"},
    {PG_ISO_8859_8, "ISO8859-8"},
    {PG_ISO_8859_8, "iso88598"},

    {PG_SQL_ASCII, NULL}        /* end marker */
};

static char *
get_encoding_from_locale(const char *ctype)
{
    char       *save;
    char       *sys;

    save = setlocale(LC_CTYPE, NULL);
    if (!save)
        return NULL;
    save = strdup(save);

    setlocale(LC_CTYPE, ctype);
    sys = nl_langinfo(CODESET);
    sys = strdup(sys);

    setlocale(LC_CTYPE, save);
    free(save);

    return sys;
}


static int
find_matching_encoding(const char *ctype, const char *sys)
{
//    char       *sys;
    int            i;

    sys = get_encoding_from_locale(ctype);

    for (i = 0; encoding_match_list[i].system_enc_name; i++)
    {
        if (strcasecmp(sys, encoding_match_list[i].system_enc_name) == 0)
        {
//            free(sys);
            return encoding_match_list[i].pg_enc_code;
        }
    }

//    free(sys);
    return -1;
}


int main(int argc, char **argv)
{
    int enc;
    char       *sys;

    if( argc != 2)
    {
        fprintf(stderr,"Invalid number of arguments.\n");
        return 1;
    }

    printf("%-23s ... ", argv[1]);

    sys = get_encoding_from_locale(argv[1]);
    printf("%-10s - ",sys);

    enc=find_matching_encoding(argv[1], sys);
    if( enc != -1 )
        printf("OK\n");
    else
        printf("NO MATCH\n");

    free(sys);
    return 0;
}

Re: Enforcing database encoding and locale match

From

Tom Lane

Date:

28 September 2007, 17:43:13

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> On Solaris I got following problematic locales:

> C                       ... 646        - NO MATCH
> POSIX                   ... 646        - NO MATCH
> cs                      ... 646        - NO MATCH
> da                      ... 646        - NO MATCH
> et                      ... 646        - NO MATCH
> it                      ... 646        - NO MATCH
> ja_JP.PCK               ... PCK        - NO MATCH
> ko                      ... 646        - NO MATCH
> no                      ... 646        - NO MATCH
> ru                      ... 646        - NO MATCH
> sl                      ... 646        - NO MATCH
> sv                      ... 646        - NO MATCH
> tr                      ... 646        - NO MATCH
> zh.GBK                  ... GBK        - NO MATCH
> zh_CN.GB18030           ... GB18030    - NO MATCH
> zh_CN.GB18030@pinyin    ... GB18030    - NO MATCH
> zh_CN.GB18030@radical   ... GB18030    - NO MATCH
> zh_CN.GB18030@stroke    ... GB18030    - NO MATCH
> zh_CN.GBK               ... GBK        - NO MATCH
> zh_CN.GBK@pinyin        ... GBK        - NO MATCH
> zh_CN.GBK@radical       ... GBK        - NO MATCH
> zh_CN.GBK@stroke        ... GBK        - NO MATCH

Not sure what 646 or PCK are, but we don't need to worry about GB18030
or GBK, because those aren't allowed backend encodings.

> The another question is what do when we know that this codeset/encoding 
> is not supported by postgres.

I don't really see a need to worry about this case.  The proposed encoding
will already have been checked to be sure it's one that the backend supports.
All we need is to be able to recognize any variant spelling of the
encodings we allow.
        regards, tom lane

Re: Enforcing database encoding and locale match

From

Zdenek Kotala

Date:

28 September 2007, 18:11:19

Tom Lane wrote:
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
>> On Solaris I got following problematic locales:
> 
>> C                       ... 646        - NO MATCH
>> POSIX                   ... 646        - NO MATCH
>> cs                      ... 646        - NO MATCH
>> da                      ... 646        - NO MATCH
>> et                      ... 646        - NO MATCH
>> it                      ... 646        - NO MATCH
>> ja_JP.PCK               ... PCK        - NO MATCH
>> ko                      ... 646        - NO MATCH
>> no                      ... 646        - NO MATCH
>> ru                      ... 646        - NO MATCH
>> sl                      ... 646        - NO MATCH
>> sv                      ... 646        - NO MATCH
>> tr                      ... 646        - NO MATCH
>> zh.GBK                  ... GBK        - NO MATCH
>> zh_CN.GB18030           ... GB18030    - NO MATCH
>> zh_CN.GB18030@pinyin    ... GB18030    - NO MATCH
>> zh_CN.GB18030@radical   ... GB18030    - NO MATCH
>> zh_CN.GB18030@stroke    ... GB18030    - NO MATCH
>> zh_CN.GBK               ... GBK        - NO MATCH
>> zh_CN.GBK@pinyin        ... GBK        - NO MATCH
>> zh_CN.GBK@radical       ... GBK        - NO MATCH
>> zh_CN.GBK@stroke        ... GBK        - NO MATCH
> 
> Not sure what 646 or PCK are, but we don't need to worry about GB18030
> or GBK, because those aren't allowed backend encodings.


PCK is Japanese Shift-JIS encoding. (see
http://www.inter-locale.com/whitepaper/learn/learn_to_type.html)

http://en.wikipedia.org/wiki/Shift_JIS

646 looks like ISO646. I will check it.

http://en.wikipedia.org/wiki/ISO646

> 
>> The another question is what do when we know that this codeset/encoding 
>> is not supported by postgres.
> 
> I don't really see a need to worry about this case.  The proposed encoding
> will already have been checked to be sure it's one that the backend supports.
> All we need is to be able to recognize any variant spelling of the
> encodings we allow.

OK. Maybe would be good put mapping into text file (e.g. encoding.map) 
into share directory. (Similar to tz_abbrev)
    Zdenek

Re: Enforcing database encoding and locale match

From

Tom Lane

Date:

28 September 2007, 19:59:11

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> On Solaris I got following problematic locales: [...]

I tried this program on Mac OS X 10.4.10 (the current release) and found
out that what that OS mostly returns is the encoding portion of the
locale name, for instance

sv_SE.ISO8859-15        ... ISO8859-15 - OK
sv_SE.UTF-8             ... UTF-8      - OK
tr_TR                   ...            - NO MATCH
tr_TR.ISO8859-9         ... ISO8859-9  - OK
tr_TR.UTF-8             ... UTF-8      - OK
uk_UA                   ...            - NO MATCH
uk_UA.ISO8859-5         ... ISO8859-5  - OK
uk_UA.KOI8-U            ... KOI8-U     - NO MATCH
uk_UA.UTF-8             ... UTF-8      - OK
zh_CN                   ...            - NO MATCH
zh_CN.eucCN             ... eucCN      - OK
zh_CN.GB18030           ... GB18030    - NO MATCH
zh_CN.GB2312            ... GB2312     - OK
zh_CN.GBK               ... GBK        - NO MATCH
zh_CN.UTF-8             ... UTF-8      - OK
zh_HK                   ...            - NO MATCH
zh_HK.Big5HKSCS         ... Big5HKSCS  - NO MATCH
zh_HK.UTF-8             ... UTF-8      - OK
zh_TW                   ...            - NO MATCH
zh_TW.Big5              ... Big5       - NO MATCH
zh_TW.UTF-8             ... UTF-8      - OK
C                       ... US-ASCII   - NO MATCH
POSIX                   ... US-ASCII   - NO MATCH

They didn't *quite* hard-wire it that way, as evidenced by the C/POSIX
results, but certainly the empty-string results are entirely useless.
Perhaps we should file a bug with Apple.  However, some poking around
in /usr/share/locale indicates that there's a consistent interpretation
to be made:

g42:/usr/share/locale tgl$ ls -l ??_??/LC_CTYPE
lrwxr-xr-x   1 root  wheel    17 Apr 26  2006 af_ZA/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
-r--r--r--   1 root  wheel  3272 Mar 20  2005 am_ET/LC_CTYPE
lrwxr-xr-x   1 root  wheel    17 Apr 26  2006 be_BY/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x   1 root  wheel    17 Apr 26  2006 bg_BG/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x   1 root  wheel    17 Apr 26  2006 ca_ES/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x   1 root  wheel    17 Apr 26  2006 cs_CZ/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x   1 root  wheel    17 Apr 26  2006 da_DK/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x   1 root  wheel    17 Apr 26  2006 de_AT/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x   1 root  wheel    17 Apr 26  2006 de_CH/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x   1 root  wheel    17 Apr 26  2006 de_DE/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x   1 root  wheel    17 Apr 26  2006 el_GR/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x   1 root  wheel    17 Apr 26  2006 en_AU/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x   1 root  wheel    17 Apr 26  2006 en_CA/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
(etc etc)

The only one that's not actually a symlink to the standard UTF-8 ctype
is am_ET/LC_CTYPE, which is identical to am_ET.UTF-8/LC_CTYPE.
So I think we can get away with something like

#ifdef __darwin__if (strlen(sys) == 0)    // assume UTF8
#endif

I suppose we'll need a few more hacks like this as the beta-test results
begin to roll in ...
        regards, tom lane

Re: Enforcing database encoding and locale match

From

Tom Lane

Date:

28 September 2007, 20:45:45

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> The another question is what do when we know that this codeset/encoding 
> is not supported by postgres.

Ah, I finally grasped what you were on about here.  As CVS HEAD stands,
if you run initdb in an unrecognized locale, you get something like

$ LANG=zh_CN.GB18030 initdb        
The files belonging to this database system will be owned by user "tgl".
This user must also own the server process.

The database cluster will be initialized with locale zh_CN.GB18030.
could not determine encoding for locale "zh_CN.GB18030": codeset is "GB18030"
initdb: could not find suitable encoding for locale "zh_CN.GB18030"
Rerun initdb with the -E option.
Try "initdb --help" for more information.
$

which is OK, but if you override it incorrectly, it'll let you do so:

$ LANG=zh_CN.GB18030 initdb -E utf8
The files belonging to this database system will be owned by user "tgl".
This user must also own the server process.

The database cluster will be initialized with locale zh_CN.GB18030.
could not determine encoding for locale "zh_CN.GB18030": codeset is "GB18030"
... but it presses merrily along ...

leading to a database which is in fact broken.

To prevent this, I think it would be sufficient to add entries to the
table for our known frontend-only encodings.  It's reasonable to assume
that anyone who wants to run Postgres will probably have a default
locale that uses *some* encoding that we support; otherwise he's going
to have a pretty unpleasant experience anyway.  If the function returns
a frontend-only encoding value then initdb will fail in a good way,
since it won't let the user select that as a database encoding.  So I
don't think we need an explicit concept of an unsupported encoding in
the table, just some more entries.
        regards, tom lane

Re: Enforcing database encoding and locale match

From

Alvaro Herrera

Date:

01 October 2007, 14:07:13

Tom Lane wrote:

> I tried this program on Mac OS X 10.4.10 (the current release) and found
> out that what that OS mostly returns is the encoding portion of the
> locale name, for instance

FWIW I tried this program here, and I get

C                       ... ANSI_X3.4-1968 - NO MATCH
POSIX                   ... ANSI_X3.4-1968 - NO MATCH

Note the funny name.  Trying initdb with LC_ALL=C correctly uses
SQL_ASCII (I saw the special case in chklocale.c), but I'm wondering if
we should list those names explicitely.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Enforcing database encoding and locale match

From

Tom Lane

Date:

01 October 2007, 14:26:56

Alvaro Herrera <alvherre@commandprompt.com> writes:
> FWIW I tried this program here, and I get

> C                       ... ANSI_X3.4-1968 - NO MATCH
> POSIX                   ... ANSI_X3.4-1968 - NO MATCH

> Note the funny name.  Trying initdb with LC_ALL=C correctly uses
> SQL_ASCII (I saw the special case in chklocale.c), but I'm wondering if
> we should list those names explicitely.

Since we're already special-casing C/POSIX, I don't see a need.
It looks a bit hopeless to keep up with all the possibilities anyway
--- by my count we've tested four different platforms so far and
gotten four different answers for the CODESET name for C :-(

Linux    ANSI_X3.4-1968
Darwin    (empty)
Solaris    646
HP-UX    roman8
        regards, tom lane

Re: Enforcing database encoding and locale match

From

Zdenek Kotala

Date:

05 October 2007, 17:11:09

Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> FWIW I tried this program here, and I get
> 
>> C                       ... ANSI_X3.4-1968 - NO MATCH
>> POSIX                   ... ANSI_X3.4-1968 - NO MATCH
> 
>> Note the funny name.  Trying initdb with LC_ALL=C correctly uses
>> SQL_ASCII (I saw the special case in chklocale.c), but I'm wondering if
>> we should list those names explicitely.
> 
> Since we're already special-casing C/POSIX, I don't see a need.
> It looks a bit hopeless to keep up with all the possibilities anyway
> --- by my count we've tested four different platforms so far and
> gotten four different answers for the CODESET name for C :-(
> 
> Linux    ANSI_X3.4-1968
> Darwin    (empty)
> Solaris    646
> HP-UX    roman8

There is some useful link:

http://www.simeji.com/bun/characterencoding_jvm142.txt

with aliases. I also checked all possible locales on Solaris nevada and 
there are two new aliases and probably unsupported TIS620.2533 (thai) 
encoding by postgres.

Patch with new aliases attached.
    Zdenek

===================================================================
RCS file: /zfs_data/cvs_pgsql/cvsroot/pgsql/src/port/chklocale.c,v
retrieving revision 1.4
diff -c -r1.4 chklocale.c
*** src/port/chklocale.c        2007/10/03 17:16:39     1.4
--- src/port/chklocale.c        2007/10/05 17:55:10
***************
*** 127,132 ****
--- 127,133 ----        {PG_WIN874, "???"},  #endif        {PG_WIN1251, "CP1251"},
+       {PG_WIN1251, "ansi-1251"},        {PG_WIN866, "CP866"},
        {PG_ISO_8859_5, "ISO-8859-5"},
***************
*** 152,157 ****
--- 153,159 ----        {PG_BIG5, "BIG5"},        {PG_BIG5, "BIG5HKSCS"},        {PG_BIG5, "CP950"},
+       {PG_BIG5, "Big5-HKSCS"},
        {PG_GBK, "GBK"},        {PG_GBK, "CP936"},