Thread: Enforcing database encoding and locale match
I was reminded again just now of the bad consequences of selecting a database encoding that is not compatible with your LC_CTYPE setting: http://archives.postgresql.org/pgsql-bugs/2007-09/msg00158.php Aside from that one, which is perilously close to being a denial of service attack, there are known problems with sorting, upper()/lower() behavior, etc etc. We're going to keep hearing those types of complaints until we do something about enforcing that people don't use an incompatible encoding. This has been discussed before, of course, and has foundered on the problem that there's no very reliable/portable way to determine what encoding is implied by LC_CTYPE. We do have code in initdb that purports to determine this on common platforms, but I've never trusted it very much, because it isn't stressed hard in common use. So the problem is how to develop some trust in it. It occurs me that what we could do is put that code into CREATE DATABASE, but have it throw a WARNING not an ERROR if it thinks the encoding doesn't match the locale. That would be sufficiently in people's faces that we'd hear about it if it didn't work. After a release cycle or so of not hearing complaints, we could promote the warning to an error. Another possibility is to treat the case as a WARNING if you're superuser and an ERROR if you're not. This would satisfy people who are uncomfortable with the idea that CREATEDB privilege comes with a built-in denial-of-service attack, while still leaving a loophole for anyone for whom the test didn't work properly. Comments? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Another possibility is to treat the case as a WARNING if you're > superuser and an ERROR if you're not. This would satisfy people > who are uncomfortable with the idea that CREATEDB privilege comes > with a built-in denial-of-service attack, while still leaving a > loophole for anyone for whom the test didn't work properly. That sounds like a good combination -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > >> Another possibility is to treat the case as a WARNING if you're >> superuser and an ERROR if you're not. This would satisfy people >> who are uncomfortable with the idea that CREATEDB privilege comes >> with a built-in denial-of-service attack, while still leaving a >> loophole for anyone for whom the test didn't work properly. >> > > That sounds like a good combination > > +1 cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Gregory Stark wrote: >> "Tom Lane" <tgl@sss.pgh.pa.us> writes: >>> Another possibility is to treat the case as a WARNING if you're >>> superuser and an ERROR if you're not. This would satisfy people >>> who are uncomfortable with the idea that CREATEDB privilege comes >>> with a built-in denial-of-service attack, while still leaving a >>> loophole for anyone for whom the test didn't work properly. >> >> That sounds like a good combination > +1 After further experimentation I want to change the proposal a bit. AFAICS, if we recognize the nl_langinfo(CODESET) result, there is no reason not to trust the answer, so we might as well throw an error always. The case that is problematic is where we can get a CODESET string but we don't recognize it. In this case it seems appropriate to do ereport(WARNING, (errmsg("could not determine encoding for locale \"%s\": codeset is \"%s\"", ctype, sys), errdetail("Please report this to <pgsql-bugs@postgresql.org>."))); and then let the user do what he wants. There need to be two exceptions to the error-on-mismatch policy. First off, if the locale is C/POSIX then we can allow any encoding. Second, it appears that we have to allow SQL_ASCII encoding to be selected regardless of locale; if we don't, the "make installcheck" regression tests fail, because they try to do exactly that; and I'm sure that there are other users out there who don't (think they) care about encoding. This is not quite as bad as the generic mismatch case, because the backend will never try to do encoding conversion and so the recursive-error panic can't happen. But you could still have unexpected sorting behavior and probably index corruption. What I propose is that we allow SQL_ASCII databases to be created when the locale is not C, but only by superusers. Comments? regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> Gregory Stark wrote: >>> "Tom Lane" <tgl@sss.pgh.pa.us> writes: >>>> Another possibility is to treat the case as a WARNING if you're >>>> superuser and an ERROR if you're not. This would satisfy people >>>> who are uncomfortable with the idea that CREATEDB privilege comes >>>> with a built-in denial-of-service attack, while still leaving a >>>> loophole for anyone for whom the test didn't work properly. >>> That sounds like a good combination >> +1 > > After further experimentation I want to change the proposal a bit. > AFAICS, if we recognize the nl_langinfo(CODESET) result, there is > no reason not to trust the answer, so we might as well throw an > error always. Agree. Code seems to be OK and on POSIX compatible OS it should be work. I attached testing code. With following command for LOCALE in `locale -a`; do ./a.out $LOCALE ; done is should be possible to verify status on all unix OS. On Solaris I got following problematic locales: C ... 646 - NO MATCH POSIX ... 646 - NO MATCH cs ... 646 - NO MATCH da ... 646 - NO MATCH et ... 646 - NO MATCH it ... 646 - NO MATCH ja_JP.PCK ... PCK - NO MATCH ko ... 646 - NO MATCH no ... 646 - NO MATCH ru ... 646 - NO MATCH sl ... 646 - NO MATCH sv ... 646 - NO MATCH tr ... 646 - NO MATCH zh.GBK ... GBK - NO MATCH zh_CN.GB18030 ... GB18030 - NO MATCH zh_CN.GB18030@pinyin ... GB18030 - NO MATCH zh_CN.GB18030@radical ... GB18030 - NO MATCH zh_CN.GB18030@stroke ... GB18030 - NO MATCH zh_CN.GBK ... GBK - NO MATCH zh_CN.GBK@pinyin ... GBK - NO MATCH zh_CN.GBK@radical ... GBK - NO MATCH zh_CN.GBK@stroke ... GBK - NO MATCH > The case that is problematic is where we can get a > CODESET string but we don't recognize it. In this case it seems > appropriate to do > > ereport(WARNING, > (errmsg("could not determine encoding for locale \"%s\": codeset is \"%s\"", > ctype, sys), > errdetail("Please report this to <pgsql-bugs@postgresql.org>."))); > > and then let the user do what he wants. The another question is what do when we know that this codeset/encoding is not supported by postgres. Maybe extend encoding match structure to struct encoding_match { enum pg_enc pg_enc_code; const char *system_enc_name; bool supported; }; and in case when it is unsupported then generates error. In case when codeset does not match anyway then generates only warning. Zdenek #include <locale.h> #include <langinfo.h> #include "postgres_fe.h" //#include "miscadmin.h" #include "mb/pg_wchar.h" /* * Checks whether the encoding selected for PostgreSQL and the * encoding used by the system locale match. */ struct encoding_match { enum pg_enc pg_enc_code; const char *system_enc_name; }; static const struct encoding_match encoding_match_list[] = { {PG_EUC_JP, "EUC-JP"}, {PG_EUC_JP, "eucJP"}, {PG_EUC_JP, "IBM-eucJP"}, {PG_EUC_JP, "sdeckanji"}, {PG_EUC_CN, "EUC-CN"}, {PG_EUC_CN, "eucCN"}, {PG_EUC_CN, "IBM-eucCN"}, {PG_EUC_CN, "GB2312"}, {PG_EUC_CN, "dechanzi"}, {PG_EUC_KR, "EUC-KR"}, {PG_EUC_KR, "eucKR"}, {PG_EUC_KR, "IBM-eucKR"}, {PG_EUC_KR, "deckorean"}, {PG_EUC_KR, "5601"}, {PG_EUC_TW, "EUC-TW"}, {PG_EUC_TW, "eucTW"}, {PG_EUC_TW, "IBM-eucTW"}, {PG_EUC_TW, "cns11643"}, #ifdef NOT_VERIFIED {PG_JOHAB, "???"}, #endif {PG_UTF8, "UTF-8"}, {PG_UTF8, "utf8"}, {PG_LATIN1, "ISO-8859-1"}, {PG_LATIN1, "ISO8859-1"}, {PG_LATIN1, "iso88591"}, {PG_LATIN2, "ISO-8859-2"}, {PG_LATIN2, "ISO8859-2"}, {PG_LATIN2, "iso88592"}, {PG_LATIN3, "ISO-8859-3"}, {PG_LATIN3, "ISO8859-3"}, {PG_LATIN3, "iso88593"}, {PG_LATIN4, "ISO-8859-4"}, {PG_LATIN4, "ISO8859-4"}, {PG_LATIN4, "iso88594"}, {PG_LATIN5, "ISO-8859-9"}, {PG_LATIN5, "ISO8859-9"}, {PG_LATIN5, "iso88599"}, {PG_LATIN6, "ISO-8859-10"}, {PG_LATIN6, "ISO8859-10"}, {PG_LATIN6, "iso885910"}, {PG_LATIN7, "ISO-8859-13"}, {PG_LATIN7, "ISO8859-13"}, {PG_LATIN7, "iso885913"}, {PG_LATIN8, "ISO-8859-14"}, {PG_LATIN8, "ISO8859-14"}, {PG_LATIN8, "iso885914"}, {PG_LATIN9, "ISO-8859-15"}, {PG_LATIN9, "ISO8859-15"}, {PG_LATIN9, "iso885915"}, {PG_LATIN10, "ISO-8859-16"}, {PG_LATIN10, "ISO8859-16"}, {PG_LATIN10, "iso885916"}, {PG_WIN1252, "CP1252"}, {PG_WIN1253, "CP1253"}, {PG_WIN1254, "CP1254"}, {PG_WIN1255, "CP1255"}, {PG_WIN1256, "CP1256"}, {PG_WIN1257, "CP1257"}, {PG_WIN1258, "CP1258"}, #ifdef NOT_VERIFIED {PG_WIN874, "???"}, #endif {PG_KOI8R, "KOI8-R"}, {PG_WIN1251, "CP1251"}, {PG_WIN866, "CP866"}, {PG_ISO_8859_5, "ISO-8859-5"}, {PG_ISO_8859_5, "ISO8859-5"}, {PG_ISO_8859_5, "iso88595"}, {PG_ISO_8859_6, "ISO-8859-6"}, {PG_ISO_8859_6, "ISO8859-6"}, {PG_ISO_8859_6, "iso88596"}, {PG_ISO_8859_7, "ISO-8859-7"}, {PG_ISO_8859_7, "ISO8859-7"}, {PG_ISO_8859_7, "iso88597"}, {PG_ISO_8859_8, "ISO-8859-8"}, {PG_ISO_8859_8, "ISO8859-8"}, {PG_ISO_8859_8, "iso88598"}, {PG_SQL_ASCII, NULL} /* end marker */ }; static char * get_encoding_from_locale(const char *ctype) { char *save; char *sys; save = setlocale(LC_CTYPE, NULL); if (!save) return NULL; save = strdup(save); setlocale(LC_CTYPE, ctype); sys = nl_langinfo(CODESET); sys = strdup(sys); setlocale(LC_CTYPE, save); free(save); return sys; } static int find_matching_encoding(const char *ctype, const char *sys) { // char *sys; int i; sys = get_encoding_from_locale(ctype); for (i = 0; encoding_match_list[i].system_enc_name; i++) { if (strcasecmp(sys, encoding_match_list[i].system_enc_name) == 0) { // free(sys); return encoding_match_list[i].pg_enc_code; } } // free(sys); return -1; } int main(int argc, char **argv) { int enc; char *sys; if( argc != 2) { fprintf(stderr,"Invalid number of arguments.\n"); return 1; } printf("%-23s ... ", argv[1]); sys = get_encoding_from_locale(argv[1]); printf("%-10s - ",sys); enc=find_matching_encoding(argv[1], sys); if( enc != -1 ) printf("OK\n"); else printf("NO MATCH\n"); free(sys); return 0; }
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: > On Solaris I got following problematic locales: > C ... 646 - NO MATCH > POSIX ... 646 - NO MATCH > cs ... 646 - NO MATCH > da ... 646 - NO MATCH > et ... 646 - NO MATCH > it ... 646 - NO MATCH > ja_JP.PCK ... PCK - NO MATCH > ko ... 646 - NO MATCH > no ... 646 - NO MATCH > ru ... 646 - NO MATCH > sl ... 646 - NO MATCH > sv ... 646 - NO MATCH > tr ... 646 - NO MATCH > zh.GBK ... GBK - NO MATCH > zh_CN.GB18030 ... GB18030 - NO MATCH > zh_CN.GB18030@pinyin ... GB18030 - NO MATCH > zh_CN.GB18030@radical ... GB18030 - NO MATCH > zh_CN.GB18030@stroke ... GB18030 - NO MATCH > zh_CN.GBK ... GBK - NO MATCH > zh_CN.GBK@pinyin ... GBK - NO MATCH > zh_CN.GBK@radical ... GBK - NO MATCH > zh_CN.GBK@stroke ... GBK - NO MATCH Not sure what 646 or PCK are, but we don't need to worry about GB18030 or GBK, because those aren't allowed backend encodings. > The another question is what do when we know that this codeset/encoding > is not supported by postgres. I don't really see a need to worry about this case. The proposed encoding will already have been checked to be sure it's one that the backend supports. All we need is to be able to recognize any variant spelling of the encodings we allow. regards, tom lane
Tom Lane wrote: > Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: >> On Solaris I got following problematic locales: > >> C ... 646 - NO MATCH >> POSIX ... 646 - NO MATCH >> cs ... 646 - NO MATCH >> da ... 646 - NO MATCH >> et ... 646 - NO MATCH >> it ... 646 - NO MATCH >> ja_JP.PCK ... PCK - NO MATCH >> ko ... 646 - NO MATCH >> no ... 646 - NO MATCH >> ru ... 646 - NO MATCH >> sl ... 646 - NO MATCH >> sv ... 646 - NO MATCH >> tr ... 646 - NO MATCH >> zh.GBK ... GBK - NO MATCH >> zh_CN.GB18030 ... GB18030 - NO MATCH >> zh_CN.GB18030@pinyin ... GB18030 - NO MATCH >> zh_CN.GB18030@radical ... GB18030 - NO MATCH >> zh_CN.GB18030@stroke ... GB18030 - NO MATCH >> zh_CN.GBK ... GBK - NO MATCH >> zh_CN.GBK@pinyin ... GBK - NO MATCH >> zh_CN.GBK@radical ... GBK - NO MATCH >> zh_CN.GBK@stroke ... GBK - NO MATCH > > Not sure what 646 or PCK are, but we don't need to worry about GB18030 > or GBK, because those aren't allowed backend encodings. PCK is Japanese Shift-JIS encoding. (see http://www.inter-locale.com/whitepaper/learn/learn_to_type.html) http://en.wikipedia.org/wiki/Shift_JIS 646 looks like ISO646. I will check it. http://en.wikipedia.org/wiki/ISO646 > >> The another question is what do when we know that this codeset/encoding >> is not supported by postgres. > > I don't really see a need to worry about this case. The proposed encoding > will already have been checked to be sure it's one that the backend supports. > All we need is to be able to recognize any variant spelling of the > encodings we allow. OK. Maybe would be good put mapping into text file (e.g. encoding.map) into share directory. (Similar to tz_abbrev) Zdenek
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: > On Solaris I got following problematic locales: [...] I tried this program on Mac OS X 10.4.10 (the current release) and found out that what that OS mostly returns is the encoding portion of the locale name, for instance sv_SE.ISO8859-15 ... ISO8859-15 - OK sv_SE.UTF-8 ... UTF-8 - OK tr_TR ... - NO MATCH tr_TR.ISO8859-9 ... ISO8859-9 - OK tr_TR.UTF-8 ... UTF-8 - OK uk_UA ... - NO MATCH uk_UA.ISO8859-5 ... ISO8859-5 - OK uk_UA.KOI8-U ... KOI8-U - NO MATCH uk_UA.UTF-8 ... UTF-8 - OK zh_CN ... - NO MATCH zh_CN.eucCN ... eucCN - OK zh_CN.GB18030 ... GB18030 - NO MATCH zh_CN.GB2312 ... GB2312 - OK zh_CN.GBK ... GBK - NO MATCH zh_CN.UTF-8 ... UTF-8 - OK zh_HK ... - NO MATCH zh_HK.Big5HKSCS ... Big5HKSCS - NO MATCH zh_HK.UTF-8 ... UTF-8 - OK zh_TW ... - NO MATCH zh_TW.Big5 ... Big5 - NO MATCH zh_TW.UTF-8 ... UTF-8 - OK C ... US-ASCII - NO MATCH POSIX ... US-ASCII - NO MATCH They didn't *quite* hard-wire it that way, as evidenced by the C/POSIX results, but certainly the empty-string results are entirely useless. Perhaps we should file a bug with Apple. However, some poking around in /usr/share/locale indicates that there's a consistent interpretation to be made: g42:/usr/share/locale tgl$ ls -l ??_??/LC_CTYPE lrwxr-xr-x 1 root wheel 17 Apr 26 2006 af_ZA/LC_CTYPE@ -> ../UTF-8/LC_CTYPE -r--r--r-- 1 root wheel 3272 Mar 20 2005 am_ET/LC_CTYPE lrwxr-xr-x 1 root wheel 17 Apr 26 2006 be_BY/LC_CTYPE@ -> ../UTF-8/LC_CTYPE lrwxr-xr-x 1 root wheel 17 Apr 26 2006 bg_BG/LC_CTYPE@ -> ../UTF-8/LC_CTYPE lrwxr-xr-x 1 root wheel 17 Apr 26 2006 ca_ES/LC_CTYPE@ -> ../UTF-8/LC_CTYPE lrwxr-xr-x 1 root wheel 17 Apr 26 2006 cs_CZ/LC_CTYPE@ -> ../UTF-8/LC_CTYPE lrwxr-xr-x 1 root wheel 17 Apr 26 2006 da_DK/LC_CTYPE@ -> ../UTF-8/LC_CTYPE lrwxr-xr-x 1 root wheel 17 Apr 26 2006 de_AT/LC_CTYPE@ -> ../UTF-8/LC_CTYPE lrwxr-xr-x 1 root wheel 17 Apr 26 2006 de_CH/LC_CTYPE@ -> ../UTF-8/LC_CTYPE lrwxr-xr-x 1 root wheel 17 Apr 26 2006 de_DE/LC_CTYPE@ -> ../UTF-8/LC_CTYPE lrwxr-xr-x 1 root wheel 17 Apr 26 2006 el_GR/LC_CTYPE@ -> ../UTF-8/LC_CTYPE lrwxr-xr-x 1 root wheel 17 Apr 26 2006 en_AU/LC_CTYPE@ -> ../UTF-8/LC_CTYPE lrwxr-xr-x 1 root wheel 17 Apr 26 2006 en_CA/LC_CTYPE@ -> ../UTF-8/LC_CTYPE (etc etc) The only one that's not actually a symlink to the standard UTF-8 ctype is am_ET/LC_CTYPE, which is identical to am_ET.UTF-8/LC_CTYPE. So I think we can get away with something like #ifdef __darwin__if (strlen(sys) == 0) // assume UTF8 #endif I suppose we'll need a few more hacks like this as the beta-test results begin to roll in ... regards, tom lane
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: > The another question is what do when we know that this codeset/encoding > is not supported by postgres. Ah, I finally grasped what you were on about here. As CVS HEAD stands, if you run initdb in an unrecognized locale, you get something like $ LANG=zh_CN.GB18030 initdb The files belonging to this database system will be owned by user "tgl". This user must also own the server process. The database cluster will be initialized with locale zh_CN.GB18030. could not determine encoding for locale "zh_CN.GB18030": codeset is "GB18030" initdb: could not find suitable encoding for locale "zh_CN.GB18030" Rerun initdb with the -E option. Try "initdb --help" for more information. $ which is OK, but if you override it incorrectly, it'll let you do so: $ LANG=zh_CN.GB18030 initdb -E utf8 The files belonging to this database system will be owned by user "tgl". This user must also own the server process. The database cluster will be initialized with locale zh_CN.GB18030. could not determine encoding for locale "zh_CN.GB18030": codeset is "GB18030" ... but it presses merrily along ... leading to a database which is in fact broken. To prevent this, I think it would be sufficient to add entries to the table for our known frontend-only encodings. It's reasonable to assume that anyone who wants to run Postgres will probably have a default locale that uses *some* encoding that we support; otherwise he's going to have a pretty unpleasant experience anyway. If the function returns a frontend-only encoding value then initdb will fail in a good way, since it won't let the user select that as a database encoding. So I don't think we need an explicit concept of an unsupported encoding in the table, just some more entries. regards, tom lane
Tom Lane wrote: > I tried this program on Mac OS X 10.4.10 (the current release) and found > out that what that OS mostly returns is the encoding portion of the > locale name, for instance FWIW I tried this program here, and I get C ... ANSI_X3.4-1968 - NO MATCH POSIX ... ANSI_X3.4-1968 - NO MATCH Note the funny name. Trying initdb with LC_ALL=C correctly uses SQL_ASCII (I saw the special case in chklocale.c), but I'm wondering if we should list those names explicitely. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > FWIW I tried this program here, and I get > C ... ANSI_X3.4-1968 - NO MATCH > POSIX ... ANSI_X3.4-1968 - NO MATCH > Note the funny name. Trying initdb with LC_ALL=C correctly uses > SQL_ASCII (I saw the special case in chklocale.c), but I'm wondering if > we should list those names explicitely. Since we're already special-casing C/POSIX, I don't see a need. It looks a bit hopeless to keep up with all the possibilities anyway --- by my count we've tested four different platforms so far and gotten four different answers for the CODESET name for C :-( Linux ANSI_X3.4-1968 Darwin (empty) Solaris 646 HP-UX roman8 regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> FWIW I tried this program here, and I get > >> C ... ANSI_X3.4-1968 - NO MATCH >> POSIX ... ANSI_X3.4-1968 - NO MATCH > >> Note the funny name. Trying initdb with LC_ALL=C correctly uses >> SQL_ASCII (I saw the special case in chklocale.c), but I'm wondering if >> we should list those names explicitely. > > Since we're already special-casing C/POSIX, I don't see a need. > It looks a bit hopeless to keep up with all the possibilities anyway > --- by my count we've tested four different platforms so far and > gotten four different answers for the CODESET name for C :-( > > Linux ANSI_X3.4-1968 > Darwin (empty) > Solaris 646 > HP-UX roman8 There is some useful link: http://www.simeji.com/bun/characterencoding_jvm142.txt with aliases. I also checked all possible locales on Solaris nevada and there are two new aliases and probably unsupported TIS620.2533 (thai) encoding by postgres. Patch with new aliases attached. Zdenek =================================================================== RCS file: /zfs_data/cvs_pgsql/cvsroot/pgsql/src/port/chklocale.c,v retrieving revision 1.4 diff -c -r1.4 chklocale.c *** src/port/chklocale.c 2007/10/03 17:16:39 1.4 --- src/port/chklocale.c 2007/10/05 17:55:10 *************** *** 127,132 **** --- 127,133 ---- {PG_WIN874, "???"}, #endif {PG_WIN1251, "CP1251"}, + {PG_WIN1251, "ansi-1251"}, {PG_WIN866, "CP866"}, {PG_ISO_8859_5, "ISO-8859-5"}, *************** *** 152,157 **** --- 153,159 ---- {PG_BIG5, "BIG5"}, {PG_BIG5, "BIG5HKSCS"}, {PG_BIG5, "CP950"}, + {PG_BIG5, "Big5-HKSCS"}, {PG_GBK, "GBK"}, {PG_GBK, "CP936"},