Thread: Character encoding and string matches
I just upgraded our database from PostgreSQL 7.4.7 to 8.1.0. Our original character enconding was set to the default of 7.4, being SQL_ASCII. Along the way I realized our data does have non-ASCII strings, so I decided to use this upgrade to fix this setting. Our new 8.1.0 database now has the LATIN1 encoding. The pg_dump / pg_restore went fine with respect to the encodings. The LATIN1 characters are correct at the places where I expect them. Most of our data is supplied the customers, and sometimes we see bogus characters in strings. It's something we live with and we always report this to them in the hope it will not happen again. During the pg_restore, it complained about failing to create several indexes because of duplicates. Inspecting these cases showed that the bogus characters are to blame. It seems that these characters (that are probably not defined in LATIN1) are now turned into wildcards: they will match any other character. Is this an expected result from running a non SQL_ASCII database? An interesting example is this (I hope my E-mail character encoding is set up correctly): apps\ì_dir\ apps\ý_dir\ There are 2 different non-ascii characters before the "_dir". Postgres treats the first one as a wildcard, but not the second one. Strangely enough both characters are in the ISO-8859-1 table: 0xEC and 0xFD. Sincerely, -- Richard van den Berg, CISSP ------------------------------------------- Trust Factory B.V. | www.dna-portal.net Bazarstraat 44a | www.trust-factory.com 2518AK The Hague | Phone: +31 70 3620684 The Netherlands | Fax : +31 70 3603009 -------------------------------------------
Richard van den Berg wrote: > During the > pg_restore, it complained about failing to create several indexes > because of duplicates. Inspecting these cases showed that the bogus > characters are to blame. It seems that these characters (that are > probably not defined in LATIN1) are now turned into wildcards: they will > match any other character. Is this an expected result from running a non > SQL_ASCII database? It should depend on the locale. Can you tell the results of `show lc_ctype` and `show lc_collate`? -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org
"Daniel Verite" <daniel@manitou-mail.org> writes: > Richard van den Berg wrote: >> During the >> pg_restore, it complained about failing to create several indexes >> because of duplicates. Inspecting these cases showed that the bogus >> characters are to blame. It seems that these characters (that are >> probably not defined in LATIN1) are now turned into wildcards: they will >> match any other character. Is this an expected result from running a non >> SQL_ASCII database? > It should depend on the locale. Can you tell the results of > `show lc_ctype` and `show lc_collate`? Yeah, there are a lot of locale definitions out there that seem to simply fail on unexpected data :-( If you don't actually need locale-specific sorting or upper()/lower() behavior, a safe choice is to run with LC_CTYPE and LC_COLLATE set to 'C' while you set the database encoding to whatever you need. This will support encoding validity checks and transcoding to different client encodings, but sorting and upper/lower will just use the default C-locale behavior. regards, tom lane
"Daniel Verite" <daniel@manitou-mail.org> wrote on 07-12-2005 12:30:17:
> It should depend on the locale. Can you tell the results of
> `show lc_ctype` and `show lc_collate`?
I've fried one of the disks. When the new one arrives, I'll send you the results (will be sometime next week).
Sincerely,
Richard van den Berg