Thread: Character encoding and string matches

Character encoding and string matches

From
Richard van den Berg
Date:
I just upgraded our database from PostgreSQL 7.4.7 to 8.1.0. Our
original character enconding was set to the default of 7.4, being
SQL_ASCII. Along the way I realized our data does have non-ASCII
strings, so I decided to use this upgrade to fix this setting. Our new
8.1.0 database now has the LATIN1 encoding. The pg_dump / pg_restore
went fine with respect to the encodings. The LATIN1 characters are
correct at the places where I expect them.

Most of our data is supplied the customers, and sometimes we see bogus
characters in strings. It's something we live with and we always report
this to them in the hope it will not happen again. During the
pg_restore, it complained about failing to create several indexes
because of duplicates. Inspecting these cases showed that the bogus
characters are to blame. It seems that these characters (that are
probably not defined in LATIN1) are now turned into wildcards: they will
match any other character. Is this an expected result from running a non
SQL_ASCII database?

An interesting example is this (I hope my E-mail character encoding is
set up correctly):

apps\ì_dir\
apps\ý_dir\

There are 2 different non-ascii characters before the "_dir". Postgres
treats the first one as a wildcard, but not the second one. Strangely
enough both characters are in the ISO-8859-1 table: 0xEC and 0xFD.

Sincerely,

--
Richard van den Berg, CISSP
-------------------------------------------
Trust Factory B.V. |     www.dna-portal.net
Bazarstraat 44a    |  www.trust-factory.com
2518AK The Hague   |  Phone: +31 70 3620684
The Netherlands    |  Fax  : +31 70 3603009
-------------------------------------------

Re: Character encoding and string matches

From
"Daniel Verite"
Date:
    Richard van den Berg wrote:

> During the
> pg_restore, it complained about failing to create several indexes
> because of duplicates. Inspecting these cases showed that the bogus
> characters are to blame. It seems that these characters (that are
> probably not defined in LATIN1) are now turned into wildcards: they will
> match any other character. Is this an expected result from running a non
> SQL_ASCII database?

It should depend on the locale. Can you tell the results of
`show lc_ctype` and `show lc_collate`?

--
 Daniel
 PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org


Re: Character encoding and string matches

From
Tom Lane
Date:
"Daniel Verite" <daniel@manitou-mail.org> writes:
>     Richard van den Berg wrote:
>> During the
>> pg_restore, it complained about failing to create several indexes
>> because of duplicates. Inspecting these cases showed that the bogus
>> characters are to blame. It seems that these characters (that are
>> probably not defined in LATIN1) are now turned into wildcards: they will
>> match any other character. Is this an expected result from running a non
>> SQL_ASCII database?

> It should depend on the locale. Can you tell the results of
> `show lc_ctype` and `show lc_collate`?

Yeah, there are a lot of locale definitions out there that seem to
simply fail on unexpected data :-(

If you don't actually need locale-specific sorting or upper()/lower()
behavior, a safe choice is to run with LC_CTYPE and LC_COLLATE set to
'C' while you set the database encoding to whatever you need.  This
will support encoding validity checks and transcoding to different
client encodings, but sorting and upper/lower will just use the default
C-locale behavior.

            regards, tom lane

Re: Character encoding and string matches

From
Richard van den Berg
Date:

"Daniel Verite" <daniel@manitou-mail.org> wrote on 07-12-2005 12:30:17:
> It should depend on the locale. Can you tell the results of
> `show lc_ctype` and `show lc_collate`?

I've fried one of the disks. When the new one arrives, I'll send you the results (will be sometime next week).

Sincerely,

Richard van den Berg