Thread: Strange query results with invalid multibyte character

Strange query results with invalid multibyte character

From
"Joost Kraaijeveld"
Date:
Hi,

I have a strange problem.

I have 2 database, both created with:
CREATE DATABASE muntdev WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default;

Running the queries below on the first database is OK and do what I expect.

If I create a backup of the first datase and restore that backup in the second database the following happens:
The first query (see below) returns 17 records, all containing 'Boonk%'.
The second query (see below)returns 'ERROR:  invalid multibyte character for locale'

Query 1:
select lastname from salesordercustomer where lower(lastname) like 'boonk%'

Query 2:
select lastname from salesordercustomer where upper(lastname) like 'BOONK%'

Both database are running *exactly* the same version of Debian Etch AMD64 and *exactly* the same version of PostgreSQL
(8.1.4for AMD64) .... 

Can anyone explain this to me?

TIA

Joost

Re: Strange query results with invalid multibyte

From
Gevik Babakhani
Date:
Hi Joost. (hoe gaat het ermee?)

I would like to test this too. Could you please provide the data you are
inserting into the database.

Regards,
Gevik


On Wed, 2006-09-27 at 13:13 +0200, Joost Kraaijeveld wrote:
> Hi,
>
> I have a strange problem.
>
> I have 2 database, both created with:
> CREATE DATABASE muntdev WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default;
>
> Running the queries below on the first database is OK and do what I expect.
>
> If I create a backup of the first datase and restore that backup in the second database the following happens:
> The first query (see below) returns 17 records, all containing 'Boonk%'.
> The second query (see below)returns 'ERROR:  invalid multibyte character for locale'
>
> Query 1:
> select lastname from salesordercustomer where lower(lastname) like 'boonk%'
>
> Query 2:
> select lastname from salesordercustomer where upper(lastname) like 'BOONK%'
>
> Both database are running *exactly* the same version of Debian Etch AMD64 and *exactly* the same version of
PostgreSQL(8.1.4 for AMD64) .... 
>
> Can anyone explain this to me?
>
> TIA
>
> Joost
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

--
Regards,
Gevik Babakhani




Re: Strange query results with invalid multibyte character

From
Tom Lane
Date:
"Joost Kraaijeveld" <J.Kraaijeveld@Askesis.nl> writes:
> I have 2 database, both created with:
> CREATE DATABASE muntdev WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default;

But have they got the same locale settings (lc_collate, lc_ctype)?

            regards, tom lane

Re: Strange query results with invalid multibyte

From
Joost Kraaijeveld
Date:
Hi Tom,

On Wed, 2006-09-27 at 11:00 -0400, Tom Lane wrote:
> "Joost Kraaijeveld" <J.Kraaijeveld@Askesis.nl> writes:
> > I have 2 database, both created with:
> > CREATE DATABASE muntdev WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default;
>
> But have they got the same locale settings (lc_collate, lc_ctype)?

According to the postgresql.conf of the *working* database the locales
are:
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'

According to the other obe:
lc_messages = 'en_US'
lc_monetary = 'en_US'
lc_numeric = 'en_US'
lc_time = 'en_US'


Could this be the difference? Is there any documentation available
somewhere, on how these locale settings work and  interact (in
combination with the server and/or client platform???), besides the
explanation in the PostgreSQL manual
(http://www.postgresql.org/docs/8.1/interactive/charset.html#LOCALE)
(which is to terse for me to understand)?

TIA


--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
web: www.askesis.nl

Re: Strange query results with invalid multibyte character

From
Tom Lane
Date:
Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:
> On Wed, 2006-09-27 at 11:00 -0400, Tom Lane wrote:
>> But have they got the same locale settings (lc_collate, lc_ctype)?

> According to the postgresql.conf of the *working* database the locales
> are:
> lc_messages = 'C'
> lc_monetary = 'C'
> lc_numeric = 'C'
> lc_time = 'C'

> According to the other obe:
> lc_messages = 'en_US'
> lc_monetary = 'en_US'
> lc_numeric = 'en_US'
> lc_time = 'en_US'

You did not show us the settings that actually count, but based on this
I'm guessing they are en_US also.  What you need to find out next is
what character set encoding that locale implies on your machine.  I'm
betting it's not utf8 though :-(

            regards, tom lane

Re: Strange query results with invalid multibyte

From
Joost Kraaijeveld
Date:

On Wed, 2006-09-27 at 12:10 -0400, Tom Lane wrote:
> Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:
> > On Wed, 2006-09-27 at 11:00 -0400, Tom Lane wrote:
> >> But have they got the same locale settings (lc_collate, lc_ctype)?
>
> > According to the postgresql.conf of the *working* database the locales
> > are:
> > lc_messages = 'C'
> > lc_monetary = 'C'
> > lc_numeric = 'C'
> > lc_time = 'C'
>
> > According to the other obe:
> > lc_messages = 'en_US'
> > lc_monetary = 'en_US'
> > lc_numeric = 'en_US'
> > lc_time = 'en_US'
>
> You did not show us the settings that actually count, but based on this
> I'm guessing they are en_US also
Ah, sorry: "show lc_collate" and "show lc_ctype" confirm your guess.

> What you need to find out next is
> what character set encoding that locale implies on your machine.  I'm
> betting it's not utf8 though :-(
I am not sure what you mean by that but maybe this helps:

the output of "locale" on the working machine is:

LANG=
LANGUAGE=en_US:en_GB
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

The output of the same command on the non-working machine:

LANG=en_US
LANGUAGE=en_NL:en_US:en_GB:en
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE="en_US"
LC_MONETARY="en_US"
LC_MESSAGES="en_US"
LC_PAPER="en_US"
LC_NAME="en_US"
LC_ADDRESS="en_US"
LC_TELEPHONE="en_US"
LC_MEASUREMENT="en_US"
LC_IDENTIFICATION="en_US"
LC_ALL=

If this is not what you mean, could you help me in the right direction?


--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
web: www.askesis.nl

Re: Strange query results with invalid multibyte

From
Tom Lane
Date:
Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:
> On Wed, 2006-09-27 at 12:10 -0400, Tom Lane wrote:
>> What you need to find out next is
>> what character set encoding that locale implies on your machine.  I'm
>> betting it's not utf8 though :-(

> If this is not what you mean, could you help me in the right direction?

Try this command:

LANG=en_US locale charmap

            regards, tom lane

Re: Strange query results with invalid multibyte

From
Joost Kraaijeveld
Date:
Hi Tom,

Thanks for putting up with the questions.

On Wed, 2006-09-27 at 14:58 -0400, Tom Lane wrote:
> LANG=en_US locale charmap
>
I have done this on both machines:

The working machine says: ISO-8859-1
The not working machine says: ISO-8859-1

I still do not understand what is happening and maybe more important,
how I can solve this (without re-installing the OS / database).

A better understanding of how the server OS, database, codepages, client
OS, charmaps etc work would be nice....

--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
web: www.askesis.nl

Re: Strange query results with invalid multibyte

From
Tom Lane
Date:
Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:
> On Wed, 2006-09-27 at 14:58 -0400, Tom Lane wrote:
>> LANG=en_US locale charmap

> The working machine says: ISO-8859-1
> The not working machine says: ISO-8859-1

OK, so the problem is that you have a locale that expects ISO-8859-1
encoding, but the database is set to UTF8 encoding, and that means
it may contain characters that the locale can't deal with.
Locale-dependent operations ... like upper() ... will fail when that
happens.

If we had a more reliable way of telling what encoding the locale
expects, we'd probably forbid creation of databases with incompatible
encodings.

> I still do not understand what is happening and maybe more important,
> how I can solve this (without re-installing the OS / database).

If you want to use UTF8 then you're stuck with a re-initdb I'm afraid.

            regards, tom lane