Thread: character problem

character problem

From
Luca Ferrari
Date:
Hi,
I've got a postgres installation which is accessed by users thru a set
of perl CGIs. Unlickily, when the user enters some characters in a
description (string) field, like "èòéàùì", the string is stored in the
database with other characters instead of those digited by the user. It
seems as the locale setting of the client and of the server are
different, but in the postgresql.conf file I didn't find such setting.
My question is: how can I fix the problem? And most important, is it
possible to adjust the already inserted strings without having to update
the tables? I mean, is it possible to change the locale settings thus
the strings in the database are displayed rightly?

Thanks,
Luca
--
Luca Ferrari
fluca1978@infinito.it


Re: character problem

From
Hannes Dorbath
Date:
> like "èòéàùì", the string is stored in the database with other characters instead of those digited by the user.

Well, how does `èòéàùì' look like in your database? Like `èòéà ùì'?
In that case you most probably have initialized your cluster with an
UTF-8 locale while not using a / the correct client encoding.

You can check your database encoding with \l in psql.

If your perl clients don't handle UTF-8, just set a client encoding for
them.

SET client_encoding = LATIN1;

for example.

The client encoding can be set for a user, as default in
postgresql.conf, interactive using the command above and maybe other ways.

> And most important, is it possible to adjust the already inserted strings without having to update the tables?

Your first need to deside what is wrong at all -- the way the servers
stores your chars, or the way your client displays them.

> is it possible to change the locale settings thus the strings in the database are displayed rightly?

The client encoding can be set anytime, to change the cluster LOCALE or
database encoding you need to initDB + dump and restore AFAIK.


--
Regards,
Hannes Dorbath

Re: character problem

From
Luca Ferrari
Date:
Hannes Dorbath's cat, on 10/10/2005 9.28, walking on the keyboard wrote:
> The client encoding can be set anytime, to change the cluster LOCALE or
> database encoding you need to initDB + dump and restore AFAIK.

Do you mean dumping the database, destroying it and then recreating it?
I'm sorry, but I'm still new in postgresql, thus it is not clear if
meant that or another trick I don't know.

Thanks,
Luca


--
Luca Ferrari
fluca1978@infinito.it

Re: character problem

From
Hannes Dorbath
Date:
On 10.10.2005 10:34, Luca Ferrari wrote:
> Do you mean dumping the database, destroying it and then recreating it?
> I'm sorry, but I'm still new in postgresql, thus it is not clear if
> meant that or another trick I don't know.

Let us find the problem before jumping to solutions. Start psql
<database> <user>, type \l inside it and paste the output for the
database in question. Then type "show client_encoding;" and paste the
output as well, please.

--
Regards,
Hannes Dorbath

Re: character problem

From
Hannes Dorbath
Date:
On 10.10.2005 11:56, Luca Ferrari wrote:
> gestione_personale=# \l
>              List of databases
>         Name        |   Owner   | Encoding
> --------------------+-----------+-----------
>  gestione_database  | dbmanager | SQL_ASCII
>  gestione_personale | wwwrun    | SQL_ASCII
>  template0          | dbmanager | SQL_ASCII
>  template1          | dbmanager | SQL_ASCII
> (4 rows)
>
> gestione_personale=# show client_encoding;
>  client_encoding
> -----------------
>  SQL_ASCII
> (1 row)

OK, so you are not using Unicode. I don't know much about the SQL_ASCII
charset, but I expect it to slightly differ from LATIN1. Type

SET client_encoding = LATIN1;

in psql, then SELECT some data containing the chars you have problems
with and see if it is fixed. If not, INSERT a new row with such chars
from psql, SELECT that row again and if the chars look right now, your
other rows are stored invalid in the datebase and you need to convert
them somehow.


--
Regards,
Hannes Dorbath

Re: character problem

From
Luca Ferrari
Date:
Hannes Dorbath's cat, on 10/10/2005 12.08, walking on the keyboard wrote:
> OK, so you are not using Unicode. I don't know much about the SQL_ASCII
> charset, but I expect it to slightly differ from LATIN1. Type
>
> SET client_encoding = LATIN1;
>
> in psql, then SELECT some data containing the chars you have problems
> with and see if it is fixed. If not, INSERT a new row with such chars
> from psql, SELECT that row again and if the chars look right now, your
> other rows are stored invalid in the datebase and you need to convert
> them somehow.

The old rows are still displayed in a bad format, the newly inserted one
is instead shown rightly. Also data in pg_dump present wrong strings,
thus any suggestion about how to reconvert already inserted data?
Moreover, the set client_encoding change is permanent or do I have to
configure it in the postgresql.conf file?

Thanks,
Luca


--
Luca Ferrari
fluca1978@infinito.it

Re: character problem

From
Hannes Dorbath
Date:
On 10.10.2005 14:24, Luca Ferrari wrote:
> The old rows are still displayed in a bad format, the newly inserted one
> is instead shown rightly. Also data in pg_dump present wrong strings,
> thus any suggestion about how to reconvert already inserted data?

It should be possible to convert it with an intentionally "wrong"
client_encoding and one or two reloads, but I don't have the time ATM to
think out how to do it exactly, maybe someone else can help in the meantime.

> Moreover, the set client_encoding change is permanent or do I have to
> configure it in the postgresql.conf file?

No it's for the running session only. Set it global in postgresql.conf,
for each user or just fire "SET client_encoding = LATIN1;" as first
query in your scripts.


--
Regards,
Hannes Dorbath

Re: character problem

From
Tom Lane
Date:
Hannes Dorbath <light@theendofthetunnel.de> writes:
> On 10.10.2005 11:56, Luca Ferrari wrote:
>> gestione_personale=# \l
>> List of databases
>> Name        |   Owner   | Encoding
>> --------------------+-----------+-----------
>> gestione_database  | dbmanager | SQL_ASCII
>> gestione_personale | wwwrun    | SQL_ASCII
>> template0          | dbmanager | SQL_ASCII
>> template1          | dbmanager | SQL_ASCII

> OK, so you are not using Unicode. I don't know much about the SQL_ASCII
> charset, but I expect it to slightly differ from LATIN1. Type

> SET client_encoding = LATIN1;

No, SQL_ASCII represents the complete absence of any encoding
knowledge.  With this database setting, changing client_encoding is a
complete no-op.  Postgres will just absorb and re-emit strings exactly
as they were supplied originally, no matter what client_encoding is.

The cause of the unexpected conversions Luca is having trouble with must
be somewhere on the client side.  It's not going to be done by Postgres
with this encoding setting.

If you are actively dealing with non-7-bit-ASCII data, using SQL_ASCII
for the database encoding is probably a bad idea, exactly because
Postgres won't help you out at all with converting encodings or
enforcing that data is validly encoded.

            regards, tom lane

Re: character problem

From
Andrew Sullivan
Date:
On Mon, Oct 10, 2005 at 10:27:27AM -0400, Tom Lane wrote:
> No, SQL_ASCII represents the complete absence of any encoding
> knowledge.  With this database setting, changing client_encoding is a
> complete no-op.  Postgres will just absorb and re-emit strings exactly
> as they were supplied originally, no matter what client_encoding is.

The documents remain pretty confusing about this, assuming I still
understand the current state of affairs (always a dangerous
assumption).  The chart in
<http://developer.postgresql.org/docs/postgres/multibyte.html>, for
instance, says "SQL_ASCII" supports "ASCII".  I'm not sure what to do
about this (I've noticed it before, and run into the same quandary).

One possibility is to add something like this immediately below the
chart in the page above:

---snip---

NOTE: SQL_ASCII _does not_ enforce a 7 bit restriction on insertions.
SQL_ASCII does not represent a positive claim that the database knows
all the characters to be 7 bit characters.  It represents instead the
complete absence of any encoding knowledge.  Inserting high-bit
characters into a database using the SQL_ASCII character set may have
unpredictable results.

---snip---

--
Andrew Sullivan  | ajs@crankycanuck.ca
I remember when computers were frustrating because they *did* exactly what
you told them to.  That actually seems sort of quaint now.
        --J.D. Baldwin

Re: character problem

From
Tom Lane
Date:
Andrew Sullivan <ajs@crankycanuck.ca> writes:
> On Mon, Oct 10, 2005 at 10:27:27AM -0400, Tom Lane wrote:
>> No, SQL_ASCII represents the complete absence of any encoding
>> knowledge.  With this database setting, changing client_encoding is a
>> complete no-op.  Postgres will just absorb and re-emit strings exactly
>> as they were supplied originally, no matter what client_encoding is.

> The documents remain pretty confusing about this, assuming I still
> understand the current state of affairs (always a dangerous
> assumption).  The chart in
> <http://developer.postgresql.org/docs/postgres/multibyte.html>, for
> instance, says "SQL_ASCII" supports "ASCII".  I'm not sure what to do
> about this (I've noticed it before, and run into the same quandary).

You're right, the documentation basically doesn't explain this at all :-(
I'll try to add something.

            regards, tom lane