Thread: Encoding-related errors when moving from 7.3 to 8.0.1

Encoding-related errors when moving from 7.3 to 8.0.1

From
Carlos Moreno
Date:
Hello,

I'm trying to upgrade a system that is currently running version 7.4.5
to the new 8.0.1

I create a backup, using pg_dump, and I expect it to work when
restoring it to 8.0.1.

However, when I run:

psql -U user -f backup.sql

Whenever there is a field value that contains characters with accents
(e.g., HTML á , which would be the Alt-160 character when
using the numeric keypad on Windows), I get an error about invalid
UNICODE characters in the COPY statements.

The error reads like:

psql:db_backup.sql:1548: ERROR:  invalid byte sequence for encoding
"UNICODE": 0xe12020
CONTEXT:  COPY country, line 5, column namespanish:
"Canad?                        "

(that ? should be an a with a ' on top -- in HTML, it would be the
á character)

I get this error with or without the line char_encoding="SQL_ASCII" at
the beginning of the pg_dump'ed file  (I noticed it and removed it to
see if that would fix the problem -- it didn't change the behaviour).

I know this feels like it could be the dumbest question ever asked
around here  :-)   But I can't figure out why it's happening and
how to fix it (I mean, it's a backup creaetd by pg_dump -- it should
be compatible with a psql restore, even if it is a different version).

Any ideas?

Thanks,

Carlos
--


Re: Encoding-related errors when moving from 7.3 to 8.0.1

From
Alvaro Herrera
Date:
On Sat, Mar 19, 2005 at 05:25:46PM -0500, Carlos Moreno wrote:

Carlos,

> The error reads like:
>
> psql:db_backup.sql:1548: ERROR:  invalid byte sequence for encoding
> "UNICODE": 0xe12020
> CONTEXT:  COPY country, line 5, column namespanish:
> "Canad?                        "

Hmm.  The sequence looks like latin1 interpreted as utf8.  This seems
the inverse of the problem reported (and solved) here

http://archives.postgresql.org/pgsql-es-ayuda/2005-03/msg00491.php

Maybe you should try sticking a

SET client_encoding TO latin1;

at the beggining of the dump file.

Why are you using CHAR(n) fields anyway?  It should probably be better
if you used VARCHAR(n) ...

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"El destino baraja y nosotros jugamos" (A. Schopenhauer)

Re: Encoding-related errors when moving from 7.3 to 8.0.1

From
Carlos Moreno
Date:
Hi Alvaro, thanks for your reply!

Alvaro Herrera wrote:
>>psql:db_backup.sql:1548: ERROR:  invalid byte sequence for encoding
>>"UNICODE": 0xe12020
>>CONTEXT:  COPY country, line 5, column namespanish:
>>"Canad?                        "
>
> Hmm.  The sequence looks like latin1 interpreted as utf8.  This seems
> the inverse of the problem reported (and solved) here
>
> http://archives.postgresql.org/pgsql-es-ayuda/2005-03/msg00491.php
>
> Maybe you should try sticking a
>
> SET client_encoding TO latin1;
>
> at the beggining of the dump file.

One thing worries me, though.  With all of the previous versions
of postgresql (I think when we started to use it in our system,
it was version 7.1), I have never worried about any encoding
issues.  Our users are mostly Spanish-speaking, and they register
to our system via web-based interfaces;  virtually 100% of them
use Windows (and perhaps most of them Windows in Spanish, with
a Spanish keyboard).

So, our system (CGI's written in C++ running on a Linux server)
simply takes whatever the user gives (properly validated and
escaped) and throws it in the database.  We've never encountered
any problem  (well, or perhaps it's the opposite?  Perhaps we've
always been living with the problem without realizing it?)

I worry now that if I needed to put a set client_encoding
statement to make the insert or COPY statements work, does
that mean that I should modify each and every program that I
have that interacts with the database, and add a "set client
encoding" statement before whatever other statement(s) we
execute?

Or is this client_encoding setting something that gets attached
to the database (or the tables) itself?

Where can I find more documentation on these issues?  I'd like
to get a deeper understanding, to avoid any future problems.

> Why are you using CHAR(n) fields anyway?  It should probably be better
> if you used VARCHAR(n) ...

Una de esas cosas que pasan hasta en las mejores familias  ;-)

(I was also surprised when noticing the bunch of spaces at the
end -- I would have thought that we were using varchars in
fields like that one)

Thanks again!

Cheers,

Carlos
--
PS: I have a strict white-list anti-spam filter in place, which
     is why a direct e-mail would be rejected -- let me know if
     you want to write directly through e-mail, so that I can
     add you to the white list file.

Re: Encoding-related errors when moving from 7.3 to 8.0.1

From
Alvaro Herrera
Date:
On Sun, Mar 20, 2005 at 10:02:24AM -0500, Carlos Moreno wrote:

Carlos,

> So, our system (CGI's written in C++ running on a Linux server)
> simply takes whatever the user gives (properly validated and
> escaped) and throws it in the database.  We've never encountered
> any problem  (well, or perhaps it's the opposite?  Perhaps we've
> always been living with the problem without realizing it?)

The latter, I think.  The problem is character recoding.  If your old
system has been running with encoding SQL_ASCII, then no recoding ever
takes place.  If you are now using UTF8 or latin1 (say) as server
encoding, then as soon as the client is using a different encoding,
there should be conversion in order to make the new data correct w.r.t.
the server encoding.  If the wrong conversion takes place, or if no
conversion takes place, you may either end up with invalid data, or
have the server reject your input (as was this case.)

So the moral of the story seems to be that yes, you need to make each
application issue the correct client_encoding before entering any data.
You can attach it to the user or database, by issuing ALTER USER (resp.
DATABASE).  But if you are using a web interface, where the user can
enter data in either win1252 or latin1 encoding (or whatever) depending
on the environment, then I'm not sure what you should do.  One idea
would be "do nothing," but that seems very invalid-data-prone.  Another
idea would be having the user select an encoding (and maybe display the
data to them after the recoding has taken place so they can correct it
in case they got it wrong.)  This seems messy and likely to upset your
users.

Someone else may have better advise for you on this.  I haven't really
worked with these things.

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu.  Five minutes later I realize that it's also talking
about food" (Donald Knuth)

Re: Encoding-related errors when moving from 7.3 to 8.0.1

From
Alvaro Herrera
Date:
On Sun, Mar 20, 2005 at 10:02:24AM -0500, Carlos Moreno wrote:

Carlos,

> Carlos
> --
> PS: I have a strict white-list anti-spam filter in place, which
>     is why a direct e-mail would be rejected -- let me know if
>     you want to write directly through e-mail, so that I can
>     add you to the white list file.

I forgot to mention that I did receive the "rejected mail" message,
which directed me to an URL which thrown a 404 error.  You may want to
take a look ...

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"Amanece.                                               (Ignacio Reyes)
 El Cerro San Cristóbal me mira, cínicamente, con ojos de virgen"

Re: Encoding-related errors when moving from 7.3 to 8.0.1

From
Carlos Moreno
Date:
Thanks again, Alvaro!

Alvaro Herrera wrote:

>>So, our system (CGI's written in C++ running on a Linux server)
>>simply takes whatever the user gives (properly validated and
>>escaped) and throws it in the database.  We've never encountered
>>any problem  (well, or perhaps it's the opposite?  Perhaps we've
>>always been living with the problem without realizing it?)
>
> The latter, I think.  The problem is character recoding.  If your old
> system has been running with encoding SQL_ASCII, then no recoding ever
> takes place.  If you are now using UTF8 or latin1 (say) as server
> encoding, then as soon as the client is using a different encoding,
> there should be conversion in order to make the new data correct w.r.t.
> the server encoding.  If the wrong conversion takes place, or if no
> conversion takes place, you may either end up with invalid data, or
> have the server reject your input (as was this case.)

This makes sense to me, yes.  The reason why I'm a bit lost is that
we never did anything whatsoever with respect to encoding.  Oddly
enough, I couldn't find much about this in the docs.

I see references to it in the runtime configuration docs (the part
where they describe the postgres.conf file).  There's one line,
commented out, where they set (as an example), the client_encoding
to sql_ascii, and a comment to the end of that line says "actually,
it defaults to the server encoding").

I just found out that in the create database statement, one of the
options specifies the encoding using for that database.

I guess what changed from version 7.4.x to 8.0 is that the default
server_encoding changed?

This means that a temporary solution (or rather, a temporary patch)
would be to create the database specifying the right server_encoding
to match what I currently have on my system?  (I wouldn't want to do
that if it is nothing more than a patch to keep hiding the problem)

> So the moral of the story seems to be that yes, you need to make each
> application issue the correct client_encoding before entering any data.
> You can attach it to the user or database, by issuing ALTER USER (resp.
> DATABASE).  But if you are using a web interface, where the user can
> enter data in either win1252 or latin1 encoding (or whatever) depending
> on the environment, then I'm not sure what you should do.

This is indeed the case;  and "do nothing" is what we have always
done with respect to this issue...  Why has it been so long without
us realizing that there was a hidden problem, I really don't know.
(and we do have users with plenty of "weird" characters -- accent
aigue, grave, circumflex, tilde, dieresis, etc. -- and they have
always worked).

I'm so lost!  :-(

BTW, the correct e-mail to pass through the anti-spam filter is
my first name, followed by a dot, followed by my last name (the
rest after the @ is the same)

Thanks again for your message!

Carlos
--

Re: Encoding-related errors when moving from 7.3 to 8.0.1

From
Carlos Moreno
Date:
[...]

> This makes sense to me, yes.  The reason why I'm a bit lost is that
> we never did anything whatsoever with respect to encoding.  Oddly
> enough, I couldn't find much about this in the docs.
> [...]
>
> I guess what changed from version 7.4.x to 8.0 is that the default
> server_encoding changed?

Some further digging through the documentation reveals (in addition
to the fact that I seem to be an incompetent when it comes to search
through documentation :-)) that the above is indeed the case.

Each database apparently has its own encoding, that can be specified
with either the createdb shell command (switch -e or --encoding, IIRC),
or with the SQL command create database.

With version 7.4.x, when I execute the command:

show server_encoding;

I get SQL_ASCII as response.  Also, when I do:

select * from pg_database;

I get a bunch of columns where the encoding field contains 0 for all
of them (I never specified encoding when creating databases, so in
all those cases it took the default).

With 8.0.1, the command show server_encoding returns UNICODE as
result -- this is also the default, since I created the database
without explicitly indicating encoding.

Apparently, conversion from latin1 to UNICODE works?  (or at least
doesn't trigger an error -- I wonder if some characters appear now
incorrectly?  I'll check that).

Yes, I know I'm blah-blah-ing and answering my own questions :-)

I'm just posting the progress I've made in case someone else was
also interested in finding out more about the problem or possible
solutions.

Thanks,

Carlos
--

Re: Encoding-related errors when moving from 7.3 to 8.0.1

From
Alvaro Herrera
Date:
On Tue, Mar 22, 2005 at 09:28:17PM -0500, Carlos Moreno wrote:

> >I guess what changed from version 7.4.x to 8.0 is that the default
> >server_encoding changed?
>
> Some further digging through the documentation reveals (in addition
> to the fact that I seem to be an incompetent when it comes to search
> through documentation :-)) that the above is indeed the case.

Oh, I forgot that.  In 8.0 the encoding is actually derived from the
locale settings, in case the platforms supports that.  Previously,
SQL_ASCII was always the default IIRC.

> Apparently, conversion from latin1 to UNICODE works?  (or at least
> doesn't trigger an error -- I wonder if some characters appear now
> incorrectly?  I'll check that).

Yes, it works (both ways).

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"El realista sabe lo que quiere; el idealista quiere lo que sabe" (Anónimo)