Thread: Encoding-related errors when moving from 7.3 to 8.0.1
Hello, I'm trying to upgrade a system that is currently running version 7.4.5 to the new 8.0.1 I create a backup, using pg_dump, and I expect it to work when restoring it to 8.0.1. However, when I run: psql -U user -f backup.sql Whenever there is a field value that contains characters with accents (e.g., HTML á , which would be the Alt-160 character when using the numeric keypad on Windows), I get an error about invalid UNICODE characters in the COPY statements. The error reads like: psql:db_backup.sql:1548: ERROR: invalid byte sequence for encoding "UNICODE": 0xe12020 CONTEXT: COPY country, line 5, column namespanish: "Canad? " (that ? should be an a with a ' on top -- in HTML, it would be the á character) I get this error with or without the line char_encoding="SQL_ASCII" at the beginning of the pg_dump'ed file (I noticed it and removed it to see if that would fix the problem -- it didn't change the behaviour). I know this feels like it could be the dumbest question ever asked around here :-) But I can't figure out why it's happening and how to fix it (I mean, it's a backup creaetd by pg_dump -- it should be compatible with a psql restore, even if it is a different version). Any ideas? Thanks, Carlos --
On Sat, Mar 19, 2005 at 05:25:46PM -0500, Carlos Moreno wrote: Carlos, > The error reads like: > > psql:db_backup.sql:1548: ERROR: invalid byte sequence for encoding > "UNICODE": 0xe12020 > CONTEXT: COPY country, line 5, column namespanish: > "Canad? " Hmm. The sequence looks like latin1 interpreted as utf8. This seems the inverse of the problem reported (and solved) here http://archives.postgresql.org/pgsql-es-ayuda/2005-03/msg00491.php Maybe you should try sticking a SET client_encoding TO latin1; at the beggining of the dump file. Why are you using CHAR(n) fields anyway? It should probably be better if you used VARCHAR(n) ... -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "El destino baraja y nosotros jugamos" (A. Schopenhauer)
Hi Alvaro, thanks for your reply! Alvaro Herrera wrote: >>psql:db_backup.sql:1548: ERROR: invalid byte sequence for encoding >>"UNICODE": 0xe12020 >>CONTEXT: COPY country, line 5, column namespanish: >>"Canad? " > > Hmm. The sequence looks like latin1 interpreted as utf8. This seems > the inverse of the problem reported (and solved) here > > http://archives.postgresql.org/pgsql-es-ayuda/2005-03/msg00491.php > > Maybe you should try sticking a > > SET client_encoding TO latin1; > > at the beggining of the dump file. One thing worries me, though. With all of the previous versions of postgresql (I think when we started to use it in our system, it was version 7.1), I have never worried about any encoding issues. Our users are mostly Spanish-speaking, and they register to our system via web-based interfaces; virtually 100% of them use Windows (and perhaps most of them Windows in Spanish, with a Spanish keyboard). So, our system (CGI's written in C++ running on a Linux server) simply takes whatever the user gives (properly validated and escaped) and throws it in the database. We've never encountered any problem (well, or perhaps it's the opposite? Perhaps we've always been living with the problem without realizing it?) I worry now that if I needed to put a set client_encoding statement to make the insert or COPY statements work, does that mean that I should modify each and every program that I have that interacts with the database, and add a "set client encoding" statement before whatever other statement(s) we execute? Or is this client_encoding setting something that gets attached to the database (or the tables) itself? Where can I find more documentation on these issues? I'd like to get a deeper understanding, to avoid any future problems. > Why are you using CHAR(n) fields anyway? It should probably be better > if you used VARCHAR(n) ... Una de esas cosas que pasan hasta en las mejores familias ;-) (I was also surprised when noticing the bunch of spaces at the end -- I would have thought that we were using varchars in fields like that one) Thanks again! Cheers, Carlos -- PS: I have a strict white-list anti-spam filter in place, which is why a direct e-mail would be rejected -- let me know if you want to write directly through e-mail, so that I can add you to the white list file.
On Sun, Mar 20, 2005 at 10:02:24AM -0500, Carlos Moreno wrote: Carlos, > So, our system (CGI's written in C++ running on a Linux server) > simply takes whatever the user gives (properly validated and > escaped) and throws it in the database. We've never encountered > any problem (well, or perhaps it's the opposite? Perhaps we've > always been living with the problem without realizing it?) The latter, I think. The problem is character recoding. If your old system has been running with encoding SQL_ASCII, then no recoding ever takes place. If you are now using UTF8 or latin1 (say) as server encoding, then as soon as the client is using a different encoding, there should be conversion in order to make the new data correct w.r.t. the server encoding. If the wrong conversion takes place, or if no conversion takes place, you may either end up with invalid data, or have the server reject your input (as was this case.) So the moral of the story seems to be that yes, you need to make each application issue the correct client_encoding before entering any data. You can attach it to the user or database, by issuing ALTER USER (resp. DATABASE). But if you are using a web interface, where the user can enter data in either win1252 or latin1 encoding (or whatever) depending on the environment, then I'm not sure what you should do. One idea would be "do nothing," but that seems very invalid-data-prone. Another idea would be having the user select an encoding (and maybe display the data to them after the recoding has taken place so they can correct it in case they got it wrong.) This seems messy and likely to upset your users. Someone else may have better advise for you on this. I haven't really worked with these things. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "I can't go to a restaurant and order food because I keep looking at the fonts on the menu. Five minutes later I realize that it's also talking about food" (Donald Knuth)
On Sun, Mar 20, 2005 at 10:02:24AM -0500, Carlos Moreno wrote: Carlos, > Carlos > -- > PS: I have a strict white-list anti-spam filter in place, which > is why a direct e-mail would be rejected -- let me know if > you want to write directly through e-mail, so that I can > add you to the white list file. I forgot to mention that I did receive the "rejected mail" message, which directed me to an URL which thrown a 404 error. You may want to take a look ... -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "Amanece. (Ignacio Reyes) El Cerro San Cristóbal me mira, cínicamente, con ojos de virgen"
Thanks again, Alvaro! Alvaro Herrera wrote: >>So, our system (CGI's written in C++ running on a Linux server) >>simply takes whatever the user gives (properly validated and >>escaped) and throws it in the database. We've never encountered >>any problem (well, or perhaps it's the opposite? Perhaps we've >>always been living with the problem without realizing it?) > > The latter, I think. The problem is character recoding. If your old > system has been running with encoding SQL_ASCII, then no recoding ever > takes place. If you are now using UTF8 or latin1 (say) as server > encoding, then as soon as the client is using a different encoding, > there should be conversion in order to make the new data correct w.r.t. > the server encoding. If the wrong conversion takes place, or if no > conversion takes place, you may either end up with invalid data, or > have the server reject your input (as was this case.) This makes sense to me, yes. The reason why I'm a bit lost is that we never did anything whatsoever with respect to encoding. Oddly enough, I couldn't find much about this in the docs. I see references to it in the runtime configuration docs (the part where they describe the postgres.conf file). There's one line, commented out, where they set (as an example), the client_encoding to sql_ascii, and a comment to the end of that line says "actually, it defaults to the server encoding"). I just found out that in the create database statement, one of the options specifies the encoding using for that database. I guess what changed from version 7.4.x to 8.0 is that the default server_encoding changed? This means that a temporary solution (or rather, a temporary patch) would be to create the database specifying the right server_encoding to match what I currently have on my system? (I wouldn't want to do that if it is nothing more than a patch to keep hiding the problem) > So the moral of the story seems to be that yes, you need to make each > application issue the correct client_encoding before entering any data. > You can attach it to the user or database, by issuing ALTER USER (resp. > DATABASE). But if you are using a web interface, where the user can > enter data in either win1252 or latin1 encoding (or whatever) depending > on the environment, then I'm not sure what you should do. This is indeed the case; and "do nothing" is what we have always done with respect to this issue... Why has it been so long without us realizing that there was a hidden problem, I really don't know. (and we do have users with plenty of "weird" characters -- accent aigue, grave, circumflex, tilde, dieresis, etc. -- and they have always worked). I'm so lost! :-( BTW, the correct e-mail to pass through the anti-spam filter is my first name, followed by a dot, followed by my last name (the rest after the @ is the same) Thanks again for your message! Carlos --
[...] > This makes sense to me, yes. The reason why I'm a bit lost is that > we never did anything whatsoever with respect to encoding. Oddly > enough, I couldn't find much about this in the docs. > [...] > > I guess what changed from version 7.4.x to 8.0 is that the default > server_encoding changed? Some further digging through the documentation reveals (in addition to the fact that I seem to be an incompetent when it comes to search through documentation :-)) that the above is indeed the case. Each database apparently has its own encoding, that can be specified with either the createdb shell command (switch -e or --encoding, IIRC), or with the SQL command create database. With version 7.4.x, when I execute the command: show server_encoding; I get SQL_ASCII as response. Also, when I do: select * from pg_database; I get a bunch of columns where the encoding field contains 0 for all of them (I never specified encoding when creating databases, so in all those cases it took the default). With 8.0.1, the command show server_encoding returns UNICODE as result -- this is also the default, since I created the database without explicitly indicating encoding. Apparently, conversion from latin1 to UNICODE works? (or at least doesn't trigger an error -- I wonder if some characters appear now incorrectly? I'll check that). Yes, I know I'm blah-blah-ing and answering my own questions :-) I'm just posting the progress I've made in case someone else was also interested in finding out more about the problem or possible solutions. Thanks, Carlos --
On Tue, Mar 22, 2005 at 09:28:17PM -0500, Carlos Moreno wrote: > >I guess what changed from version 7.4.x to 8.0 is that the default > >server_encoding changed? > > Some further digging through the documentation reveals (in addition > to the fact that I seem to be an incompetent when it comes to search > through documentation :-)) that the above is indeed the case. Oh, I forgot that. In 8.0 the encoding is actually derived from the locale settings, in case the platforms supports that. Previously, SQL_ASCII was always the default IIRC. > Apparently, conversion from latin1 to UNICODE works? (or at least > doesn't trigger an error -- I wonder if some characters appear now > incorrectly? I'll check that). Yes, it works (both ways). -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "El realista sabe lo que quiere; el idealista quiere lo que sabe" (Anónimo)