Thread: Locale/encoding problem/question
Hello all, I somehow managed to stuff up the encoding (or locale or something) in a transfer of a database from one machine to another (also different linux distribution). The problem is this: the origional database was created and populated with data using whatever default locale/encoding was installed on the first machine. Words would appear correctly, as in: Hoë After dumping the database and restoring on a different one, the word incorrectly appears as: Hoë Before I do any more damage or waste time, I'd appreciate any ideas or pointers on correctly performing the initdb -E<encoding> --locale=<locale>... (or any other ideas on resolving this issue) , before I try and restore the data again. Thanks henk
On Fri, Aug 04, 2006 at 10:48:17AM +0200, henka@cityweb.co.za wrote: > Hello all, > > I somehow managed to stuff up the encoding (or locale or something) in a > transfer of a database from one machine to another (also different linux > distribution). > > The problem is this: the origional database was created and populated > with data using whatever default locale/encoding was installed on the > first machine. Two big questions: 1. What encoding are the two database (\l will tell you)? 2. What encoding are the clients expecting? It is entirely possible that the databases have got the right encoding, but the client you're accessing it doesn't understand. For example, if you've got both your databases on UTF-8, then the transfer went fine, but perhaps the client on the new machine doesn't display UTF-8. This happens because psql and other clients default to the same encoding as the server and don't check to see if the terminal actually supports that. So what you're decribing could also be a result of the server sending you UTF-8, the client displaying that, but the xterm or whatever you're using thinking it's Latin1. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
> On Fri, Aug 04, 2006 at 10:48:17AM +0200, henka@cityweb.co.za wrote: >> Hello all, >> >> I somehow managed to stuff up the encoding (or locale or something) in a >> transfer of a database from one machine to another (also different linux >> distribution). >> >> The problem is this: the origional database was created and populated >> with data using whatever default locale/encoding was installed on the >> first machine. > > Two big questions: > > 1. What encoding are the two database (\l will tell you)? > 2. What encoding are the clients expecting? Thanks for the response, Martijn. I *think* the client_encoding origionally in the db was UTF-8 (but I could be wrong, it might have been LATIN1). I would imagine that LATIN1 would be the right one, since it needs to display standard english, plus some others (such as é ä ë è etc). The multibyte chars show up in xterm (putty) -and- when the data is displayed using php in a browser - both incorrectly. I've even tried using LATIN1 (ie, explicitly setting it to latin1 using initdb, and then restoring the database after changing the 'utf-8' strings in the dump data to 'latin1'). This still yields the funny chars. To be honest, I have no idea what the origional encoding was. Can you suggest any other approaches I can try to restore the database so that those chars display correctly? All comments are welcome. Regards Henk
On Fri, Aug 04, 2006 at 11:58:22AM +0200, henka@cityweb.co.za wrote: > > On Fri, Aug 04, 2006 at 10:48:17AM +0200, henka@cityweb.co.za wrote: > > Two big questions: > > > > 1. What encoding are the two database (\l will tell you)? > > 2. What encoding are the clients expecting? > I've even tried using LATIN1 (ie, explicitly setting it to latin1 using > initdb, and then restoring the database after changing the 'utf-8' strings > in the dump data to 'latin1'). This still yields the funny chars. Wait, so the dump is in utf-8? You shouldn't need to edit the dump, postgresql will convert the encodings on the fly while loading. > To be honest, I have no idea what the origional encoding was. It should be in the dump file, almost the first line. Locale is of no interest to pg_dump, you'll have to decide how you want it. > Can you suggest any other approaches I can try to restore the database so > that those chars display correctly? Well, at the very least, does it go away if you type: set client_encoding=latin1; Please provide more details about your setup too, your client is on windows? The server is ...? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
> On Fri, Aug 04, 2006 at 11:58:22AM +0200, henka@cityweb.co.za wrote: >> > On Fri, Aug 04, 2006 at 10:48:17AM +0200, henka@cityweb.co.za wrote: >> > Two big questions: >> > >> > 1. What encoding are the two database (\l will tell you)? >> > 2. What encoding are the clients expecting? > >> I've even tried using LATIN1 (ie, explicitly setting it to latin1 using >> initdb, and then restoring the database after changing the 'utf-8' >> strings >> in the dump data to 'latin1'). This still yields the funny chars. > > Wait, so the dump is in utf-8? You shouldn't need to edit the dump, > postgresql will convert the encodings on the fly while loading. I've actually found two versions - one with UTF-8, and the other LATIN1. >> To be honest, I have no idea what the origional encoding was. > > It should be in the dump file, almost the first line. Locale is of no > interest to pg_dump, you'll have to decide how you want it. Yes: UTF-8 and the other is LATIN1 > Well, at the very least, does it go away if you type: > > set client_encoding=latin1; No it doesn't. That was one of the first things I tried after reading the docs. > Please provide more details about your setup too, your client is on > windows? The server is ...? Server: Linux debian sarge PG: 8.1.4 show all: lc_collate C lc_ctype C lc_messages C lc_monetary C lc_numeric C lc_time C client_encoding LATIN1 (or UTF-8) Clients: Windows using PuTTY (ie, for psql), and dynamic web content with PHP/Pg (on any browser).
henka@cityweb.co.za writes: >> It should be in the dump file, almost the first line. Locale is of no >> interest to pg_dump, you'll have to decide how you want it. > Yes: UTF-8 and the other is LATIN1 Note that this represents what the original server *thought* the encoding was. But it's not at all impossible that the server thought the data was LATIN1 when it was really UTF8. (The other way around is less plausible because the server would have been able to detect encoding errors.) If you were using clients that treated the data as UTF8 without paying attention to what the server thought, you'd not have realized you were mislabeling the data. But, if you tried to load data marked as LATIN1 into a server using UTF8, it'd have applied a LATIN1 to UTF8 conversion, and then everything's hosed. I'd suggest actually inspecting the data in the dump file: it's not that hard to tell UTF8 from LATIN1 if you look at the byte sequences. Or you could just take the file marked LATIN1, edit it to change the client_encoding setting to say the data is UTF8, and see if you can load it. If it's not UTF8, 8.1.4 will almost certainly detect a ton of encoding errors. regards, tom lane
> henka@cityweb.co.za writes: >>> It should be in the dump file, almost the first line. Locale is of no >>> interest to pg_dump, you'll have to decide how you want it. > >> Yes: UTF-8 and the other is LATIN1 > > Note that this represents what the original server *thought* the > encoding was. But it's not at all impossible that the server thought > the data was LATIN1 when it was really UTF8. (The other way around is > less plausible because the server would have been able to detect > encoding errors.) If you were using clients that treated the data > as UTF8 without paying attention to what the server thought, you'd > not have realized you were mislabeling the data. > > But, if you tried to load data marked as LATIN1 into a server using > UTF8, it'd have applied a LATIN1 to UTF8 conversion, and then > everything's hosed. > > I'd suggest actually inspecting the data in the dump file: it's not that > hard to tell UTF8 from LATIN1 if you look at the byte sequences. > > Or you could just take the file marked LATIN1, edit it to change the > client_encoding setting to say the data is UTF8, and see if you can > load it. If it's not UTF8, 8.1.4 will almost certainly detect a ton of > encoding errors. Thanks Tom, your suggestion worked. Just to document this for others, this is what I did: - created a new empty DB: initdb -ELATIN1 -D data. - edited dump file with UTF8 encoding and changed to LATIN1 (doing the reverse resulted in encoding errors during restore). - restored database So, it looks like it was the reverse: the db thought it was UTF8, when in fact it was LATIN1. Regards