Thread: Postgres Backup and Restore
Hello List, I have done some searching and looking at all the documentation I could find, and I have found a (what i consider) a bug in pg_dump pg_restore that I wanted to bounce off the list to find their solutions. I have a database which gets updated from many locations and it looks like we have gotten some character encoding mis matches in the system. I am going through and fixing these as this is my problem, not postgres's, however, in finding this bug in our data, I found a situation with postgres. If you have bad character encodings in your data, you can not backup and restore that data using pg_dump and pg_restore, on Windows Server 2003, using compress or plain text modes. Both modes fail on the bad encoding when you try to restore. I found this out while doing a test run on our backup and restore process, this data had crept in over a recent period and got into the database somehow - I am finding out how :) I see this as a bug because the data got in there and postgres accepted it on the copy in, but then refuses to accept it on the restore, but I wanted to find out what the list thought. The database is 8.2, running on Windows Server 2003. The database is UTF8 as reported by \l in psql. Now I have managed to do a backup and test restore, but I did this by writing a ruby script that went through the plain text version of the backup line by line and split out the good from the bad by attempting to parse each line through Iconv. It was not a fast process, but it did result in me finding the 120 lines that contained bad data and get them out of the file's COPY tables. I then had to go and individually correct each of these fields to handle the incorrect encodings. Now, the problem is, I should be able to restore my backup, regardless of this sort of problem. The problem characters were all in text fields and so should be able to be restored. How are other admins handling such a situation? What backup tools do you use? I know on Oracle I can do a Binary hot backup which just bypasses the whole problem. But as it stands right now, I will have to keep this script handy or constantly double and triple check encodings. Note, I know the bad stuff shouldn't get in there to begin with, but that is not the point here. I should be able to restore to the state I made the backup in. Anyway, comments? Ideas? I have attached some of the encodings below if anyone is interested. Regards Mikel http://lindsaar.net/ "\201L J\303\203\342\200\234ZSEF U." "\201N\t\\N\tHU\t2084\t20" "DON\303\203\302\201T U." "\201ROS\t\\N\tHU\t3580\t" "\215HE 25\t\\N\t\\N\tHAM" "\215GER\t\\N\t\\N\t2001-" "\201SHINDEN 955-2\t\\" "\215NBUCH\t\\N\tDE\t710"
"Mikel Lindsaar" <raasdnil@gmail.com> writes: > If you have bad character encodings in your data, you can not backup > and restore that data using pg_dump and pg_restore, on Windows Server > 2003, using compress or plain text modes. Both modes fail on the bad > encoding when you try to restore. Are you sure that the destination database has the same encoding declaration as the source did? How was the badly-encoded data inserted in the first place? regards, tom lane
> Are you sure that the destination database has the same encoding > declaration as the source did? That's what I'd be looking at - I had similar problems a few weeks ago. See this thread: http://archives.postgresql.org/pgsql-admin/2007-12/msg00235.php THINK BEFORE YOU PRINT - Save paper if you don't really need to print this *******************Confidentiality and Privilege Notice******************* The material contained in this message is privileged and confidential to the addressee. If you are not the addressee indicated in this message or responsible for delivery of the message to such person, you may not copy or deliver this message to anyone, and you should destroy it and kindly notify the sender by reply email. Information in this message that does not relate to the official business of Weatherbeeta must be treated as neither given nor endorsed by Weatherbeeta. Weatherbeeta, its employees, contractors or associates shall not be liable for direct, indirect or consequential loss arising from transmission of this message or any attachments e-mail.
> > Are you sure that the destination database has the same encoding > > declaration as the source did? > > That's what I'd be looking at - I had similar problems a few weeks ago. > See > this thread: > http://archives.postgresql.org/pgsql-admin/2007-12/msg00235.php I've been thinking about this, and while I don't agree it's a bug, I think that perhaps PostgreSQL should raise a notice or warning that the destination database has different encoding than the file being restored...? Obviously I'm not the only one (seemingly) caught out by this so I'm sure it would save some heartache for people. THINK BEFORE YOU PRINT - Save paper if you don't really need to print this *******************Confidentiality and Privilege Notice******************* The material contained in this message is privileged and confidential to the addressee. If you are not the addressee indicated in this message or responsible for delivery of the message to such person, you may not copy or deliver this message to anyone, and you should destroy it and kindly notify the sender by reply email. Information in this message that does not relate to the official business of Weatherbeeta must be treated as neither given nor endorsed by Weatherbeeta. Weatherbeeta, its employees, contractors or associates shall not be liable for direct, indirect or consequential loss arising from transmission of this message or any attachments e-mail.
"Phillip Smith" <phillip.smith@weatherbeeta.com.au> writes: > I've been thinking about this, and while I don't agree it's a bug, I think > that perhaps PostgreSQL should raise a notice or warning that the > destination database has different encoding than the file being restored...? If Postgres actually *knows* that the encodings are different, it can deal with that. The cases that are problematic are where the software has been misinformed for one reason or another. I doubt that a warning issued (or not) on the basis of misinformation will be especially helpful. regards, tom lane
> "Phillip Smith" <phillip.smith@weatherbeeta.com.au> writes: > > I've been thinking about this, and while I don't agree it's a bug, I > think > > that perhaps PostgreSQL should raise a notice or warning that the > > destination database has different encoding than the file being > restored...? > > If Postgres actually *knows* that the encodings are different, it can > deal with that. The cases that are problematic are where the software > has been misinformed for one reason or another. I doubt that a warning > issued (or not) on the basis of misinformation will be especially helpful. How has the software been misinformed and how can we avoid it then? :) The start of the pgdump sets the client encoding, so shouldn't that tell PG that things could get strange? THINK BEFORE YOU PRINT - Save paper if you don't really need to print this *******************Confidentiality and Privilege Notice******************* The material contained in this message is privileged and confidential to the addressee. If you are not the addressee indicated in this message or responsible for delivery of the message to such person, you may not copy or deliver this message to anyone, and you should destroy it and kindly notify the sender by reply email. Information in this message that does not relate to the official business of Weatherbeeta must be treated as neither given nor endorsed by Weatherbeeta. Weatherbeeta, its employees, contractors or associates shall not be liable for direct, indirect or consequential loss arising from transmission of this message or any attachments e-mail.
"Phillip Smith" <phillip.smith@weatherbeeta.com.au> writes: > How has the software been misinformed and how can we avoid it then? :) Well, if that were entirely clear then we could fix it. > The start of the pgdump sets the client encoding, so shouldn't that tell PG > that things could get strange? The problems seem to arise when the data doesn't actually have the claimed encoding. We've been gradually tightening the encoding checks over the years, so data that older versions let pass without comment might now be recognized as malformed. In the OP's case there may be something else going on (such as an actual encoding-conversion bug). I've encouraged him to try to trace things forward from his original input data and see if we can find out exactly where it went off the rails. regards, tom lane