Thread: Problem with pg_compresslog'd archives
This may belong in a bug report, but I'll post it here first... There appears to be a **SERIOUS** problem with using pg_compresslog and pg_uncompresslog with Postgresql 8.4.2. Here's my configuration snippet: full_page_writes = on # recover from partial page writes wal_buffers = 256kB # min 32kB # (change requires restart) #wal_writer_delay = 200ms # 1-10000 milliseconds #commit_delay = 0 # range 0-100000, in microseconds #commit_siblings = 5 # range 1-1000 # - Checkpoints - checkpoint_segments = 64 # in logfile segments, min 1, 16MB each #checkpoint_timeout = 5min # range 30s-1h checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0 #checkpoint_warning = 30s # 0 disables archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog %p | bzip2 - >/dbms/pg_archive/%f.bz2' #command to use to archive a logfile segment All appears to be fine with the writes, and they are being saved off on the nightly backups without incident. I take a full dump using the instructions in the documentation and make sure I copy the proper "must have" file for consistency to be reached. The problem comes when I try to restore. recovery_conf contains: restore_command = '/usr/local/pgsql/recovery.sh %f %p' And that file contains: #! /bin/sh infile=$1 outfile=$2 if test -f /dbms/pg_archive/$infile.bz2 then bunzip2 -c /dbms/pg_archive/$infile.bz2 | /usr/local/pgsql/bin/pg_decompresslog - $outfile exit 0 else exit 1 fi ============== The problem is that it appears that some of the segments being saved are no good! On occasion I get this when trying to restore... Feb 7 12:43:51 dbms2 postgres[2001]: [210-1] LOG: restored log file "00000001000001710000009A" from archive Feb 7 12:43:52 dbms2 postgres[2001]: [211-1] LOG: restored log file "00000001000001710000009B" from archive Feb 7 12:43:52 dbms2 postgres[2001]: [212-1] LOG: restored log file "00000001000001710000009C" from archive Feb 7 12:43:52 dbms2 postgres[2001]: [213-1] LOG: restored log file "00000001000001710000009D" from archive Feb 7 12:43:53 dbms2 postgres[2001]: [214-1] LOG: restored log file "00000001000001710000009E" from archive Feb 7 12:43:53 dbms2 postgres[2001]: [215-1] LOG: restored log file "00000001000001710000009F" from archive Feb 7 12:43:54 dbms2 postgres[2001]: [216-1] LOG: restored log file "0000000100000171000000A0" from archive Feb 7 12:43:54 dbms2 postgres[2001]: [217-1] LOG: restored log file "0000000100000171000000A1" from archive Feb 7 12:43:55 dbms2 postgres[2001]: [218-1] LOG: restored log file "0000000100000171000000A2" from archive Feb 7 12:43:55 dbms2 postgres[2001]: [219-1] LOG: restored log file "0000000100000171000000A3" from archive Feb 7 12:43:56 dbms2 postgres[2001]: [220-1] LOG: restored log file "0000000100000171000000A4" from archive Feb 7 12:43:56 dbms2 postgres[2001]: [221-1] LOG: restored log file "0000000100000171000000A5" from archive Feb 7 12:43:57 dbms2 postgres[2001]: [222-1] LOG: restored log file "0000000100000171000000A6" from archive Feb 7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC: corrupted page pointers: lower = 772, upper = 616, special = 0 Feb 7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT: xlog redo hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4 Feb 7 12:43:57 dbms2 postgres[2000]: [1-1] LOG: startup process (PID 2001) was terminated by signal 6: Abort trap Feb 7 12:43:57 dbms2 postgres[2000]: [2-1] LOG: terminating any other active server processes Eek. I assume this means that either A6 or A7 is corrupt. But I have the file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED: On the SOURCE machine (which is running just fine): tickerforum# cksum *171*A[67]* 172998591 830621 0000000100000171000000A6.bz2 1283345296 1541006 0000000100000171000000A7.bz2 And off the BACKUP archive, which is what I'm trying to restore: # cksum *171*A[67]* 172998591 830621 0000000100000171000000A6.bz2 1283345296 1541006 0000000100000171000000A7.bz2 Identical, says the checksums. This is VERY BAD - if pg_compresslog is damaging the files in some instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT RESTORE!!!!!! Needless to say this is a MAJOR problem. -- Karl Denninger
Attachment
Koichi-san, seems, very similar problem. Regards, Oleg On Sun, 7 Feb 2010, Karl Denninger wrote: > This may belong in a bug report, but I'll post it here first... > > There appears to be a **SERIOUS** problem with using pg_compresslog and > pg_uncompresslog with Postgresql 8.4.2. > > Here's my configuration snippet: > > full_page_writes = on # recover from partial page writes > wal_buffers = 256kB # min 32kB > # (change requires restart) > #wal_writer_delay = 200ms # 1-10000 milliseconds > > #commit_delay = 0 # range 0-100000, in microseconds > #commit_siblings = 5 # range 1-1000 > > # - Checkpoints - > > checkpoint_segments = 64 # in logfile segments, min 1, > 16MB each > #checkpoint_timeout = 5min # range 30s-1h > checkpoint_completion_target = 0.9 # checkpoint target duration, > 0.0 - 1.0 > #checkpoint_warning = 30s # 0 disables > > archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog > %p | bzip2 - >/dbms/pg_archive/%f.bz2' #command to use to > archive a logfile segment > > All appears to be fine with the writes, and they are being saved off on > the nightly backups without incident. > > I take a full dump using the instructions in the documentation and make > sure I copy the proper "must have" file for consistency to be reached. > > The problem comes when I try to restore. > > recovery_conf contains: > > restore_command = '/usr/local/pgsql/recovery.sh %f %p' > > And that file contains: > > > #! /bin/sh > > infile=$1 > outfile=$2 > > if test -f /dbms/pg_archive/$infile.bz2 > then > bunzip2 -c /dbms/pg_archive/$infile.bz2 | > /usr/local/pgsql/bin/pg_decompresslog - $outfile > exit 0 > else > exit 1 > fi > > ============== > > The problem is that it appears that some of the segments being saved are > no good! On occasion I get this when trying to restore... > > Feb 7 12:43:51 dbms2 postgres[2001]: [210-1] LOG: restored log file > "00000001000001710000009A" from archive > Feb 7 12:43:52 dbms2 postgres[2001]: [211-1] LOG: restored log file > "00000001000001710000009B" from archive > Feb 7 12:43:52 dbms2 postgres[2001]: [212-1] LOG: restored log file > "00000001000001710000009C" from archive > Feb 7 12:43:52 dbms2 postgres[2001]: [213-1] LOG: restored log file > "00000001000001710000009D" from archive > Feb 7 12:43:53 dbms2 postgres[2001]: [214-1] LOG: restored log file > "00000001000001710000009E" from archive > Feb 7 12:43:53 dbms2 postgres[2001]: [215-1] LOG: restored log file > "00000001000001710000009F" from archive > Feb 7 12:43:54 dbms2 postgres[2001]: [216-1] LOG: restored log file > "0000000100000171000000A0" from archive > Feb 7 12:43:54 dbms2 postgres[2001]: [217-1] LOG: restored log file > "0000000100000171000000A1" from archive > Feb 7 12:43:55 dbms2 postgres[2001]: [218-1] LOG: restored log file > "0000000100000171000000A2" from archive > Feb 7 12:43:55 dbms2 postgres[2001]: [219-1] LOG: restored log file > "0000000100000171000000A3" from archive > Feb 7 12:43:56 dbms2 postgres[2001]: [220-1] LOG: restored log file > "0000000100000171000000A4" from archive > Feb 7 12:43:56 dbms2 postgres[2001]: [221-1] LOG: restored log file > "0000000100000171000000A5" from archive > Feb 7 12:43:57 dbms2 postgres[2001]: [222-1] LOG: restored log file > "0000000100000171000000A6" from archive > Feb 7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC: corrupted page > pointers: lower = 772, upper = 616, special = 0 > Feb 7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT: xlog redo > hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4 > Feb 7 12:43:57 dbms2 postgres[2000]: [1-1] LOG: startup process (PID > 2001) was terminated by signal 6: Abort trap > Feb 7 12:43:57 dbms2 postgres[2000]: [2-1] LOG: terminating any other > active server processes > > Eek. > > I assume this means that either A6 or A7 is corrupt. But I have the > file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED: > > On the SOURCE machine (which is running just fine): > tickerforum# cksum *171*A[67]* > 172998591 830621 0000000100000171000000A6.bz2 > 1283345296 1541006 0000000100000171000000A7.bz2 > > And off the BACKUP archive, which is what I'm trying to restore: > > # cksum *171*A[67]* > 172998591 830621 0000000100000171000000A6.bz2 > 1283345296 1541006 0000000100000171000000A7.bz2 > > Identical, says the checksums. > > This is VERY BAD - if pg_compresslog is damaging the files in some > instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT > RESTORE!!!!!! > > Needless to say this is a MAJOR problem. > > -- Karl Denninger > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
I found it's pg_compresslog problem (calculation of XNOOP record length used in pg_decompresslog). I'm fixing the bug and will upload the fix shortly. Sorry for inconvenience. ------------------ Koichi Suzuki 2010/2/8 Karl Denninger <karl@denninger.net>: > This may belong in a bug report, but I'll post it here first... > > There appears to be a **SERIOUS** problem with using pg_compresslog and > pg_uncompresslog with Postgresql 8.4.2. > > Here's my configuration snippet: > > full_page_writes = on # recover from partial page writes > wal_buffers = 256kB # min 32kB > # (change requires restart) > #wal_writer_delay = 200ms # 1-10000 milliseconds > > #commit_delay = 0 # range 0-100000, in microseconds > #commit_siblings = 5 # range 1-1000 > > # - Checkpoints - > > checkpoint_segments = 64 # in logfile segments, min 1, > 16MB each > #checkpoint_timeout = 5min # range 30s-1h > checkpoint_completion_target = 0.9 # checkpoint target duration, > 0.0 - 1.0 > #checkpoint_warning = 30s # 0 disables > > archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog > %p | bzip2 - >/dbms/pg_archive/%f.bz2' #command to use to > archive a logfile segment > > All appears to be fine with the writes, and they are being saved off on > the nightly backups without incident. > > I take a full dump using the instructions in the documentation and make > sure I copy the proper "must have" file for consistency to be reached. > > The problem comes when I try to restore. > > recovery_conf contains: > > restore_command = '/usr/local/pgsql/recovery.sh %f %p' > > And that file contains: > > > #! /bin/sh > > infile=$1 > outfile=$2 > > if test -f /dbms/pg_archive/$infile.bz2 > then > bunzip2 -c /dbms/pg_archive/$infile.bz2 | > /usr/local/pgsql/bin/pg_decompresslog - $outfile > exit 0 > else > exit 1 > fi > > ============== > > The problem is that it appears that some of the segments being saved are > no good! On occasion I get this when trying to restore... > > Feb 7 12:43:51 dbms2 postgres[2001]: [210-1] LOG: restored log file > "00000001000001710000009A" from archive > Feb 7 12:43:52 dbms2 postgres[2001]: [211-1] LOG: restored log file > "00000001000001710000009B" from archive > Feb 7 12:43:52 dbms2 postgres[2001]: [212-1] LOG: restored log file > "00000001000001710000009C" from archive > Feb 7 12:43:52 dbms2 postgres[2001]: [213-1] LOG: restored log file > "00000001000001710000009D" from archive > Feb 7 12:43:53 dbms2 postgres[2001]: [214-1] LOG: restored log file > "00000001000001710000009E" from archive > Feb 7 12:43:53 dbms2 postgres[2001]: [215-1] LOG: restored log file > "00000001000001710000009F" from archive > Feb 7 12:43:54 dbms2 postgres[2001]: [216-1] LOG: restored log file > "0000000100000171000000A0" from archive > Feb 7 12:43:54 dbms2 postgres[2001]: [217-1] LOG: restored log file > "0000000100000171000000A1" from archive > Feb 7 12:43:55 dbms2 postgres[2001]: [218-1] LOG: restored log file > "0000000100000171000000A2" from archive > Feb 7 12:43:55 dbms2 postgres[2001]: [219-1] LOG: restored log file > "0000000100000171000000A3" from archive > Feb 7 12:43:56 dbms2 postgres[2001]: [220-1] LOG: restored log file > "0000000100000171000000A4" from archive > Feb 7 12:43:56 dbms2 postgres[2001]: [221-1] LOG: restored log file > "0000000100000171000000A5" from archive > Feb 7 12:43:57 dbms2 postgres[2001]: [222-1] LOG: restored log file > "0000000100000171000000A6" from archive > Feb 7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC: corrupted page > pointers: lower = 772, upper = 616, special = 0 > Feb 7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT: xlog redo > hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4 > Feb 7 12:43:57 dbms2 postgres[2000]: [1-1] LOG: startup process (PID > 2001) was terminated by signal 6: Abort trap > Feb 7 12:43:57 dbms2 postgres[2000]: [2-1] LOG: terminating any other > active server processes > > Eek. > > I assume this means that either A6 or A7 is corrupt. But I have the > file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED: > > On the SOURCE machine (which is running just fine): > tickerforum# cksum *171*A[67]* > 172998591 830621 0000000100000171000000A6.bz2 > 1283345296 1541006 0000000100000171000000A7.bz2 > > And off the BACKUP archive, which is what I'm trying to restore: > > # cksum *171*A[67]* > 172998591 830621 0000000100000171000000A6.bz2 > 1283345296 1541006 0000000100000171000000A7.bz2 > > Identical, says the checksums. > > This is VERY BAD - if pg_compresslog is damaging the files in some > instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT > RESTORE!!!!!! > > Needless to say this is a MAJOR problem. > > -- Karl Denninger > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > > -- ------ Koichi Suzuki
Will this come through as a commit on the pgfoundry codebase? I've subscribed looking for it....
The last edit, if I read the release notes and tracebacks on the codebase correctly, goes back to the early part of 2009 - which strongly implies that there are a **LOT** of people out there that could be running this code with un-restoreable archives!
That, for obvious reasons, could be VERY, VERY bad if someone was to suffer a system crash....
Koichi Suzuki wrote:
The last edit, if I read the release notes and tracebacks on the codebase correctly, goes back to the early part of 2009 - which strongly implies that there are a **LOT** of people out there that could be running this code with un-restoreable archives!
That, for obvious reasons, could be VERY, VERY bad if someone was to suffer a system crash....
Koichi Suzuki wrote:
I found it's pg_compresslog problem (calculation of XNOOP record length used in pg_decompresslog). I'm fixing the bug and will upload the fix shortly. Sorry for inconvenience. ------------------ Koichi Suzuki 2010/2/8 Karl Denninger <karl@denninger.net>:This may belong in a bug report, but I'll post it here first... There appears to be a **SERIOUS** problem with using pg_compresslog and pg_uncompresslog with Postgresql 8.4.2. Here's my configuration snippet: full_page_writes = on # recover from partial page writes wal_buffers = 256kB # min 32kB # (change requires restart) #wal_writer_delay = 200ms # 1-10000 milliseconds #commit_delay = 0 # range 0-100000, in microseconds #commit_siblings = 5 # range 1-1000 # - Checkpoints - checkpoint_segments = 64 # in logfile segments, min 1, 16MB each #checkpoint_timeout = 5min # range 30s-1h checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0 #checkpoint_warning = 30s # 0 disables archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog %p | bzip2 - >/dbms/pg_archive/%f.bz2' #command to use to archive a logfile segment All appears to be fine with the writes, and they are being saved off on the nightly backups without incident. I take a full dump using the instructions in the documentation and make sure I copy the proper "must have" file for consistency to be reached. The problem comes when I try to restore. recovery_conf contains: restore_command = '/usr/local/pgsql/recovery.sh %f %p' And that file contains: #! /bin/sh infile=$1 outfile=$2 if test -f /dbms/pg_archive/$infile.bz2 then bunzip2 -c /dbms/pg_archive/$infile.bz2 | /usr/local/pgsql/bin/pg_decompresslog - $outfile exit 0 else exit 1 fi ============== The problem is that it appears that some of the segments being saved are no good! On occasion I get this when trying to restore... Feb 7 12:43:51 dbms2 postgres[2001]: [210-1] LOG: restored log file "00000001000001710000009A" from archive Feb 7 12:43:52 dbms2 postgres[2001]: [211-1] LOG: restored log file "00000001000001710000009B" from archive Feb 7 12:43:52 dbms2 postgres[2001]: [212-1] LOG: restored log file "00000001000001710000009C" from archive Feb 7 12:43:52 dbms2 postgres[2001]: [213-1] LOG: restored log file "00000001000001710000009D" from archive Feb 7 12:43:53 dbms2 postgres[2001]: [214-1] LOG: restored log file "00000001000001710000009E" from archive Feb 7 12:43:53 dbms2 postgres[2001]: [215-1] LOG: restored log file "00000001000001710000009F" from archive Feb 7 12:43:54 dbms2 postgres[2001]: [216-1] LOG: restored log file "0000000100000171000000A0" from archive Feb 7 12:43:54 dbms2 postgres[2001]: [217-1] LOG: restored log file "0000000100000171000000A1" from archive Feb 7 12:43:55 dbms2 postgres[2001]: [218-1] LOG: restored log file "0000000100000171000000A2" from archive Feb 7 12:43:55 dbms2 postgres[2001]: [219-1] LOG: restored log file "0000000100000171000000A3" from archive Feb 7 12:43:56 dbms2 postgres[2001]: [220-1] LOG: restored log file "0000000100000171000000A4" from archive Feb 7 12:43:56 dbms2 postgres[2001]: [221-1] LOG: restored log file "0000000100000171000000A5" from archive Feb 7 12:43:57 dbms2 postgres[2001]: [222-1] LOG: restored log file "0000000100000171000000A6" from archive Feb 7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC: corrupted page pointers: lower = 772, upper = 616, special = 0 Feb 7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT: xlog redo hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4 Feb 7 12:43:57 dbms2 postgres[2000]: [1-1] LOG: startup process (PID 2001) was terminated by signal 6: Abort trap Feb 7 12:43:57 dbms2 postgres[2000]: [2-1] LOG: terminating any other active server processes Eek. I assume this means that either A6 or A7 is corrupt. But I have the file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED: On the SOURCE machine (which is running just fine): tickerforum# cksum *171*A[67]* 172998591 830621 0000000100000171000000A6.bz2 1283345296 1541006 0000000100000171000000A7.bz2 And off the BACKUP archive, which is what I'm trying to restore: # cksum *171*A[67]* 172998591 830621 0000000100000171000000A6.bz2 1283345296 1541006 0000000100000171000000A7.bz2 Identical, says the checksums. This is VERY BAD - if pg_compresslog is damaging the files in some instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT RESTORE!!!!!! Needless to say this is a MAJOR problem. -- Karl Denninger -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Attachment
I understand the situation. I'll upload the improved code ASAP. ---------- Koichi Suzuki 2010/2/11 Karl Denninger <karl@denninger.net>: > Will this come through as a commit on the pgfoundry codebase? I've > subscribed looking for it.... > > The last edit, if I read the release notes and tracebacks on the codebase > correctly, goes back to the early part of 2009 - which strongly implies that > there are a **LOT** of people out there that could be running this code with > un-restoreable archives! > > That, for obvious reasons, could be VERY, VERY bad if someone was to suffer > a system crash.... > > > Koichi Suzuki wrote: > > I found it's pg_compresslog problem (calculation of XNOOP record > length used in pg_decompresslog). I'm fixing the bug and will > upload the fix shortly. > > Sorry for inconvenience. > > ------------------ > Koichi Suzuki > > 2010/2/8 Karl Denninger <karl@denninger.net>: > > > This may belong in a bug report, but I'll post it here first... > > There appears to be a **SERIOUS** problem with using pg_compresslog and > pg_uncompresslog with Postgresql 8.4.2. > > Here's my configuration snippet: > > full_page_writes = on # recover from partial page writes > wal_buffers = 256kB # min 32kB > # (change requires restart) > #wal_writer_delay = 200ms # 1-10000 milliseconds > > #commit_delay = 0 # range 0-100000, in microseconds > #commit_siblings = 5 # range 1-1000 > > # - Checkpoints - > > checkpoint_segments = 64 # in logfile segments, min 1, > 16MB each > #checkpoint_timeout = 5min # range 30s-1h > checkpoint_completion_target = 0.9 # checkpoint target duration, > 0.0 - 1.0 > #checkpoint_warning = 30s # 0 disables > > archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog > %p | bzip2 - >/dbms/pg_archive/%f.bz2' #command to use to > archive a logfile segment > > All appears to be fine with the writes, and they are being saved off on > the nightly backups without incident. > > I take a full dump using the instructions in the documentation and make > sure I copy the proper "must have" file for consistency to be reached. > > The problem comes when I try to restore. > > recovery_conf contains: > > restore_command = '/usr/local/pgsql/recovery.sh %f %p' > > And that file contains: > > > #! /bin/sh > > infile=$1 > outfile=$2 > > if test -f /dbms/pg_archive/$infile.bz2 > then > bunzip2 -c /dbms/pg_archive/$infile.bz2 | > /usr/local/pgsql/bin/pg_decompresslog - $outfile > exit 0 > else > exit 1 > fi > > ============== > > The problem is that it appears that some of the segments being saved are > no good! On occasion I get this when trying to restore... > > Feb 7 12:43:51 dbms2 postgres[2001]: [210-1] LOG: restored log file > "00000001000001710000009A" from archive > Feb 7 12:43:52 dbms2 postgres[2001]: [211-1] LOG: restored log file > "00000001000001710000009B" from archive > Feb 7 12:43:52 dbms2 postgres[2001]: [212-1] LOG: restored log file > "00000001000001710000009C" from archive > Feb 7 12:43:52 dbms2 postgres[2001]: [213-1] LOG: restored log file > "00000001000001710000009D" from archive > Feb 7 12:43:53 dbms2 postgres[2001]: [214-1] LOG: restored log file > "00000001000001710000009E" from archive > Feb 7 12:43:53 dbms2 postgres[2001]: [215-1] LOG: restored log file > "00000001000001710000009F" from archive > Feb 7 12:43:54 dbms2 postgres[2001]: [216-1] LOG: restored log file > "0000000100000171000000A0" from archive > Feb 7 12:43:54 dbms2 postgres[2001]: [217-1] LOG: restored log file > "0000000100000171000000A1" from archive > Feb 7 12:43:55 dbms2 postgres[2001]: [218-1] LOG: restored log file > "0000000100000171000000A2" from archive > Feb 7 12:43:55 dbms2 postgres[2001]: [219-1] LOG: restored log file > "0000000100000171000000A3" from archive > Feb 7 12:43:56 dbms2 postgres[2001]: [220-1] LOG: restored log file > "0000000100000171000000A4" from archive > Feb 7 12:43:56 dbms2 postgres[2001]: [221-1] LOG: restored log file > "0000000100000171000000A5" from archive > Feb 7 12:43:57 dbms2 postgres[2001]: [222-1] LOG: restored log file > "0000000100000171000000A6" from archive > Feb 7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC: corrupted page > pointers: lower = 772, upper = 616, special = 0 > Feb 7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT: xlog redo > hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4 > Feb 7 12:43:57 dbms2 postgres[2000]: [1-1] LOG: startup process (PID > 2001) was terminated by signal 6: Abort trap > Feb 7 12:43:57 dbms2 postgres[2000]: [2-1] LOG: terminating any other > active server processes > > Eek. > > I assume this means that either A6 or A7 is corrupt. But I have the > file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED: > > On the SOURCE machine (which is running just fine): > tickerforum# cksum *171*A[67]* > 172998591 830621 0000000100000171000000A6.bz2 > 1283345296 1541006 0000000100000171000000A7.bz2 > > And off the BACKUP archive, which is what I'm trying to restore: > > # cksum *171*A[67]* > 172998591 830621 0000000100000171000000A6.bz2 > 1283345296 1541006 0000000100000171000000A7.bz2 > > Identical, says the checksums. > > This is VERY BAD - if pg_compresslog is damaging the files in some > instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT > RESTORE!!!!!! > > Needless to say this is a MAJOR problem. > > -- Karl Denninger > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > > > > >
Not yet. I'm rebuilding the test suit for better testing. If you're in a hurry, I can send you an update for test. ---------- Koichi Suzuki 2010/4/19 Karl Denninger <karl@denninger.net>: > Has there been an update on this situation? > > Koichi Suzuki wrote: > > I understand the situation. I'll upload the improved code ASAP. > > ---------- > Koichi Suzuki > > > > 2010/2/11 Karl Denninger <karl@denninger.net>: > > > Will this come through as a commit on the pgfoundry codebase? I've > subscribed looking for it.... > > The last edit, if I read the release notes and tracebacks on the codebase > correctly, goes back to the early part of 2009 - which strongly implies that > there are a **LOT** of people out there that could be running this code with > un-restoreable archives! > > That, for obvious reasons, could be VERY, VERY bad if someone was to suffer > a system crash.... > > > Koichi Suzuki wrote: > > I found it's pg_compresslog problem (calculation of XNOOP record > length used in pg_decompresslog). I'm fixing the bug and will > upload the fix shortly. > > Sorry for inconvenience. > > ------------------ > Koichi Suzuki > > 2010/2/8 Karl Denninger <karl@denninger.net>: > > > This may belong in a bug report, but I'll post it here first... > > There appears to be a **SERIOUS** problem with using pg_compresslog and > pg_uncompresslog with Postgresql 8.4.2. > > Here's my configuration snippet: > > full_page_writes = on # recover from partial page writes > wal_buffers = 256kB # min 32kB > # (change requires restart) > #wal_writer_delay = 200ms # 1-10000 milliseconds > > #commit_delay = 0 # range 0-100000, in microseconds > #commit_siblings = 5 # range 1-1000 > > # - Checkpoints - > > checkpoint_segments = 64 # in logfile segments, min 1, > 16MB each > #checkpoint_timeout = 5min # range 30s-1h > checkpoint_completion_target = 0.9 # checkpoint target duration, > 0.0 - 1.0 > #checkpoint_warning = 30s # 0 disables > > archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog > %p | bzip2 - >/dbms/pg_archive/%f.bz2' #command to use to > archive a logfile segment > > All appears to be fine with the writes, and they are being saved off on > the nightly backups without incident. > > I take a full dump using the instructions in the documentation and make > sure I copy the proper "must have" file for consistency to be reached. > > The problem comes when I try to restore. > > recovery_conf contains: > > restore_command = '/usr/local/pgsql/recovery.sh %f %p' > > And that file contains: > > > #! /bin/sh > > infile=$1 > outfile=$2 > > if test -f /dbms/pg_archive/$infile.bz2 > then > bunzip2 -c /dbms/pg_archive/$infile.bz2 | > /usr/local/pgsql/bin/pg_decompresslog - $outfile > exit 0 > else > exit 1 > fi > > ============== > > The problem is that it appears that some of the segments being saved are > no good! On occasion I get this when trying to restore... > > Feb 7 12:43:51 dbms2 postgres[2001]: [210-1] LOG: restored log file > "00000001000001710000009A" from archive > Feb 7 12:43:52 dbms2 postgres[2001]: [211-1] LOG: restored log file > "00000001000001710000009B" from archive > Feb 7 12:43:52 dbms2 postgres[2001]: [212-1] LOG: restored log file > "00000001000001710000009C" from archive > Feb 7 12:43:52 dbms2 postgres[2001]: [213-1] LOG: restored log file > "00000001000001710000009D" from archive > Feb 7 12:43:53 dbms2 postgres[2001]: [214-1] LOG: restored log file > "00000001000001710000009E" from archive > Feb 7 12:43:53 dbms2 postgres[2001]: [215-1] LOG: restored log file > "00000001000001710000009F" from archive > Feb 7 12:43:54 dbms2 postgres[2001]: [216-1] LOG: restored log file > "0000000100000171000000A0" from archive > Feb 7 12:43:54 dbms2 postgres[2001]: [217-1] LOG: restored log file > "0000000100000171000000A1" from archive > Feb 7 12:43:55 dbms2 postgres[2001]: [218-1] LOG: restored log file > "0000000100000171000000A2" from archive > Feb 7 12:43:55 dbms2 postgres[2001]: [219-1] LOG: restored log file > "0000000100000171000000A3" from archive > Feb 7 12:43:56 dbms2 postgres[2001]: [220-1] LOG: restored log file > "0000000100000171000000A4" from archive > Feb 7 12:43:56 dbms2 postgres[2001]: [221-1] LOG: restored log file > "0000000100000171000000A5" from archive > Feb 7 12:43:57 dbms2 postgres[2001]: [222-1] LOG: restored log file > "0000000100000171000000A6" from archive > Feb 7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC: corrupted page > pointers: lower = 772, upper = 616, special = 0 > Feb 7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT: xlog redo > hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4 > Feb 7 12:43:57 dbms2 postgres[2000]: [1-1] LOG: startup process (PID > 2001) was terminated by signal 6: Abort trap > Feb 7 12:43:57 dbms2 postgres[2000]: [2-1] LOG: terminating any other > active server processes > > Eek. > > I assume this means that either A6 or A7 is corrupt. But I have the > file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED: > > On the SOURCE machine (which is running just fine): > tickerforum# cksum *171*A[67]* > 172998591 830621 0000000100000171000000A6.bz2 > 1283345296 1541006 0000000100000171000000A7.bz2 > > And off the BACKUP archive, which is what I'm trying to restore: > > # cksum *171*A[67]* > 172998591 830621 0000000100000171000000A6.bz2 > 1283345296 1541006 0000000100000171000000A7.bz2 > > Identical, says the checksums. > > This is VERY BAD - if pg_compresslog is damaging the files in some > instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT > RESTORE!!!!!! > > Needless to say this is a MAJOR problem. > > -- Karl Denninger > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > > > > > > > >
Has there been an update on this situation?
Koichi Suzuki wrote:
Koichi Suzuki wrote:
I understand the situation. I'll upload the improved code ASAP. ---------- Koichi Suzuki 2010/2/11 Karl Denninger <karl@denninger.net>:Will this come through as a commit on the pgfoundry codebase? I've subscribed looking for it.... The last edit, if I read the release notes and tracebacks on the codebase correctly, goes back to the early part of 2009 - which strongly implies that there are a **LOT** of people out there that could be running this code with un-restoreable archives! That, for obvious reasons, could be VERY, VERY bad if someone was to suffer a system crash.... Koichi Suzuki wrote: I found it's pg_compresslog problem (calculation of XNOOP record length used in pg_decompresslog). I'm fixing the bug and will upload the fix shortly. Sorry for inconvenience. ------------------ Koichi Suzuki 2010/2/8 Karl Denninger <karl@denninger.net>: This may belong in a bug report, but I'll post it here first... There appears to be a **SERIOUS** problem with using pg_compresslog and pg_uncompresslog with Postgresql 8.4.2. Here's my configuration snippet: full_page_writes = on # recover from partial page writes wal_buffers = 256kB # min 32kB # (change requires restart) #wal_writer_delay = 200ms # 1-10000 milliseconds #commit_delay = 0 # range 0-100000, in microseconds #commit_siblings = 5 # range 1-1000 # - Checkpoints - checkpoint_segments = 64 # in logfile segments, min 1, 16MB each #checkpoint_timeout = 5min # range 30s-1h checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0 #checkpoint_warning = 30s # 0 disables archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog %p | bzip2 - >/dbms/pg_archive/%f.bz2' #command to use to archive a logfile segment All appears to be fine with the writes, and they are being saved off on the nightly backups without incident. I take a full dump using the instructions in the documentation and make sure I copy the proper "must have" file for consistency to be reached. The problem comes when I try to restore. recovery_conf contains: restore_command = '/usr/local/pgsql/recovery.sh %f %p' And that file contains: #! /bin/sh infile=$1 outfile=$2 if test -f /dbms/pg_archive/$infile.bz2 then bunzip2 -c /dbms/pg_archive/$infile.bz2 | /usr/local/pgsql/bin/pg_decompresslog - $outfile exit 0 else exit 1 fi ============== The problem is that it appears that some of the segments being saved are no good! On occasion I get this when trying to restore... Feb 7 12:43:51 dbms2 postgres[2001]: [210-1] LOG: restored log file "00000001000001710000009A" from archive Feb 7 12:43:52 dbms2 postgres[2001]: [211-1] LOG: restored log file "00000001000001710000009B" from archive Feb 7 12:43:52 dbms2 postgres[2001]: [212-1] LOG: restored log file "00000001000001710000009C" from archive Feb 7 12:43:52 dbms2 postgres[2001]: [213-1] LOG: restored log file "00000001000001710000009D" from archive Feb 7 12:43:53 dbms2 postgres[2001]: [214-1] LOG: restored log file "00000001000001710000009E" from archive Feb 7 12:43:53 dbms2 postgres[2001]: [215-1] LOG: restored log file "00000001000001710000009F" from archive Feb 7 12:43:54 dbms2 postgres[2001]: [216-1] LOG: restored log file "0000000100000171000000A0" from archive Feb 7 12:43:54 dbms2 postgres[2001]: [217-1] LOG: restored log file "0000000100000171000000A1" from archive Feb 7 12:43:55 dbms2 postgres[2001]: [218-1] LOG: restored log file "0000000100000171000000A2" from archive Feb 7 12:43:55 dbms2 postgres[2001]: [219-1] LOG: restored log file "0000000100000171000000A3" from archive Feb 7 12:43:56 dbms2 postgres[2001]: [220-1] LOG: restored log file "0000000100000171000000A4" from archive Feb 7 12:43:56 dbms2 postgres[2001]: [221-1] LOG: restored log file "0000000100000171000000A5" from archive Feb 7 12:43:57 dbms2 postgres[2001]: [222-1] LOG: restored log file "0000000100000171000000A6" from archive Feb 7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC: corrupted page pointers: lower = 772, upper = 616, special = 0 Feb 7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT: xlog redo hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4 Feb 7 12:43:57 dbms2 postgres[2000]: [1-1] LOG: startup process (PID 2001) was terminated by signal 6: Abort trap Feb 7 12:43:57 dbms2 postgres[2000]: [2-1] LOG: terminating any other active server processes Eek. I assume this means that either A6 or A7 is corrupt. But I have the file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED: On the SOURCE machine (which is running just fine): tickerforum# cksum *171*A[67]* 172998591 830621 0000000100000171000000A6.bz2 1283345296 1541006 0000000100000171000000A7.bz2 And off the BACKUP archive, which is what I'm trying to restore: # cksum *171*A[67]* 172998591 830621 0000000100000171000000A6.bz2 1283345296 1541006 0000000100000171000000A7.bz2 Identical, says the checksums. This is VERY BAD - if pg_compresslog is damaging the files in some instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT RESTORE!!!!!! Needless to say this is a MAJOR problem. -- Karl Denninger -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Attachment
Not in a huge hurry (fixed right is better than fixed fast); just trying to figure out where the fix is in terms of progress. Koichi Suzuki wrote: > Not yet. I'm rebuilding the test suit for better testing. If you're > in a hurry, I can send you an update for test. > > ---------- > Koichi Suzuki > > > > 2010/4/19 Karl Denninger <karl@denninger.net>: > >> Has there been an update on this situation? >> >> Koichi Suzuki wrote: >> >> I understand the situation. I'll upload the improved code ASAP. >> >> ---------- >> Koichi Suzuki >> >> >> >> 2010/2/11 Karl Denninger <karl@denninger.net>: >> >> >> Will this come through as a commit on the pgfoundry codebase? I've >> subscribed looking for it.... >> >> The last edit, if I read the release notes and tracebacks on the codebase >> correctly, goes back to the early part of 2009 - which strongly implies that >> there are a **LOT** of people out there that could be running this code with >> un-restoreable archives! >> >> That, for obvious reasons, could be VERY, VERY bad if someone was to suffer >> a system crash.... >> >> >> Koichi Suzuki wrote: >> >> I found it's pg_compresslog problem (calculation of XNOOP record >> length used in pg_decompresslog). I'm fixing the bug and will >> upload the fix shortly. >> >> Sorry for inconvenience. >> >> ------------------ >> Koichi Suzuki >> >> 2010/2/8 Karl Denninger <karl@denninger.net>: >> >> >> This may belong in a bug report, but I'll post it here first... >> >> There appears to be a **SERIOUS** problem with using pg_compresslog and >> pg_uncompresslog with Postgresql 8.4.2. >> >> Here's my configuration snippet: >> >> full_page_writes =n # recover from partial page writes >> wal_buffers =56kB # min 32kB >> # (change requires restart) >> #wal_writer_delay =00ms # 1-10000 milliseconds >> >> #commit_delay = # range 0-100000, in microseconds >> #commit_siblings = # range 1-1000 >> >> # - Checkpoints - >> >> checkpoint_segments =4 # in logfile segments, min 1, >> 16MB each >> #checkpoint_timeout =min # range 30s-1h >> checkpoint_completion_target =.9 # checkpoint target duration, >> 0.0 - 1.0 >> #checkpoint_warning =0s # 0 disables >> >> archive_command =test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog >> %p | bzip2 - >/dbms/pg_archive/%f.bz2' #command to use to >> archive a logfile segment >> >> All appears to be fine with the writes, and they are being saved off on >> the nightly backups without incident. >> >> I take a full dump using the instructions in the documentation and make >> sure I copy the proper "must have" file for consistency to be reached. >> >> The problem comes when I try to restore. >> >> recovery_conf contains: >> >> restore_command =/usr/local/pgsql/recovery.sh %f %p' >> >> And that file contains: >> >> >> #! /bin/sh >> >> infile= >> outfile= >> >> if test -f /dbms/pg_archive/$infile.bz2 >> then >> bunzip2 -c /dbms/pg_archive/$infile.bz2 | >> /usr/local/pgsql/bin/pg_decompresslog - $outfile >> exit 0 >> else >> exit 1 >> fi >> >> ============ >> >> The problem is that it appears that some of the segments being saved are >> no good! On occasion I get this when trying to restore... >> >> Feb 7 12:43:51 dbms2 postgres[2001]: [210-1] LOG: restored log file >> "00000001000001710000009A" from archive >> Feb 7 12:43:52 dbms2 postgres[2001]: [211-1] LOG: restored log file >> "00000001000001710000009B" from archive >> Feb 7 12:43:52 dbms2 postgres[2001]: [212-1] LOG: restored log file >> "00000001000001710000009C" from archive >> Feb 7 12:43:52 dbms2 postgres[2001]: [213-1] LOG: restored log file >> "00000001000001710000009D" from archive >> Feb 7 12:43:53 dbms2 postgres[2001]: [214-1] LOG: restored log file >> "00000001000001710000009E" from archive >> Feb 7 12:43:53 dbms2 postgres[2001]: [215-1] LOG: restored log file >> "00000001000001710000009F" from archive >> Feb 7 12:43:54 dbms2 postgres[2001]: [216-1] LOG: restored log file >> "0000000100000171000000A0" from archive >> Feb 7 12:43:54 dbms2 postgres[2001]: [217-1] LOG: restored log file >> "0000000100000171000000A1" from archive >> Feb 7 12:43:55 dbms2 postgres[2001]: [218-1] LOG: restored log file >> "0000000100000171000000A2" from archive >> Feb 7 12:43:55 dbms2 postgres[2001]: [219-1] LOG: restored log file >> "0000000100000171000000A3" from archive >> Feb 7 12:43:56 dbms2 postgres[2001]: [220-1] LOG: restored log file >> "0000000100000171000000A4" from archive >> Feb 7 12:43:56 dbms2 postgres[2001]: [221-1] LOG: restored log file >> "0000000100000171000000A5" from archive >> Feb 7 12:43:57 dbms2 postgres[2001]: [222-1] LOG: restored log file >> "0000000100000171000000A6" from archive >> Feb 7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC: corrupted page >> pointers: lower =72, upper = 616, special = 0 >> Feb 7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT: xlog redo >> hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4 >> Feb 7 12:43:57 dbms2 postgres[2000]: [1-1] LOG: startup process (PID >> 2001) was terminated by signal 6: Abort trap >> Feb 7 12:43:57 dbms2 postgres[2000]: [2-1] LOG: terminating any other >> active server processes >> >> Eek. >> >> I assume this means that either A6 or A7 is corrupt. But I have the >> file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED: >> >> On the SOURCE machine (which is running just fine): >> tickerforum# cksum *171*A[67]* >> 172998591 830621 0000000100000171000000A6.bz2 >> 1283345296 1541006 0000000100000171000000A7.bz2 >> >> And off the BACKUP archive, which is what I'm trying to restore: >> >> # cksum *171*A[67]* >> 172998591 830621 0000000100000171000000A6.bz2 >> 1283345296 1541006 0000000100000171000000A7.bz2 >> >> Identical, says the checksums. >> >> This is VERY BAD - if pg_compresslog is damaging the files in some >> instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT >> RESTORE!!!!!! >> >> Needless to say this is a MAJOR problem. >> >> -- Karl Denninger >> >> >> >> -- >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-general >> >> >> >> >> >> >> >> >>