Thread: Problem with pg_compresslog'd archives

Problem with pg_compresslog'd archives

From

Karl Denninger

Date:

07 February 2010, 19:03:07

This may belong in a bug report, but I'll post it here first...

There appears to be a **SERIOUS** problem with using pg_compresslog and
pg_uncompresslog with Postgresql 8.4.2.

Here's my configuration snippet:

full_page_writes = on                   # recover from partial page writes
wal_buffers = 256kB                     # min 32kB
                                        # (change requires restart)
#wal_writer_delay = 200ms               # 1-10000 milliseconds

#commit_delay = 0                       # range 0-100000, in microseconds
#commit_siblings = 5                    # range 1-1000

# - Checkpoints -

checkpoint_segments = 64                # in logfile segments, min 1,
16MB each
#checkpoint_timeout = 5min              # range 30s-1h
checkpoint_completion_target = 0.9      # checkpoint target duration,
0.0 - 1.0
#checkpoint_warning = 30s               # 0 disables

archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog
%p | bzip2 - >/dbms/pg_archive/%f.bz2'           #command to use to
archive a logfile segment

All appears to be fine with the writes, and they are being saved off on
the nightly backups without incident.

I take a full dump using the instructions in the documentation and make
sure I copy the proper "must have" file for consistency to be reached.

The problem comes when I try to restore.

recovery_conf contains:

restore_command = '/usr/local/pgsql/recovery.sh %f %p'

And that file contains:


#! /bin/sh

infile=$1
outfile=$2

if test -f /dbms/pg_archive/$infile.bz2
then
        bunzip2 -c /dbms/pg_archive/$infile.bz2 |
/usr/local/pgsql/bin/pg_decompresslog - $outfile
        exit 0
else
        exit 1
fi

==============

The problem is that it appears that some of the segments being saved are
no good!  On occasion I get this when trying to restore...

Feb  7 12:43:51 dbms2 postgres[2001]: [210-1] LOG:  restored log file
"00000001000001710000009A" from archive
Feb  7 12:43:52 dbms2 postgres[2001]: [211-1] LOG:  restored log file
"00000001000001710000009B" from archive
Feb  7 12:43:52 dbms2 postgres[2001]: [212-1] LOG:  restored log file
"00000001000001710000009C" from archive
Feb  7 12:43:52 dbms2 postgres[2001]: [213-1] LOG:  restored log file
"00000001000001710000009D" from archive
Feb  7 12:43:53 dbms2 postgres[2001]: [214-1] LOG:  restored log file
"00000001000001710000009E" from archive
Feb  7 12:43:53 dbms2 postgres[2001]: [215-1] LOG:  restored log file
"00000001000001710000009F" from archive
Feb  7 12:43:54 dbms2 postgres[2001]: [216-1] LOG:  restored log file
"0000000100000171000000A0" from archive
Feb  7 12:43:54 dbms2 postgres[2001]: [217-1] LOG:  restored log file
"0000000100000171000000A1" from archive
Feb  7 12:43:55 dbms2 postgres[2001]: [218-1] LOG:  restored log file
"0000000100000171000000A2" from archive
Feb  7 12:43:55 dbms2 postgres[2001]: [219-1] LOG:  restored log file
"0000000100000171000000A3" from archive
Feb  7 12:43:56 dbms2 postgres[2001]: [220-1] LOG:  restored log file
"0000000100000171000000A4" from archive
Feb  7 12:43:56 dbms2 postgres[2001]: [221-1] LOG:  restored log file
"0000000100000171000000A5" from archive
Feb  7 12:43:57 dbms2 postgres[2001]: [222-1] LOG:  restored log file
"0000000100000171000000A6" from archive
Feb  7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC:  corrupted page
pointers: lower = 772, upper = 616, special = 0
Feb  7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT:  xlog redo
hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4
Feb  7 12:43:57 dbms2 postgres[2000]: [1-1] LOG:  startup process (PID
2001) was terminated by signal 6: Abort trap
Feb  7 12:43:57 dbms2 postgres[2000]: [2-1] LOG:  terminating any other
active server processes

Eek.

I assume this means that either A6 or A7 is corrupt.  But I have the
file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED:

On the SOURCE machine (which is running just fine):
tickerforum# cksum *171*A[67]*
172998591 830621 0000000100000171000000A6.bz2
1283345296 1541006 0000000100000171000000A7.bz2

And off the BACKUP archive, which is what I'm trying to restore:

# cksum *171*A[67]*
172998591 830621 0000000100000171000000A6.bz2
1283345296 1541006 0000000100000171000000A7.bz2

Identical, says the checksums.

This is VERY BAD - if pg_compresslog is damaging the files in some
instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT
RESTORE!!!!!!

Needless to say this is a MAJOR problem.

-- Karl Denninger

Attachment

karl.vcf

Re: Problem with pg_compresslog'd archives

From

Oleg Bartunov

Date:

07 February 2010, 19:18:16

Koichi-san,

seems, very similar problem.


Regards,
   Oleg

On Sun, 7 Feb 2010, Karl Denninger wrote:

> This may belong in a bug report, but I'll post it here first...
>
> There appears to be a **SERIOUS** problem with using pg_compresslog and
> pg_uncompresslog with Postgresql 8.4.2.
>
> Here's my configuration snippet:
>
> full_page_writes = on                   # recover from partial page writes
> wal_buffers = 256kB                     # min 32kB
>                                        # (change requires restart)
> #wal_writer_delay = 200ms               # 1-10000 milliseconds
>
> #commit_delay = 0                       # range 0-100000, in microseconds
> #commit_siblings = 5                    # range 1-1000
>
> # - Checkpoints -
>
> checkpoint_segments = 64                # in logfile segments, min 1,
> 16MB each
> #checkpoint_timeout = 5min              # range 30s-1h
> checkpoint_completion_target = 0.9      # checkpoint target duration,
> 0.0 - 1.0
> #checkpoint_warning = 30s               # 0 disables
>
> archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog
> %p | bzip2 - >/dbms/pg_archive/%f.bz2'           #command to use to
> archive a logfile segment
>
> All appears to be fine with the writes, and they are being saved off on
> the nightly backups without incident.
>
> I take a full dump using the instructions in the documentation and make
> sure I copy the proper "must have" file for consistency to be reached.
>
> The problem comes when I try to restore.
>
> recovery_conf contains:
>
> restore_command = '/usr/local/pgsql/recovery.sh %f %p'
>
> And that file contains:
>
>
> #! /bin/sh
>
> infile=$1
> outfile=$2
>
> if test -f /dbms/pg_archive/$infile.bz2
> then
>        bunzip2 -c /dbms/pg_archive/$infile.bz2 |
> /usr/local/pgsql/bin/pg_decompresslog - $outfile
>        exit 0
> else
>        exit 1
> fi
>
> ==============
>
> The problem is that it appears that some of the segments being saved are
> no good!  On occasion I get this when trying to restore...
>
> Feb  7 12:43:51 dbms2 postgres[2001]: [210-1] LOG:  restored log file
> "00000001000001710000009A" from archive
> Feb  7 12:43:52 dbms2 postgres[2001]: [211-1] LOG:  restored log file
> "00000001000001710000009B" from archive
> Feb  7 12:43:52 dbms2 postgres[2001]: [212-1] LOG:  restored log file
> "00000001000001710000009C" from archive
> Feb  7 12:43:52 dbms2 postgres[2001]: [213-1] LOG:  restored log file
> "00000001000001710000009D" from archive
> Feb  7 12:43:53 dbms2 postgres[2001]: [214-1] LOG:  restored log file
> "00000001000001710000009E" from archive
> Feb  7 12:43:53 dbms2 postgres[2001]: [215-1] LOG:  restored log file
> "00000001000001710000009F" from archive
> Feb  7 12:43:54 dbms2 postgres[2001]: [216-1] LOG:  restored log file
> "0000000100000171000000A0" from archive
> Feb  7 12:43:54 dbms2 postgres[2001]: [217-1] LOG:  restored log file
> "0000000100000171000000A1" from archive
> Feb  7 12:43:55 dbms2 postgres[2001]: [218-1] LOG:  restored log file
> "0000000100000171000000A2" from archive
> Feb  7 12:43:55 dbms2 postgres[2001]: [219-1] LOG:  restored log file
> "0000000100000171000000A3" from archive
> Feb  7 12:43:56 dbms2 postgres[2001]: [220-1] LOG:  restored log file
> "0000000100000171000000A4" from archive
> Feb  7 12:43:56 dbms2 postgres[2001]: [221-1] LOG:  restored log file
> "0000000100000171000000A5" from archive
> Feb  7 12:43:57 dbms2 postgres[2001]: [222-1] LOG:  restored log file
> "0000000100000171000000A6" from archive
> Feb  7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC:  corrupted page
> pointers: lower = 772, upper = 616, special = 0
> Feb  7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT:  xlog redo
> hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4
> Feb  7 12:43:57 dbms2 postgres[2000]: [1-1] LOG:  startup process (PID
> 2001) was terminated by signal 6: Abort trap
> Feb  7 12:43:57 dbms2 postgres[2000]: [2-1] LOG:  terminating any other
> active server processes
>
> Eek.
>
> I assume this means that either A6 or A7 is corrupt.  But I have the
> file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED:
>
> On the SOURCE machine (which is running just fine):
> tickerforum# cksum *171*A[67]*
> 172998591 830621 0000000100000171000000A6.bz2
> 1283345296 1541006 0000000100000171000000A7.bz2
>
> And off the BACKUP archive, which is what I'm trying to restore:
>
> # cksum *171*A[67]*
> 172998591 830621 0000000100000171000000A6.bz2
> 1283345296 1541006 0000000100000171000000A7.bz2
>
> Identical, says the checksums.
>
> This is VERY BAD - if pg_compresslog is damaging the files in some
> instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT
> RESTORE!!!!!!
>
> Needless to say this is a MAJOR problem.
>
> -- Karl Denninger
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Problem with pg_compresslog'd archives

From

Koichi Suzuki

Date:

11 February 2010, 01:00:05

I found it's pg_compresslog problem (calculation of XNOOP record
length used in pg_decompresslog).    I'm fixing the bug and will
upload the fix shortly.

Sorry for inconvenience.

------------------
Koichi Suzuki

2010/2/8 Karl Denninger <karl@denninger.net>:
> This may belong in a bug report, but I'll post it here first...
>
> There appears to be a **SERIOUS** problem with using pg_compresslog and
> pg_uncompresslog with Postgresql 8.4.2.
>
> Here's my configuration snippet:
>
> full_page_writes = on                   # recover from partial page writes
> wal_buffers = 256kB                     # min 32kB
>                                        # (change requires restart)
> #wal_writer_delay = 200ms               # 1-10000 milliseconds
>
> #commit_delay = 0                       # range 0-100000, in microseconds
> #commit_siblings = 5                    # range 1-1000
>
> # - Checkpoints -
>
> checkpoint_segments = 64                # in logfile segments, min 1,
> 16MB each
> #checkpoint_timeout = 5min              # range 30s-1h
> checkpoint_completion_target = 0.9      # checkpoint target duration,
> 0.0 - 1.0
> #checkpoint_warning = 30s               # 0 disables
>
> archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog
> %p | bzip2 - >/dbms/pg_archive/%f.bz2'           #command to use to
> archive a logfile segment
>
> All appears to be fine with the writes, and they are being saved off on
> the nightly backups without incident.
>
> I take a full dump using the instructions in the documentation and make
> sure I copy the proper "must have" file for consistency to be reached.
>
> The problem comes when I try to restore.
>
> recovery_conf contains:
>
> restore_command = '/usr/local/pgsql/recovery.sh %f %p'
>
> And that file contains:
>
>
> #! /bin/sh
>
> infile=$1
> outfile=$2
>
> if test -f /dbms/pg_archive/$infile.bz2
> then
>        bunzip2 -c /dbms/pg_archive/$infile.bz2 |
> /usr/local/pgsql/bin/pg_decompresslog - $outfile
>        exit 0
> else
>        exit 1
> fi
>
> ==============
>
> The problem is that it appears that some of the segments being saved are
> no good!  On occasion I get this when trying to restore...
>
> Feb  7 12:43:51 dbms2 postgres[2001]: [210-1] LOG:  restored log file
> "00000001000001710000009A" from archive
> Feb  7 12:43:52 dbms2 postgres[2001]: [211-1] LOG:  restored log file
> "00000001000001710000009B" from archive
> Feb  7 12:43:52 dbms2 postgres[2001]: [212-1] LOG:  restored log file
> "00000001000001710000009C" from archive
> Feb  7 12:43:52 dbms2 postgres[2001]: [213-1] LOG:  restored log file
> "00000001000001710000009D" from archive
> Feb  7 12:43:53 dbms2 postgres[2001]: [214-1] LOG:  restored log file
> "00000001000001710000009E" from archive
> Feb  7 12:43:53 dbms2 postgres[2001]: [215-1] LOG:  restored log file
> "00000001000001710000009F" from archive
> Feb  7 12:43:54 dbms2 postgres[2001]: [216-1] LOG:  restored log file
> "0000000100000171000000A0" from archive
> Feb  7 12:43:54 dbms2 postgres[2001]: [217-1] LOG:  restored log file
> "0000000100000171000000A1" from archive
> Feb  7 12:43:55 dbms2 postgres[2001]: [218-1] LOG:  restored log file
> "0000000100000171000000A2" from archive
> Feb  7 12:43:55 dbms2 postgres[2001]: [219-1] LOG:  restored log file
> "0000000100000171000000A3" from archive
> Feb  7 12:43:56 dbms2 postgres[2001]: [220-1] LOG:  restored log file
> "0000000100000171000000A4" from archive
> Feb  7 12:43:56 dbms2 postgres[2001]: [221-1] LOG:  restored log file
> "0000000100000171000000A5" from archive
> Feb  7 12:43:57 dbms2 postgres[2001]: [222-1] LOG:  restored log file
> "0000000100000171000000A6" from archive
> Feb  7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC:  corrupted page
> pointers: lower = 772, upper = 616, special = 0
> Feb  7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT:  xlog redo
> hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4
> Feb  7 12:43:57 dbms2 postgres[2000]: [1-1] LOG:  startup process (PID
> 2001) was terminated by signal 6: Abort trap
> Feb  7 12:43:57 dbms2 postgres[2000]: [2-1] LOG:  terminating any other
> active server processes
>
> Eek.
>
> I assume this means that either A6 or A7 is corrupt.  But I have the
> file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED:
>
> On the SOURCE machine (which is running just fine):
> tickerforum# cksum *171*A[67]*
> 172998591 830621 0000000100000171000000A6.bz2
> 1283345296 1541006 0000000100000171000000A7.bz2
>
> And off the BACKUP archive, which is what I'm trying to restore:
>
> # cksum *171*A[67]*
> 172998591 830621 0000000100000171000000A6.bz2
> 1283345296 1541006 0000000100000171000000A7.bz2
>
> Identical, says the checksums.
>
> This is VERY BAD - if pg_compresslog is damaging the files in some
> instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT
> RESTORE!!!!!!
>
> Needless to say this is a MAJOR problem.
>
> -- Karl Denninger
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>



--
------
Koichi Suzuki

Re: Problem with pg_compresslog'd archives

From

Karl Denninger

Date:

11 February 2010, 12:51:01

Will this come through as a commit on the pgfoundry codebase? I've subscribed looking for it....

The last edit, if I read the release notes and tracebacks on the codebase correctly, goes back to the early part of 2009 - which strongly implies that there are a **LOT** of people out there that could be running this code with un-restoreable archives!

That, for obvious reasons, could be VERY, VERY bad if someone was to suffer a system crash....

Koichi Suzuki wrote:

I found it's pg_compresslog problem (calculation of XNOOP record
length used in pg_decompresslog).    I'm fixing the bug and will
upload the fix shortly.

Sorry for inconvenience.

------------------
Koichi Suzuki

2010/2/8 Karl Denninger <karl@denninger.net>:

This may belong in a bug report, but I'll post it here first...

There appears to be a **SERIOUS** problem with using pg_compresslog and
pg_uncompresslog with Postgresql 8.4.2.

Here's my configuration snippet:

full_page_writes = on                   # recover from partial page writes
wal_buffers = 256kB                     # min 32kB
                                       # (change requires restart)
#wal_writer_delay = 200ms               # 1-10000 milliseconds

#commit_delay = 0                       # range 0-100000, in microseconds
#commit_siblings = 5                    # range 1-1000

# - Checkpoints -

checkpoint_segments = 64                # in logfile segments, min 1,
16MB each
#checkpoint_timeout = 5min              # range 30s-1h
checkpoint_completion_target = 0.9      # checkpoint target duration,
0.0 - 1.0
#checkpoint_warning = 30s               # 0 disables

archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog
%p | bzip2 - >/dbms/pg_archive/%f.bz2'           #command to use to
archive a logfile segment

All appears to be fine with the writes, and they are being saved off on
the nightly backups without incident.

I take a full dump using the instructions in the documentation and make
sure I copy the proper "must have" file for consistency to be reached.

The problem comes when I try to restore.

recovery_conf contains:

restore_command = '/usr/local/pgsql/recovery.sh %f %p'

And that file contains:


#! /bin/sh

infile=$1
outfile=$2

if test -f /dbms/pg_archive/$infile.bz2
then
       bunzip2 -c /dbms/pg_archive/$infile.bz2 |
/usr/local/pgsql/bin/pg_decompresslog - $outfile
       exit 0
else
       exit 1
fi

==============

The problem is that it appears that some of the segments being saved are
no good!  On occasion I get this when trying to restore...

Feb  7 12:43:51 dbms2 postgres[2001]: [210-1] LOG:  restored log file
"00000001000001710000009A" from archive
Feb  7 12:43:52 dbms2 postgres[2001]: [211-1] LOG:  restored log file
"00000001000001710000009B" from archive
Feb  7 12:43:52 dbms2 postgres[2001]: [212-1] LOG:  restored log file
"00000001000001710000009C" from archive
Feb  7 12:43:52 dbms2 postgres[2001]: [213-1] LOG:  restored log file
"00000001000001710000009D" from archive
Feb  7 12:43:53 dbms2 postgres[2001]: [214-1] LOG:  restored log file
"00000001000001710000009E" from archive
Feb  7 12:43:53 dbms2 postgres[2001]: [215-1] LOG:  restored log file
"00000001000001710000009F" from archive
Feb  7 12:43:54 dbms2 postgres[2001]: [216-1] LOG:  restored log file
"0000000100000171000000A0" from archive
Feb  7 12:43:54 dbms2 postgres[2001]: [217-1] LOG:  restored log file
"0000000100000171000000A1" from archive
Feb  7 12:43:55 dbms2 postgres[2001]: [218-1] LOG:  restored log file
"0000000100000171000000A2" from archive
Feb  7 12:43:55 dbms2 postgres[2001]: [219-1] LOG:  restored log file
"0000000100000171000000A3" from archive
Feb  7 12:43:56 dbms2 postgres[2001]: [220-1] LOG:  restored log file
"0000000100000171000000A4" from archive
Feb  7 12:43:56 dbms2 postgres[2001]: [221-1] LOG:  restored log file
"0000000100000171000000A5" from archive
Feb  7 12:43:57 dbms2 postgres[2001]: [222-1] LOG:  restored log file
"0000000100000171000000A6" from archive
Feb  7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC:  corrupted page
pointers: lower = 772, upper = 616, special = 0
Feb  7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT:  xlog redo
hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4
Feb  7 12:43:57 dbms2 postgres[2000]: [1-1] LOG:  startup process (PID
2001) was terminated by signal 6: Abort trap
Feb  7 12:43:57 dbms2 postgres[2000]: [2-1] LOG:  terminating any other
active server processes

Eek.

I assume this means that either A6 or A7 is corrupt.  But I have the
file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED:

On the SOURCE machine (which is running just fine):
tickerforum# cksum *171*A[67]*
172998591 830621 0000000100000171000000A6.bz2
1283345296 1541006 0000000100000171000000A7.bz2

And off the BACKUP archive, which is what I'm trying to restore:

# cksum *171*A[67]*
172998591 830621 0000000100000171000000A6.bz2
1283345296 1541006 0000000100000171000000A7.bz2

Identical, says the checksums.

This is VERY BAD - if pg_compresslog is damaging the files in some
instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT
RESTORE!!!!!!

Needless to say this is a MAJOR problem.

-- Karl Denninger



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Attachment

karl.vcf

Re: Problem with pg_compresslog'd archives

From

Koichi Suzuki

Date:

11 February 2010, 13:28:14

I understand the situation.   I'll upload the improved code ASAP.

----------
Koichi Suzuki



2010/2/11 Karl Denninger <karl@denninger.net>:
> Will this come through as a commit on the pgfoundry codebase?  I've
> subscribed looking for it....
>
> The last edit, if I read the release notes and tracebacks on the codebase
> correctly, goes back to the early part of 2009 - which strongly implies that
> there are a **LOT** of people out there that could be running this code with
> un-restoreable archives!
>
> That, for obvious reasons, could be VERY, VERY bad if someone was to suffer
> a system crash....
>
>
> Koichi Suzuki wrote:
>
> I found it's pg_compresslog problem (calculation of XNOOP record
> length used in pg_decompresslog).    I'm fixing the bug and will
> upload the fix shortly.
>
> Sorry for inconvenience.
>
> ------------------
> Koichi Suzuki
>
> 2010/2/8 Karl Denninger <karl@denninger.net>:
>
>
> This may belong in a bug report, but I'll post it here first...
>
> There appears to be a **SERIOUS** problem with using pg_compresslog and
> pg_uncompresslog with Postgresql 8.4.2.
>
> Here's my configuration snippet:
>
> full_page_writes = on                   # recover from partial page writes
> wal_buffers = 256kB                     # min 32kB
>                                        # (change requires restart)
> #wal_writer_delay = 200ms               # 1-10000 milliseconds
>
> #commit_delay = 0                       # range 0-100000, in microseconds
> #commit_siblings = 5                    # range 1-1000
>
> # - Checkpoints -
>
> checkpoint_segments = 64                # in logfile segments, min 1,
> 16MB each
> #checkpoint_timeout = 5min              # range 30s-1h
> checkpoint_completion_target = 0.9      # checkpoint target duration,
> 0.0 - 1.0
> #checkpoint_warning = 30s               # 0 disables
>
> archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog
> %p | bzip2 - >/dbms/pg_archive/%f.bz2'           #command to use to
> archive a logfile segment
>
> All appears to be fine with the writes, and they are being saved off on
> the nightly backups without incident.
>
> I take a full dump using the instructions in the documentation and make
> sure I copy the proper "must have" file for consistency to be reached.
>
> The problem comes when I try to restore.
>
> recovery_conf contains:
>
> restore_command = '/usr/local/pgsql/recovery.sh %f %p'
>
> And that file contains:
>
>
> #! /bin/sh
>
> infile=$1
> outfile=$2
>
> if test -f /dbms/pg_archive/$infile.bz2
> then
>        bunzip2 -c /dbms/pg_archive/$infile.bz2 |
> /usr/local/pgsql/bin/pg_decompresslog - $outfile
>        exit 0
> else
>        exit 1
> fi
>
> ==============
>
> The problem is that it appears that some of the segments being saved are
> no good!  On occasion I get this when trying to restore...
>
> Feb  7 12:43:51 dbms2 postgres[2001]: [210-1] LOG:  restored log file
> "00000001000001710000009A" from archive
> Feb  7 12:43:52 dbms2 postgres[2001]: [211-1] LOG:  restored log file
> "00000001000001710000009B" from archive
> Feb  7 12:43:52 dbms2 postgres[2001]: [212-1] LOG:  restored log file
> "00000001000001710000009C" from archive
> Feb  7 12:43:52 dbms2 postgres[2001]: [213-1] LOG:  restored log file
> "00000001000001710000009D" from archive
> Feb  7 12:43:53 dbms2 postgres[2001]: [214-1] LOG:  restored log file
> "00000001000001710000009E" from archive
> Feb  7 12:43:53 dbms2 postgres[2001]: [215-1] LOG:  restored log file
> "00000001000001710000009F" from archive
> Feb  7 12:43:54 dbms2 postgres[2001]: [216-1] LOG:  restored log file
> "0000000100000171000000A0" from archive
> Feb  7 12:43:54 dbms2 postgres[2001]: [217-1] LOG:  restored log file
> "0000000100000171000000A1" from archive
> Feb  7 12:43:55 dbms2 postgres[2001]: [218-1] LOG:  restored log file
> "0000000100000171000000A2" from archive
> Feb  7 12:43:55 dbms2 postgres[2001]: [219-1] LOG:  restored log file
> "0000000100000171000000A3" from archive
> Feb  7 12:43:56 dbms2 postgres[2001]: [220-1] LOG:  restored log file
> "0000000100000171000000A4" from archive
> Feb  7 12:43:56 dbms2 postgres[2001]: [221-1] LOG:  restored log file
> "0000000100000171000000A5" from archive
> Feb  7 12:43:57 dbms2 postgres[2001]: [222-1] LOG:  restored log file
> "0000000100000171000000A6" from archive
> Feb  7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC:  corrupted page
> pointers: lower = 772, upper = 616, special = 0
> Feb  7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT:  xlog redo
> hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4
> Feb  7 12:43:57 dbms2 postgres[2000]: [1-1] LOG:  startup process (PID
> 2001) was terminated by signal 6: Abort trap
> Feb  7 12:43:57 dbms2 postgres[2000]: [2-1] LOG:  terminating any other
> active server processes
>
> Eek.
>
> I assume this means that either A6 or A7 is corrupt.  But I have the
> file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED:
>
> On the SOURCE machine (which is running just fine):
> tickerforum# cksum *171*A[67]*
> 172998591 830621 0000000100000171000000A6.bz2
> 1283345296 1541006 0000000100000171000000A7.bz2
>
> And off the BACKUP archive, which is what I'm trying to restore:
>
> # cksum *171*A[67]*
> 172998591 830621 0000000100000171000000A6.bz2
> 1283345296 1541006 0000000100000171000000A7.bz2
>
> Identical, says the checksums.
>
> This is VERY BAD - if pg_compresslog is damaging the files in some
> instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT
> RESTORE!!!!!!
>
> Needless to say this is a MAJOR problem.
>
> -- Karl Denninger
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
>
>
>

Re: Problem with pg_compresslog'd archives

From

Koichi Suzuki

Date:

19 April 2010, 13:30:40

Not yet.  I'm rebuilding the test suit for better testing.   If you're
in a hurry, I can send you an update for test.

----------
Koichi Suzuki



2010/4/19 Karl Denninger <karl@denninger.net>:
> Has there been an update on this situation?
>
> Koichi Suzuki wrote:
>
> I understand the situation.   I'll upload the improved code ASAP.
>
> ----------
> Koichi Suzuki
>
>
>
> 2010/2/11 Karl Denninger <karl@denninger.net>:
>
>
> Will this come through as a commit on the pgfoundry codebase?  I've
> subscribed looking for it....
>
> The last edit, if I read the release notes and tracebacks on the codebase
> correctly, goes back to the early part of 2009 - which strongly implies that
> there are a **LOT** of people out there that could be running this code with
> un-restoreable archives!
>
> That, for obvious reasons, could be VERY, VERY bad if someone was to suffer
> a system crash....
>
>
> Koichi Suzuki wrote:
>
> I found it's pg_compresslog problem (calculation of XNOOP record
> length used in pg_decompresslog).    I'm fixing the bug and will
> upload the fix shortly.
>
> Sorry for inconvenience.
>
> ------------------
> Koichi Suzuki
>
> 2010/2/8 Karl Denninger <karl@denninger.net>:
>
>
> This may belong in a bug report, but I'll post it here first...
>
> There appears to be a **SERIOUS** problem with using pg_compresslog and
> pg_uncompresslog with Postgresql 8.4.2.
>
> Here's my configuration snippet:
>
> full_page_writes = on                   # recover from partial page writes
> wal_buffers = 256kB                     # min 32kB
>                                        # (change requires restart)
> #wal_writer_delay = 200ms               # 1-10000 milliseconds
>
> #commit_delay = 0                       # range 0-100000, in microseconds
> #commit_siblings = 5                    # range 1-1000
>
> # - Checkpoints -
>
> checkpoint_segments = 64                # in logfile segments, min 1,
> 16MB each
> #checkpoint_timeout = 5min              # range 30s-1h
> checkpoint_completion_target = 0.9      # checkpoint target duration,
> 0.0 - 1.0
> #checkpoint_warning = 30s               # 0 disables
>
> archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog
> %p | bzip2 - >/dbms/pg_archive/%f.bz2'           #command to use to
> archive a logfile segment
>
> All appears to be fine with the writes, and they are being saved off on
> the nightly backups without incident.
>
> I take a full dump using the instructions in the documentation and make
> sure I copy the proper "must have" file for consistency to be reached.
>
> The problem comes when I try to restore.
>
> recovery_conf contains:
>
> restore_command = '/usr/local/pgsql/recovery.sh %f %p'
>
> And that file contains:
>
>
> #! /bin/sh
>
> infile=$1
> outfile=$2
>
> if test -f /dbms/pg_archive/$infile.bz2
> then
>        bunzip2 -c /dbms/pg_archive/$infile.bz2 |
> /usr/local/pgsql/bin/pg_decompresslog - $outfile
>        exit 0
> else
>        exit 1
> fi
>
> ==============
>
> The problem is that it appears that some of the segments being saved are
> no good!  On occasion I get this when trying to restore...
>
> Feb  7 12:43:51 dbms2 postgres[2001]: [210-1] LOG:  restored log file
> "00000001000001710000009A" from archive
> Feb  7 12:43:52 dbms2 postgres[2001]: [211-1] LOG:  restored log file
> "00000001000001710000009B" from archive
> Feb  7 12:43:52 dbms2 postgres[2001]: [212-1] LOG:  restored log file
> "00000001000001710000009C" from archive
> Feb  7 12:43:52 dbms2 postgres[2001]: [213-1] LOG:  restored log file
> "00000001000001710000009D" from archive
> Feb  7 12:43:53 dbms2 postgres[2001]: [214-1] LOG:  restored log file
> "00000001000001710000009E" from archive
> Feb  7 12:43:53 dbms2 postgres[2001]: [215-1] LOG:  restored log file
> "00000001000001710000009F" from archive
> Feb  7 12:43:54 dbms2 postgres[2001]: [216-1] LOG:  restored log file
> "0000000100000171000000A0" from archive
> Feb  7 12:43:54 dbms2 postgres[2001]: [217-1] LOG:  restored log file
> "0000000100000171000000A1" from archive
> Feb  7 12:43:55 dbms2 postgres[2001]: [218-1] LOG:  restored log file
> "0000000100000171000000A2" from archive
> Feb  7 12:43:55 dbms2 postgres[2001]: [219-1] LOG:  restored log file
> "0000000100000171000000A3" from archive
> Feb  7 12:43:56 dbms2 postgres[2001]: [220-1] LOG:  restored log file
> "0000000100000171000000A4" from archive
> Feb  7 12:43:56 dbms2 postgres[2001]: [221-1] LOG:  restored log file
> "0000000100000171000000A5" from archive
> Feb  7 12:43:57 dbms2 postgres[2001]: [222-1] LOG:  restored log file
> "0000000100000171000000A6" from archive
> Feb  7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC:  corrupted page
> pointers: lower = 772, upper = 616, special = 0
> Feb  7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT:  xlog redo
> hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4
> Feb  7 12:43:57 dbms2 postgres[2000]: [1-1] LOG:  startup process (PID
> 2001) was terminated by signal 6: Abort trap
> Feb  7 12:43:57 dbms2 postgres[2000]: [2-1] LOG:  terminating any other
> active server processes
>
> Eek.
>
> I assume this means that either A6 or A7 is corrupt.  But I have the
> file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED:
>
> On the SOURCE machine (which is running just fine):
> tickerforum# cksum *171*A[67]*
> 172998591 830621 0000000100000171000000A6.bz2
> 1283345296 1541006 0000000100000171000000A7.bz2
>
> And off the BACKUP archive, which is what I'm trying to restore:
>
> # cksum *171*A[67]*
> 172998591 830621 0000000100000171000000A6.bz2
> 1283345296 1541006 0000000100000171000000A7.bz2
>
> Identical, says the checksums.
>
> This is VERY BAD - if pg_compresslog is damaging the files in some
> instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT
> RESTORE!!!!!!
>
> Needless to say this is a MAJOR problem.
>
> -- Karl Denninger
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
>
>
>
>
>
>

Re: Problem with pg_compresslog'd archives

From

Karl Denninger

Date:

19 April 2010, 13:38:41

Has there been an update on this situation?

Koichi Suzuki wrote:

I understand the situation.   I'll upload the improved code ASAP.

----------
Koichi Suzuki



2010/2/11 Karl Denninger <karl@denninger.net>:

Will this come through as a commit on the pgfoundry codebase?  I've
subscribed looking for it....

The last edit, if I read the release notes and tracebacks on the codebase
correctly, goes back to the early part of 2009 - which strongly implies that
there are a **LOT** of people out there that could be running this code with
un-restoreable archives!

That, for obvious reasons, could be VERY, VERY bad if someone was to suffer
a system crash....


Koichi Suzuki wrote:

I found it's pg_compresslog problem (calculation of XNOOP record
length used in pg_decompresslog).    I'm fixing the bug and will
upload the fix shortly.

Sorry for inconvenience.

------------------
Koichi Suzuki

2010/2/8 Karl Denninger <karl@denninger.net>:


This may belong in a bug report, but I'll post it here first...

There appears to be a **SERIOUS** problem with using pg_compresslog and
pg_uncompresslog with Postgresql 8.4.2.

Here's my configuration snippet:

full_page_writes = on                   # recover from partial page writes
wal_buffers = 256kB                     # min 32kB
                                       # (change requires restart)
#wal_writer_delay = 200ms               # 1-10000 milliseconds

#commit_delay = 0                       # range 0-100000, in microseconds
#commit_siblings = 5                    # range 1-1000

# - Checkpoints -

checkpoint_segments = 64                # in logfile segments, min 1,
16MB each
#checkpoint_timeout = 5min              # range 30s-1h
checkpoint_completion_target = 0.9      # checkpoint target duration,
0.0 - 1.0
#checkpoint_warning = 30s               # 0 disables

archive_command = 'test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog
%p | bzip2 - >/dbms/pg_archive/%f.bz2'           #command to use to
archive a logfile segment

All appears to be fine with the writes, and they are being saved off on
the nightly backups without incident.

I take a full dump using the instructions in the documentation and make
sure I copy the proper "must have" file for consistency to be reached.

The problem comes when I try to restore.

recovery_conf contains:

restore_command = '/usr/local/pgsql/recovery.sh %f %p'

And that file contains:


#! /bin/sh

infile=$1
outfile=$2

if test -f /dbms/pg_archive/$infile.bz2
then
       bunzip2 -c /dbms/pg_archive/$infile.bz2 |
/usr/local/pgsql/bin/pg_decompresslog - $outfile
       exit 0
else
       exit 1
fi

==============

The problem is that it appears that some of the segments being saved are
no good!  On occasion I get this when trying to restore...

Feb  7 12:43:51 dbms2 postgres[2001]: [210-1] LOG:  restored log file
"00000001000001710000009A" from archive
Feb  7 12:43:52 dbms2 postgres[2001]: [211-1] LOG:  restored log file
"00000001000001710000009B" from archive
Feb  7 12:43:52 dbms2 postgres[2001]: [212-1] LOG:  restored log file
"00000001000001710000009C" from archive
Feb  7 12:43:52 dbms2 postgres[2001]: [213-1] LOG:  restored log file
"00000001000001710000009D" from archive
Feb  7 12:43:53 dbms2 postgres[2001]: [214-1] LOG:  restored log file
"00000001000001710000009E" from archive
Feb  7 12:43:53 dbms2 postgres[2001]: [215-1] LOG:  restored log file
"00000001000001710000009F" from archive
Feb  7 12:43:54 dbms2 postgres[2001]: [216-1] LOG:  restored log file
"0000000100000171000000A0" from archive
Feb  7 12:43:54 dbms2 postgres[2001]: [217-1] LOG:  restored log file
"0000000100000171000000A1" from archive
Feb  7 12:43:55 dbms2 postgres[2001]: [218-1] LOG:  restored log file
"0000000100000171000000A2" from archive
Feb  7 12:43:55 dbms2 postgres[2001]: [219-1] LOG:  restored log file
"0000000100000171000000A3" from archive
Feb  7 12:43:56 dbms2 postgres[2001]: [220-1] LOG:  restored log file
"0000000100000171000000A4" from archive
Feb  7 12:43:56 dbms2 postgres[2001]: [221-1] LOG:  restored log file
"0000000100000171000000A5" from archive
Feb  7 12:43:57 dbms2 postgres[2001]: [222-1] LOG:  restored log file
"0000000100000171000000A6" from archive
Feb  7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC:  corrupted page
pointers: lower = 772, upper = 616, special = 0
Feb  7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT:  xlog redo
hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4
Feb  7 12:43:57 dbms2 postgres[2000]: [1-1] LOG:  startup process (PID
2001) was terminated by signal 6: Abort trap
Feb  7 12:43:57 dbms2 postgres[2000]: [2-1] LOG:  terminating any other
active server processes

Eek.

I assume this means that either A6 or A7 is corrupt.  But I have the
file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED:

On the SOURCE machine (which is running just fine):
tickerforum# cksum *171*A[67]*
172998591 830621 0000000100000171000000A6.bz2
1283345296 1541006 0000000100000171000000A7.bz2

And off the BACKUP archive, which is what I'm trying to restore:

# cksum *171*A[67]*
172998591 830621 0000000100000171000000A6.bz2
1283345296 1541006 0000000100000171000000A7.bz2

Identical, says the checksums.

This is VERY BAD - if pg_compresslog is damaging the files in some
instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT
RESTORE!!!!!!

Needless to say this is a MAJOR problem.

-- Karl Denninger



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Attachment

karl.vcf

Re: Problem with pg_compresslog'd archives

From

Karl Denninger

Date:

19 April 2010, 13:43:18

Not in a huge hurry (fixed right is better than fixed fast); just trying
to figure out where the fix is in terms of progress.

Koichi Suzuki wrote:
> Not yet.  I'm rebuilding the test suit for better testing.   If you're
> in a hurry, I can send you an update for test.
>
> ----------
> Koichi Suzuki
>
>
>
> 2010/4/19 Karl Denninger <karl@denninger.net>:
>
>> Has there been an update on this situation?
>>
>> Koichi Suzuki wrote:
>>
>> I understand the situation.   I'll upload the improved code ASAP.
>>
>> ----------
>> Koichi Suzuki
>>
>>
>>
>> 2010/2/11 Karl Denninger <karl@denninger.net>:
>>
>>
>> Will this come through as a commit on the pgfoundry codebase?  I've
>> subscribed looking for it....
>>
>> The last edit, if I read the release notes and tracebacks on the codebase
>> correctly, goes back to the early part of 2009 - which strongly implies that
>> there are a **LOT** of people out there that could be running this code with
>> un-restoreable archives!
>>
>> That, for obvious reasons, could be VERY, VERY bad if someone was to suffer
>> a system crash....
>>
>>
>> Koichi Suzuki wrote:
>>
>> I found it's pg_compresslog problem (calculation of XNOOP record
>> length used in pg_decompresslog).    I'm fixing the bug and will
>> upload the fix shortly.
>>
>> Sorry for inconvenience.
>>
>> ------------------
>> Koichi Suzuki
>>
>> 2010/2/8 Karl Denninger <karl@denninger.net>:
>>
>>
>> This may belong in a bug report, but I'll post it here first...
>>
>> There appears to be a **SERIOUS** problem with using pg_compresslog and
>> pg_uncompresslog with Postgresql 8.4.2.
>>
>> Here's my configuration snippet:
>>
>> full_page_writes =n                   # recover from partial page writes
>> wal_buffers =56kB                     # min 32kB
>>                                        # (change requires restart)
>> #wal_writer_delay =00ms               # 1-10000 milliseconds
>>
>> #commit_delay =                       # range 0-100000, in microseconds
>> #commit_siblings =                    # range 1-1000
>>
>> # - Checkpoints -
>>
>> checkpoint_segments =4                # in logfile segments, min 1,
>> 16MB each
>> #checkpoint_timeout =min              # range 30s-1h
>> checkpoint_completion_target =.9      # checkpoint target duration,
>> 0.0 - 1.0
>> #checkpoint_warning =0s               # 0 disables
>>
>> archive_command =test ! -f /dbms/pg_archive/%f.bz2 && pg_compresslog
>> %p | bzip2 - >/dbms/pg_archive/%f.bz2'           #command to use to
>> archive a logfile segment
>>
>> All appears to be fine with the writes, and they are being saved off on
>> the nightly backups without incident.
>>
>> I take a full dump using the instructions in the documentation and make
>> sure I copy the proper "must have" file for consistency to be reached.
>>
>> The problem comes when I try to restore.
>>
>> recovery_conf contains:
>>
>> restore_command =/usr/local/pgsql/recovery.sh %f %p'
>>
>> And that file contains:
>>
>>
>> #! /bin/sh
>>
>> infile=
>> outfile=
>>
>> if test -f /dbms/pg_archive/$infile.bz2
>> then
>>        bunzip2 -c /dbms/pg_archive/$infile.bz2 |
>> /usr/local/pgsql/bin/pg_decompresslog - $outfile
>>        exit 0
>> else
>>        exit 1
>> fi
>>
>> ============
>>
>> The problem is that it appears that some of the segments being saved are
>> no good!  On occasion I get this when trying to restore...
>>
>> Feb  7 12:43:51 dbms2 postgres[2001]: [210-1] LOG:  restored log file
>> "00000001000001710000009A" from archive
>> Feb  7 12:43:52 dbms2 postgres[2001]: [211-1] LOG:  restored log file
>> "00000001000001710000009B" from archive
>> Feb  7 12:43:52 dbms2 postgres[2001]: [212-1] LOG:  restored log file
>> "00000001000001710000009C" from archive
>> Feb  7 12:43:52 dbms2 postgres[2001]: [213-1] LOG:  restored log file
>> "00000001000001710000009D" from archive
>> Feb  7 12:43:53 dbms2 postgres[2001]: [214-1] LOG:  restored log file
>> "00000001000001710000009E" from archive
>> Feb  7 12:43:53 dbms2 postgres[2001]: [215-1] LOG:  restored log file
>> "00000001000001710000009F" from archive
>> Feb  7 12:43:54 dbms2 postgres[2001]: [216-1] LOG:  restored log file
>> "0000000100000171000000A0" from archive
>> Feb  7 12:43:54 dbms2 postgres[2001]: [217-1] LOG:  restored log file
>> "0000000100000171000000A1" from archive
>> Feb  7 12:43:55 dbms2 postgres[2001]: [218-1] LOG:  restored log file
>> "0000000100000171000000A2" from archive
>> Feb  7 12:43:55 dbms2 postgres[2001]: [219-1] LOG:  restored log file
>> "0000000100000171000000A3" from archive
>> Feb  7 12:43:56 dbms2 postgres[2001]: [220-1] LOG:  restored log file
>> "0000000100000171000000A4" from archive
>> Feb  7 12:43:56 dbms2 postgres[2001]: [221-1] LOG:  restored log file
>> "0000000100000171000000A5" from archive
>> Feb  7 12:43:57 dbms2 postgres[2001]: [222-1] LOG:  restored log file
>> "0000000100000171000000A6" from archive
>> Feb  7 12:43:57 dbms2 postgres[2001]: [223-1] PANIC:  corrupted page
>> pointers: lower =72, upper = 616, special = 0
>> Feb  7 12:43:57 dbms2 postgres[2001]: [223-2] CONTEXT:  xlog redo
>> hot_update: rel 1663/616245/1193269; tid 53/93; new 53/4
>> Feb  7 12:43:57 dbms2 postgres[2000]: [1-1] LOG:  startup process (PID
>> 2001) was terminated by signal 6: Abort trap
>> Feb  7 12:43:57 dbms2 postgres[2000]: [2-1] LOG:  terminating any other
>> active server processes
>>
>> Eek.
>>
>> I assume this means that either A6 or A7 is corrupt.  But I have the
>> file both in the restore AND ON THE MACHINE WHERE IT ORIGINATED:
>>
>> On the SOURCE machine (which is running just fine):
>> tickerforum# cksum *171*A[67]*
>> 172998591 830621 0000000100000171000000A6.bz2
>> 1283345296 1541006 0000000100000171000000A7.bz2
>>
>> And off the BACKUP archive, which is what I'm trying to restore:
>>
>> # cksum *171*A[67]*
>> 172998591 830621 0000000100000171000000A6.bz2
>> 1283345296 1541006 0000000100000171000000A7.bz2
>>
>> Identical, says the checksums.
>>
>> This is VERY BAD - if pg_compresslog is damaging the files in some
>> instances then ANY BACKUP TAKEN USING THEM IS SUSPECT AND MAY NOT
>> RESTORE!!!!!!
>>
>> Needless to say this is a MAJOR problem.
>>
>> -- Karl Denninger
>>
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>>
>>
>>
>>
>>
>>
>>
>>

Attachment

karl.vcf