Thread: BUG #5602: Recovering from Hot-Standby file backup leads to the currupted indexes

The following bug has been logged online:

Bug reference:      5602
Logged by:          Chris
Email address:      christoph.springer@magicinternet.de
PostgreSQL version: 9.0beta3
Operating system:   Debian 5.0.3
Description:        Recovering from Hot-Standby file backup leads to the
currupted indexes
Details:

A Hot-Standby database is being backed up to a new machine like this:

0. pause WAL file deletion from Hot-Standby
1. backup pg_control
2. sync data and tablespace files
3. wait for checkout using pg_controldata and waiting for UNDO address
change
4. copy all WAL files
5. recover the newly created database using WAL files

The database recovers and starts normally, but some of the indexes are
corrupt afterwards. I could not find any strange log record in logs for the
recovered database.

This is a definition of a index that is corrupt afterwards:
CREATE INDEX idx_tbl_textcol1_textcol2
  ON tbl
  USING btree
  (textcol1, textcol2)
TABLESPACE data5_tbs;

Thank you in advance,

Chris
"Chris" <christoph.springer@magicinternet.de> writes:
> A Hot-Standby database is being backed up to a new machine like this:

> 0. pause WAL file deletion from Hot-Standby
> 1. backup pg_control
> 2. sync data and tablespace files
> 3. wait for checkout using pg_controldata and waiting for UNDO address
> change
> 4. copy all WAL files
> 5. recover the newly created database using WAL files

This process seems almost entirely unrelated to the documented way of
doing it; I'm not surprised that you end up with some files not in sync.
Please see pg_start_backup and friends.

            regards, tom lane
> This process seems almost entirely unrelated to the documented way of
> doing it; I'm not surprised that you end up with some files not in sync.
> Please see pg_start_backup and friends.
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 regards, tom lane
>

Hallo Tom,

It was done as documented in http://www.postgresql.org/docs/9.0/static/back=
up-incremental-updated.html

With my best regards,

-- Valentine
valgog <valgog@gmail.com> writes:
>> This process seems almost entirely unrelated to the documented way of
>> doing it; I'm not surprised that you end up with some files not in sync.
>> Please see pg_start_backup and friends.

> It was done as documented in http://www.postgresql.org/docs/9.0/static/backup-incremental-updated.html

Oh, I missed that you were copying from a hot-standby slave rather than
the master.  Still, your procedure doesn't clearly match step 2, and
that step is the weak point of the process --- the grandchild slave
isn't consistent until it's replayed WAL far enough, but we don't have
any automated support for verifying that.  (I hope that's going to get
improved in 9.1.)  I suspect you allowed the grandchild to go live
before it was really consistent.

            regards, tom lane
On Thu, 2010-08-05 at 11:28 -0700, valgog wrote:
> > This process seems almost entirely unrelated to the documented way of
> > doing it; I'm not surprised that you end up with some files not in sync.
> > Please see pg_start_backup and friends.

> It was done as documented in http://www.postgresql.org/docs/9.0/static/backup-incremental-updated.html

The procedure used does differ from that documented. However, IMHO the
procedure *documented* is *not* safe and could lead to corrupt indexes
in the way described, since the last recovered point might be mid-way
between two halves of an index split record, which will never be
corrected during HS. What I find surprising is that the technique the OP
describes should be safe, assuming step 5 waits for the correct point of
consistency before attempting to run queries.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Thu, 2010-08-05 at 11:28 -0700, valgog wrote:
>> It was done as documented in http://www.postgresql.org/docs/9.0/static/backup-incremental-updated.html

> The procedure used does differ from that documented. However, IMHO the
> procedure *documented* is *not* safe and could lead to corrupt indexes
> in the way described, since the last recovered point might be mid-way
> between two halves of an index split record, which will never be
> corrected during HS.

Hm, I was looking at that and thinking it seemed unsafe for entirely
different reasons.  But if you didn't write it, who did?

            regards, tom lane
On Fri, Aug 6, 2010 at 12:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hm, I was looking at that and thinking it seemed unsafe for entirely
> different reasons. =A0But if you didn't write it, who did?

Incrementally Updated Backups technique has been in the document since 8.2.
In the development cycle of 9.0, I and Heikki appended some description to
make the technique more robust; pg_control file should be backed up first
and the backup end point should be checked before running query.

If it's unsafe, I'm happy to modify it. Which part looks unsafe?

Regards,

--=20
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Hi all,

the procedure that waits for the checkpoint location change looks
like::

function wait_for_checkpoint_location_change() {
    PRE_WAIT=$( pg_controldata $PGDATA | awk -F: '/Latest checkpoint
location/ { print $2 }' )
    log "Waiting for checkpoint"
    while true ; do
        sleep 5
        if [[ "$PRE_WAIT" != "$( pg_controldata $PGDATA | awk -F: '/
Latest checkpoint location/ { print $2 }' )" ]] ; then
            log "Checkpoint."
            return
        fi
    done
}


The grandchild database log has the following records about the
recovery process::

2010-08-06 09:48:31.266 CEST,,,30649,,4c5bbe46.77b9,15,,2010-08-06
09:48:22 CEST,,0,LOG,00000,"restored log file
""000000010000005B000000DC"" from archive",,,
,,,,,,""
2010-08-06 09:48:31.564 CEST,,,30649,,4c5bbe46.77b9,16,,2010-08-06
09:48:22 CEST,,0,LOG,00000,"consistent recovery state reached at 5B/
DD000000",,,,,,,,,""
2010-08-06 09:48:31.751 CEST,,,30649,,4c5bbe46.77b9,17,,2010-08-06
09:48:22 CEST,,0,LOG,00000,"restored log file
""000000010000005B000000DD"" from archive",,,
,,,,,,""

Regards,

-- Chris
On Fri, Aug 6, 2010 at 7:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The procedure used does differ from that documented. However, IMHO the
> procedure *documented* is *not* safe and could lead to corrupt indexes
> in the way described, since the last recovered point might be mid-way
> between two halves of an index split record, which will never be
> corrected during HS.

An index split record is replayed by two calls of rm_redo()? If not,
we don't need to worry about the above since the last recovered point
which pg_last_xlog_replay_location() returns is updated after every
rm_redo().

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes:
> On Fri, Aug 6, 2010 at 7:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> The procedure used does differ from that documented. However, IMHO the
>> procedure *documented* is *not* safe and could lead to corrupt indexes
>> in the way described, since the last recovered point might be mid-way
>> between two halves of an index split record, which will never be
>> corrected during HS.

> An index split record is replayed by two calls of rm_redo()? If not,
> we don't need to worry about the above since the last recovered point
> which pg_last_xlog_replay_location() returns is updated after every
> rm_redo().

Yeah, I thought that was bogus too.  If we're following a live master,
the second xlog record should be along shortly, and in any case queries
will give the correct result in between.  The problem is only interesting
if the WAL series ends and we have to cons up the split completion by
ourselves; but the logic to do that does exist.

What was bothering me about the procedure is that it's not clear when
the new slave has reached consistency, in the sense of having used WAL
to clean up any out-of-sync conditions in the base backup it was started
from.  So you can't be sure when it's okay to begin treating it as a
trustworthy backup or potential master.  We track the minimum safe
recovery point for normal PITR recovery cases, but that mechanism isn't
available for slaves cloned according to this procedure.  So the DBA is
just flying blind as to whether the slave is trustworthy yet.  I can't
prove that that's what burnt the original complainant, but it fits the
symptoms.

            regards, tom lane
On Thu, 2010-08-12 at 01:31 -0400, Tom Lane wrote:
> So the DBA is
> just flying blind as to whether the slave is trustworthy yet.  I can't
> prove that that's what burnt the original complainant, but it fits the
> symptoms.

The safest approach is to

1. run pg_start_backup() on master, remember LSN
2. copy backup_label from master to standby
3. wait for starting LSN to be applied on standby
4. run backup on standby
5. run pg_stop_backup() on master

That ensures we don't run without full page writes during backup and
that we have an explicit consistency point to work from.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Aug 12, 2010 at 2:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What was bothering me about the procedure is that it's not clear when
> the new slave has reached consistency, in the sense of having used WAL
> to clean up any out-of-sync conditions in the base backup it was started
> from. =A0So you can't be sure when it's okay to begin treating it as a
> trustworthy backup or potential master. =A0We track the minimum safe
> recovery point for normal PITR recovery cases, but that mechanism isn't
> available for slaves cloned according to this procedure. =A0So the DBA is
> just flying blind as to whether the slave is trustworthy yet. =A0I can't
> prove that that's what burnt the original complainant, but it fits the
> symptoms.

The step 2 of the procedure can ensure that new slave has reached
consistency. No?

Regards,

--=20
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Thu, Aug 12, 2010 at 4:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The safest approach is to
>
> 1. run pg_start_backup() on master, remember LSN
> 2. copy backup_label from master to standby
> 3. wait for starting LSN to be applied on standby

ISTM we should wait for the latest checkpoint redo location to reach
the starting location which pg_start_backup() returns.

> That ensures we don't run without full page writes during backup and
> that we have an explicit consistency point to work from.

Yeah, the current documented procedure might not work when
full_page_writes is off on the master. This should be documented.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Re: BUG #5602: Recovering from Hot-Standby file backup leads to the currupted indexes

From
Valentine Gogichashvili
Date:
Hi,

Actually full_page_write being turned off on the master is probably a
problem.

-- Valentine

On Thu, Aug 12, 2010 at 9:43 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

> On Thu, Aug 12, 2010 at 4:18 PM, Simon Riggs <simon@2ndquadrant.com>
> wrote:
> > The safest approach is to
> >
> > 1. run pg_start_backup() on master, remember LSN
> > 2. copy backup_label from master to standby
> > 3. wait for starting LSN to be applied on standby
>
> ISTM we should wait for the latest checkpoint redo location to reach
> the starting location which pg_start_backup() returns.
>
> > That ensures we don't run without full page writes during backup and
> > that we have an explicit consistency point to work from.
>
> Yeah, the current documented procedure might not work when
> full_page_writes is off on the master. This should be documented.
>
> Regards,
>
> --
> Fujii Masao
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>
On Thu, Aug 12, 2010 at 5:33 PM, Valentine Gogichashvili
<valgog@gmail.com> wrote:
> Hi,
> Actually full_page_write being turned off on the master is probably a
> problem.

Yep. As Simon suggests, we must run pg_start_backup on the master,
to take a backup from the standby safely even if full_page_writes
is disabled. So we should:

(based on Simon's suggestion)
1. run pg_start_backup() on master.
2. copy backup_label from master to temporary area.
   copying backup_label directly to standby would generate another
   weakness (e.g., what if standby is restarted while backup_label
   exists in standby?), so backup_label should be copied to elsewhere
   than standby.
3. wait for "Latest checkpoint's REDO location" which pg_controldata
   on standby returns, to reach or exceed "START WAL LOCATION" in
   backup_label copied in the step 2. This would take long, but we
   can run checkpoint on standby to shorten waiting time.
4. run backup on standby
5. run pg_stop_backup() on master
6. copy backup_label from temporary are to backup

Is this procedure still unsafe?

I'm thinking that it's worth leaving the current documented procedure
since it has less impact on the master, if we write clearly that the
procedure is safe only when full_page_writes is enabled. Comments?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes:
> On Thu, Aug 12, 2010 at 5:33 PM, Valentine Gogichashvili
> <valgog@gmail.com> wrote:
>> Actually full_page_write being turned off on the master is probably a
>> problem.

> Yep. As Simon suggests, we must run pg_start_backup on the master,
> to take a backup from the standby safely even if full_page_writes
> is disabled. So we should:

> (based on Simon's suggestion)
> 1. run pg_start_backup() on master.
> 2. copy backup_label from master to temporary area.
>    copying backup_label directly to standby would generate another
>    weakness (e.g., what if standby is restarted while backup_label
>    exists in standby?), so backup_label should be copied to elsewhere
>    than standby.
> 3. wait for "Latest checkpoint's REDO location" which pg_controldata
>    on standby returns, to reach or exceed "START WAL LOCATION" in
>    backup_label copied in the step 2. This would take long, but we
>    can run checkpoint on standby to shorten waiting time.

Hm, can you actually execute CHECKPOINT on a HS slave?  Is it guaranteed
to cause a restartpoint to be created?

> 4. run backup on standby
> 5. run pg_stop_backup() on master
> 6. copy backup_label from temporary are to backup

> Is this procedure still unsafe?

This still isn't doing anything to address the problem I'm worried
about, which is when does the copy actually reach consistency.  The
above procedure might guarantee that it eventually will reach
consistency, but you don't know when it has.

            regards, tom lane
On Thu, Aug 12, 2010 at 11:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> (based on Simon's suggestion)
>> 1. run pg_start_backup() on master.
>> 2. copy backup_label from master to temporary area.
>> =A0 =A0copying backup_label directly to standby would generate another
>> =A0 =A0weakness (e.g., what if standby is restarted while backup_label
>> =A0 =A0exists in standby?), so backup_label should be copied to elsewhere
>> =A0 =A0than standby.
>> 3. wait for "Latest checkpoint's REDO location" which pg_controldata
>> =A0 =A0on standby returns, to reach or exceed "START WAL LOCATION" in
>> =A0 =A0backup_label copied in the step 2. This would take long, but we
>> =A0 =A0can run checkpoint on standby to shorten waiting time.
>
> Hm, can you actually execute CHECKPOINT on a HS slave?

Yes.

> =A0Is it guaranteed
> to cause a restartpoint to be created?

CHECKPOINT on a HS slave creates a restartpoint only when there
is CHECKPOINT record which has already been replayed but has not
created a restartpoint yet. Such a CHECKPOINT record is expected
to exist after the step 2 because it's generated by pg_start_backup
in the step1. So executing CHECKPOINT on a HS slave at the step 3
would almost create a restartpoint.

But, in file-based log shipping case, it might take long to ship
such a CHECKPOINT record. So we might need to execute
pg_switch_xlog() on the master before executing CHECKPOINT on the
slave.

>> 4. run backup on standby
>> 5. run pg_stop_backup() on master
>> 6. copy backup_label from temporary are to backup
>
>> Is this procedure still unsafe?
>
> This still isn't doing anything to address the problem I'm worried
> about, which is when does the copy actually reach consistency. =A0The
> above procedure might guarantee that it eventually will reach
> consistency, but you don't know when it has.

Once new standby starting from the backup taken from another
standby has reached the backup end location (i.e., it has read
the XLOG_BACKUP_END record generated by pg_stop_backup in the
step 5), we can think that the database has reach consistency.
Since new standby doesn't accept connections from the client
until that, we can ensure that the users will not access to
inconsistent database.

Regards,

PS. I'll be unable to read hackers from Aug 13th to 20th because of
a vacation.

--=20
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center