Thread: BUG #7710: Xid epoch is not updated properly during checkpoint

BUG #7710: Xid epoch is not updated properly during checkpoint

From

tarvip@gmail.com

Date:

28 November 2012, 15:45:26

The following bug has been logged on the website:

Bug reference:      7710
Logged by:          Tarvi Pillessaar
Email address:      tarvip@gmail.com
PostgreSQL version: 9.1.6
Operating system:   linux
Description:        =


This happens only if wal_level=3Dhot_standby.


Here are the steps to reproduce this issue.


We have following db cluster:

postgres@sbox /usr/local/pgsql $ pg_controldata data|grep NextXID
Latest checkpoint's NextXID:          0/4294966303
postgres@sbox /usr/local/pgsql $

Basically we have less than 1000 XIDs to epoch boundary.

Modify following parameters in conf:
checkpoint_segments =3D 16
checkpoint_completion_target =3D 0.9
checkpoint_timeout =3D 2min
log_checkpoints =3D on
log_line_prefix =3D '%t %r %p %d %u '
wal_level =3D hot_standby



Let's start up the cluster:
postgres@sbox /usr/local/pgsql $ postgres -D /usr/local/pgsql/data
2012-11-27 20:44:43 EET  26353   LOG:  database system was shut down at
2012-11-27 18:43:12 EET
2012-11-27 20:44:43 EET  26352   LOG:  database system is ready to accept
connections
...


In another session:
postgres@sbox /usr/local/pgsql $ psql -c "select now(),txid_current(),
txid_current()-2^32"
              now              | txid_current | ?column? =

-------------------------------+--------------+----------
 2012-11-27 20:45:01.394324+02 |   4294966303 |     -993



Now let's consume some XIDs, otherwise we have nothing to checkpoint.

postgres@sbox /usr/local/pgsql $ pgbench -c 1 -t 700
...
postgres@sbox /usr/local/pgsql $ psql -c "select now(),txid_current(),
txid_current()-2^32"
              now              | txid_current | ?column? =

-------------------------------+--------------+----------
 2012-11-27 20:45:27.256096+02 |   4294967005 |     -291



After a while, checkpoint starts:
2012-11-27 20:46:43 EET  26354   LOG:  checkpoint starting: time


Now let's cross the epoch boundary:

postgres@sbox /usr/local/pgsql $ pgbench -c 1 -t 700
...

postgres@sbox /usr/local/pgsql $ psql -c "select now(),txid_current(),
txid_current()-2^32"
              now              | txid_current | ?column? =

-------------------------------+--------------+----------
 2012-11-27 20:46:51.205384+02 |   4294967713 |      417

Seems that we have successfully crossed the boundary.


When checkpoint completes:
2012-11-27 20:47:32 EET  26354   LOG:  checkpoint complete: wrote 779
buffers (25.4%); 0 transaction log file(s) added, 0 removed, 9 recycled;
write=3D49.441 s, sync=3D0.170 s, total=3D49.636 s; sync files=3D16, longes=
t=3D0.131
s, average=3D0.010 s

postgres@sbox /usr/local/pgsql $ psql -c "select now(),txid_current(),
txid_current()-2^32"
              now              | txid_current |  ?column?   =

-------------------------------+--------------+-------------
 2012-11-27 20:47:42.031007+02 |          421 | -4294966875

It seems that epoch bump was rolled back.

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Sergey Burladyan

Date:

30 November 2012, 15:59:34

Hi all.

We also have reproduced this behaviour of txid wraparound. pg version 9.0.5.
We have done some test.

We are using hot standby and our chekpoint interval is closly to one
hour (actually, our pg does chekpoints all time). So probability of
intersection active chekpoint and getting max txid is near 100%

First time we got fail whith this wraparound was some month ago during
londiste-replication. and we haven't found any satisfactory
explaination, but now, we get new conditions.

Please, review checkpoint and hotstandy.

Does anybody know this bug exists in head?

We have done our test in 9.2.1 and bug have been found again:

--- checkpoint begin
$ /usr/lib/postgresql/9.2/bin/psql -p 5430 -h `pwd`/s -d postgres -c
'select now(),txid_current(),txid_current()-2^32'
             now              | txid_current | ?column?
------------------------------+--------------+----------
 2012-11-30 19:48:48.57472+04 |   4294967730 |      434
(1 row)

$ /usr/lib/postgresql/9.2/bin/psql -p 5430 -h `pwd`/s -d postgres -c
'select now(),txid_current(),txid_current()-2^32'
              now              | txid_current | ?column?
-------------------------------+--------------+----------
 2012-11-30 19:49:11.758169+04 |   4294967731 |      435
(1 row)
--- checkpoint end

$ /usr/lib/postgresql/9.2/bin/psql -p 5430 -h `pwd`/s -d postgres -c
'select now(),txid_current(),txid_current()-2^32'
              now              | txid_current |  ?column?
-------------------------------+--------------+-------------
 2012-11-30 19:51:06.302396+04 |          439 | -4294966857
(1 row)

Note this previous threads: Standbys, txid_current_snapshot,
wraparound http://archives.postgresql.org/pgsql-hackers/2012-06/msg00888.php

--
Sergey Burladyan

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Tom Lane

Date:

01 December 2012, 22:56:42

tarvip@gmail.com writes:
> [ txid_current can show a bogus value near XID wraparound ]
> This happens only if wal_level=hot_standby.

I believe what is happening here is

(1) CreateCheckPoint sets up checkPoint.nextXid and
checkPoint.nextXidEpoch, near xlog.c line 7070 in HEAD.  At this point,
nextXid is still a bit less than the wrap point.

(2) After performing the checkpoint, at line 7113, CreateCheckPoint
calls LogStandbySnapshot() which "helpfully" updates checkPoint.nextXid
to the latest value.  Which by now has wrapped around.  But it doesn't
fix checkPoint.nextXidEpoch, so the checkpoint that gets written out has
effectively lost the epoch bump that should have happened.

While we could add some more logic to try to correct the epoch value
in this scenario, I think it's a much better idea to just stop having
LogStandbySnapshot update the nextXid.  That seems to me to be useless
complication.  I also quite dislike the fact that we're effectively
redefining the checkpoint nextXid from being taken before the main
body of the checkpoint to being taken afterwards, but *only* in
XLogStandbyInfoActive mode.  If that inconsistency isn't already causing
bugs (besides this one) today, it'll probably cause them in the future.

So barring objections, I'm going to remove LogStandbySnapshot's behavior
of returning the updated nextXid.

            regards, tom lane

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Andres Freund

Date:

01 December 2012, 22:59:33

Hi all,

On 2012-11-27 19:52:09 +0000, tarvip@gmail.com wrote:
> This happens only if wal_level=hot_standby.
>
>
> Here are the steps to reproduce this issue.

Oh. Fucking. Wow.

I think this tiny comment just helped me find the bug. After previously
having looked for it without success for some time I just recognized the
problem while working on something unrelated.

Here it goes:

void
CreateCheckPoint(int flags)
{
...
    /*
     * Get the other info we need for the checkpoint record.
     */
    LWLockAcquire(XidGenLock, LW_SHARED);
    checkPoint.nextXid = ShmemVariableCache->nextXid;
    checkPoint.oldestXid = ShmemVariableCache->oldestXid;
    checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
    LWLockRelease(XidGenLock);

    /* Increase XID epoch if we've wrapped around since last checkpoint */
    checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
    if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
        checkPoint.nextXidEpoch++;
// i.e. compute the epoch based on the *current* nextXid.

...
// do all the writing, take some time
        CheckPointGuts(checkPoint.redo, flags);
...
     * Update checkPoint.nextXid since we have a later value
     */
    if (!shutdown && XLogStandbyInfoActive())
        LogStandbySnapshot(&checkPoint.nextXid);
...
    /* Update shared-memory copy of checkpoint XID/epoch */
    {
        /* use volatile pointer to prevent code rearrangement */
        volatile XLogCtlData *xlogctl = XLogCtl;

        SpinLockAcquire(&xlogctl->info_lck);
        xlogctl->ckptXidEpoch = checkPoint.nextXidEpoch;
        xlogctl->ckptXid = checkPoint.nextXid;
        SpinLockRelease(&xlogctl->info_lck);
    }
...
}

Notice the end of the comment above/about LogStandbySnapshot. It updates
nextXid! But it does *not* recheck whether we have had a wraparound +
so it does not increase the epoch counter. Although we might have had a
wraparound inbetween.

Which in turn means the the ShmemVariableCache->nextXid <
xlogctl->ckptXid computation in GetNextXidAndEpoch doesn't return
correct results anymore because it can't recognize that were in a new
epoch now.

Trivial patch attached.

I am not sure that I understand why its interesting to have a newer
->nextXid in the checkpoint, but I don't see anything that would
make it dangerous besides this.

This looks like it also explains #6291 and the slighly different issue
described in CAAZKuFbB7UR3NXV1pkZFRXy=6V1QBq_OeHJWJNTLpKBpH=QFgg@mail.gmail.com,
as the issue is just as present on standbys as it is on primaries.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

0001-Fix-xid-epoch-calculation-with-wal_level-hot_standby.patch

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Andres Freund

Date:

01 December 2012, 23:10:31

On 2012-12-01 17:56:33 -0500, Tom Lane wrote:
> tarvip@gmail.com writes:
> > [ txid_current can show a bogus value near XID wraparound ]
> > This happens only if wal_level=hot_standby.
>
> I believe what is happening here is

Hrmpf. You had to report the fix for that three minutes before
me. ;)

> (1) CreateCheckPoint sets up checkPoint.nextXid and
> checkPoint.nextXidEpoch, near xlog.c line 7070 in HEAD.  At this point,
> nextXid is still a bit less than the wrap point.
>
> (2) After performing the checkpoint, at line 7113, CreateCheckPoint
> calls LogStandbySnapshot() which "helpfully" updates checkPoint.nextXid
> to the latest value.  Which by now has wrapped around.  But it doesn't
> fix checkPoint.nextXidEpoch, so the checkpoint that gets written out has
> effectively lost the epoch bump that should have happened.

Same conclusion here.

> While we could add some more logic to try to correct the epoch value
> in this scenario, I think it's a much better idea to just stop having
> LogStandbySnapshot update the nextXid.  That seems to me to be useless
> complication.  I also quite dislike the fact that we're effectively
> redefining the checkpoint nextXid from being taken before the main
> body of the checkpoint to being taken afterwards, but *only* in
> XLogStandbyInfoActive mode.  If that inconsistency isn't already causing
> bugs (besides this one) today, it'll probably cause them in the future.
>
> So barring objections, I'm going to remove LogStandbySnapshot's behavior
> of returning the updated nextXid.

I don't see any reason why it would be bad to remove this. I think the
current behaviour could actually even delay getting to an active state
slightly in the presence of prepared transactions because its used to
create to initialize the KnownAssignedXid machinery in xlog_redo. If the
prepared xacts are suboverflown its a *good* thing to have an old
->nextXid.

Greetings,

Andres Freund

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Andres Freund

Date:

01 December 2012, 23:49:21

On 2012-12-02 00:10:22 +0100, Andres Freund wrote:
> On 2012-12-01 17:56:33 -0500, Tom Lane wrote:
> > So barring objections, I'm going to remove LogStandbySnapshot's behavior
> > of returning the updated nextXid.
>
> I don't see any reason why it would be bad to remove this. I think the
> current behaviour could actually even delay getting to an active state
> slightly in the presence of prepared transactions because its used to
> create to initialize the KnownAssignedXid machinery in xlog_redo. If the
> prepared xacts are suboverflown its a *good* thing to have an old
> ->nextXid.

That doesn't hold true btw, due to the fact LogStandbySnapshot isn't
called in the shutdown case.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Simon Riggs

Date:

02 December 2012, 10:13:02

On 1 December 2012 23:10, Andres Freund <andres@anarazel.de> wrote:

>> So barring objections, I'm going to remove LogStandbySnapshot's behavior
>> of returning the updated nextXid.
>
> I don't see any reason why it would be bad to remove this. I think the
> current behaviour could actually even delay getting to an active state
> slightly in the presence of prepared transactions because its used to
> create to initialize the KnownAssignedXid machinery in xlog_redo. If the
> prepared xacts are suboverflown its a *good* thing to have an old
> ->nextXid.

That particular coding was pretty weird. Please change.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Simon Riggs

Date:

02 December 2012, 12:52:07

On 1 December 2012 22:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> tarvip@gmail.com writes:
>> [ txid_current can show a bogus value near XID wraparound ]
>> This happens only if wal_level=hot_standby.
>
> I believe what is happening here is
>
> (1) CreateCheckPoint sets up checkPoint.nextXid and
> checkPoint.nextXidEpoch, near xlog.c line 7070 in HEAD.  At this point,
> nextXid is still a bit less than the wrap point.
>
> (2) After performing the checkpoint, at line 7113, CreateCheckPoint
> calls LogStandbySnapshot() which "helpfully" updates checkPoint.nextXid
> to the latest value.  Which by now has wrapped around.  But it doesn't
> fix checkPoint.nextXidEpoch, so the checkpoint that gets written out has
> effectively lost the epoch bump that should have happened.
>
> While we could add some more logic to try to correct the epoch value
> in this scenario, I think it's a much better idea to just stop having
> LogStandbySnapshot update the nextXid.  That seems to me to be useless
> complication.  I also quite dislike the fact that we're effectively
> redefining the checkpoint nextXid from being taken before the main
> body of the checkpoint to being taken afterwards, but *only* in
> XLogStandbyInfoActive mode.  If that inconsistency isn't already causing
> bugs (besides this one) today, it'll probably cause them in the future.

I agree that the coding looks weird and agree it shouldn't be there.
The meaning of the checkpoint values should not differ because
wal_level has changed.

> So barring objections, I'm going to remove LogStandbySnapshot's behavior
> of returning the updated nextXid.

Removing it may cause other bugs, but if so, those other bugs need to
be solved in the right way, not by having a too-far-forwards nextxid
on the checkpoint record. Having said that, I can't see any bugs that
would be caused by this.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Simon Riggs

Date:

02 December 2012, 15:08:41

On 2 December 2012 12:51, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 1 December 2012 22:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> tarvip@gmail.com writes:
>>> [ txid_current can show a bogus value near XID wraparound ]
>>> This happens only if wal_level=hot_standby.
>>
>> I believe what is happening here is
>>
>> (1) CreateCheckPoint sets up checkPoint.nextXid and
>> checkPoint.nextXidEpoch, near xlog.c line 7070 in HEAD.  At this point,
>> nextXid is still a bit less than the wrap point.
>>
>> (2) After performing the checkpoint, at line 7113, CreateCheckPoint
>> calls LogStandbySnapshot() which "helpfully" updates checkPoint.nextXid
>> to the latest value.  Which by now has wrapped around.  But it doesn't
>> fix checkPoint.nextXidEpoch, so the checkpoint that gets written out has
>> effectively lost the epoch bump that should have happened.
>>
>> While we could add some more logic to try to correct the epoch value
>> in this scenario, I think it's a much better idea to just stop having
>> LogStandbySnapshot update the nextXid.  That seems to me to be useless
>> complication.  I also quite dislike the fact that we're effectively
>> redefining the checkpoint nextXid from being taken before the main
>> body of the checkpoint to being taken afterwards, but *only* in
>> XLogStandbyInfoActive mode.  If that inconsistency isn't already causing
>> bugs (besides this one) today, it'll probably cause them in the future.
>
> I agree that the coding looks weird and agree it shouldn't be there.
> The meaning of the checkpoint values should not differ because
> wal_level has changed.
>
>> So barring objections, I'm going to remove LogStandbySnapshot's behavior
>> of returning the updated nextXid.
>
> Removing it may cause other bugs, but if so, those other bugs need to
> be solved in the right way, not by having a too-far-forwards nextxid
> on the checkpoint record. Having said that, I can't see any bugs that
> would be caused by this.

Andres'  patch is invasive and is not for my liking in backbranches.

Tom's recommended change goes further still and not one I'm
comfortable risking, even though I agree with the root cause/change.

I've applied an absolutely minimal fix on this, which introduces no
other changes that could cause unforeseen consequences.

Others may wish to go further, overriding my patches, as they choose.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Tom Lane

Date:

02 December 2012, 15:25:31

Simon Riggs <simon@2ndQuadrant.com> writes:
> I've applied an absolutely minimal fix on this, which introduces no
> other changes that could cause unforeseen consequences.

This is not what we'd agreed to do, I thought.

Now that I've thought more about this bug, the existing coding is flat
out wrong, with or without correction of the epoch.  As you yourself
just wrote in a comment, the checkpoint record logically belongs to the
"redo" point in the WAL stream, not to where it's physically located.
Having it carry a nextXid that belongs to the later point is simply
wrong.  Having it carry different nextXids depending on wal_level is
even more wrong.

I can point right now to one misbehavior this causes: if you run a
point-in-time recovery with a stop point somewhere in the middle of the
checkpoint, you should end up with a nextXid corresponding to the stop
point.  This hack in LogStandbySnapshot causes you to end up with a
much later nextXid, if you were running hot-standby.

> Others may wish to go further, overriding my patches, as they choose.

Okay, I will take the responsibility for changing this, but it needs to
change.  This coding was ill-considered from the word go.

            regards, tom lane

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Simon Riggs

Date:

02 December 2012, 15:58:34

On 2 December 2012 15:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I can point right now to one misbehavior this causes: if you run a
> point-in-time recovery with a stop point somewhere in the middle of the
> checkpoint, you should end up with a nextXid corresponding to the stop
> point.  This hack in LogStandbySnapshot causes you to end up with a
> much later nextXid, if you were running hot-standby.

True, though that does not cause any problem.

>> Others may wish to go further, overriding my patches, as they choose.
>
> Okay, I will take the responsibility for changing this, but it needs to
> change.

OK. At least we have the minimal coding to fall back on if need be.

> This coding was ill-considered from the word go.

Agreed, but then I don't have a clear reason why it is that way and
yet I'm sure I did it for some reason.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Tom Lane

Date:

02 December 2012, 16:44:39

Simon Riggs <simon@2ndQuadrant.com> writes:
> On 2 December 2012 15:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This coding was ill-considered from the word go.

> Agreed, but then I don't have a clear reason why it is that way and
> yet I'm sure I did it for some reason.

I think you just did it because it looked easy, and you didn't think
very hard about the consequences.

As far as the concern about new bugs is concerned: if we start replay
from this checkpoint, we will certainly not consider that the DB is
consistent until we reach the checkpoint's physical position.  And by
that point we'll have replayed the XLOG_RUNNING_XACTS record emitted by
LogStandbySnapshot, so our idea of the nextXid should be fully up to
date anyway.  The same goes for checkpoints encountered later in the
replay run --- they'd just be duplicative of the preceding
XLOG_RUNNING_XACTS record.  There is no reason to put the same XID into
the checkpoint record.

            regards, tom lane

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Simon Riggs

Date:

02 December 2012, 16:56:22

On 2 December 2012 16:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On 2 December 2012 15:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> This coding was ill-considered from the word go.
>
>> Agreed, but then I don't have a clear reason why it is that way and
>> yet I'm sure I did it for some reason.
>
> I think you just did it because it looked easy, and you didn't think
> very hard about the consequences.

I don't typically think such thoughts. Your ability to see more deeply
into certain topics is a credit to you, not an implication of laziness
on me.

I'm sure I believed it a necessary or useful thing to do.

> As far as the concern about new bugs is concerned: if we start replay
> from this checkpoint, we will certainly not consider that the DB is
> consistent until we reach the checkpoint's physical position.  And by
> that point we'll have replayed the XLOG_RUNNING_XACTS record emitted by
> LogStandbySnapshot, so our idea of the nextXid should be fully up to
> date anyway.  The same goes for checkpoints encountered later in the
> replay run --- they'd just be duplicative of the preceding
> XLOG_RUNNING_XACTS record.  There is no reason to put the same XID into
> the checkpoint record.

Exactly, the end result is identical, but the intermediate states may differ.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7710: Xid epoch is not updated properly during checkpoint

From

Tom Lane

Date:

02 December 2012, 17:44:18

Simon Riggs <simon@2ndQuadrant.com> writes:
> On 2 December 2012 16:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> As far as the concern about new bugs is concerned: if we start replay
>> from this checkpoint, we will certainly not consider that the DB is
>> consistent until we reach the checkpoint's physical position.  And by
>> that point we'll have replayed the XLOG_RUNNING_XACTS record emitted by
>> LogStandbySnapshot, so our idea of the nextXid should be fully up to
>> date anyway.  The same goes for checkpoints encountered later in the
>> replay run --- they'd just be duplicative of the preceding
>> XLOG_RUNNING_XACTS record.  There is no reason to put the same XID into
>> the checkpoint record.

> Exactly, the end result is identical, but the intermediate states may differ.

Indeed, and the intermediate states are *wrong*, as things stand.  But
they won't be visible to backends AFAICS, since we aren't allowing any
backends in yet.  I'm more concerned about the possible effects on tools
that scan WAL --- and in that connection, it's not apparent to me that
your "minimal fix" is any safer than the other change.  It's a
behavioral change either way.  I think correct tools shouldn't have a
problem either way, but it seems to me to be real hard to argue that
buggy tools would be affected by one change but not the other.  More
likely it could go either way depending on the nature of the bug.

            regards, tom lane