Thread: BUG #7710: Xid epoch is not updated properly during checkpoint
The following bug has been logged on the website: Bug reference: 7710 Logged by: Tarvi Pillessaar Email address: tarvip@gmail.com PostgreSQL version: 9.1.6 Operating system: linux Description: = This happens only if wal_level=3Dhot_standby. Here are the steps to reproduce this issue. We have following db cluster: postgres@sbox /usr/local/pgsql $ pg_controldata data|grep NextXID Latest checkpoint's NextXID: 0/4294966303 postgres@sbox /usr/local/pgsql $ Basically we have less than 1000 XIDs to epoch boundary. Modify following parameters in conf: checkpoint_segments =3D 16 checkpoint_completion_target =3D 0.9 checkpoint_timeout =3D 2min log_checkpoints =3D on log_line_prefix =3D '%t %r %p %d %u ' wal_level =3D hot_standby Let's start up the cluster: postgres@sbox /usr/local/pgsql $ postgres -D /usr/local/pgsql/data 2012-11-27 20:44:43 EET 26353 LOG: database system was shut down at 2012-11-27 18:43:12 EET 2012-11-27 20:44:43 EET 26352 LOG: database system is ready to accept connections ... In another session: postgres@sbox /usr/local/pgsql $ psql -c "select now(),txid_current(), txid_current()-2^32" now | txid_current | ?column? = -------------------------------+--------------+---------- 2012-11-27 20:45:01.394324+02 | 4294966303 | -993 Now let's consume some XIDs, otherwise we have nothing to checkpoint. postgres@sbox /usr/local/pgsql $ pgbench -c 1 -t 700 ... postgres@sbox /usr/local/pgsql $ psql -c "select now(),txid_current(), txid_current()-2^32" now | txid_current | ?column? = -------------------------------+--------------+---------- 2012-11-27 20:45:27.256096+02 | 4294967005 | -291 After a while, checkpoint starts: 2012-11-27 20:46:43 EET 26354 LOG: checkpoint starting: time Now let's cross the epoch boundary: postgres@sbox /usr/local/pgsql $ pgbench -c 1 -t 700 ... postgres@sbox /usr/local/pgsql $ psql -c "select now(),txid_current(), txid_current()-2^32" now | txid_current | ?column? = -------------------------------+--------------+---------- 2012-11-27 20:46:51.205384+02 | 4294967713 | 417 Seems that we have successfully crossed the boundary. When checkpoint completes: 2012-11-27 20:47:32 EET 26354 LOG: checkpoint complete: wrote 779 buffers (25.4%); 0 transaction log file(s) added, 0 removed, 9 recycled; write=3D49.441 s, sync=3D0.170 s, total=3D49.636 s; sync files=3D16, longes= t=3D0.131 s, average=3D0.010 s postgres@sbox /usr/local/pgsql $ psql -c "select now(),txid_current(), txid_current()-2^32" now | txid_current | ?column? = -------------------------------+--------------+------------- 2012-11-27 20:47:42.031007+02 | 421 | -4294966875 It seems that epoch bump was rolled back.
Hi all. We also have reproduced this behaviour of txid wraparound. pg version 9.0.5. We have done some test. We are using hot standby and our chekpoint interval is closly to one hour (actually, our pg does chekpoints all time). So probability of intersection active chekpoint and getting max txid is near 100% First time we got fail whith this wraparound was some month ago during londiste-replication. and we haven't found any satisfactory explaination, but now, we get new conditions. Please, review checkpoint and hotstandy. Does anybody know this bug exists in head? We have done our test in 9.2.1 and bug have been found again: --- checkpoint begin $ /usr/lib/postgresql/9.2/bin/psql -p 5430 -h `pwd`/s -d postgres -c 'select now(),txid_current(),txid_current()-2^32' now | txid_current | ?column? ------------------------------+--------------+---------- 2012-11-30 19:48:48.57472+04 | 4294967730 | 434 (1 row) $ /usr/lib/postgresql/9.2/bin/psql -p 5430 -h `pwd`/s -d postgres -c 'select now(),txid_current(),txid_current()-2^32' now | txid_current | ?column? -------------------------------+--------------+---------- 2012-11-30 19:49:11.758169+04 | 4294967731 | 435 (1 row) --- checkpoint end $ /usr/lib/postgresql/9.2/bin/psql -p 5430 -h `pwd`/s -d postgres -c 'select now(),txid_current(),txid_current()-2^32' now | txid_current | ?column? -------------------------------+--------------+------------- 2012-11-30 19:51:06.302396+04 | 439 | -4294966857 (1 row) Note this previous threads: Standbys, txid_current_snapshot, wraparound http://archives.postgresql.org/pgsql-hackers/2012-06/msg00888.php -- Sergey Burladyan
tarvip@gmail.com writes: > [ txid_current can show a bogus value near XID wraparound ] > This happens only if wal_level=hot_standby. I believe what is happening here is (1) CreateCheckPoint sets up checkPoint.nextXid and checkPoint.nextXidEpoch, near xlog.c line 7070 in HEAD. At this point, nextXid is still a bit less than the wrap point. (2) After performing the checkpoint, at line 7113, CreateCheckPoint calls LogStandbySnapshot() which "helpfully" updates checkPoint.nextXid to the latest value. Which by now has wrapped around. But it doesn't fix checkPoint.nextXidEpoch, so the checkpoint that gets written out has effectively lost the epoch bump that should have happened. While we could add some more logic to try to correct the epoch value in this scenario, I think it's a much better idea to just stop having LogStandbySnapshot update the nextXid. That seems to me to be useless complication. I also quite dislike the fact that we're effectively redefining the checkpoint nextXid from being taken before the main body of the checkpoint to being taken afterwards, but *only* in XLogStandbyInfoActive mode. If that inconsistency isn't already causing bugs (besides this one) today, it'll probably cause them in the future. So barring objections, I'm going to remove LogStandbySnapshot's behavior of returning the updated nextXid. regards, tom lane
Hi all, On 2012-11-27 19:52:09 +0000, tarvip@gmail.com wrote: > This happens only if wal_level=hot_standby. > > > Here are the steps to reproduce this issue. Oh. Fucking. Wow. I think this tiny comment just helped me find the bug. After previously having looked for it without success for some time I just recognized the problem while working on something unrelated. Here it goes: void CreateCheckPoint(int flags) { ... /* * Get the other info we need for the checkpoint record. */ LWLockAcquire(XidGenLock, LW_SHARED); checkPoint.nextXid = ShmemVariableCache->nextXid; checkPoint.oldestXid = ShmemVariableCache->oldestXid; checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB; LWLockRelease(XidGenLock); /* Increase XID epoch if we've wrapped around since last checkpoint */ checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch; if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid) checkPoint.nextXidEpoch++; // i.e. compute the epoch based on the *current* nextXid. ... // do all the writing, take some time CheckPointGuts(checkPoint.redo, flags); ... * Update checkPoint.nextXid since we have a later value */ if (!shutdown && XLogStandbyInfoActive()) LogStandbySnapshot(&checkPoint.nextXid); ... /* Update shared-memory copy of checkpoint XID/epoch */ { /* use volatile pointer to prevent code rearrangement */ volatile XLogCtlData *xlogctl = XLogCtl; SpinLockAcquire(&xlogctl->info_lck); xlogctl->ckptXidEpoch = checkPoint.nextXidEpoch; xlogctl->ckptXid = checkPoint.nextXid; SpinLockRelease(&xlogctl->info_lck); } ... } Notice the end of the comment above/about LogStandbySnapshot. It updates nextXid! But it does *not* recheck whether we have had a wraparound + so it does not increase the epoch counter. Although we might have had a wraparound inbetween. Which in turn means the the ShmemVariableCache->nextXid < xlogctl->ckptXid computation in GetNextXidAndEpoch doesn't return correct results anymore because it can't recognize that were in a new epoch now. Trivial patch attached. I am not sure that I understand why its interesting to have a newer ->nextXid in the checkpoint, but I don't see anything that would make it dangerous besides this. This looks like it also explains #6291 and the slighly different issue described in CAAZKuFbB7UR3NXV1pkZFRXy=6V1QBq_OeHJWJNTLpKBpH=QFgg@mail.gmail.com, as the issue is just as present on standbys as it is on primaries. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2012-12-01 17:56:33 -0500, Tom Lane wrote: > tarvip@gmail.com writes: > > [ txid_current can show a bogus value near XID wraparound ] > > This happens only if wal_level=hot_standby. > > I believe what is happening here is Hrmpf. You had to report the fix for that three minutes before me. ;) > (1) CreateCheckPoint sets up checkPoint.nextXid and > checkPoint.nextXidEpoch, near xlog.c line 7070 in HEAD. At this point, > nextXid is still a bit less than the wrap point. > > (2) After performing the checkpoint, at line 7113, CreateCheckPoint > calls LogStandbySnapshot() which "helpfully" updates checkPoint.nextXid > to the latest value. Which by now has wrapped around. But it doesn't > fix checkPoint.nextXidEpoch, so the checkpoint that gets written out has > effectively lost the epoch bump that should have happened. Same conclusion here. > While we could add some more logic to try to correct the epoch value > in this scenario, I think it's a much better idea to just stop having > LogStandbySnapshot update the nextXid. That seems to me to be useless > complication. I also quite dislike the fact that we're effectively > redefining the checkpoint nextXid from being taken before the main > body of the checkpoint to being taken afterwards, but *only* in > XLogStandbyInfoActive mode. If that inconsistency isn't already causing > bugs (besides this one) today, it'll probably cause them in the future. > > So barring objections, I'm going to remove LogStandbySnapshot's behavior > of returning the updated nextXid. I don't see any reason why it would be bad to remove this. I think the current behaviour could actually even delay getting to an active state slightly in the presence of prepared transactions because its used to create to initialize the KnownAssignedXid machinery in xlog_redo. If the prepared xacts are suboverflown its a *good* thing to have an old ->nextXid. Greetings, Andres Freund
On 2012-12-02 00:10:22 +0100, Andres Freund wrote: > On 2012-12-01 17:56:33 -0500, Tom Lane wrote: > > So barring objections, I'm going to remove LogStandbySnapshot's behavior > > of returning the updated nextXid. > > I don't see any reason why it would be bad to remove this. I think the > current behaviour could actually even delay getting to an active state > slightly in the presence of prepared transactions because its used to > create to initialize the KnownAssignedXid machinery in xlog_redo. If the > prepared xacts are suboverflown its a *good* thing to have an old > ->nextXid. That doesn't hold true btw, due to the fact LogStandbySnapshot isn't called in the shutdown case. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 1 December 2012 23:10, Andres Freund <andres@anarazel.de> wrote: >> So barring objections, I'm going to remove LogStandbySnapshot's behavior >> of returning the updated nextXid. > > I don't see any reason why it would be bad to remove this. I think the > current behaviour could actually even delay getting to an active state > slightly in the presence of prepared transactions because its used to > create to initialize the KnownAssignedXid machinery in xlog_redo. If the > prepared xacts are suboverflown its a *good* thing to have an old > ->nextXid. That particular coding was pretty weird. Please change. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 1 December 2012 22:56, Tom Lane <tgl@sss.pgh.pa.us> wrote: > tarvip@gmail.com writes: >> [ txid_current can show a bogus value near XID wraparound ] >> This happens only if wal_level=hot_standby. > > I believe what is happening here is > > (1) CreateCheckPoint sets up checkPoint.nextXid and > checkPoint.nextXidEpoch, near xlog.c line 7070 in HEAD. At this point, > nextXid is still a bit less than the wrap point. > > (2) After performing the checkpoint, at line 7113, CreateCheckPoint > calls LogStandbySnapshot() which "helpfully" updates checkPoint.nextXid > to the latest value. Which by now has wrapped around. But it doesn't > fix checkPoint.nextXidEpoch, so the checkpoint that gets written out has > effectively lost the epoch bump that should have happened. > > While we could add some more logic to try to correct the epoch value > in this scenario, I think it's a much better idea to just stop having > LogStandbySnapshot update the nextXid. That seems to me to be useless > complication. I also quite dislike the fact that we're effectively > redefining the checkpoint nextXid from being taken before the main > body of the checkpoint to being taken afterwards, but *only* in > XLogStandbyInfoActive mode. If that inconsistency isn't already causing > bugs (besides this one) today, it'll probably cause them in the future. I agree that the coding looks weird and agree it shouldn't be there. The meaning of the checkpoint values should not differ because wal_level has changed. > So barring objections, I'm going to remove LogStandbySnapshot's behavior > of returning the updated nextXid. Removing it may cause other bugs, but if so, those other bugs need to be solved in the right way, not by having a too-far-forwards nextxid on the checkpoint record. Having said that, I can't see any bugs that would be caused by this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2 December 2012 12:51, Simon Riggs <simon@2ndquadrant.com> wrote: > On 1 December 2012 22:56, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> tarvip@gmail.com writes: >>> [ txid_current can show a bogus value near XID wraparound ] >>> This happens only if wal_level=hot_standby. >> >> I believe what is happening here is >> >> (1) CreateCheckPoint sets up checkPoint.nextXid and >> checkPoint.nextXidEpoch, near xlog.c line 7070 in HEAD. At this point, >> nextXid is still a bit less than the wrap point. >> >> (2) After performing the checkpoint, at line 7113, CreateCheckPoint >> calls LogStandbySnapshot() which "helpfully" updates checkPoint.nextXid >> to the latest value. Which by now has wrapped around. But it doesn't >> fix checkPoint.nextXidEpoch, so the checkpoint that gets written out has >> effectively lost the epoch bump that should have happened. >> >> While we could add some more logic to try to correct the epoch value >> in this scenario, I think it's a much better idea to just stop having >> LogStandbySnapshot update the nextXid. That seems to me to be useless >> complication. I also quite dislike the fact that we're effectively >> redefining the checkpoint nextXid from being taken before the main >> body of the checkpoint to being taken afterwards, but *only* in >> XLogStandbyInfoActive mode. If that inconsistency isn't already causing >> bugs (besides this one) today, it'll probably cause them in the future. > > I agree that the coding looks weird and agree it shouldn't be there. > The meaning of the checkpoint values should not differ because > wal_level has changed. > >> So barring objections, I'm going to remove LogStandbySnapshot's behavior >> of returning the updated nextXid. > > Removing it may cause other bugs, but if so, those other bugs need to > be solved in the right way, not by having a too-far-forwards nextxid > on the checkpoint record. Having said that, I can't see any bugs that > would be caused by this. Andres' patch is invasive and is not for my liking in backbranches. Tom's recommended change goes further still and not one I'm comfortable risking, even though I agree with the root cause/change. I've applied an absolutely minimal fix on this, which introduces no other changes that could cause unforeseen consequences. Others may wish to go further, overriding my patches, as they choose. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > I've applied an absolutely minimal fix on this, which introduces no > other changes that could cause unforeseen consequences. This is not what we'd agreed to do, I thought. Now that I've thought more about this bug, the existing coding is flat out wrong, with or without correction of the epoch. As you yourself just wrote in a comment, the checkpoint record logically belongs to the "redo" point in the WAL stream, not to where it's physically located. Having it carry a nextXid that belongs to the later point is simply wrong. Having it carry different nextXids depending on wal_level is even more wrong. I can point right now to one misbehavior this causes: if you run a point-in-time recovery with a stop point somewhere in the middle of the checkpoint, you should end up with a nextXid corresponding to the stop point. This hack in LogStandbySnapshot causes you to end up with a much later nextXid, if you were running hot-standby. > Others may wish to go further, overriding my patches, as they choose. Okay, I will take the responsibility for changing this, but it needs to change. This coding was ill-considered from the word go. regards, tom lane
On 2 December 2012 15:25, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I can point right now to one misbehavior this causes: if you run a > point-in-time recovery with a stop point somewhere in the middle of the > checkpoint, you should end up with a nextXid corresponding to the stop > point. This hack in LogStandbySnapshot causes you to end up with a > much later nextXid, if you were running hot-standby. True, though that does not cause any problem. >> Others may wish to go further, overriding my patches, as they choose. > > Okay, I will take the responsibility for changing this, but it needs to > change. OK. At least we have the minimal coding to fall back on if need be. > This coding was ill-considered from the word go. Agreed, but then I don't have a clear reason why it is that way and yet I'm sure I did it for some reason. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 2 December 2012 15:25, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This coding was ill-considered from the word go. > Agreed, but then I don't have a clear reason why it is that way and > yet I'm sure I did it for some reason. I think you just did it because it looked easy, and you didn't think very hard about the consequences. As far as the concern about new bugs is concerned: if we start replay from this checkpoint, we will certainly not consider that the DB is consistent until we reach the checkpoint's physical position. And by that point we'll have replayed the XLOG_RUNNING_XACTS record emitted by LogStandbySnapshot, so our idea of the nextXid should be fully up to date anyway. The same goes for checkpoints encountered later in the replay run --- they'd just be duplicative of the preceding XLOG_RUNNING_XACTS record. There is no reason to put the same XID into the checkpoint record. regards, tom lane
On 2 December 2012 16:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On 2 December 2012 15:25, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> This coding was ill-considered from the word go. > >> Agreed, but then I don't have a clear reason why it is that way and >> yet I'm sure I did it for some reason. > > I think you just did it because it looked easy, and you didn't think > very hard about the consequences. I don't typically think such thoughts. Your ability to see more deeply into certain topics is a credit to you, not an implication of laziness on me. I'm sure I believed it a necessary or useful thing to do. > As far as the concern about new bugs is concerned: if we start replay > from this checkpoint, we will certainly not consider that the DB is > consistent until we reach the checkpoint's physical position. And by > that point we'll have replayed the XLOG_RUNNING_XACTS record emitted by > LogStandbySnapshot, so our idea of the nextXid should be fully up to > date anyway. The same goes for checkpoints encountered later in the > replay run --- they'd just be duplicative of the preceding > XLOG_RUNNING_XACTS record. There is no reason to put the same XID into > the checkpoint record. Exactly, the end result is identical, but the intermediate states may differ. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 2 December 2012 16:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> As far as the concern about new bugs is concerned: if we start replay >> from this checkpoint, we will certainly not consider that the DB is >> consistent until we reach the checkpoint's physical position. And by >> that point we'll have replayed the XLOG_RUNNING_XACTS record emitted by >> LogStandbySnapshot, so our idea of the nextXid should be fully up to >> date anyway. The same goes for checkpoints encountered later in the >> replay run --- they'd just be duplicative of the preceding >> XLOG_RUNNING_XACTS record. There is no reason to put the same XID into >> the checkpoint record. > Exactly, the end result is identical, but the intermediate states may differ. Indeed, and the intermediate states are *wrong*, as things stand. But they won't be visible to backends AFAICS, since we aren't allowing any backends in yet. I'm more concerned about the possible effects on tools that scan WAL --- and in that connection, it's not apparent to me that your "minimal fix" is any safer than the other change. It's a behavioral change either way. I think correct tools shouldn't have a problem either way, but it seems to me to be real hard to argue that buggy tools would be affected by one change but not the other. More likely it could go either way depending on the nature of the bug. regards, tom lane