Re: [HACKERS] xlog min recovery request ... is past current point ... - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] xlog min recovery request ... is past current point ... |
Date | |
Msg-id | CA+TgmoYujN3vnS-bMBcpBia15NnDRUebq+HMzB+JVUCHs6LA6A@mail.gmail.com Whole thread Raw |
In response to | xlog min recovery request ... is past current point ... (Christophe Pettus <xof@thebuild.com>) |
List | pgsql-hackers |
[ blast-from-the-past department ] On Fri, Feb 3, 2012 at 11:33 AM Christophe Pettus <xof@thebuild.com> wrote: > While bringing up a streaming replica, and while it is working its way through the WAL segments before connecting to theprimary, I see a lot of messages of the form: > > 2012-02-01 21:26:13.978 PST,,,24448,,4f2a1e61.5f80,54,,2012-02-01 21:25:53 PST,1/0,0,LOG,00000,"restored log file ""0000000100000DB400000065""from archive",,,,,,,,,"" > 2012-02-01 21:26:14.032 PST,,,24448,,4f2a1e61.5f80,55,,2012-02-01 21:25:53 PST,1/0,0,WARNING,01000,"xlog min recovery requestDB5/42E15098 is past current point DB4/657FA490",,,,,"writing block 5 of relation base/155650/156470_vm > xlog redo insert: rel 1663/155650/1658867; tid 9640/53",,,,"" > 2012-02-01 21:26:14.526 PST,,,24448,,4f2a1e61.5f80,56,,2012-02-01 21:25:53 PST,1/0,0,LOG,00000,"restored log file ""0000000100000DB400000066""from archive",,,,,,,,,"" > > All of these are on _vm relations. The recovery completed successfully and the secondary connected to the primary withoutissue, so: Are these messages something to be concerned over? It appears that we never did anything about this, and I just came across some evidence that it's still a live issue. And, looking at the code, I can't see any reason why it shouldn't be. I found Heikki's list of steps (in his 2012 reply) a bit difficult to understand, so here's my attempt to restate the scenario he describes: Suppose that a user begins a base backup. While the base backup is running, the user performs a DML-operation on an all-visible page, clearing the visibility map bit. Then, they run VACUUM, resetting the bit. The page is copied after both of these things happen. Then, the base backup ends. Next, the user tries to start up the standby. When the DML operation is replayed, the visibility map bit is cleared without changing the page LSN (e.g. heap_xlog_delete -> visibilitymap_clear). If the buffer is now flushed out of shared_buffers before the VACUUM is replayed, then we can get the complaint mentioned in $SUBJECT. OK, so what could/should we do about this? VM pages are subject to a somewhat unusual scheme with regard to write-ahead logging: to minimize WAL traffic, records that clear the visibilitymap bit in passing do not register the buffer, meaning that such records generate no full-page images for the visibility map. Records that set the visibility map bit do register the buffer, but suppress full-page images unless checksums or wal_log_hints are in use. At least if checksums are not in use, this is reasonable, because modifications to VM pages don't rely on the existing page contents, so even if the page is torn, replay need not get confused, as long as it ignores the page LSN when clearing the bit. And that's just what heap_xlog_delete() and friends actually do. It's acceptable but useless to pay attention to the page LSN when setting a bit, because the only thing we can do is decide to not set a bit that should have been set, which doesn't break anything, and in fact heap_xlog_visible() does pay attention to the page LSN "out of an abundance of conservatism" -- a comment I wrote originally, and with which I'm no longer very impressed, because it seems like it's just making things more confusing. The overall picture here is that the LSN of the VM page matters very little. If heap_xlog_visible() didn't have that "if (lsn > PageGetLSN(vmpage))" test guarding visibilitymap_set here, the rule would be simple: the LSN of a VM page would have no effect on replay of related WAL records, period. However, it would probably make the warning that is the subject of this thread more common. Imagine the same scenario outlined above but, instead of the bit starting out set and then being cleared and set again, it starts out cleared and gets set and cleared again. The rest is the same: the page is copied by the base backup in its final form. As things stand, that if-test prevents the VM bit from being set, which also means the page LSN is not set nor is the page dirtied, which means that evicting the page can't trigger this warning. But, regardless of whether we actually remove that if-test, considering it as a hypothetical seems useful. If we actually had a rule that the page LSN doesn't really affect whether actions affecting VM pages are replayed, then we would just need visibilitymap_clear() to set the page LSN just like visibilitymap_set() does. The only thing the page LSN would be doing is causing the warning which is the subject of the thread to get generated, or not, so it would make sense to just set in a way that avoids the warning, since it matters to nothing else. However, checksums seem to complicate the situation considerably. Ignore everything proposed above and suppose that, with checksums enabled, the user deletes a tuple from a page marked all-visible, clearing the VM bit. The dirty page is flushed to the OS. As it is being written back to durable storage, the system crashes, tearing the page. There's no full-page image in the WAL to fix things because the buffer isn't even registered, so I guess the next access to that VM page just errors out. Am I missing something? If not, perhaps the thing to do here is go the other direction and make the WAL logging for VM pages more standard: always register the buffer, take full page images just like any other buffer, and make the decision as to whether or not to replay VM actions fully conditional on the page LSN. The visibility map is so small that the additional volume of full-page writes seems unlikely to matter very much. There's an additional downside in that registering an additional buffer makes the WAL records bigger, and making the WAL records for insert, update, and delete bigger is pretty undesirable, but I think we could do it just in the case where the bit actually needs to be cleared, which ought to be comparatively rare, so maybe it's fine. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: