Thread: 'Invalid lp' during heap_xlog_delete
Page on disk has empty lp 1
* Insert into page lp 1
checkpoint START. Redo eventually starts here.
** Delete all rows on page.
autovac truncate
DropRelFileNodeBuffers - dirty page NOT written. lp 1 on disk still empty
checkpoint completes
crash
smgrtruncate - Not reached
heap_xlog_delete reads page with empty lp 1 and the delete fails.
The checkpoint can not have yet written * or ** before DropRelFileNodeBuffers invalidates either of those dirty page versions for this to repro.
Even if we reach the truncate we don't fsync it till the next checkpoint. So on filesystems which delay metadata updates a crash can lose the truncate.
Once we do the fsync(), for the truncate, the REDO read will return BLK_NOTFOUND and the DELETE REDO attempt will be skipped.
WIthout the fsync() or crashing before the truncate, the delete redo depends on the most recent version of the page having been written by the checkpoint.
Found during stress test and verified with pg_usleep's to test hypothesis.
Is DropRelFileNodeBuffers purely for performance or would there be any correctness problems if not done.
On Fri, Nov 08, 2019 at 12:46:51PM -0800, Daniel Wood wrote: > Is DropRelFileNodeBuffers purely for performance or would there be > any correctness problems if not done. On which version did you find that? Only HEAD or did you use a version on a stable branch? There has been some work done in this area lately as of 6d05086. -- Michael
Attachment
I repro'ed on PG11 and PG10 STABLE but several months old.
I looked at 6d05086 but it doesn't address the core issue.
DropRelFileNodeBuffers prevents the checkpoint from writing all needed dirty pages for any REDO's that exist BEFORE the truncate. If we crash after a checkpoint but before the physical truncate then the REDO will need to replay the operation against the dirty page that the Drop invalidated.
Teja Mupparti, an engineer I work with, suggested moving DropRelFileNodeBuffers to the bottom of smgrtruncate() after the physical truncate. Doing that along with a fsync() after the truncate seems to plug the hole.
On November 8, 2019 at 5:39 PM Michael Paquier < michael@paquier.xyz> wrote:On Fri, Nov 08, 2019 at 12:46:51PM -0800, Daniel Wood wrote:Is DropRelFileNodeBuffers purely for performance or would there beany correctness problems if not done.On which version did you find that? Only HEAD or did you use aversion on a stable branch? There has been some work done in thisarea lately as of 6d05086.--Michael
On Fri, Nov 08, 2019 at 06:44:08PM -0800, Daniel Wood wrote: > I repro'ed on PG11 and PG10 STABLE but several months old. > I looked at 6d05086 but it doesn't address the core issue. > > DropRelFileNodeBuffers prevents the checkpoint from writing all > needed dirty pages for any REDO's that exist BEFORE the truncate. > If we crash after a checkpoint but before the physical truncate then > the REDO will need to replay the operation against the dirty page > that the Drop invalidated. I am beginning to look at this thread more seriously, and I'd like to first try to reproduce that by myself. Could you share the steps you used to do that? This includes any manual sleep calls you may have added, the timing of the crash, manual checkpoints, debugger breakpoints, etc. It may be possible to extract some more generic test from that. -- Michael
Attachment
It's been tedious to get it exactly right but I think I got it. FYI, I was delayed because today we had yet another customer hit this: 'redo max offset' error. The system crashed as a number of autovacuums and a checkpoint happened and then the REDO failure.
Two tiny code changes:
bufmgr.c:bufferSync() pg_usleep(10000000); // At begin of function
smgr.c:smgrtruncate(): Add the following just after CacheInvalidateSmgr()
if (forknum == MAIN_FORKNUM && nblocks == 0) {
pg_usleep(40000000);
{ char *cp=NULL; *cp=13; }
}
pg_usleep(40000000);
{ char *cp=NULL; *cp=13; }
}
Now for the heavily commented SQL repro. It will require that you execute a checkpoint in another session when instructed by the repro.sql script. You have 4 seconds to do that. The repro script explains exactly what must happen.
-----------------------------------------------------------
create table t (c char(1111));
alter table t alter column c set storage plain;
alter table t alter column c set storage plain;
-- Make sure there actually is an allocated page 0 and it is empty.
-- REDO Delete would ignore a non-existant page: XLogReadBufferForRedoExtended: return BLK_NOTFOUND;
-- Hopefully two row deletes don't trigger autovacuum and truncate the empty page.
insert into t values ('1'), ('2');
delete from t;
-- REDO Delete would ignore a non-existant page: XLogReadBufferForRedoExtended: return BLK_NOTFOUND;
-- Hopefully two row deletes don't trigger autovacuum and truncate the empty page.
insert into t values ('1'), ('2');
delete from t;
checkpoint; -- Checkpoint the empty page to disk.
-- This insert should be before the next checkpoint 'start'. I don't want to replay it.
-- And, yes, there needs to be another checkpoint completed to skip its replay and start
-- with the replay of the delete below.
insert into t values ('1'), ('2');
-- And, yes, there needs to be another checkpoint completed to skip its replay and start
-- with the replay of the delete below.
insert into t values ('1'), ('2');
-- Checkpoint needs to start in another session. However, I need to stall the checkpoint
-- to prevent it from writing the dirty page to disk until I get to the vacuum below.
select 'Please start checkpoint in another session';
select pg_sleep(4);
-- to prevent it from writing the dirty page to disk until I get to the vacuum below.
select 'Please start checkpoint in another session';
select pg_sleep(4);
-- Below is the problematic delete.
-- It succeeds now(online) because the dirty page has two rows on it.
-- However, with respect to crash recovery there are 3 possible scenarios depending on timing.
-- 1) The ongoing checkpoint might write the page with the two rows on it before
-- the deletes. This leads to BLK_NEEDS_REDO for the deletes. It works
-- because the page read from disk has the rows on it.
-- 2) The ongoing checkpoint might write the page just after the deletes.
-- In that case BLK_DONE will happen and there'll be no problem. LSN check.
-- 3) The checkpoint can fail to write the dirty page because a vacuum can call
-- smgrtruncate->DropRelFileNodeBuffers() which invalidates the dirty page.
-- If smgrtruncate safely completes the physical truncation then BLK_NOTFOUND
-- happens and we skip the redo of the delete. So the skipped dirty write is OK.
-- The problme happens if we crash after the 2nd checkpoint completes
-- but before the physical truncate 'mdtruncate()'.
delete from t;
-- It succeeds now(online) because the dirty page has two rows on it.
-- However, with respect to crash recovery there are 3 possible scenarios depending on timing.
-- 1) The ongoing checkpoint might write the page with the two rows on it before
-- the deletes. This leads to BLK_NEEDS_REDO for the deletes. It works
-- because the page read from disk has the rows on it.
-- 2) The ongoing checkpoint might write the page just after the deletes.
-- In that case BLK_DONE will happen and there'll be no problem. LSN check.
-- 3) The checkpoint can fail to write the dirty page because a vacuum can call
-- smgrtruncate->DropRelFileNodeBuffers() which invalidates the dirty page.
-- If smgrtruncate safely completes the physical truncation then BLK_NOTFOUND
-- happens and we skip the redo of the delete. So the skipped dirty write is OK.
-- The problme happens if we crash after the 2nd checkpoint completes
-- but before the physical truncate 'mdtruncate()'.
delete from t;
-- The vacuum must complete DropRelFileNodeBuffers.
-- The vacuum must sleep for a few seconds to allow the checkpoint to complete
-- such that recovery starts with the Delete REDO.
-- We must crash before mdtruncate() does the physical truncate. If the physical
-- truncate happens the BLK_NOTFOUND will be returned and the Delete REDO skipped.
-- The vacuum must sleep for a few seconds to allow the checkpoint to complete
-- such that recovery starts with the Delete REDO.
-- We must crash before mdtruncate() does the physical truncate. If the physical
-- truncate happens the BLK_NOTFOUND will be returned and the Delete REDO skipped.
vacuum t;
--------------------------------------------------------
On November 10, 2019 at 11:51 PM Michael Paquier < michael@paquier.xyz> wrote:On Fri, Nov 08, 2019 at 06:44:08PM -0800, Daniel Wood wrote:I repro'ed on PG11 and PG10 STABLE but several months old.I looked at 6d05086 but it doesn't address the core issue.DropRelFileNodeBuffers prevents the checkpoint from writing allneeded dirty pages for any REDO's that exist BEFORE the truncate.If we crash after a checkpoint but before the physical truncate thenthe REDO will need to replay the operation against the dirty pagethat the Drop invalidated.I am beginning to look at this thread more seriously, and I'd like tofirst try to reproduce that by myself. Could you share the steps youused to do that? This includes any manual sleep calls you may haveadded, the timing of the crash, manual checkpoints, debuggerbreakpoints, etc. It may be possible to extract some more generictest from that.--Michael
Sorry I missed one thing. Turn off full page writes. I'm running in an env. with atomic 8K writes.
On November 12, 2019 at 6:23 PM Daniel Wood <hexexpert@comcast.net> wrote:It's been tedious to get it exactly right but I think I got it. FYI, I was delayed because today we had yet another customer hit this: 'redo max offset' error. The system crashed as a number of autovacuums and a checkpoint happened and then the REDO failure
On Thu, Nov 14, 2019 at 07:38:19PM -0800, Daniel Wood wrote: > Sorry I missed one thing. Turn off full page writes. Hmm. Linux FSes use typically 4kB pages. I'll try to look into that with lower page sizes for relation and WAL pages. > I'm running in an env. with atomic 8K writes. What's that? -- Michael
Attachment
> I'll try to look into that with lower page sizes for relation and WAL pages.
The page size is totally unrelated to this bug. When you repro the redo failure it is because the log record is being applied to an old page version. The correct newer page version never got written because of the truncate page invalidation. The cause is not a torn write.
> What's that?
Azure PostgreSQL on Windows has proprietary mechanisms in its filesystem to guarantee that a write is atomic.
- Dan
On November 18, 2019 at 4:58 AM Michael Paquier < michael@paquier.xyz> wrote:On Thu, Nov 14, 2019 at 07:38:19PM -0800, Daniel Wood wrote:Sorry I missed one thing. Turn off full page writes.Hmm. Linux FSes use typically 4kB pages. I'll try to look into thatwith lower page sizes for relation and WAL pages.I'm running in an env. with atomic 8K writes.What's that?--Michael
Hi, On 2019-11-08 12:46:51 -0800, Daniel Wood wrote: > Page on disk has empty lp 1 > * Insert into page lp 1 > > checkpoint START. Redo eventually starts here. > ** Delete all rows on page. It's worthwhile to note that this part cannot happen without full page writes disabled. By dint of a checkpoint having stared previously, this will otherwise always include an FPW (or be marked as WILL_INIT by a previous record, which functionally is equivalent). > autovac truncate > DropRelFileNodeBuffers - dirty page NOT written. lp 1 on disk still empty > checkpoint completes If I understand correctly, the DropRelFileNodeBuffers() needs to happen before the BufferSync() reaches the buffer containing the page with the deletion, but before the relevant file(s) are truncated. And obviously the deletion needs to have finished modifying the page in question. Not a conflict with what you wrote, just confirming. > crash > smgrtruncate - Not reached This seems like a somewhat confusing description to me, because smgrtruncate() is what calls DropRelFileNodeBuffers(). I assume what you mean by "smgrtruncate" is not the function, but the smgr_truncate() callback? > Even if we reach the truncate we don't fsync it till the next > checkpoint. So on filesystems which delay metadata updates a crash > can lose the truncate. I think that's probably fine though. Leaving the issue of checkpoint completing inbetween the DropRelFileNodeBuffers() and the actual truncation aside, we'd have the WAL logged truncation truncating the file. I don't think it's unreasonable to except a filesystem that claims to support running without full_page_writes (I've seen several such claims turning out not to be true under load), to preserve either the original page contents or the new file size after a a crash. If your filesystem doesn't, you really ought not to use it with FPW = off. I do wonder, halfway related, if there's an argument that XLogReadBufferForRedoExtended() ought to return something other than BLK_NEEDS_REDO for pages read during recovery that are all-zeroes, at least for some RBM_* modes. > Once we do the fsync(), for the truncate, the REDO read will return > BLK_NOTFOUND and the DELETE REDO attempt will be skipped. WIthout the > fsync() or crashing before the truncate, the delete redo depends on > the most recent version of the page having been written by the > checkpoint. We also need to correctly replay this on a standby, so I don't think just adding an smgrimmedsync() is the answer. We'd not replay the truncation standbys / during PITR, unless I miss something. So we'd just end up with the same problem in slightly different situations. > Is DropRelFileNodeBuffers purely for performance or would there be any > correctness problems if not done. There would be correctness problems if we left that out - the on-disk state and the in-memory state would diverge. To me it sounds the fix here would be to rejigger the RelationTruncate() sequence for truncation of the main for as follows: 1) MyPgXact->delayChkpt = true 2) XLogInsert(XLOG_SMGR_TRUNCATE) 3) smgrtruncate() (which, as now, first does a DropRelFileNodeBuffers(), and then calls the smgr_truncate callback) 4) MyPgXact->delayChkpt = false I'm not worried about the increased delayChkpt = true time. Compared with the frequency of RecordTransactionCommit() this seems harmless. I'm inclined to think that we should make the XLogFlush() in the RelationNeedsWAL() branch of RelationTruncate() unconditional. Performing the truncation on the filesystem level without actually having persisted the corresponding WAL record is dangerous. I think we need to backpatch a fix for this (even if one were to consider fpw = off unsupported). I think there's enough other nasty edge cases here. While fpw=on fixes the deletion case at hand, I think you could very well end up in a nasty situation in other cases where either redo location or the actual checkpoint record would fall between the WAL record and the actual truncation. Imagine e.g. a base backup starting in such a situation - you'd potentially end up with a relation that contains old data, without later replaying the truncation record. The window isn't huge, but also not negligible. I'll start a separate thread about whether we need to do a good bit of the work in smgrtruncate() / smgrdounlink() / ... in critical sections. - Andres
> On December 6, 2019 at 3:06 PM Andres Freund <andres@anarazel.de> wrote: ... > > crash > > smgrtruncate - Not reached > > This seems like a somewhat confusing description to me, because > smgrtruncate() is what calls DropRelFileNodeBuffers(). I assume what you > mean by "smgrtruncate" is not the function, but the smgr_truncate() > callback? My mistake. Yes, smgr_truncate() > > Even if we reach the truncate we don't fsync it till the next > > checkpoint. So on filesystems which delay metadata updates a crash > > can lose the truncate. > > I think that's probably fine though. Leaving the issue of checkpoint > completing inbetween the DropRelFileNodeBuffers() and the actual > truncation aside, we'd have the WAL logged truncation truncating the > file. I don't think it's unreasonable to except a filesystem that claims > to support running without full_page_writes (I've seen several such > claims turning out not to be true under load), to preserve either the > original page contents or the new file size after a a crash. If your > filesystem doesn't, you really ought not to use it with FPW = off. If the phsyical truncate doesn't occur in the seconds after the cache invalidation nor the fsync within the minutes till the next checkpoint you are NOT left with a torn page on disk. You are left with the 'incorrect' page on disk. In other words, an older page because the invalidation prevent the write of the most recent dirty page. Redos don't like old incorrect pages. But, yes, fullpage writes covers up this anomaly(To be generous). > > Once we do the fsync(), for the truncate, the REDO read will return > > BLK_NOTFOUND and the DELETE REDO attempt will be skipped. WIthout the > > fsync() or crashing before the truncate, the delete redo depends on > > the most recent version of the page having been written by the > > checkpoint. > > We also need to correctly replay this on a standby, so I don't think > just adding an smgrimmedsync() is the answer. We'd not replay the > truncation standbys / during PITR, unless I miss something. So we'd just > end up with the same problem in slightly different situations. I haven't mentioned to you that we have seen what appears to be the same problem during PITR's depending on which base backup we start with. I didn't mention it because I haven't created a repro to prove it. I simply suspect it. > To me it sounds the fix here would be to rejigger the RelationTruncate() > sequence for truncation of the main for as follows: > > 1) MyPgXact->delayChkpt = true > 2) XLogInsert(XLOG_SMGR_TRUNCATE) > 3) smgrtruncate() (which, as now, first does a DropRelFileNodeBuffers(), > and then calls the smgr_truncate callback) > 4) MyPgXact->delayChkpt = false > > I'm not worried about the increased delayChkpt = true time. Compared > with the frequency of RecordTransactionCommit() this seems harmless. Seems reasonable. > I'm inclined to think that we should make the XLogFlush() in the > RelationNeedsWAL() branch of RelationTruncate() > unconditional. Performing the truncation on the filesystem level without > actually having persisted the corresponding WAL record is dangerous. Yes, I also was curious about why it was conditional. - Dan Wood