Re: 'Invalid lp' during heap_xlog_delete - Mailing list pgsql-hackers
From | Daniel Wood |
---|---|
Subject | Re: 'Invalid lp' during heap_xlog_delete |
Date | |
Msg-id | 1335373813.287510.1573611814107@connect.xfinity.com Whole thread Raw |
In response to | Re: 'Invalid lp' during heap_xlog_delete (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: 'Invalid lp' during heap_xlog_delete
|
List | pgsql-hackers |
It's been tedious to get it exactly right but I think I got it. FYI, I was delayed because today we had yet another customer hit this: 'redo max offset' error. The system crashed as a number of autovacuums and a checkpoint happened and then the REDO failure.
Two tiny code changes:
bufmgr.c:bufferSync() pg_usleep(10000000); // At begin of function
smgr.c:smgrtruncate(): Add the following just after CacheInvalidateSmgr()
if (forknum == MAIN_FORKNUM && nblocks == 0) {
pg_usleep(40000000);
{ char *cp=NULL; *cp=13; }
}
pg_usleep(40000000);
{ char *cp=NULL; *cp=13; }
}
Now for the heavily commented SQL repro. It will require that you execute a checkpoint in another session when instructed by the repro.sql script. You have 4 seconds to do that. The repro script explains exactly what must happen.
-----------------------------------------------------------
create table t (c char(1111));
alter table t alter column c set storage plain;
alter table t alter column c set storage plain;
-- Make sure there actually is an allocated page 0 and it is empty.
-- REDO Delete would ignore a non-existant page: XLogReadBufferForRedoExtended: return BLK_NOTFOUND;
-- Hopefully two row deletes don't trigger autovacuum and truncate the empty page.
insert into t values ('1'), ('2');
delete from t;
-- REDO Delete would ignore a non-existant page: XLogReadBufferForRedoExtended: return BLK_NOTFOUND;
-- Hopefully two row deletes don't trigger autovacuum and truncate the empty page.
insert into t values ('1'), ('2');
delete from t;
checkpoint; -- Checkpoint the empty page to disk.
-- This insert should be before the next checkpoint 'start'. I don't want to replay it.
-- And, yes, there needs to be another checkpoint completed to skip its replay and start
-- with the replay of the delete below.
insert into t values ('1'), ('2');
-- And, yes, there needs to be another checkpoint completed to skip its replay and start
-- with the replay of the delete below.
insert into t values ('1'), ('2');
-- Checkpoint needs to start in another session. However, I need to stall the checkpoint
-- to prevent it from writing the dirty page to disk until I get to the vacuum below.
select 'Please start checkpoint in another session';
select pg_sleep(4);
-- to prevent it from writing the dirty page to disk until I get to the vacuum below.
select 'Please start checkpoint in another session';
select pg_sleep(4);
-- Below is the problematic delete.
-- It succeeds now(online) because the dirty page has two rows on it.
-- However, with respect to crash recovery there are 3 possible scenarios depending on timing.
-- 1) The ongoing checkpoint might write the page with the two rows on it before
-- the deletes. This leads to BLK_NEEDS_REDO for the deletes. It works
-- because the page read from disk has the rows on it.
-- 2) The ongoing checkpoint might write the page just after the deletes.
-- In that case BLK_DONE will happen and there'll be no problem. LSN check.
-- 3) The checkpoint can fail to write the dirty page because a vacuum can call
-- smgrtruncate->DropRelFileNodeBuffers() which invalidates the dirty page.
-- If smgrtruncate safely completes the physical truncation then BLK_NOTFOUND
-- happens and we skip the redo of the delete. So the skipped dirty write is OK.
-- The problme happens if we crash after the 2nd checkpoint completes
-- but before the physical truncate 'mdtruncate()'.
delete from t;
-- It succeeds now(online) because the dirty page has two rows on it.
-- However, with respect to crash recovery there are 3 possible scenarios depending on timing.
-- 1) The ongoing checkpoint might write the page with the two rows on it before
-- the deletes. This leads to BLK_NEEDS_REDO for the deletes. It works
-- because the page read from disk has the rows on it.
-- 2) The ongoing checkpoint might write the page just after the deletes.
-- In that case BLK_DONE will happen and there'll be no problem. LSN check.
-- 3) The checkpoint can fail to write the dirty page because a vacuum can call
-- smgrtruncate->DropRelFileNodeBuffers() which invalidates the dirty page.
-- If smgrtruncate safely completes the physical truncation then BLK_NOTFOUND
-- happens and we skip the redo of the delete. So the skipped dirty write is OK.
-- The problme happens if we crash after the 2nd checkpoint completes
-- but before the physical truncate 'mdtruncate()'.
delete from t;
-- The vacuum must complete DropRelFileNodeBuffers.
-- The vacuum must sleep for a few seconds to allow the checkpoint to complete
-- such that recovery starts with the Delete REDO.
-- We must crash before mdtruncate() does the physical truncate. If the physical
-- truncate happens the BLK_NOTFOUND will be returned and the Delete REDO skipped.
-- The vacuum must sleep for a few seconds to allow the checkpoint to complete
-- such that recovery starts with the Delete REDO.
-- We must crash before mdtruncate() does the physical truncate. If the physical
-- truncate happens the BLK_NOTFOUND will be returned and the Delete REDO skipped.
vacuum t;
--------------------------------------------------------
On November 10, 2019 at 11:51 PM Michael Paquier < michael@paquier.xyz> wrote:On Fri, Nov 08, 2019 at 06:44:08PM -0800, Daniel Wood wrote:I repro'ed on PG11 and PG10 STABLE but several months old.I looked at 6d05086 but it doesn't address the core issue.DropRelFileNodeBuffers prevents the checkpoint from writing allneeded dirty pages for any REDO's that exist BEFORE the truncate.If we crash after a checkpoint but before the physical truncate thenthe REDO will need to replay the operation against the dirty pagethat the Drop invalidated.I am beginning to look at this thread more seriously, and I'd like tofirst try to reproduce that by myself. Could you share the steps youused to do that? This includes any manual sleep calls you may haveadded, the timing of the crash, manual checkpoints, debuggerbreakpoints, etc. It may be possible to extract some more generictest from that.--Michael
pgsql-hackers by date: