Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock - Mailing list pgsql-hackers

From vignesh C
Subject Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock
Date
Msg-id CALDaNm2Avar8KWTx-DxxOGW1pE9VtYrYMcue7gQPj5pJ-5ttbw@mail.gmail.com
Whole thread Raw
In response to Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Wed, Aug 10, 2022 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 10, 2022 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2022-08-09 20:21:19 -0700, Mark Dilger wrote:
> > > > On Aug 9, 2022, at 7:26 PM, Andres Freund <andres@anarazel.de> wrote:
> > > >
> > > > The relevant code triggering it:
> > > >
> > > >     newbuf = XLogInitBufferForRedo(record, 1);
> > > >     _hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
> > > >                               xlrec->new_bucket_flag, true);
> > > >     if (!IsBufferCleanupOK(newbuf))
> > > >             elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire cleanup lock");
> > > >
> > > > Why do we just crash if we don't already have a cleanup lock? That can't be
> > > > right. Or is there supposed to be a guarantee this can't happen?
> > >
> > > Perhaps the code assumes that when xl_hash_split_allocate_page record was
> > > written, the new_bucket field referred to an unused page, and so during
> > > replay it should also refer to an unused page, and being unused, that nobody
> > > will have it pinned.  But at least in heap we sometimes pin unused pages
> > > just long enough to examine them and to see that they are unused.  Maybe
> > > something like that is happening here?
> >
> > I don't think it's a safe assumption that nobody would hold a pin on such a
> > page during recovery. While not the case here, somebody else could have used
> > pg_prewarm to read it in.
> >
> > But also, the checkpointer or bgwriter could have it temporarily pinned, to
> > write it out, or another backend could try to write it out as a victim buffer
> > and have it temporarily pinned.
> >
> >
> > static int
> > SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
> > {
> > ...
> >         /*
> >          * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
> >          * buffer is clean by the time we've locked it.)
> >          */
> >         PinBuffer_Locked(bufHdr);
> >         LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
> >
> >
> > As you can see we acquire a pin without holding a lock on the page (and that
> > can't be changed!).
> >
>
> I think this could be the probable reason for failure though I didn't
> try to debug/reproduce this yet. AFAIU, this is possible during
> recovery/replay of WAL record XLOG_HASH_SPLIT_ALLOCATE_PAGE as via
> XLogReadBufferForRedoExtended, we can mark the buffer dirty while
> restoring from full page image. OTOH, because during normal operation
> we didn't mark the page dirty SyncOneBuffer would have skipped it due
> to check (if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))).

I'm trying to simulate the scenario in streaming replication using the below:
CREATE TABLE pvactst (i INT, a INT[], p POINT) with (autovacuum_enabled = off);
CREATE INDEX hash_pvactst ON pvactst USING hash (i);
INSERT INTO pvactst SELECT i, array[1,2,3], point(i, i+1) FROM
generate_series(1,1000) i;

With the above scenario, it will be able to replay allocation of page
for split operation. I will slightly change the above statements and
try to debug and see if we can make the background writer process to
pin this buffer and simulate the scenario. I will post my findings
once I'm done with the analysis.

Regards,
Vignesh



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Next
From: Ranier Vilela
Date:
Subject: Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)