Re: Assertion being hit during WAL replay - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Assertion being hit during WAL replay
Date
Msg-id 20230414184015.v55b7u4s6v4wipxo@awork3.anarazel.de
Whole thread Raw
In response to Re: Assertion being hit during WAL replay  (Andres Freund <andres@anarazel.de>)
Responses Re: Assertion being hit during WAL replay  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2023-04-11 15:03:02 -0700, Andres Freund wrote:
> On 2023-04-11 16:54:53 -0400, Tom Lane wrote:
> > Here's something related to what I hit that time:
> > 
> > diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
> > index 052263aea6..d43a7c7bcb 100644
> > --- a/src/backend/optimizer/plan/subselect.c
> > +++ b/src/backend/optimizer/plan/subselect.c
> > @@ -2188,6 +2188,7 @@ SS_charge_for_initplans(PlannerInfo *root, RelOptInfo *final_rel)
> >  void
> >  SS_attach_initplans(PlannerInfo *root, Plan *plan)
> >  {
> > +   Assert(root->init_plans == NIL);
> >     plan->initPlan = root->init_plans;
> >  }
> >  
> > You won't get through initdb with this, but if you install this change
> > into a successfully init'd database and then "make installcheck-parallel",
> > it will crash and then fail to recover, at least a lot of the time.
> 
> Ah, that allowed me to reproduce. Thanks.
> 
> 
> Took me a bit to understand how we actually get into this situation. A PRUNE
> record for relation+block that doesn't exist during recovery. That doesn't
> commonly happen outside of PITR or such, because we obviously need a block
> with content to generate the PRUNE. The way it does happen here, is that the
> relation is vacuumed and then truncated. Then we crash. Thus we end up with a
> PRUNE record for a block that doesn't exist on disk.
> 
> Which is also why the test is quite timing sensitive.
> 
> Seems like it'd be good to have a test that covers this scenario. There's
> plenty code around it that doesn't currently get exercised.
> 
> None of the existing tests seem like a great fit. I guess it could be added to
> 013_crash_restart, but that really focuses on something else.
> 
> So I guess I'll write a 036_notsureyet.pl...

See also the separate report by Alexander Lakhin at
https://postgr.es/m/0b5eb82b-cb99-e0a4-b932-3dc60e2e3926@gmail.com

I pushed the fix + test now.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: refactoring relation extension and BufferAlloc(), faster COPY
Next
From: Tom Lane
Date:
Subject: Re: Assertion being hit during WAL replay