Re: START_REPLICATION SLOT causing a crash in an assert build - Mailing list pgsql-hackers

From Andres Freund
Subject Re: START_REPLICATION SLOT causing a crash in an assert build
Date
Msg-id 20221008025633.5njv7ufypkggsbuf@awork3.anarazel.de
Whole thread Raw
In response to Re: START_REPLICATION SLOT causing a crash in an assert build  (Andres Freund <andres@anarazel.de>)
Responses Re: START_REPLICATION SLOT causing a crash in an assert build
List pgsql-hackers
Hi,

On 2022-10-07 12:00:56 -0700, Andres Freund wrote:
> On 2022-10-07 15:30:43 +0900, Kyotaro Horiguchi wrote:
> > The key point  of this is this:
> > 
> > +     * XXX: I think there cannot actually be data from an older slot
> > +     * here. After a crash we throw away the old stats data and if a slot is
> > +     * dropped while shut down, we'll not load the slot data at startup.
> > 
> > I think this is true.  Assuming that we don't recreate or rename
> > objects that have stats after writing out stats, we won't have stats
> > for a different object with the same name.
> 
> Thanks for thinking through this!

> > If we can rely on that fact, the existing check in pgstat_acquire_replslot()
> > becomes useless. Thus we don't need to store object name in stats entry. I
> > agree to that.
> 
> I don't agree with this aspect. I think it's better to ensure that the stats
> object exists when acquiring a slot, rather than later, when reporting. It's a
> lot simpler to think through the lock nesting etc that way.
> 
> I'd also eventually like to remove the stats that are currently kept
> separately in ReorderBuffer, and that will be easier/cheaper if we can rely on
> the stats objects to have already been created.

Here's a cleaned up version of my earlier prototype.

- I wrapped the access to ReplicationSlotCtl->replication_slots[i].data.name
  in a new function bool ReplicationSlotName(index, name). I'm not entirely
  happy with that name, as it sounds like a more general accessor than it is -
  I toyed with ReplicationSlotNameForIndex(), but that seemed somewhat
  unnecessary, I don't forsee a need for another name accessor.

  Anyone wants to weigh in?

- Substantial comment and commit message polishing

- I'm planning to drop PgStat_StatReplSlotEntry.slotname and a
  PGSTAT_FILE_FORMAT_ID bump in master and to rename slotname to
  slotname_unused in 15.

- Self-review found a bug, I copy-pasted create=false in the call to
  pgstat_get_entry_ref() in pgstat_acquire_replslot(). This did *NOT* cause
  any test failures - clearly our test coverage in this area is woefully
  inadequate.

- This patch does not contain a test for the fix. I think this can only be
  tested by a tap test starting pg_recvlogical in the background and checking
  pg_recvlogical's output. That type of test is notoriously hard to be
  reliable, so committing it shortly before the release is wrapped seems like
  a bad idea.

I manually verified that:
- a continually active slot reporting stats after pgstat_reset_replslot()
  works correctly (this is what crashed before)
- replslot stats reporting works correctly after stats were removed
- upgrading from pre-fix to post-fix preserves stats (when keeping slotname /
  not increasing the stats version, of course)


I'm planning to push this either later tonight (if I feel up to it after
cooking dinner) or tomorrow morning PST, due to the release wrap deadline.

Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Ajay P S
Date:
Subject: Difference between HeapTupleData and TupleTableSlot structures
Next
From: Tom Lane
Date:
Subject: Re: Difference between HeapTupleData and TupleTableSlot structures