Hi,
On 2022-10-07 12:00:56 -0700, Andres Freund wrote:
> On 2022-10-07 15:30:43 +0900, Kyotaro Horiguchi wrote:
> > The key point of this is this:
> >
> > + * XXX: I think there cannot actually be data from an older slot
> > + * here. After a crash we throw away the old stats data and if a slot is
> > + * dropped while shut down, we'll not load the slot data at startup.
> >
> > I think this is true. Assuming that we don't recreate or rename
> > objects that have stats after writing out stats, we won't have stats
> > for a different object with the same name.
>
> Thanks for thinking through this!
> > If we can rely on that fact, the existing check in pgstat_acquire_replslot()
> > becomes useless. Thus we don't need to store object name in stats entry. I
> > agree to that.
>
> I don't agree with this aspect. I think it's better to ensure that the stats
> object exists when acquiring a slot, rather than later, when reporting. It's a
> lot simpler to think through the lock nesting etc that way.
>
> I'd also eventually like to remove the stats that are currently kept
> separately in ReorderBuffer, and that will be easier/cheaper if we can rely on
> the stats objects to have already been created.
Here's a cleaned up version of my earlier prototype.
- I wrapped the access to ReplicationSlotCtl->replication_slots[i].data.name
in a new function bool ReplicationSlotName(index, name). I'm not entirely
happy with that name, as it sounds like a more general accessor than it is -
I toyed with ReplicationSlotNameForIndex(), but that seemed somewhat
unnecessary, I don't forsee a need for another name accessor.
Anyone wants to weigh in?
- Substantial comment and commit message polishing
- I'm planning to drop PgStat_StatReplSlotEntry.slotname and a
PGSTAT_FILE_FORMAT_ID bump in master and to rename slotname to
slotname_unused in 15.
- Self-review found a bug, I copy-pasted create=false in the call to
pgstat_get_entry_ref() in pgstat_acquire_replslot(). This did *NOT* cause
any test failures - clearly our test coverage in this area is woefully
inadequate.
- This patch does not contain a test for the fix. I think this can only be
tested by a tap test starting pg_recvlogical in the background and checking
pg_recvlogical's output. That type of test is notoriously hard to be
reliable, so committing it shortly before the release is wrapped seems like
a bad idea.
I manually verified that:
- a continually active slot reporting stats after pgstat_reset_replslot()
works correctly (this is what crashed before)
- replslot stats reporting works correctly after stats were removed
- upgrading from pre-fix to post-fix preserves stats (when keeping slotname /
not increasing the stats version, of course)
I'm planning to push this either later tonight (if I feel up to it after
cooking dinner) or tomorrow morning PST, due to the release wrap deadline.
Greetings,
Andres Freund