On Wed, Jun 12, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:
> With extra logging added, I see the following events happening:
> 1) pgstat_rc_1.setup calls pgstat_create_replslot(), gets
> ReplicationSlotIndex(slot) = 0 and calls
> pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0).
>
> 2) pgstat_rc_1.s0_get_changes executes pg_logical_slot_get_changes(...)
> and then calls pgstat_gc_entry_refs on shmem_exit() ->
> pgstat_shutdown_hook() ...;
> with the sleep added inside pgstat_release_entry_ref, this backend waits
> after decreasing entry_ref->shared_entry->refcount to 0.
>
> 3) pgstat_rc_1.stop removes the replication slot.
>
> 4) pgstat_rc_2.setup calls pgstat_create_replslot(), gets
> ReplicationSlotIndex(slot) = 0 and calls
> pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0),
> which leads to the call pgstat_reinit_entry(), which increases refcount
> for the same shared_entry as in (1) and (2), and then to the call
> pgstat_acquire_entry_ref(), which increases refcount once more.
>
> 5) the backend 2 reaches
> Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0),
> which fails due to refcount = 2.
Thanks for the details.
So this comes down to the point that we offer no guarantee that the
stats entry a backend sees at shutdown is the same as the one it wants
to clean up. That's the same problem as what Floris has reported
here, with an OID wraparound and tables to get the same hash key.
That can happen for all stats kinds:
https://www.postgresql.org/message-id/b14ae28029f64757bb64613be2549a74@Optiver.com
I don't think that this is going to fly far except if we introduce a
concept of "generation" or "age" in the stats entries. The idea is
simple: when a stats entry is reinitialized because of a drop&create,
increment a counter to tell that this is a new generation, and keep
track of it in *both* PgStat_EntryRef (local backend reference to the
shmem stats entry) *and* PgStatShared_HashEntry (the central one).
When releasing an entry, if we know that the shared entry we are
pointing at is not of the same generation as the local reference, it
means that the entry has been reused for something else with the same
hash key, so give up. It should not be that invasive, still it means
ABI breakage in the two pgstats internal structures I am mentioning,
which is OK for a backpatch as this stuff is internal. On top of
that, this problem means that we can silently and randomly lose stats,
which is not cool.
Note that Noah has been working on a better integration of injection
points with isolation tests. We could reuse that here to have a test
case, with an injection point waiting around the pg_usleep() you have
hardcoded:
https://www.postgresql.org/message-id/20240614003549.c2.nmisch@google.com
I'll try to give it a go on Monday.
--
Michael