Hello hackers,
20.03.2023 09:10, Peter Smith wrote:
>
> Using this I was also able to reproduce the problem. But test failures
> were rare. The make check-world seemed OK, and indeed the
> test_decoding tests would also appear to PASS around 14 out of 15
> times.
I've stumbled upon this assertion failure again during testing following cd312adc5.
This time I've simplified the reproducer to the attached modification.
With this patch applied, `make -s check -C contrib/test_decoding` fails on master as below:
ok 1 - pgstat_rc_1 14 ms
not ok 2 - pgstat_rc_2 1351 ms
contrib/test_decoding/output_iso/log/postmaster.log contains:
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 562,
PID: 1130928
With extra logging added, I see the following events happening:
1) pgstat_rc_1.setup calls pgstat_create_replslot(), gets
ReplicationSlotIndex(slot) = 0 and calls
pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0).
2) pgstat_rc_1.s0_get_changes executes pg_logical_slot_get_changes(...)
and then calls pgstat_gc_entry_refs on shmem_exit() ->
pgstat_shutdown_hook() ...;
with the sleep added inside pgstat_release_entry_ref, this backend waits
after decreasing entry_ref->shared_entry->refcount to 0.
3) pgstat_rc_1.stop removes the replication slot.
4) pgstat_rc_2.setup calls pgstat_create_replslot(), gets
ReplicationSlotIndex(slot) = 0 and calls
pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0),
which leads to the call pgstat_reinit_entry(), which increases refcount
for the same shared_entry as in (1) and (2), and then to the call
pgstat_acquire_entry_ref(), which increases refcount once more.
5) the backend 2 reaches
Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0),
which fails due to refcount = 2.
Best regards,
Alexander