Re: BF mamba failure - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: BF mamba failure
Date
Msg-id 42227456-1132-4d4e-d6ef-e096668a9a4a@gmail.com
Whole thread Raw
In response to Re: BF mamba failure  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: BF mamba failure
List pgsql-hackers
Hello hackers,

20.03.2023 09:10, Peter Smith wrote:
>
> Using this I was also able to reproduce the problem. But test failures
> were rare. The make check-world seemed OK, and indeed the
> test_decoding tests would also appear to PASS around 14 out of 15
> times.

I've stumbled upon this assertion failure again during testing following cd312adc5.

This time I've simplified the reproducer to the attached modification.
With this patch applied, `make -s check -C contrib/test_decoding` fails on master as below:
ok 1         - pgstat_rc_1                                14 ms
not ok 2     - pgstat_rc_2                              1351 ms


contrib/test_decoding/output_iso/log/postmaster.log contains:
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 562,

PID: 1130928

With extra logging added, I see the following events happening:
1) pgstat_rc_1.setup calls pgstat_create_replslot(), gets
   ReplicationSlotIndex(slot) = 0 and calls
   pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0).

2) pgstat_rc_1.s0_get_changes executes pg_logical_slot_get_changes(...)
   and then calls pgstat_gc_entry_refs on shmem_exit() ->
   pgstat_shutdown_hook() ...;
   with the sleep added inside pgstat_release_entry_ref, this backend waits
   after decreasing entry_ref->shared_entry->refcount to 0.

3) pgstat_rc_1.stop removes the replication slot.

4) pgstat_rc_2.setup calls pgstat_create_replslot(), gets
   ReplicationSlotIndex(slot) = 0 and calls
   pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0),
   which leads to the call pgstat_reinit_entry(), which increases refcount
   for the same shared_entry as in (1) and (2), and then to the call
   pgstat_acquire_entry_ref(), which increases refcount once more.

5) the backend 2 reaches
Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0),
   which fails due to refcount = 2.

Best regards,
Alexander
Attachment

pgsql-hackers by date:

Previous
From: Sushrut Shivaswamy
Date:
Subject: Columnar format export in Postgres
Next
From: Peter Geoghegan
Date:
Subject: Re: Remove dependence on integer wrapping