Hi,
On Tue, Jan 3, 2023 at 9:57 PM Alex Richman <alexrichman@onesignal.com> wrote:
>
> Apologies for the delay (and happy christmas/new years).
>
> Please find included a full backtrace[1] of a sample of this crash, replicated on postgres 15.1-1 in the same
environmentdescribed in my original email. Included as a gist due to the length but lmk if it should be pasted in full
forposterity. I've also added the python script[2] used to replicate, if that's relevant.
>
> Unfortunately we have not been able to reproduce this in a clean room environment, however we can note a few
additionalthings:
> - This has occurred over multiple distinct servers with different data sets, though similar write loads. Suggesting
it'snot a specific server with data corruption.
> - Disabling pg_repack, autovacuum, automatic reindexing, has no effect, the bug can still occur
> - Running the same script on a read-only logical replica does not hit the bug
> - As above, if the server is idle (no write traffic), then it does not hit the bug
> - The bug occurs roughly 1 in every 10 executions of the create replication slot, so is not 100% consistent.
> - We're fairly confident that this did not occur pre 14.5-1, and started occurring in 14.6-1 & 15.1-1.
> So we would assume that there is some concurrent write traffic from our web tier that sometimes causes a segfault in
thelogical replication slot creation.
>
> Please let me know if you need any more information.
Thank you for providing more information.
One possibility is that you encountered the bug in snapbuild.c that is
already fixed by commit 898ef41bf6f4 and will be included in 14.7 and
15.2. I've attached patches of this fix for PG14 and PG15. Could you
please try the same scenario again with these patches and see if the
issue happens?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com