Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1 - Mailing list pgsql-bugs

From Alex Richman
Subject Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1
Date
Msg-id CAMnUB3oJ5rjxxrtB6vc15RdyqDeVtsfMn1aEehzWATw56VNtow@mail.gmail.com
Whole thread Raw
In response to Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-bugs
Apologies for the delay (and happy christmas/new years).

Please find included a full backtrace[1] of a sample of this crash, replicated on postgres 15.1-1 in the same environment described in my original email.  Included as a gist due to the length but lmk if it should be pasted in full for posterity.  I've also added the python script[2] used to replicate, if that's relevant.

Unfortunately we have not been able to reproduce this in a clean room environment, however we can note a few additional things:
- This has occurred over multiple distinct servers with different data sets, though similar write loads.  Suggesting it's not a specific server with data corruption.
- Disabling pg_repack, autovacuum, automatic reindexing, has no effect, the bug can still occur
- Running the same script on a read-only logical replica does not hit the bug
- As above, if the server is idle (no write traffic), then it does not hit the bug
- The bug occurs roughly 1 in every 10 executions of the create replication slot, so is not 100% consistent.
- We're fairly confident that this did not occur pre 14.5-1, and started occurring in 14.6-1 & 15.1-1.
So we would assume that there is some concurrent write traffic from our web tier that sometimes causes a segfault in the logical replication slot creation.

Please let me know if you need any more information.

Thanks,
- Alex.


On Thu, 15 Dec 2022 at 03:06, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Dec 15, 2022 at 4:35 AM Alex Richman <alexrichman@onesignal.com> wrote:
>
> We've noticed a segfault bug while creating logical replication slots.  We've seen this on 14.6-1 and 15.1-1 (both installed from postgres apt-archive repos to debian 11 hosts).  We're confident that this behaviour is not present in 14.5-1.
>
> We can only reproduce this if the server is under load (reads/writes from clients, perhaps also pg_repacks, reindexes etc).  If the server is idle then we cannot reproduce it.
>
> Dmesg logs:
> 211935.284834] postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp 00007fff6ebef068 error 4 in postgres[559079e4d000+51a000]
>
> Postgres logs:
> 2022-12-14 22:33:58 UTC 2297089STATEMENT:  SELECT pg_create_logical_replication_slot('replica_b7fa97cc_28', 'pgoutput', false);
> 2022-12-14 22:33:58 UTC 778310LOG:  server process (PID 2297089) was terminated by signal 11: Segmentation fault
> 2022-12-14 22:33:58 UTC 778310DETAIL:  Failed process was running: SELECT pg_create_logical_replication_slot('replica_b7fa97cc_28', 'pgoutput', false);
> 2022-12-14 22:33:58 UTC 778310LOG:  terminating any other active server processes
>
> Journal logs:
> Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel: postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp 00007fff6ebef068 error 4 in postgres[559079e4d000+51a000]
> Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel: Code: 48 83 c4 08 49 89 c0 5b 41 5c 4c 89 c0 41 5d 5d c3 66 90 83 e3 02 75 b8 e9 71 18 b9 ff e9 d3 18 b9 ff 90 48 89 fe 48 8b 7f f8 <48> 8b 47 >
>
> Let me know if you need any more info.
>

It is difficult to diagnose this without more information because
there is no clear call stack or a reproducible scenario. Will it be
possible for you to get a reproducible test case or at least call
stack to proceed?

--
With Regards,
Amit Kapila.

pgsql-bugs by date:

Previous
From: Amit Kapila
Date:
Subject: Re: BUG #17716: walsender process hang while decoding 'DROP PUBLICATION' XLOG
Next
From: PG Bug reporting form
Date:
Subject: BUG #17736: when psql -c is used, the $ sign is escaped