Thread: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1

Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1

From
Alex Richman
Date:
Hi,

We've noticed a segfault bug while creating logical replication slots.  We've seen this on 14.6-1 and 15.1-1 (both installed from postgres apt-archive repos to debian 11 hosts).  We're confident that this behaviour is not present in 14.5-1.

We can only reproduce this if the server is under load (reads/writes from clients, perhaps also pg_repacks, reindexes etc).  If the server is idle then we cannot reproduce it.

Dmesg logs:
211935.284834] postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp 00007fff6ebef068 error 4 in postgres[559079e4d000+51a000]

Postgres logs:
2022-12-14 22:33:58 UTC 2297089STATEMENT:  SELECT pg_create_logical_replication_slot('replica_b7fa97cc_28', 'pgoutput', false);
2022-12-14 22:33:58 UTC 778310LOG:  server process (PID 2297089) was terminated by signal 11: Segmentation fault
2022-12-14 22:33:58 UTC 778310DETAIL:  Failed process was running: SELECT pg_create_logical_replication_slot('replica_b7fa97cc_28', 'pgoutput', false);
2022-12-14 22:33:58 UTC 778310LOG:  terminating any other active server processes

Journal logs:
Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel: postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp 00007fff6ebef068 error 4 in postgres[559079e4d000+51a000]
Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel: Code: 48 83 c4 08 49 89 c0 5b 41 5c 4c 89 c0 41 5d 5d c3 66 90 83 e3 02 75 b8 e9 71 18 b9 ff e9 d3 18 b9 ff 90 48 89 fe 48 8b 7f f8 <48> 8b 47 >

Let me know if you need any more info.

Thanks,
- Alex.
On Thu, Dec 15, 2022 at 4:35 AM Alex Richman <alexrichman@onesignal.com> wrote:
>
> We've noticed a segfault bug while creating logical replication slots.  We've seen this on 14.6-1 and 15.1-1 (both
installedfrom postgres apt-archive repos to debian 11 hosts).  We're confident that this behaviour is not present in
14.5-1.
>
> We can only reproduce this if the server is under load (reads/writes from clients, perhaps also pg_repacks, reindexes
etc). If the server is idle then we cannot reproduce it.
 
>
> Dmesg logs:
> 211935.284834] postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp 00007fff6ebef068 error 4 in
postgres[559079e4d000+51a000]
>
> Postgres logs:
> 2022-12-14 22:33:58 UTC 2297089STATEMENT:  SELECT pg_create_logical_replication_slot('replica_b7fa97cc_28',
'pgoutput',false);
 
> 2022-12-14 22:33:58 UTC 778310LOG:  server process (PID 2297089) was terminated by signal 11: Segmentation fault
> 2022-12-14 22:33:58 UTC 778310DETAIL:  Failed process was running: SELECT
pg_create_logical_replication_slot('replica_b7fa97cc_28','pgoutput', false);
 
> 2022-12-14 22:33:58 UTC 778310LOG:  terminating any other active server processes
>
> Journal logs:
> Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel: postgres[2297089]: segfault at 10 ip
000055907a33dd97sp 00007fff6ebef068 error 4 in postgres[559079e4d000+51a000]
 
> Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel: Code: 48 83 c4 08 49 89 c0 5b 41 5c 4c 89 c0 41
5d5d c3 66 90 83 e3 02 75 b8 e9 71 18 b9 ff e9 d3 18 b9 ff 90 48 89 fe 48 8b 7f f8 <48> 8b 47 >
 
>
> Let me know if you need any more info.
>

It is difficult to diagnose this without more information because
there is no clear call stack or a reproducible scenario. Will it be
possible for you to get a reproducible test case or at least call
stack to proceed?

-- 
With Regards,
Amit Kapila.



Apologies for the delay (and happy christmas/new years).

Please find included a full backtrace[1] of a sample of this crash, replicated on postgres 15.1-1 in the same environment described in my original email.  Included as a gist due to the length but lmk if it should be pasted in full for posterity.  I've also added the python script[2] used to replicate, if that's relevant.

Unfortunately we have not been able to reproduce this in a clean room environment, however we can note a few additional things:
- This has occurred over multiple distinct servers with different data sets, though similar write loads.  Suggesting it's not a specific server with data corruption.
- Disabling pg_repack, autovacuum, automatic reindexing, has no effect, the bug can still occur
- Running the same script on a read-only logical replica does not hit the bug
- As above, if the server is idle (no write traffic), then it does not hit the bug
- The bug occurs roughly 1 in every 10 executions of the create replication slot, so is not 100% consistent.
- We're fairly confident that this did not occur pre 14.5-1, and started occurring in 14.6-1 & 15.1-1.
So we would assume that there is some concurrent write traffic from our web tier that sometimes causes a segfault in the logical replication slot creation.

Please let me know if you need any more information.

Thanks,
- Alex.


On Thu, 15 Dec 2022 at 03:06, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Dec 15, 2022 at 4:35 AM Alex Richman <alexrichman@onesignal.com> wrote:
>
> We've noticed a segfault bug while creating logical replication slots.  We've seen this on 14.6-1 and 15.1-1 (both installed from postgres apt-archive repos to debian 11 hosts).  We're confident that this behaviour is not present in 14.5-1.
>
> We can only reproduce this if the server is under load (reads/writes from clients, perhaps also pg_repacks, reindexes etc).  If the server is idle then we cannot reproduce it.
>
> Dmesg logs:
> 211935.284834] postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp 00007fff6ebef068 error 4 in postgres[559079e4d000+51a000]
>
> Postgres logs:
> 2022-12-14 22:33:58 UTC 2297089STATEMENT:  SELECT pg_create_logical_replication_slot('replica_b7fa97cc_28', 'pgoutput', false);
> 2022-12-14 22:33:58 UTC 778310LOG:  server process (PID 2297089) was terminated by signal 11: Segmentation fault
> 2022-12-14 22:33:58 UTC 778310DETAIL:  Failed process was running: SELECT pg_create_logical_replication_slot('replica_b7fa97cc_28', 'pgoutput', false);
> 2022-12-14 22:33:58 UTC 778310LOG:  terminating any other active server processes
>
> Journal logs:
> Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel: postgres[2297089]: segfault at 10 ip 000055907a33dd97 sp 00007fff6ebef068 error 4 in postgres[559079e4d000+51a000]
> Dec 14 22:33:58 c7a2da82-x60-postgres-persistence-onesignal kernel: Code: 48 83 c4 08 49 89 c0 5b 41 5c 4c 89 c0 41 5d 5d c3 66 90 83 e3 02 75 b8 e9 71 18 b9 ff e9 d3 18 b9 ff 90 48 89 fe 48 8b 7f f8 <48> 8b 47 >
>
> Let me know if you need any more info.
>

It is difficult to diagnose this without more information because
there is no clear call stack or a reproducible scenario. Will it be
possible for you to get a reproducible test case or at least call
stack to proceed?

--
With Regards,
Amit Kapila.

Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1

From
Masahiko Sawada
Date:
Hi,

On Tue, Jan 3, 2023 at 9:57 PM Alex Richman <alexrichman@onesignal.com> wrote:
>
> Apologies for the delay (and happy christmas/new years).
>
> Please find included a full backtrace[1] of a sample of this crash, replicated on postgres 15.1-1 in the same
environmentdescribed in my original email.  Included as a gist due to the length but lmk if it should be pasted in full
forposterity.  I've also added the python script[2] used to replicate, if that's relevant. 
>
> Unfortunately we have not been able to reproduce this in a clean room environment, however we can note a few
additionalthings: 
> - This has occurred over multiple distinct servers with different data sets, though similar write loads.  Suggesting
it'snot a specific server with data corruption. 
> - Disabling pg_repack, autovacuum, automatic reindexing, has no effect, the bug can still occur
> - Running the same script on a read-only logical replica does not hit the bug
> - As above, if the server is idle (no write traffic), then it does not hit the bug
> - The bug occurs roughly 1 in every 10 executions of the create replication slot, so is not 100% consistent.
> - We're fairly confident that this did not occur pre 14.5-1, and started occurring in 14.6-1 & 15.1-1.
> So we would assume that there is some concurrent write traffic from our web tier that sometimes causes a segfault in
thelogical replication slot creation. 
>
> Please let me know if you need any more information.

Thank you for providing more information.

One possibility is that you encountered the bug in snapbuild.c that is
already fixed by commit 898ef41bf6f4 and will be included in 14.7 and
15.2. I've attached patches of this fix for PG14 and PG15. Could you
please try the same scenario again with these patches and see if the
issue happens?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment
 I've attached patches of this fix for PG14 and PG15. Could you
please try the same scenario again with these patches and see if the
issue happens?

 
I applied your pg15 patch to the 15.1 source and it seems to be fixed, thanks!

Look forward to the 15.2 release so we can deploy this more broadly.

Thanks,
- Alex.

Re: Segfault while creating logical replication slots on active DB 14.6-1 + 15.1-1

From
Masahiko Sawada
Date:
On Sat, Jan 7, 2023 at 12:20 AM Alex Richman <alexrichman@onesignal.com> wrote:
>>
>>  I've attached patches of this fix for PG14 and PG15. Could you
>> please try the same scenario again with these patches and see if the
>> issue happens?
>
>
>
> I applied your pg15 patch to the 15.1 source and it seems to be fixed, thanks!
>
> Look forward to the 15.2 release so we can deploy this more broadly.

Thank you for your confirmation!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com