processes stuck in shutdown following OOM/recovery - Mailing list pgsql-hackers

From Justin Pryzby
Subject processes stuck in shutdown following OOM/recovery
Date
Msg-id ZWlrdQarrZvLsgIk@pryzbyj2023
Whole thread Raw
Responses Re: processes stuck in shutdown following OOM/recovery
List pgsql-hackers
If postgres starts, and one of its children is immediately killed, and
the cluster is also told to stop, then, instead, the whole system gets
wedged.

$ initdb -D ./pgdev.dat1
$ pg_ctl -D ./pgdev.dat1 start -o '-c port=5678'
$ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is a child's pid
.......................................................... failed
pg_ctl: server does not shut down

$ ps -wwwf --ppid 2524494
UID          PID    PPID  C STIME TTY          TIME CMD
pryzbyj  2524552 2524494  0 20:47 ?        00:00:00 postgres: checkpointer 

(gdb) bt
#0  0x00007f0ce2d08c03 in epoll_wait (epfd=10, events=0x55cb4cbaac28, maxevents=1, timeout=timeout@entry=156481) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055cb4c219208 in WaitEventSetWaitBlock (set=set@entry=0x55cb4cbaabc0, cur_timeout=cur_timeout@entry=156481,
occurred_events=occurred_events@entry=0x7ffd80130410,
 
    nevents=nevents@entry=1) at ../src/backend/storage/ipc/latch.c:1583
#2  0x000055cb4c219e02 in WaitEventSetWait (set=0x55cb4cbaabc0, timeout=timeout@entry=300000,
occurred_events=occurred_events@entry=0x7ffd80130410,nevents=nevents@entry=1, 
 
    wait_event_info=wait_event_info@entry=83886084) at ../src/backend/storage/ipc/latch.c:1529
#3  0x000055cb4c219f87 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41,
timeout=timeout@entry=300000,wait_event_info=wait_event_info@entry=83886084)
 
    at ../src/backend/storage/ipc/latch.c:539
#4  0x000055cb4c1aabc2 in CheckpointerMain () at ../src/backend/postmaster/checkpointer.c:523
#5  0x000055cb4c1a8207 in AuxiliaryProcessMain (auxtype=auxtype@entry=CheckpointerProcess) at
../src/backend/postmaster/auxprocess.c:153
#6  0x000055cb4c1ae63d in StartChildProcess (type=type@entry=CheckpointerProcess) at
../src/backend/postmaster/postmaster.c:5331
#7  0x000055cb4c1b07f3 in ServerLoop () at ../src/backend/postmaster/postmaster.c:1792
#8  0x000055cb4c1b1c56 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x55cb4cbaa380) at
../src/backend/postmaster/postmaster.c:1466
#9  0x000055cb4c0f4c1b in main (argc=5, argv=0x55cb4cbaa380) at ../src/backend/main/main.c:198

I noticed this because of the counter-effective behavior of systemd+PGDG
unit files to run "pg_ctl stop" whenever a backend is killed for OOM:
https://www.postgresql.org/message-id/ZVI112aVNCHOQgfF@pryzbyj2023

This affects v15, and fails at 7ff23c6d27 but not its parent.

commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 (HEAD)
Author: Thomas Munro <tmunro@postgresql.org>
Date:   Mon Aug 2 17:32:20 2021 +1200

    Run checkpointer and bgwriter in crash recovery.

-- 
Justin



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Sequence Access Methods, round two
Next
From: Peter Smith
Date:
Subject: Re: pg_upgrade and logical replication