BUG #14945: postmaster deadlock while logging after syslogger exited - Mailing list pgsql-bugs
From | dap@joyent.com |
---|---|
Subject | BUG #14945: postmaster deadlock while logging after syslogger exited |
Date | |
Msg-id | 20171204201055.27108.18283@wrigleys.postgresql.org Whole thread Raw |
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 14945 Logged by: David Pacheco Email address: dap@joyent.com PostgreSQL version: Unsupported/Unknown Operating system: SmartOS joyent_20160830T004914Z Description: I've run into what appears to be a deadlock in the logging subsystem. What I know for sure is: - New attempts to establish connections to the database block indefinitely. Some queries on existing connections appear to work. - The postmaster process is currently blocked in its SIGUSR1 handler. It's blocked trying to write to its error log via stderr (a pipe). The message that it's writing indicates that it failed to start an autovacuum worker because there was insufficient memory available. - This pipe is full, containing a very long SQL string. - The only open fd on the whole system for the other side of the pipe is inside the postmaster process itself. (Since the postmaster is blocked, nothing will ever read from the pipe.) - The syslogger process (which normally has a reference to the other side of the pipe) has exited and has not yet been reaped. - The postmaster process (which normally would reap the dead syslogger process) has SIGCLD both pending and masked. We're deadlocked, and there's no indication that anything could allow the database to resume accepting connections without operator intervention. Also relevant: there are a number of errors in the PostgreSQL log around the time when the syslogger process exited that indicate out-of-memory conditions. Details below. I ran into this on PostgreSQL 9.2.4 on SmartOS joyent_20160830T004914Z. Based on my reading of the code, I expect this is still an issue on versions 9.6 and 10 as well, and I don't believe there's anything platform-specific about this behavior (except maybe the underlying out-of-memory condition, but that can happen on any system). There was some discussion on this issue on pgsql-general: https://www.postgresql.org/message-id/CACukRjOaR2_DY2tLzPe-ZJUActf8RcZ%3Dr1HimQxFgYBy%2B%2BaR8Q%40mail.gmail.com As discussed there, I think the out of memory condition was the catalyst for this, and that's likely just operator error (i.e., having overprovisioned the container in which I was running PostgreSQL). However, I don't think it's necessary to run out of memory in order to hit this deadlock. I think all that's necessary are high log volume, a fatal syslogger failure, a postmaster attempt to log anything, and unfortunate timing. I understand that arbitrary programmer errors and out-of-memory errors are not necessarily recoverable, but this deadlock is considerably worse (at least in my use case) than if the database itself had failed fatally. The rest of this email is the raw data. Let me know if there's anything else I can provide! ------------------------ Initial observation: when I ran "psql" to connect to the database, it would hang (for hours, it turned out). Stack of postmaster process: $ mdb core.41349 Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ] > $C fffffd7fffdf9df0 libc.so.1`__write+0xa() fffffd7fffdfb220 write_pipe_chunks+0x142() fffffd7fffdfb280 send_message_to_server_log+0x55b() fffffd7fffdfb2a0 EmitErrorReport+0xe6() fffffd7fffdfb390 errfinish+0x1ac() fffffd7fffdfb3b0 StartAutoVacWorker+0x5c() fffffd7fffdfb3e0 StartAutovacuumWorker+0x6f() fffffd7fffdfb410 sigusr1_handler+0x185() fffffd7fffdfb420 libc.so.1`__sighndlr+6() fffffd7fffdfb4b0 libc.so.1`call_user_handler+0x1db(10, 0, fffffd7fffdfb520) fffffd7fffdfb500 libc.so.1`sigacthandler+0x116(10, 0, fffffd7fffdfb520) fffffd7fffdfb970 libc.so.1`__pollsys+0xa() fffffd7fffdfba90 libc.so.1`pselect+0x1cb(5, fffffd7fffdfbaf0, 0, 0, fffffd7fffdfbaa0, 0) fffffd7fffdfbae0 libc.so.1`select+0x5a(5, fffffd7fffdfbaf0, 0, 0, fffffd7fffdfdaf0) fffffd7fffdffb40 ServerLoop+0xb5() fffffd7fffdffbe0 PostmasterMain+0xec2() fffffd7fffdffc00 main+0x23a() fffffd7fffdffc10 _start+0x6c() From the registers, we can get the arguments to the write syscall, which indicate the file descriptor and buffer being written: > <rdi=p 2 > <rsi=p 0xfffffd7fffdf9e10 > <rdx=E 101 > 0xfffffd7fffdf9e10,0t101/c 0xfffffd7fffdf9e10: \??t2017-10-27 23:47:27.937 UTCLOG: could not fork autovacuum worker process: Not enough space It's writing to its own stderr (a pipe) a message about being out of memory. I've left out the kernel state about the pipe being full and what other processes had it open because it's rather long and irrelevant to the PostgreSQL side, but I can provide that output if it's useful. The postmaster has SIGCLD pending and masked: $ pflags 41349 41349: /opt/postgresql/9.2.4/bin/postgres -D /manatee/pg/data data model = _LP64 flags = ORPHAN|MSACCT|MSFORK sigpend = 0x00028000,0x00000000,0x00000000 /1: flags = ASLEEP write(0x2,0xfffffd7fffdf9e10,0x65) sigmask = 0xfebff047,0xfffffff7,0x000003ff The signal mask may be system-specific. I've decoded it and confirmed that SIGCHLD is in "sigpend" as well as "sigmask". And that makes sense because the stack above indicates we're in "sigusr1_handler", which masks SIGCHLD when it starts. I found the last syslogger's pid from the core file I saved of the postmaster: > SysLoggerPID/E postgres`SysLoggerPID: postgres`SysLoggerPID: 41351 Here's the status of that process: [root@529d1cc4 (postgres) ~]$ ps -opid,args -p 41351 PID COMMAND 41351 <defunct> That output indicates that the process exited, but hasn't been reaped yet. Other kernel state indicates that the process exited around 2017-10-27T23:46:21. Here are all of the entries in the PostgreSQL log from 23:19:12 until the top of the next hour: https://gist.githubusercontent.com/davepacheco/c5541bb464532075f2da761dd990a457/raw/2ba242055aca2fb374e9118045a830d08c590e0a/gistfile1.txt There are certainly a lot of indication of low memory conditions, although no smoking gun about why the syslogger exited. That's much less worrisome to me than the resulting deadlock.
pgsql-bugs by date: