Thread: checkpointer and other server processes crashing
Hello, We've been experiencing PG server process crashes about every other week on a mostly read only website (except for a single insert/update on page access). Typical log entries look like LOG: checkpointer process (PID 11200) was terminated by signal 9: Killed LOG: terminating any other active server processes Other than the checkpointer, the server process that was terminated was either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific SELECT. The database is always recovered within a second and everything else appears to resume normally. We're not certain about what triggers this, but in several instances the web logs show an external bot issuing multiple HEAD requests on what is logically a single page. The web server logs show "broken pipe" and EOF errors, and PG logs sometimes shows a number of "incomplete startup packet" messages before the termination message. This started roughly when the site was migrated to Go, whose web "processes" run as "goroutines", scheduled by Go's runtime (previously the site used Python and Gunicorn to serve the pages, which probably isolated the PG processes from a barrage of nearly simultaneous requests). As I understand it, the PG server processes doing a SELECT are spawned as children of the Go process, so presumably if a "goroutine" dies, the associated PG process would die too, but I'm not sure I grasp why that would cause a recovery/restart. I also don't understand where the checkpointer process fits in the picture (and what would cause it to die). For the record, this is on PG 11.9 running on Debian. TIA, Joe
On 2/15/21 1:15 PM, Joe Abbate wrote: > Hello, > > We've been experiencing PG server process crashes about every other week > on a mostly read only website (except for a single insert/update on page > access). Typical log entries look like > > LOG: checkpointer process (PID 11200) was terminated by signal 9: Killed > LOG: terminating any other active server processes Have you looked at the system logs to see if the OOM killer is involved? > > Other than the checkpointer, the server process that was terminated was > either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific > SELECT. > > The database is always recovered within a second and everything else > appears to resume normally. We're not certain about what triggers this, > but in several instances the web logs show an external bot issuing > multiple HEAD requests on what is logically a single page. The web > server logs show "broken pipe" and EOF errors, and PG logs sometimes > shows a number of "incomplete startup packet" messages before the > termination message. > > This started roughly when the site was migrated to Go, whose web > "processes" run as "goroutines", scheduled by Go's runtime (previously > the site used Python and Gunicorn to serve the pages, which probably > isolated the PG processes from a barrage of nearly simultaneous requests). > > As I understand it, the PG server processes doing a SELECT are spawned > as children of the Go process, so presumably if a "goroutine" dies, the > associated PG process would die too, but I'm not sure I grasp why that > would cause a recovery/restart. I also don't understand where the > checkpointer process fits in the picture (and what would cause it to die). > > For the record, this is on PG 11.9 running on Debian. > > TIA, > > Joe > > -- Adrian Klaver adrian.klaver@aklaver.com
Joe Abbate <jma@freedomcircle.com> writes: > Hello, > > We've been experiencing PG server process crashes about every other week > on a mostly read only website (except for a single insert/update on page > access). Typical log entries look like > > LOG: checkpointer process (PID 11200) was terminated by signal 9: Killed > LOG: terminating any other active server processes > > Other than the checkpointer, the server process that was terminated was > either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific > SELECT. > > The database is always recovered within a second and everything else > appears to resume normally. We're not certain about what triggers this, > but in several instances the web logs show an external bot issuing > multiple HEAD requests on what is logically a single page. The web > server logs show "broken pipe" and EOF errors, and PG logs sometimes > shows a number of "incomplete startup packet" messages before the > termination message. > > This started roughly when the site was migrated to Go, whose web > "processes" run as "goroutines", scheduled by Go's runtime (previously > the site used Python and Gunicorn to serve the pages, which probably > isolated the PG processes from a barrage of nearly simultaneous requests). > > As I understand it, the PG server processes doing a SELECT are spawned > as children of the Go process, so presumably if a "goroutine" dies, the > associated PG process would die too, but I'm not sure I grasp why that > would cause a recovery/restart. I also don't understand where the > checkpointer process fits in the picture (and what would cause it to die). > A signal 9 typically means something is explicitly killing processes. I would check your system logs in case something is killing processes due to running out of some resource (like memory). If it is a fairly recent Debian system, journalctl might be useful for checking. -- Tim Cross
On 15/2/21 16:29, Adrian Klaver wrote: > On 2/15/21 1:15 PM, Joe Abbate wrote: >> We've been experiencing PG server process crashes about every other >> week on a mostly read only website (except for a single insert/update >> on page access). Typical log entries look like >> >> LOG: checkpointer process (PID 11200) was terminated by signal 9: Killed >> LOG: terminating any other active server processes > > Have you looked at the system logs to see if the OOM killer is involved? No, I hadn't, but now I see that of the past three crashes, four postgres processes were OOM victims. I guess I have to read the OOM articles I've been skipping (read one a long time ago). If you have any specific suggestions, let me know. Thanks Adrian, Joe
On 2/15/21 1:50 PM, Joe Abbate wrote: > On 15/2/21 16:29, Adrian Klaver wrote: >> On 2/15/21 1:15 PM, Joe Abbate wrote: >>> We've been experiencing PG server process crashes about every other >>> week on a mostly read only website (except for a single insert/update >>> on page access). Typical log entries look like >>> >>> LOG: checkpointer process (PID 11200) was terminated by signal 9: >>> Killed >>> LOG: terminating any other active server processes >> >> Have you looked at the system logs to see if the OOM killer is involved? > > No, I hadn't, but now I see that of the past three crashes, four > postgres processes were OOM victims. I guess I have to read the OOM > articles I've been skipping (read one a long time ago). If you have any > specific suggestions, let me know. There are suggestions here: https://www.postgresql.org/docs/12/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT It probably would not hurt to figure why this seemed to happen with the Python -> Go switch. Even if you can get the OOM not to kick in, the fact that it was kicking in would indicate you now have memory hungry processes that did not exist before. The questions being is this strictly due to the language/framework change, natural growth of data set being worked with, or something else? > > Thanks Adrian, > > Joe -- Adrian Klaver adrian.klaver@aklaver.com
On 2021-02-15 15:34:30 -0800, Adrian Klaver wrote: > It probably would not hurt to figure why this seemed to happen with the > Python -> Go switch. Even if you can get the OOM not to kick in, the fact > that it was kicking in would indicate you now have memory hungry processes > that did not exist before. Or just more of them. I could imagine that switching from Python/Gunicorn to Go increased the number of queries that could be in-flight at the same time. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
On 2021-02-15 16:15:44 -0500, Joe Abbate wrote: > As I understand it, the PG server processes doing a SELECT are spawned as > children of the Go process, There's a misconception here: The server processes are children of the postmaster process, not of the client program (whether it's written in Go or any other language). Think about it: The database server and the client may not even be on the same machine, so there cannot be a parent/child relationship between client and server processes. > so presumably if a "goroutine" dies, the associated PG process would > die too, A server process will terminate when the connection is closed, but it may not notice that immediately. Moreover, since Goroutines are handled by the Go runtime, not the Linux kernel, the connection may not be closed just because a goroutine terminates (actually, I think the standard API pretty much guarantuees the existence of a connection pool). hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"