Thread: checkpointer and other server processes crashing

checkpointer and other server processes crashing

From
Joe Abbate
Date:
Hello,

We've been experiencing PG server process crashes about every other week 
on a mostly read only website (except for a single insert/update on page 
access).  Typical log entries look like

LOG:  checkpointer process (PID 11200) was terminated by signal 9: Killed
LOG:  terminating any other active server processes

Other than the checkpointer, the server process that was terminated was 
either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific 
SELECT.

The database is always recovered within a second and everything else 
appears to resume normally.  We're not certain about what triggers this, 
but in several instances the web logs show an external bot issuing 
multiple HEAD requests on what is logically a single page.  The web 
server logs show "broken pipe" and EOF errors, and PG logs sometimes 
shows a number of "incomplete startup packet" messages before the 
termination message.

This started roughly when the site was migrated to Go, whose web 
"processes" run as "goroutines", scheduled by Go's runtime (previously 
the site used Python and Gunicorn to serve the pages, which probably 
isolated the PG processes from a barrage of nearly simultaneous requests).

As I understand it, the PG server processes doing a SELECT are spawned 
as children of the Go process, so presumably if a "goroutine" dies, the 
associated PG process would die too, but I'm not sure I grasp why that 
would cause a recovery/restart.  I also don't understand where the 
checkpointer process fits in the picture (and what would cause it to die).

For the record, this is on PG 11.9 running on Debian.

TIA,

Joe



Re: checkpointer and other server processes crashing

From
Adrian Klaver
Date:
On 2/15/21 1:15 PM, Joe Abbate wrote:
> Hello,
> 
> We've been experiencing PG server process crashes about every other week 
> on a mostly read only website (except for a single insert/update on page 
> access).  Typical log entries look like
> 
> LOG:  checkpointer process (PID 11200) was terminated by signal 9: Killed
> LOG:  terminating any other active server processes

Have you looked at the system logs to see if the OOM killer is involved?

> 
> Other than the checkpointer, the server process that was terminated was 
> either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific 
> SELECT.
> 
> The database is always recovered within a second and everything else 
> appears to resume normally.  We're not certain about what triggers this, 
> but in several instances the web logs show an external bot issuing 
> multiple HEAD requests on what is logically a single page.  The web 
> server logs show "broken pipe" and EOF errors, and PG logs sometimes 
> shows a number of "incomplete startup packet" messages before the 
> termination message.
> 
> This started roughly when the site was migrated to Go, whose web 
> "processes" run as "goroutines", scheduled by Go's runtime (previously 
> the site used Python and Gunicorn to serve the pages, which probably 
> isolated the PG processes from a barrage of nearly simultaneous requests).
> 
> As I understand it, the PG server processes doing a SELECT are spawned 
> as children of the Go process, so presumably if a "goroutine" dies, the 
> associated PG process would die too, but I'm not sure I grasp why that 
> would cause a recovery/restart.  I also don't understand where the 
> checkpointer process fits in the picture (and what would cause it to die).
> 
> For the record, this is on PG 11.9 running on Debian.
> 
> TIA,
> 
> Joe
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: checkpointer and other server processes crashing

From
Tim Cross
Date:
Joe Abbate <jma@freedomcircle.com> writes:

> Hello,
>
> We've been experiencing PG server process crashes about every other week
> on a mostly read only website (except for a single insert/update on page
> access).  Typical log entries look like
>
> LOG:  checkpointer process (PID 11200) was terminated by signal 9: Killed
> LOG:  terminating any other active server processes
>
> Other than the checkpointer, the server process that was terminated was
> either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific
> SELECT.
>
> The database is always recovered within a second and everything else
> appears to resume normally.  We're not certain about what triggers this,
> but in several instances the web logs show an external bot issuing
> multiple HEAD requests on what is logically a single page.  The web
> server logs show "broken pipe" and EOF errors, and PG logs sometimes
> shows a number of "incomplete startup packet" messages before the
> termination message.
>
> This started roughly when the site was migrated to Go, whose web
> "processes" run as "goroutines", scheduled by Go's runtime (previously
> the site used Python and Gunicorn to serve the pages, which probably
> isolated the PG processes from a barrage of nearly simultaneous requests).
>
> As I understand it, the PG server processes doing a SELECT are spawned
> as children of the Go process, so presumably if a "goroutine" dies, the
> associated PG process would die too, but I'm not sure I grasp why that
> would cause a recovery/restart.  I also don't understand where the
> checkpointer process fits in the picture (and what would cause it to die).
>

A signal 9 typically means something is explicitly killing processes. I
would check your system logs in case something is killing processes due
to running out of some resource (like memory). If it is a fairly recent
Debian system, journalctl might be useful for checking.

--
Tim Cross



Re: checkpointer and other server processes crashing

From
Joe Abbate
Date:
On 15/2/21 16:29, Adrian Klaver wrote:
> On 2/15/21 1:15 PM, Joe Abbate wrote:
>> We've been experiencing PG server process crashes about every other 
>> week on a mostly read only website (except for a single insert/update 
>> on page access).  Typical log entries look like
>>
>> LOG:  checkpointer process (PID 11200) was terminated by signal 9: Killed
>> LOG:  terminating any other active server processes
> 
> Have you looked at the system logs to see if the OOM killer is involved?

No, I hadn't, but now I see that of the past three crashes, four 
postgres processes were OOM victims.  I guess I have to read the OOM 
articles I've been skipping (read one a long time ago).  If you have any 
specific suggestions, let me know.

Thanks Adrian,

Joe



Re: checkpointer and other server processes crashing

From
Adrian Klaver
Date:
On 2/15/21 1:50 PM, Joe Abbate wrote:
> On 15/2/21 16:29, Adrian Klaver wrote:
>> On 2/15/21 1:15 PM, Joe Abbate wrote:
>>> We've been experiencing PG server process crashes about every other 
>>> week on a mostly read only website (except for a single insert/update 
>>> on page access).  Typical log entries look like
>>>
>>> LOG:  checkpointer process (PID 11200) was terminated by signal 9: 
>>> Killed
>>> LOG:  terminating any other active server processes
>>
>> Have you looked at the system logs to see if the OOM killer is involved?
> 
> No, I hadn't, but now I see that of the past three crashes, four 
> postgres processes were OOM victims.  I guess I have to read the OOM 
> articles I've been skipping (read one a long time ago).  If you have any 
> specific suggestions, let me know.

There are suggestions here:

https://www.postgresql.org/docs/12/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

It probably would not hurt to figure why this seemed to happen with the 
Python -> Go switch. Even if you can get the OOM not to kick in, the 
fact that it was kicking in would indicate you now have memory hungry 
processes that did not exist before. The questions being is this 
strictly due to the language/framework change, natural growth of data 
set being worked with, or something else?

> 
> Thanks Adrian,
> 
> Joe


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: checkpointer and other server processes crashing

From
"Peter J. Holzer"
Date:
On 2021-02-15 15:34:30 -0800, Adrian Klaver wrote:
> It probably would not hurt to figure why this seemed to happen with the
> Python -> Go switch. Even if you can get the OOM not to kick in, the fact
> that it was kicking in would indicate you now have memory hungry processes
> that did not exist before.

Or just more of them. I could imagine that switching from
Python/Gunicorn to Go increased the number of queries that could be
in-flight at the same time.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: checkpointer and other server processes crashing

From
"Peter J. Holzer"
Date:
On 2021-02-15 16:15:44 -0500, Joe Abbate wrote:
> As I understand it, the PG server processes doing a SELECT are spawned as
> children of the Go process,

There's a misconception here: The server processes are children of the
postmaster process, not of the client program (whether it's written in
Go or any other language). Think about it: The database server and the
client may not even be on the same machine, so there cannot be a
parent/child relationship between client and server processes.

> so presumably if a "goroutine" dies, the associated PG process would
> die too,

A server process will terminate when the connection is closed, but it
may not notice that immediately. Moreover, since Goroutines are handled
by the Go runtime, not the Linux kernel, the connection may not be
closed just because a goroutine terminates (actually, I think the
standard API pretty much guarantuees the existence of a connection
pool).

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment