Thread: server process exited with code 1

server process exited with code 1

From

Eric Hill

Date:

23 April 2021, 15:58:32

Hey,

We are experiencing a periodic PostgreSQL crash. It happens overnight when automated processes are updating content on our website. My PostgreSQL version information is:

PostgreSQL 11.10, compiled by Visual C++ build 1914, 64-bit

It is running on Windows Server 2019 Standard. We were having this same crash on PostgreSQL 9.6, also running on Windows Server, I believe 2012, before a recent upgrade. Here is the relevant part of the log:

2021-04-23 03:20:57 EDT [5324]: LOG: connection received: host=10.120.80.162 port=54017

2021-04-23 03:20:57 EDT [5324]: LOG: connection authorized: user=dba_webjmp database=webjmp

2021-04-23 03:21:00 EDT [15776]: LOG: server process (PID 14820) exited with exit code 1

2021-04-23 03:21:00 EDT [15776]: DETAIL: Failed process was running: SET client_min_messages TO warning;SET TIME ZONE INTERVAL '+00:00' HOUR TO MINUTE;

2021-04-23 03:21:00 EDT [15776]: LOG: terminating any other active server processes

2021-04-23 03:21:00 EDT [5324]: WARNING: terminating connection because of crash of another server process

2021-04-23 03:21:00 EDT [5324]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2021-04-23 03:21:00 EDT [5324]: HINT: In a moment you should be able to reconnect to the database and repeat your command.

I would really like for PostgreSQL to produce a crashdump of this crash but one is not produced. I made a crashdumps folder in the data directory and gave it every permission I could think of, but no joy there.

We’ve looked at various aspects of the PostgreSQL instance and the machine it is running on, and we don’t seem to be running into any resource constraints (connections, memory). This is what I’m logging:

# pgBadger-approved log line prefix

log_line_prefix = '%t [%p]: '

# S1561728 - Log any query that takes > 250 milliseconds to return

log_min_duration_statement = 250ms

# Additional logging to troubleshoot crashes, also using pgBadger recommendations

log_checkpoints = on # Default is off

log_lock_waits = on # Default is off

log_connections = on # Default is off

log_disconnections = on # Default is off

log_temp_files = 0 # log temporary files equal or larger

log_autovacuum_min_duration = 0 # log all autovacuums

log_error_verbosity = default # default is the default

Any thoughts? Any other information I could provide that would be helpful?

Thanks!

Eric

Re: server process exited with code 1

From

Julien Rouhaud

Date:

23 April 2021, 16:22:39

On Fri, Apr 23, 2021 at 03:58:32PM +0000, Eric Hill wrote:
> Hey,
> 
> We are experiencing a periodic PostgreSQL crash. It happens overnight when automated processes are updating content
onour website.  My PostgreSQL version information is:
 
> 
> PostgreSQL 11.10, compiled by Visual C++ build 1914, 64-bit
> 
> It is running on Windows Server 2019 Standard.  We were having this same crash on PostgreSQL 9.6, also running on
WindowsServer, I believe 2012, before a recent upgrade.  Here is the relevant part of the log:
 
> 
> 2021-04-23 03:20:57 EDT [5324]: LOG:  connection received: host=10.120.80.162 port=54017
> 2021-04-23 03:20:57 EDT [5324]: LOG:  connection authorized: user=dba_webjmp database=webjmp
> 2021-04-23 03:21:00 EDT [15776]: LOG:  server process (PID 14820) exited with exit code 1
> 2021-04-23 03:21:00 EDT [15776]: DETAIL:  Failed process was running: SET client_min_messages TO warning;SET TIME
ZONEINTERVAL '+00:00' HOUR TO MINUTE;
 
> 2021-04-23 03:21:00 EDT [15776]: LOG:  terminating any other active server processes
> 2021-04-23 03:21:00 EDT [5324]: WARNING:  terminating connection because of crash of another server process
> 2021-04-23 03:21:00 EDT [5324]: DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory.
 
> 2021-04-23 03:21:00 EDT [5324]: HINT:  In a moment you should be able to reconnect to the database and repeat your
command.
> 
> I would really like for PostgreSQL to produce a crashdump of this crash but one is not produced.  I made a crashdumps
folderin the data directory and gave it every permission I could think of, but no joy there.
 

The mentioned query works just fine for me, so I doubt we could help without a
core dump.  Did you check the instructions at

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows#Using_crash_dumps_to_debug_random_and_unpredictable_backend_crashes?

Re: server process exited with code 1

From

Tom Lane

Date:

23 April 2021, 16:36:41

Eric Hill <Eric.Hill@jmp.com> writes:
> 2021-04-23 03:20:57 EDT [5324]: LOG:  connection received: host=10.120.80.162 port=54017
> 2021-04-23 03:20:57 EDT [5324]: LOG:  connection authorized: user=dba_webjmp database=webjmp
> 2021-04-23 03:21:00 EDT [15776]: LOG:  server process (PID 14820) exited with exit code 1
> 2021-04-23 03:21:00 EDT [15776]: DETAIL:  Failed process was running: SET client_min_messages TO warning;SET TIME
ZONEINTERVAL '+00:00' HOUR TO MINUTE; 
> 2021-04-23 03:21:00 EDT [15776]: LOG:  terminating any other active server processes

Hm.  That's fairly odd.  Exit code 1 ordinarily means that the backend
hit a "FATAL" error, which is one that causes process termination but
*not* a database-wide restart.  The fact that the postmaster is treating
it that way anyway implies that the backend failed to disarm the "dead
man switch" mechanism that exists to check that backends cleaned up
their shared-memory state before exiting.  Another thing that doesn't
square with this being an ordinary FATAL exit is that there's no message
visible in the log.

(I wonder what this looks like from the client's end --- is it getting
any sort of termination message, or just a lost network connection?)

The implication is that something in the backend did _exit(1) to force
process termination without any of the normal atexit cleanup.  There
is not, or at least isn't supposed to be, anything in Postgres proper
that does that.  So I'm wondering if you have any nonstandard code
in there, such as unusual extensions.  A badly-written event trigger
could perhaps do it too.

This being Windows, a certain amount of suspicion has to be directed
towards bogus antivirus code, too.

            regards, tom lane

Re: server process exited with code 1

From

Eric Hill

Date:

23 April 2021, 19:08:39

Hey, Julien,

Yes, I have looked at those instructions and followed them to a T. My data directory is C:\Program Files\PostgreSQL\11\data as confirmed by SHOW data_directory;, and I gave the postgres, pgsqladmin, and NETWORK SERVICE accounts (the latter is the account that is running the services) Full Control on the crashdumps directory I created under data. No crash dumps have yet appeared in that folder. I’m not sure how to convince crashdumps to show up. Maybe with all the antivirus stuff turned off it will help?

Thanks!

Eric

From: Julien Rouhaud <rjuju123@gmail.com>
Date: Friday, April 23, 2021 at 12:21 PM
To: Eric Hill <Eric.Hill@jmp.com>
Cc: pgsql-general@lists.postgresql.org <pgsql-general@lists.postgresql.org>
Subject: Re: server process exited with code 1

EXTERNAL

On Fri, Apr 23, 2021 at 03:58:32PM +0000, Eric Hill wrote:
> Hey,
>
> We are experiencing a periodic PostgreSQL crash. It happens overnight when automated processes are updating content on our website. My PostgreSQL version information is:
>
> PostgreSQL 11.10, compiled by Visual C++ build 1914, 64-bit
>
> It is running on Windows Server 2019 Standard. We were having this same crash on PostgreSQL 9.6, also running on Windows Server, I believe 2012, before a recent upgrade. Here is the relevant part of the log:
>
> 2021-04-23 03:20:57 EDT [5324]: LOG: connection received: host=10.120.80.162 port=54017
> 2021-04-23 03:20:57 EDT [5324]: LOG: connection authorized: user=dba_webjmp database=webjmp
> 2021-04-23 03:21:00 EDT [15776]: LOG: server process (PID 14820) exited with exit code 1
> 2021-04-23 03:21:00 EDT [15776]: DETAIL: Failed process was running: SET client_min_messages TO warning;SET TIME ZONE INTERVAL '+00:00' HOUR TO MINUTE;
> 2021-04-23 03:21:00 EDT [15776]: LOG: terminating any other active server processes
> 2021-04-23 03:21:00 EDT [5324]: WARNING: terminating connection because of crash of another server process
> 2021-04-23 03:21:00 EDT [5324]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
> 2021-04-23 03:21:00 EDT [5324]: HINT: In a moment you should be able to reconnect to the database and repeat your command.
>
> I would really like for PostgreSQL to produce a crashdump of this crash but one is not produced. I made a crashdumps folder in the data directory and gave it every permission I could think of, but no joy there.

The mentioned query works just fine for me, so I doubt we could help without a
core dump. Did you check the instructions at
https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.postgresql.org%2Fwiki%2FGetting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows%23Using_crash_dumps_to_debug_random_and_unpredictable_backend_crashes&data=04%7C01%7CEric.Hill%40jmp.com%7Cddb46609d2e148820b3908d90673e86a%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637547917153130030%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VJTI%2FHtgXN6u5F6k5YYcImSuS3MwQe%2BIDdLy7CQLo7g%3D&reserved=0?

Re: server process exited with code 1

From

Eric Hill

Date:

23 April 2021, 19:30:12

Wow, Tom, thanks, that gives us a lot to go on.

The first sign of trouble on the client is just ECONNRESET:

[[37m2021-04-23T07:21:01.073Z[39m] [35m WARN[39m: main/6716 on [redacted]:

[37m [36mPubSub lost connection to datatabase, retrying connection. Error: read ECONNRESET

at TCP.onStreamRead (internal/stream_base_commons.js:209:20)

at TCP.callbackTrampoline (internal/async_hooks.js:131:14) {

errno: -4077,

code: 'ECONNRESET',

syscall: 'read'

}[39m

[39m[[37m2021-04-23T07:21:01.079Z[39m] [35m WARN[39m: main/6716 on [redacted]:

[37m [36mPubSub lost connection to datatabase, retrying connection. Error: Connection terminated unexpectedly

(Times in that log are GMT, so 07:21am here === 03:21am in the psql log).

I don’t believe we have any unusual extensions. We do have triggers, and the VM does have antivirus protection. I’ll work on exclusions for the AV, and we’ll look into our triggers a bit.

Thanks!

Eric

From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Friday, April 23, 2021 at 12:36 PM
To: Eric Hill <Eric.Hill@jmp.com>
Cc: pgsql-general@lists.postgresql.org <pgsql-general@lists.postgresql.org>
Subject: Re: server process exited with code 1

EXTERNAL

Eric Hill <Eric.Hill@jmp.com> writes:
> 2021-04-23 03:20:57 EDT [5324]: LOG: connection received: host=10.120.80.162 port=54017
> 2021-04-23 03:20:57 EDT [5324]: LOG: connection authorized: user=dba_webjmp database=webjmp
> 2021-04-23 03:21:00 EDT [15776]: LOG: server process (PID 14820) exited with exit code 1
> 2021-04-23 03:21:00 EDT [15776]: DETAIL: Failed process was running: SET client_min_messages TO warning;SET TIME ZONE INTERVAL '+00:00' HOUR TO MINUTE;
> 2021-04-23 03:21:00 EDT [15776]: LOG: terminating any other active server processes

Hm. That's fairly odd. Exit code 1 ordinarily means that the backend
hit a "FATAL" error, which is one that causes process termination but
*not* a database-wide restart. The fact that the postmaster is treating
it that way anyway implies that the backend failed to disarm the "dead
man switch" mechanism that exists to check that backends cleaned up
their shared-memory state before exiting. Another thing that doesn't
square with this being an ordinary FATAL exit is that there's no message
visible in the log.

(I wonder what this looks like from the client's end --- is it getting
any sort of termination message, or just a lost network connection?)

The implication is that something in the backend did _exit(1) to force
process termination without any of the normal atexit cleanup. There
is not, or at least isn't supposed to be, anything in Postgres proper
that does that. So I'm wondering if you have any nonstandard code
in there, such as unusual extensions. A badly-written event trigger
could perhaps do it too.

This being Windows, a certain amount of suspicion has to be directed
towards bogus antivirus code, too.

regards, tom lane

Re: server process exited with code 1

From

Tom Lane

Date:

23 April 2021, 23:12:41

Eric Hill <Eric.Hill@jmp.com> writes:
> I dont believe we have any unusual extensions. We do have triggers, and the VM does have antivirus protection.
Illwork on exclusions for the AV, and well look into our triggers a bit.

BTW, I happened to check in our commit log to see when we installed the
dead man switch I referred to, and the commit message is quite
interesting:

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL8_4_BR [969d7cd43] 2009-05-05 19:59:00 +0000

Install a "dead man switch" to allow the postmaster to detect cases where
a backend has done exit(0) or exit(1) without having disengaged itself
from shared memory. We are at risk for this whenever third-party code is
loaded into a backend, since such code might not know it's supposed to go
through proc_exit() instead. Also, it is reported that under Windows
there are ways to externally kill a process that cause the status code
returned to the postmaster to be indistinguishable from a voluntary exit
(thank you, Microsoft). If this does happen then the system is probably
hosed --- for instance, the dead session might still be holding locks.
So the best recovery method is to treat this like a backend crash.

The dead man switch is armed for a particular child process when it
acquires a regular PGPROC, and disarmed when the PGPROC is released;
these should be the first and last touches of shared memory resources
in a backend, or close enough anyway. This choice means there is no
coverage for auxiliary processes, but I doubt we need that, since they
shouldn't be executing any user-provided code anyway.

So it seems like you'd better be looking into the possibility that
something entirely external to Postgres is causing that backend process
to quit with what looks like _exit(1). Unfortunately, this commit is
from before we had a habit of including links to mailing list
discussions in commit messages; but if you dig around in the pghackers
archives near that date you can probably find the thread, and maybe
there will be more info about the Windows aspect. (I'm not a Windows
guy, so this was purely hearsay on my part.)

regards, tom lane