Thread: server process exited with code 1
Hey,
We are experiencing a periodic PostgreSQL crash. It happens overnight when automated processes are updating content on our website. My PostgreSQL version information is:
PostgreSQL 11.10, compiled by Visual C++ build 1914, 64-bit
It is running on Windows Server 2019 Standard. We were having this same crash on PostgreSQL 9.6, also running on Windows Server, I believe 2012, before a recent upgrade. Here is the relevant part of the log:
2021-04-23 03:20:57 EDT [5324]: LOG: connection received: host=10.120.80.162 port=54017
2021-04-23 03:20:57 EDT [5324]: LOG: connection authorized: user=dba_webjmp database=webjmp
2021-04-23 03:21:00 EDT [15776]: LOG: server process (PID 14820) exited with exit code 1
2021-04-23 03:21:00 EDT [15776]: DETAIL: Failed process was running: SET client_min_messages TO warning;SET TIME ZONE INTERVAL '+00:00' HOUR TO MINUTE;
2021-04-23 03:21:00 EDT [15776]: LOG: terminating any other active server processes
2021-04-23 03:21:00 EDT [5324]: WARNING: terminating connection because of crash of another server process
2021-04-23 03:21:00 EDT [5324]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2021-04-23 03:21:00 EDT [5324]: HINT: In a moment you should be able to reconnect to the database and repeat your command.
I would really like for PostgreSQL to produce a crashdump of this crash but one is not produced. I made a crashdumps folder in the data directory and gave it every permission I could think of, but no joy there.
We’ve looked at various aspects of the PostgreSQL instance and the machine it is running on, and we don’t seem to be running into any resource constraints (connections, memory). This is what I’m logging:
# pgBadger-approved log line prefix
log_line_prefix = '%t [%p]: '
# S1561728 - Log any query that takes > 250 milliseconds to return
log_min_duration_statement = 250ms
# Additional logging to troubleshoot crashes, also using pgBadger recommendations
log_checkpoints = on # Default is off
log_lock_waits = on # Default is off
log_connections = on # Default is off
log_disconnections = on # Default is off
log_temp_files = 0 # log temporary files equal or larger
log_autovacuum_min_duration = 0 # log all autovacuums
log_error_verbosity = default # default is the default
Any thoughts? Any other information I could provide that would be helpful?
Thanks!
Eric
On Fri, Apr 23, 2021 at 03:58:32PM +0000, Eric Hill wrote: > Hey, > > We are experiencing a periodic PostgreSQL crash. It happens overnight when automated processes are updating content onour website. My PostgreSQL version information is: > > PostgreSQL 11.10, compiled by Visual C++ build 1914, 64-bit > > It is running on Windows Server 2019 Standard. We were having this same crash on PostgreSQL 9.6, also running on WindowsServer, I believe 2012, before a recent upgrade. Here is the relevant part of the log: > > 2021-04-23 03:20:57 EDT [5324]: LOG: connection received: host=10.120.80.162 port=54017 > 2021-04-23 03:20:57 EDT [5324]: LOG: connection authorized: user=dba_webjmp database=webjmp > 2021-04-23 03:21:00 EDT [15776]: LOG: server process (PID 14820) exited with exit code 1 > 2021-04-23 03:21:00 EDT [15776]: DETAIL: Failed process was running: SET client_min_messages TO warning;SET TIME ZONEINTERVAL '+00:00' HOUR TO MINUTE; > 2021-04-23 03:21:00 EDT [15776]: LOG: terminating any other active server processes > 2021-04-23 03:21:00 EDT [5324]: WARNING: terminating connection because of crash of another server process > 2021-04-23 03:21:00 EDT [5324]: DETAIL: The postmaster has commanded this server process to roll back the current transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. > 2021-04-23 03:21:00 EDT [5324]: HINT: In a moment you should be able to reconnect to the database and repeat your command. > > I would really like for PostgreSQL to produce a crashdump of this crash but one is not produced. I made a crashdumps folderin the data directory and gave it every permission I could think of, but no joy there. The mentioned query works just fine for me, so I doubt we could help without a core dump. Did you check the instructions at https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows#Using_crash_dumps_to_debug_random_and_unpredictable_backend_crashes?
Eric Hill <Eric.Hill@jmp.com> writes: > 2021-04-23 03:20:57 EDT [5324]: LOG: connection received: host=10.120.80.162 port=54017 > 2021-04-23 03:20:57 EDT [5324]: LOG: connection authorized: user=dba_webjmp database=webjmp > 2021-04-23 03:21:00 EDT [15776]: LOG: server process (PID 14820) exited with exit code 1 > 2021-04-23 03:21:00 EDT [15776]: DETAIL: Failed process was running: SET client_min_messages TO warning;SET TIME ZONEINTERVAL '+00:00' HOUR TO MINUTE; > 2021-04-23 03:21:00 EDT [15776]: LOG: terminating any other active server processes Hm. That's fairly odd. Exit code 1 ordinarily means that the backend hit a "FATAL" error, which is one that causes process termination but *not* a database-wide restart. The fact that the postmaster is treating it that way anyway implies that the backend failed to disarm the "dead man switch" mechanism that exists to check that backends cleaned up their shared-memory state before exiting. Another thing that doesn't square with this being an ordinary FATAL exit is that there's no message visible in the log. (I wonder what this looks like from the client's end --- is it getting any sort of termination message, or just a lost network connection?) The implication is that something in the backend did _exit(1) to force process termination without any of the normal atexit cleanup. There is not, or at least isn't supposed to be, anything in Postgres proper that does that. So I'm wondering if you have any nonstandard code in there, such as unusual extensions. A badly-written event trigger could perhaps do it too. This being Windows, a certain amount of suspicion has to be directed towards bogus antivirus code, too. regards, tom lane
Hey, Julien,
Yes, I have looked at those instructions and followed them to a T. My data directory is C:\Program Files\PostgreSQL\11\data as confirmed by SHOW data_directory;, and I gave the postgres, pgsqladmin, and NETWORK SERVICE accounts (the latter is the account that is running the services) Full Control on the crashdumps directory I created under data. No crash dumps have yet appeared in that folder. I’m not sure how to convince crashdumps to show up. Maybe with all the antivirus stuff turned off it will help?
Thanks!
Eric
From: Julien Rouhaud <rjuju123@gmail.com>
Date: Friday, April 23, 2021 at 12:21 PM
To: Eric Hill <Eric.Hill@jmp.com>
Cc: pgsql-general@lists.postgresql.org <pgsql-general@lists.postgresql.org>
Subject: Re: server process exited with code 1
EXTERNAL
On Fri, Apr 23, 2021 at 03:58:32PM +0000, Eric Hill wrote:
> Hey,
>
> We are experiencing a periodic PostgreSQL crash. It happens overnight when automated processes are updating content on our website. My PostgreSQL version information is:
>
> PostgreSQL 11.10, compiled by Visual C++ build 1914, 64-bit
>
> It is running on Windows Server 2019 Standard. We were having this same crash on PostgreSQL 9.6, also running on Windows Server, I believe 2012, before a recent upgrade. Here is the relevant part of the log:
>
> 2021-04-23 03:20:57 EDT [5324]: LOG: connection received: host=10.120.80.162 port=54017
> 2021-04-23 03:20:57 EDT [5324]: LOG: connection authorized: user=dba_webjmp database=webjmp
> 2021-04-23 03:21:00 EDT [15776]: LOG: server process (PID 14820) exited with exit code 1
> 2021-04-23 03:21:00 EDT [15776]: DETAIL: Failed process was running: SET client_min_messages TO warning;SET TIME ZONE INTERVAL '+00:00' HOUR TO MINUTE;
> 2021-04-23 03:21:00 EDT [15776]: LOG: terminating any other active server processes
> 2021-04-23 03:21:00 EDT [5324]: WARNING: terminating connection because of crash of another server process
> 2021-04-23 03:21:00 EDT [5324]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
> 2021-04-23 03:21:00 EDT [5324]: HINT: In a moment you should be able to reconnect to the database and repeat your command.
>
> I would really like for PostgreSQL to produce a crashdump of this crash but one is not produced. I made a crashdumps folder in the data directory and gave it every permission I could think of, but no joy there.
The mentioned query works just fine for me, so I doubt we could help without a
core dump. Did you check the instructions at
https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.postgresql.org%2Fwiki%2FGetting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows%23Using_crash_dumps_to_debug_random_and_unpredictable_backend_crashes&data=04%7C01%7CEric.Hill%40jmp.com%7Cddb46609d2e148820b3908d90673e86a%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637547917153130030%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VJTI%2FHtgXN6u5F6k5YYcImSuS3MwQe%2BIDdLy7CQLo7g%3D&reserved=0?
Wow, Tom, thanks, that gives us a lot to go on.
The first sign of trouble on the client is just ECONNRESET:
[[37m2021-04-23T07:21:01.073Z[39m] [35m WARN[39m: main/6716 on [redacted]:
[37m [36mPubSub lost connection to datatabase, retrying connection. Error: read ECONNRESET
at TCP.onStreamRead (internal/stream_base_commons.js:209:20)
at TCP.callbackTrampoline (internal/async_hooks.js:131:14) {
errno: -4077,
code: 'ECONNRESET',
syscall: 'read'
}[39m
[39m[[37m2021-04-23T07:21:01.079Z[39m] [35m WARN[39m: main/6716 on [redacted]:
[37m [36mPubSub lost connection to datatabase, retrying connection. Error: Connection terminated unexpectedly
(Times in that log are GMT, so 07:21am here === 03:21am in the psql log).
I don’t believe we have any unusual extensions. We do have triggers, and the VM does have antivirus protection. I’ll work on exclusions for the AV, and we’ll look into our triggers a bit.
Thanks!
Eric
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Friday, April 23, 2021 at 12:36 PM
To: Eric Hill <Eric.Hill@jmp.com>
Cc: pgsql-general@lists.postgresql.org <pgsql-general@lists.postgresql.org>
Subject: Re: server process exited with code 1
EXTERNAL
Eric Hill <Eric.Hill@jmp.com> writes:
> 2021-04-23 03:20:57 EDT [5324]: LOG: connection received: host=10.120.80.162 port=54017
> 2021-04-23 03:20:57 EDT [5324]: LOG: connection authorized: user=dba_webjmp database=webjmp
> 2021-04-23 03:21:00 EDT [15776]: LOG: server process (PID 14820) exited with exit code 1
> 2021-04-23 03:21:00 EDT [15776]: DETAIL: Failed process was running: SET client_min_messages TO warning;SET TIME ZONE INTERVAL '+00:00' HOUR TO MINUTE;
> 2021-04-23 03:21:00 EDT [15776]: LOG: terminating any other active server processes
Hm. That's fairly odd. Exit code 1 ordinarily means that the backend
hit a "FATAL" error, which is one that causes process termination but
*not* a database-wide restart. The fact that the postmaster is treating
it that way anyway implies that the backend failed to disarm the "dead
man switch" mechanism that exists to check that backends cleaned up
their shared-memory state before exiting. Another thing that doesn't
square with this being an ordinary FATAL exit is that there's no message
visible in the log.
(I wonder what this looks like from the client's end --- is it getting
any sort of termination message, or just a lost network connection?)
The implication is that something in the backend did _exit(1) to force
process termination without any of the normal atexit cleanup. There
is not, or at least isn't supposed to be, anything in Postgres proper
that does that. So I'm wondering if you have any nonstandard code
in there, such as unusual extensions. A badly-written event trigger
could perhaps do it too.
This being Windows, a certain amount of suspicion has to be directed
towards bogus antivirus code, too.
regards, tom lane
Eric Hill <Eric.Hill@jmp.com> writes: > I dont believe we have any unusual extensions. We do have triggers, and the VM does have antivirus protection. Illwork on exclusions for the AV, and well look into our triggers a bit. BTW, I happened to check in our commit log to see when we installed the dead man switch I referred to, and the commit message is quite interesting: Author: Tom Lane <tgl@sss.pgh.pa.us> Branch: master Release: REL8_4_BR [969d7cd43] 2009-05-05 19:59:00 +0000 Install a "dead man switch" to allow the postmaster to detect cases where a backend has done exit(0) or exit(1) without having disengaged itself from shared memory. We are at risk for this whenever third-party code is loaded into a backend, since such code might not know it's supposed to go through proc_exit() instead. Also, it is reported that under Windows there are ways to externally kill a process that cause the status code returned to the postmaster to be indistinguishable from a voluntary exit (thank you, Microsoft). If this does happen then the system is probably hosed --- for instance, the dead session might still be holding locks. So the best recovery method is to treat this like a backend crash. The dead man switch is armed for a particular child process when it acquires a regular PGPROC, and disarmed when the PGPROC is released; these should be the first and last touches of shared memory resources in a backend, or close enough anyway. This choice means there is no coverage for auxiliary processes, but I doubt we need that, since they shouldn't be executing any user-provided code anyway. So it seems like you'd better be looking into the possibility that something entirely external to Postgres is causing that backend process to quit with what looks like _exit(1). Unfortunately, this commit is from before we had a habit of including links to mailing list discussions in commit messages; but if you dig around in the pghackers archives near that date you can probably find the thread, and maybe there will be more info about the Windows aspect. (I'm not a Windows guy, so this was purely hearsay on my part.) regards, tom lane