Thread: Very slow queries followed by checkpointer process killed with signal 9
Hello,
I'm troubleshooting an issue where about once a week, a database appears to lock up and then the PostgreSQL process crashes and recovers. When this happens, a few queries will be logged, but there is no pattern to which queries are executing when the crash happens, and the queries logged don't appear to be queries that would consume a lot of resources.
It seems like something in the backend is locking up, causing queries to slow down or fail before Postgres crashes.
Here's an example of the log output leading up to and following one of these crashes:
2023-03-30 13:03:21.943 UTC [4155] LOG: duration: 14232.602 ms statement: START TRANSACTION;
2023-03-30 13:03:25.947 UTC [8899] LOG: duration: 17269.756 ms statement: BEGIN
2023-03-30 13:03:28.805 UTC [8874] LOG: duration: 19987.241 ms statement: BEGIN
2023-03-30 13:03:32.068 UTC [8326] LOG: duration: 21541.082 ms statement: BEGIN
2023-03-30 13:04:12.164 UTC [1] LOG: checkpointer process (PID 23) was terminated by signal 9: Killed
2023-03-30 13:04:12.457 UTC [444] LOG: duration: 58248.342 ms parse <unnamed>: INSERT INTO simple_table (id, value) VALUES ($1, $2)
ON CONFLICT(id) DO UPDATE SET value = $2
2023-03-30 13:04:18.256 UTC [4155] LOG: duration: 42389.362 ms statement: COMMIT;
2023-03-30 13:04:23.720 UTC [1] LOG: terminating any other active server processes
2023-03-30 13:04:28.444 UTC [8916] FATAL: the database system is in recovery mode
2023-03-30 13:03:25.947 UTC [8899] LOG: duration: 17269.756 ms statement: BEGIN
2023-03-30 13:03:28.805 UTC [8874] LOG: duration: 19987.241 ms statement: BEGIN
2023-03-30 13:03:32.068 UTC [8326] LOG: duration: 21541.082 ms statement: BEGIN
2023-03-30 13:04:12.164 UTC [1] LOG: checkpointer process (PID 23) was terminated by signal 9: Killed
2023-03-30 13:04:12.457 UTC [444] LOG: duration: 58248.342 ms parse <unnamed>: INSERT INTO simple_table (id, value) VALUES ($1, $2)
ON CONFLICT(id) DO UPDATE SET value = $2
2023-03-30 13:04:18.256 UTC [4155] LOG: duration: 42389.362 ms statement: COMMIT;
2023-03-30 13:04:23.720 UTC [1] LOG: terminating any other active server processes
2023-03-30 13:04:28.444 UTC [8916] FATAL: the database system is in recovery mode
I'm curious where I should look for root causes.
Thanks!
Jason McLaurin <jason@jcore.io> writes: > I'm troubleshooting an issue where about once a week, a database appears to > lock up and then the PostgreSQL process crashes and recovers. When this > happens, a few queries will be logged, but there is no pattern to which > queries are executing when the crash happens, and the queries logged don't > appear to be queries that would consume a lot of resources. Hmm, is it always the checkpointer that gets the OOM kill? That seems quite odd. What PG version is this exactly? Do you have any extensions installed? I recall having seen somebody before reporting odd slowness of trivial commands like BEGIN. I failed to find the thread(s) in the archives though, so I'm not sure if we identified the cause. regards, tom lane
Re: Very slow queries followed by checkpointer process killed with signal 9
From
Jason McLaurin
Date:
Hi Tom,
Hmm, is it always the checkpointer that gets the OOM kill? That seems
quite odd.
What PG version is this exactly? Do you have any extensions installed?
I recall having seen somebody before reporting odd slowness of trivial
commands like BEGIN. I failed to find the thread(s) in the archives
though, so I'm not sure if we identified the cause.
This is Postgres 14.5 running in the TimescaleDB Docker image. We're not actually using TimescaleDB on this database instance (there are no hypertables), but we use the same Docker image everywhere. So it's certainly possible that a TimescaleDB background process is to blame despite the complete lack of TimescaleDB-enabled tables.
Is there anywhere you'd suggest we start looking for hints? I'd be interested in increasing relevant logging verbosity so that we can see when key background processes are running, both in Postgres core and Timescale.
Thanks so much!
Jason McLaurin <jason@jcore.io> writes: > Is there anywhere you'd suggest we start looking for hints? I'd be > interested in increasing relevant logging verbosity so that we can see when > key background processes are running, both in Postgres core and Timescale. It might be helpful to try to identify which wait events the slow queries are blocking on (pg_stat_activity.wait_event_type and .wait_event). I'm not sure if you're going to be able to extract useful data, because your query on pg_stat_activity is likely to be slow too. But it's a place to start. Also, given that you're evidently incurring the wrath of the OOM killer, you should try to understand why the kernel thinks it's under memory pressure. Do you have too many processes, or perhaps you've configured too much shared memory? regards, tom lane
On 4/2/23 21:40, Tom Lane wrote: > Jason McLaurin <jason@jcore.io> writes: >> Is there anywhere you'd suggest we start looking for hints? I'd be >> interested in increasing relevant logging verbosity so that we can see when >> key background processes are running, both in Postgres core and Timescale. > > It might be helpful to try to identify which wait events the slow > queries are blocking on (pg_stat_activity.wait_event_type and > .wait_event). I'm not sure if you're going to be able to extract > useful data, because your query on pg_stat_activity is likely to > be slow too. But it's a place to start. > > Also, given that you're evidently incurring the wrath of the OOM > killer, you should try to understand why the kernel thinks it's > under memory pressure. Do you have too many processes, or perhaps > you've configured too much shared memory? Given this: > This is Postgres 14.5 running in the TimescaleDB Docker image. Possibly the docker image is running with a cgroup memory.limit set? The OOM killer will trigger on any cgroup limit even if the host has plenty of free memory. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Very slow queries followed by checkpointer process killed with signal 9
From
Christoph Moench-Tegeder
Date:
## Tom Lane (tgl@sss.pgh.pa.us): > Jason McLaurin <jason@jcore.io> writes: > > I'm troubleshooting an issue where about once a week, a database appears to > > lock up and then the PostgreSQL process crashes and recovers. When this > > happens, a few queries will be logged, but there is no pattern to which > > queries are executing when the crash happens, and the queries logged don't > > appear to be queries that would consume a lot of resources. > > Hmm, is it always the checkpointer that gets the OOM kill? That seems > quite odd. Are you sure it's OOM? That's a frequent source of SIGKILLs, but SIGKILL is not exclusive to oomkiller. Maybe something in the container stack (some cgroup resource limiting? would be in line with slowdown at least; or a hard ulimit on cpu time?) or maybe even some "anti virus" thingy ("It's been 0 days since AV ate the database"). Regards, Christoph -- Spare Space.