Re: backend stuck in DataFileExtend - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: backend stuck in DataFileExtend |
Date | |
Msg-id | 20240506170413.u7rybdg7nmgrizfu@awork3.anarazel.de Whole thread Raw |
In response to | backend stuck in DataFileExtend (Justin Pryzby <pryzby@telsasoft.com>) |
Responses |
Re: backend stuck in DataFileExtend
|
List | pgsql-hackers |
Hi, On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote: > In March, I noticed that a backend got stuck overnight doing: > > backend_start | 2024-03-27 22:34:12.774195-07 > xact_start | 2024-03-27 22:34:39.741933-07 > query_start | 2024-03-27 22:34:41.997276-07 > state_change | 2024-03-27 22:34:41.997307-07 > wait_event_type | IO > wait_event | DataFileExtend > state | active > backend_xid | 330896991 > backend_xmin | 330896991 > query_id | -3255287420265634540 > query | PREPARE mml_1 AS INSERT INTO child.un... > > The backend was spinning at 100% CPU: > > [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881 > PID WCHAN %CPU S TTY TIME COMMAND > 7881 ? 99.4 R ? 08:14:55 postgres: telsasoft ts [local] INSERT > > This was postgres 16 STABLE compiled at 14e991db8. > > It's possible that this is a rare issue that we haven't hit before. > It's also possible this this is a recent regression. We previously > compiled at b2c9936a7 without hitting any issue (possibly by chance). > > I could neither strace the process nor attach a debugger. They got > stuck attaching. Maybe it's possible there's a kernel issue. This is a > VM running centos7 / 3.10.0-1160.49.1.el7.x86_64. > $ awk '{print $14, $15}' /proc/7881/stat # usr / sys > 229 3088448 > > When I tried to shut down postgres (hoping to finally be able to attach > a debugger), instead it got stuck: That strongly indicates either a kernel bug or storage having an issue. It can't be postgres' fault if an IO never completes. What do /proc/$pid/stack say? > In both cases, the backend got stuck after 10pm, which is when a backup > job kicks off, followed by other DB maintenance. Our backup job uses > pg_export_snapshot() + pg_dump --snapshot. In today's case, the pg_dump > would've finished and snapshot closed at 2023-05-05 22:15. The backup > job did some more pg_dumps involving historic tables without snapshots > and finished at 01:11:40, at which point a reindex job started, but it > looks like the DB was already stuck for the purpose of reindex, and so > the script ended after a handful of commands were "[canceled] due to > statement timeout". Is it possible that you're "just" waiting for very slow IO? Is there a lot of dirty memory? Particularly on these old kernels that can lead to very extreme delays. grep -Ei 'dirty|writeback' /proc/meminfo > [...] > Full disclosure: the VM that hit this issue today has had storage-level > errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3 > days ago. So indeed, my suspicion from above is confirmed. Greetings, Andres Freund
pgsql-hackers by date: