backend stuck in DataFileExtend - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | backend stuck in DataFileExtend |
Date | |
Msg-id | ZjjjssKamHyWpxLT@pryzbyj2023 Whole thread Raw |
Responses |
Re: backend stuck in DataFileExtend
|
List | pgsql-hackers |
In March, I noticed that a backend got stuck overnight doing: backend_start | 2024-03-27 22:34:12.774195-07 xact_start | 2024-03-27 22:34:39.741933-07 query_start | 2024-03-27 22:34:41.997276-07 state_change | 2024-03-27 22:34:41.997307-07 wait_event_type | IO wait_event | DataFileExtend state | active backend_xid | 330896991 backend_xmin | 330896991 query_id | -3255287420265634540 query | PREPARE mml_1 AS INSERT INTO child.un... The backend was spinning at 100% CPU: [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881 PID WCHAN %CPU S TTY TIME COMMAND 7881 ? 99.4 R ? 08:14:55 postgres: telsasoft ts [local] INSERT This was postgres 16 STABLE compiled at 14e991db8. It's possible that this is a rare issue that we haven't hit before. It's also possible this this is a recent regression. We previously compiled at b2c9936a7 without hitting any issue (possibly by chance). I could neither strace the process nor attach a debugger. They got stuck attaching. Maybe it's possible there's a kernel issue. This is a VM running centos7 / 3.10.0-1160.49.1.el7.x86_64. $ awk '{print $14, $15}' /proc/7881/stat # usr / sys 229 3088448 When I tried to shut down postgres (hoping to finally be able to attach a debugger), instead it got stuck: $ ps -fu postgres UID PID PPID C STIME TTY TIME CMD postgres 7881 119674 99 mar27 ? 08:38:06 postgres: telsasoft ts [local] INSERT postgres 119674 1 0 mar25 ? 00:07:13 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/ postgres 119676 119674 0 mar25 ? 00:00:11 postgres: logger postgres 119679 119674 0 mar25 ? 00:11:56 postgres: checkpointer This first occurred on Mar 27, but I see today that it's now recurring at a different customer: backend_start | 2024-05-05 22:19:17.009477-06 xact_start | 2024-05-05 22:20:18.129305-06 query_start | 2024-05-05 22:20:19.409555-06 state_change | 2024-05-05 22:20:19.409556-06 pid | 18468 wait_event_type | IO wait_event | DataFileExtend state | active backend_xid | 4236249136 backend_xmin | 4236221661 query_id | 2601062835886299431 left | PREPARE mml_1 AS INSERT INTO chil This time it's running v16.2 (REL_16_STABLE compiled at b78fa8547), under centos7 / 3.10.0-1160.66.1.el7.x86_64. The symptoms are the same: backend stuck using 100% CPU in user mode: [pryzbyj@telsasoft-centos7 ~]$ awk '{print $14, $15}' /proc/18468/stat # usr / sys 364 2541168 This seems to have affected other backends, which are now waiting on RegisterSyncRequest, frozenid, and CheckpointerComm. In both cases, the backend got stuck after 10pm, which is when a backup job kicks off, followed by other DB maintenance. Our backup job uses pg_export_snapshot() + pg_dump --snapshot. In today's case, the pg_dump would've finished and snapshot closed at 2023-05-05 22:15. The backup job did some more pg_dumps involving historic tables without snapshots and finished at 01:11:40, at which point a reindex job started, but it looks like the DB was already stuck for the purpose of reindex, and so the script ended after a handful of commands were "[canceled] due to statement timeout". Full disclosure: the VM that hit this issue today has had storage-level errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3 days ago. Maybe more importantly, this VM also seems to suffer from some memory leak, and the leaky process was Killed shortly after the stuck query started. This makes me suspect a race condition which was triggered while swapping: May 5 22:24:05 localhost kernel: Out of memory: Kill process 17157 (python3.8) score 134 or sacrifice child We don't have as good logs from March, but I'm not aware of killed processes on the VM where we hit this in March, but it's true that the I/O there is not fast. Also, I fibbed when I said these were compiled at 16 STABLE - I'd backpatched a small number of patches from master: a97bbe1f1df Reduce branches in heapgetpage()'s per-tuple loop 98f320eb2ef Increase default vacuum_buffer_usage_limit to 2MB. 44086b09753 Preliminary refactor of heap scanning functions 959b38d770b Invent --transaction-size option for pg_restore. a45c78e3284 Rearrange pg_dump's handling of large objects for better efficiency. 9d1a5354f58 Fix costing bug in MergeAppend a5cf808be55 Read include/exclude commands for dump/restore from file 8c16ad3b432 Allow using syncfs() in frontend utilities. cccc6cdeb32 Add support for syncfs() in frontend support functions. 3ed19567198 Make enum for sync methods available to frontend code. f39b265808b Move PG_TEMP_FILE* macros to file_utils.h. a14354cac0e Add GUC parameter "huge_pages_status" I will need to restart services this morning, but if someone wants to suggest diagnostic measures, I will see whether the command gets stuck or not. -- Justin
pgsql-hackers by date: