Thread: backend stuck in DataFileExtend
In March, I noticed that a backend got stuck overnight doing: backend_start | 2024-03-27 22:34:12.774195-07 xact_start | 2024-03-27 22:34:39.741933-07 query_start | 2024-03-27 22:34:41.997276-07 state_change | 2024-03-27 22:34:41.997307-07 wait_event_type | IO wait_event | DataFileExtend state | active backend_xid | 330896991 backend_xmin | 330896991 query_id | -3255287420265634540 query | PREPARE mml_1 AS INSERT INTO child.un... The backend was spinning at 100% CPU: [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881 PID WCHAN %CPU S TTY TIME COMMAND 7881 ? 99.4 R ? 08:14:55 postgres: telsasoft ts [local] INSERT This was postgres 16 STABLE compiled at 14e991db8. It's possible that this is a rare issue that we haven't hit before. It's also possible this this is a recent regression. We previously compiled at b2c9936a7 without hitting any issue (possibly by chance). I could neither strace the process nor attach a debugger. They got stuck attaching. Maybe it's possible there's a kernel issue. This is a VM running centos7 / 3.10.0-1160.49.1.el7.x86_64. $ awk '{print $14, $15}' /proc/7881/stat # usr / sys 229 3088448 When I tried to shut down postgres (hoping to finally be able to attach a debugger), instead it got stuck: $ ps -fu postgres UID PID PPID C STIME TTY TIME CMD postgres 7881 119674 99 mar27 ? 08:38:06 postgres: telsasoft ts [local] INSERT postgres 119674 1 0 mar25 ? 00:07:13 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/ postgres 119676 119674 0 mar25 ? 00:00:11 postgres: logger postgres 119679 119674 0 mar25 ? 00:11:56 postgres: checkpointer This first occurred on Mar 27, but I see today that it's now recurring at a different customer: backend_start | 2024-05-05 22:19:17.009477-06 xact_start | 2024-05-05 22:20:18.129305-06 query_start | 2024-05-05 22:20:19.409555-06 state_change | 2024-05-05 22:20:19.409556-06 pid | 18468 wait_event_type | IO wait_event | DataFileExtend state | active backend_xid | 4236249136 backend_xmin | 4236221661 query_id | 2601062835886299431 left | PREPARE mml_1 AS INSERT INTO chil This time it's running v16.2 (REL_16_STABLE compiled at b78fa8547), under centos7 / 3.10.0-1160.66.1.el7.x86_64. The symptoms are the same: backend stuck using 100% CPU in user mode: [pryzbyj@telsasoft-centos7 ~]$ awk '{print $14, $15}' /proc/18468/stat # usr / sys 364 2541168 This seems to have affected other backends, which are now waiting on RegisterSyncRequest, frozenid, and CheckpointerComm. In both cases, the backend got stuck after 10pm, which is when a backup job kicks off, followed by other DB maintenance. Our backup job uses pg_export_snapshot() + pg_dump --snapshot. In today's case, the pg_dump would've finished and snapshot closed at 2023-05-05 22:15. The backup job did some more pg_dumps involving historic tables without snapshots and finished at 01:11:40, at which point a reindex job started, but it looks like the DB was already stuck for the purpose of reindex, and so the script ended after a handful of commands were "[canceled] due to statement timeout". Full disclosure: the VM that hit this issue today has had storage-level errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3 days ago. Maybe more importantly, this VM also seems to suffer from some memory leak, and the leaky process was Killed shortly after the stuck query started. This makes me suspect a race condition which was triggered while swapping: May 5 22:24:05 localhost kernel: Out of memory: Kill process 17157 (python3.8) score 134 or sacrifice child We don't have as good logs from March, but I'm not aware of killed processes on the VM where we hit this in March, but it's true that the I/O there is not fast. Also, I fibbed when I said these were compiled at 16 STABLE - I'd backpatched a small number of patches from master: a97bbe1f1df Reduce branches in heapgetpage()'s per-tuple loop 98f320eb2ef Increase default vacuum_buffer_usage_limit to 2MB. 44086b09753 Preliminary refactor of heap scanning functions 959b38d770b Invent --transaction-size option for pg_restore. a45c78e3284 Rearrange pg_dump's handling of large objects for better efficiency. 9d1a5354f58 Fix costing bug in MergeAppend a5cf808be55 Read include/exclude commands for dump/restore from file 8c16ad3b432 Allow using syncfs() in frontend utilities. cccc6cdeb32 Add support for syncfs() in frontend support functions. 3ed19567198 Make enum for sync methods available to frontend code. f39b265808b Move PG_TEMP_FILE* macros to file_utils.h. a14354cac0e Add GUC parameter "huge_pages_status" I will need to restart services this morning, but if someone wants to suggest diagnostic measures, I will see whether the command gets stuck or not. -- Justin
Hi, On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote: > In March, I noticed that a backend got stuck overnight doing: > > backend_start | 2024-03-27 22:34:12.774195-07 > xact_start | 2024-03-27 22:34:39.741933-07 > query_start | 2024-03-27 22:34:41.997276-07 > state_change | 2024-03-27 22:34:41.997307-07 > wait_event_type | IO > wait_event | DataFileExtend > state | active > backend_xid | 330896991 > backend_xmin | 330896991 > query_id | -3255287420265634540 > query | PREPARE mml_1 AS INSERT INTO child.un... > > The backend was spinning at 100% CPU: > > [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881 > PID WCHAN %CPU S TTY TIME COMMAND > 7881 ? 99.4 R ? 08:14:55 postgres: telsasoft ts [local] INSERT > > This was postgres 16 STABLE compiled at 14e991db8. > > It's possible that this is a rare issue that we haven't hit before. > It's also possible this this is a recent regression. We previously > compiled at b2c9936a7 without hitting any issue (possibly by chance). > > I could neither strace the process nor attach a debugger. They got > stuck attaching. Maybe it's possible there's a kernel issue. This is a > VM running centos7 / 3.10.0-1160.49.1.el7.x86_64. > $ awk '{print $14, $15}' /proc/7881/stat # usr / sys > 229 3088448 > > When I tried to shut down postgres (hoping to finally be able to attach > a debugger), instead it got stuck: That strongly indicates either a kernel bug or storage having an issue. It can't be postgres' fault if an IO never completes. What do /proc/$pid/stack say? > In both cases, the backend got stuck after 10pm, which is when a backup > job kicks off, followed by other DB maintenance. Our backup job uses > pg_export_snapshot() + pg_dump --snapshot. In today's case, the pg_dump > would've finished and snapshot closed at 2023-05-05 22:15. The backup > job did some more pg_dumps involving historic tables without snapshots > and finished at 01:11:40, at which point a reindex job started, but it > looks like the DB was already stuck for the purpose of reindex, and so > the script ended after a handful of commands were "[canceled] due to > statement timeout". Is it possible that you're "just" waiting for very slow IO? Is there a lot of dirty memory? Particularly on these old kernels that can lead to very extreme delays. grep -Ei 'dirty|writeback' /proc/meminfo > [...] > Full disclosure: the VM that hit this issue today has had storage-level > errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3 > days ago. So indeed, my suspicion from above is confirmed. Greetings, Andres Freund
On Mon, May 06, 2024 at 10:04:13AM -0700, Andres Freund wrote: > Hi, > > On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote: > > In March, I noticed that a backend got stuck overnight doing: > > > > backend_start | 2024-03-27 22:34:12.774195-07 > > xact_start | 2024-03-27 22:34:39.741933-07 > > query_start | 2024-03-27 22:34:41.997276-07 > > state_change | 2024-03-27 22:34:41.997307-07 > > wait_event_type | IO > > wait_event | DataFileExtend > > state | active > > backend_xid | 330896991 > > backend_xmin | 330896991 > > query_id | -3255287420265634540 > > query | PREPARE mml_1 AS INSERT INTO child.un... > > > > The backend was spinning at 100% CPU: > > > > [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881 > > PID WCHAN %CPU S TTY TIME COMMAND > > 7881 ? 99.4 R ? 08:14:55 postgres: telsasoft ts [local] INSERT > > > > This was postgres 16 STABLE compiled at 14e991db8. > > > > It's possible that this is a rare issue that we haven't hit before. > > It's also possible this this is a recent regression. We previously > > compiled at b2c9936a7 without hitting any issue (possibly by chance). > > > > I could neither strace the process nor attach a debugger. They got > > stuck attaching. Maybe it's possible there's a kernel issue. This is a > > VM running centos7 / 3.10.0-1160.49.1.el7.x86_64. > > > $ awk '{print $14, $15}' /proc/7881/stat # usr / sys > > 229 3088448 > > > > When I tried to shut down postgres (hoping to finally be able to attach > > a debugger), instead it got stuck: > > That strongly indicates either a kernel bug or storage having an issue. It > can't be postgres' fault if an IO never completes. Is that for sure even though wchan=? (which I take to mean "not in a system call"), and the process is stuck in user mode ? > What do /proc/$pid/stack say? [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/18468/stat 18468 (postgres) R 2274 18468 18468 0 -1 4857920 91836 0 3985 0 364 3794271 0 0 20 0 1 0 6092292660 941846528 10 184467440737095516154194304 12848820 140732995870240 140732995857304 139726958536394 0 4194304 19929088 536896135 0 0 0 173 0 0 1682 0 0 14949632 15052146 34668544 140732995874457 140732995874511 140732995874511 140732995874781 0 > > In both cases, the backend got stuck after 10pm, which is when a backup > > job kicks off, followed by other DB maintenance. Our backup job uses > > pg_export_snapshot() + pg_dump --snapshot. In today's case, the pg_dump > > would've finished and snapshot closed at 2023-05-05 22:15. The backup > > job did some more pg_dumps involving historic tables without snapshots > > and finished at 01:11:40, at which point a reindex job started, but it > > looks like the DB was already stuck for the purpose of reindex, and so > > the script ended after a handful of commands were "[canceled] due to > > statement timeout". > > Is it possible that you're "just" waiting for very slow IO? Is there a lot of > dirty memory? Particularly on these old kernels that can lead to very extreme > delays. > > grep -Ei 'dirty|writeback' /proc/meminfo [pryzbyj@telsasoft-centos7 ~]$ grep -Ei 'dirty|writeback' /proc/meminfo Dirty: 28 kB Writeback: 0 kB WritebackTmp: 0 kB > > Full disclosure: the VM that hit this issue today has had storage-level > > errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3 > > days ago. > > So indeed, my suspicion from above is confirmed. I'd be fine with that conclusion (as in the earlier thread), except this has now happened on 2 different VMs, and the first one has no I/O issues. If this were another symptom of a storage failure, and hadn't previously happened on another VM, I wouldn't be re-reporting it. -- Justin
Hi, On 2024-05-06 12:37:26 -0500, Justin Pryzby wrote: > On Mon, May 06, 2024 at 10:04:13AM -0700, Andres Freund wrote: > > Hi, > > > > On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote: > > > In March, I noticed that a backend got stuck overnight doing: > > > > > > backend_start | 2024-03-27 22:34:12.774195-07 > > > xact_start | 2024-03-27 22:34:39.741933-07 > > > query_start | 2024-03-27 22:34:41.997276-07 > > > state_change | 2024-03-27 22:34:41.997307-07 > > > wait_event_type | IO > > > wait_event | DataFileExtend > > > state | active > > > backend_xid | 330896991 > > > backend_xmin | 330896991 > > > query_id | -3255287420265634540 > > > query | PREPARE mml_1 AS INSERT INTO child.un... > > > > > > The backend was spinning at 100% CPU: > > > > > > [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881 > > > PID WCHAN %CPU S TTY TIME COMMAND > > > 7881 ? 99.4 R ? 08:14:55 postgres: telsasoft ts [local] INSERT > > > > > > This was postgres 16 STABLE compiled at 14e991db8. > > > > > > It's possible that this is a rare issue that we haven't hit before. > > > It's also possible this this is a recent regression. We previously > > > compiled at b2c9936a7 without hitting any issue (possibly by chance). > > > > > > I could neither strace the process nor attach a debugger. They got > > > stuck attaching. Maybe it's possible there's a kernel issue. This is a > > > VM running centos7 / 3.10.0-1160.49.1.el7.x86_64. > > > > > $ awk '{print $14, $15}' /proc/7881/stat # usr / sys > > > 229 3088448 > > > > > > When I tried to shut down postgres (hoping to finally be able to attach > > > a debugger), instead it got stuck: > > > > That strongly indicates either a kernel bug or storage having an issue. It > > can't be postgres' fault if an IO never completes. > > Is that for sure even though wchan=? (which I take to mean "not in a system > call"), and the process is stuck in user mode ? Postgres doesn't do anything to prevent a debugger from working, so this is just indicative that the kernel is stuck somewhere that it didn't set up information about being blocked - because it's busy doing something. > > What do /proc/$pid/stack say? > > [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/18468/stat > 18468 (postgres) R 2274 18468 18468 0 -1 4857920 91836 0 3985 0 364 3794271 0 0 20 0 1 0 6092292660 941846528 10 184467440737095516154194304 12848820 140732995870240 140732995857304 139726958536394 0 4194304 19929088 536896135 0 0 0 173 0 0 1682 0 0 14949632 15052146 34668544 140732995874457 140732995874511 140732995874511 140732995874781 0 stack, not stat... > > > Full disclosure: the VM that hit this issue today has had storage-level > > > errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3 > > > days ago. > > > > So indeed, my suspicion from above is confirmed. > > I'd be fine with that conclusion (as in the earlier thread), except this > has now happened on 2 different VMs, and the first one has no I/O > issues. If this were another symptom of a storage failure, and hadn't > previously happened on another VM, I wouldn't be re-reporting it. Is it the same VM hosting environment? And the same old distro? Greetings, Andres Freund
On Mon, May 06, 2024 at 10:51:08AM -0700, Andres Freund wrote: > Hi, > > On 2024-05-06 12:37:26 -0500, Justin Pryzby wrote: > > On Mon, May 06, 2024 at 10:04:13AM -0700, Andres Freund wrote: > > > Hi, > > > > > > On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote: > > > > In March, I noticed that a backend got stuck overnight doing: > > > > > > > > backend_start | 2024-03-27 22:34:12.774195-07 > > > > xact_start | 2024-03-27 22:34:39.741933-07 > > > > query_start | 2024-03-27 22:34:41.997276-07 > > > > state_change | 2024-03-27 22:34:41.997307-07 > > > > wait_event_type | IO > > > > wait_event | DataFileExtend > > > > state | active > > > > backend_xid | 330896991 > > > > backend_xmin | 330896991 > > > > query_id | -3255287420265634540 > > > > query | PREPARE mml_1 AS INSERT INTO child.un... > > > > > > > > The backend was spinning at 100% CPU: > > > > > > > > [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881 > > > > PID WCHAN %CPU S TTY TIME COMMAND > > > > 7881 ? 99.4 R ? 08:14:55 postgres: telsasoft ts [local] INSERT > > > > > > > > This was postgres 16 STABLE compiled at 14e991db8. > > > > > > > > It's possible that this is a rare issue that we haven't hit before. > > > > It's also possible this this is a recent regression. We previously > > > > compiled at b2c9936a7 without hitting any issue (possibly by chance). > > > > > > > > I could neither strace the process nor attach a debugger. They got > > > > stuck attaching. Maybe it's possible there's a kernel issue. This is a > > > > VM running centos7 / 3.10.0-1160.49.1.el7.x86_64. > > > > > > > $ awk '{print $14, $15}' /proc/7881/stat # usr / sys > > > > 229 3088448 > > > > > > > > When I tried to shut down postgres (hoping to finally be able to attach > > > > a debugger), instead it got stuck: > > > > > > That strongly indicates either a kernel bug or storage having an issue. It > > > can't be postgres' fault if an IO never completes. > > > > Is that for sure even though wchan=? (which I take to mean "not in a system > > call"), and the process is stuck in user mode ? > > Postgres doesn't do anything to prevent a debugger from working, so this is > just indicative that the kernel is stuck somewhere that it didn't set up > information about being blocked - because it's busy doing something. > > > > > What do /proc/$pid/stack say? > > > > [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/18468/stat > > 18468 (postgres) R 2274 18468 18468 0 -1 4857920 91836 0 3985 0 364 3794271 0 0 20 0 1 0 6092292660 941846528 10 184467440737095516154194304 12848820 140732995870240 140732995857304 139726958536394 0 4194304 19929088 536896135 0 0 0 173 0 0 1682 0 0 14949632 15052146 34668544 140732995874457 140732995874511 140732995874511 140732995874781 0 > > stack, not stat... Ah, that is illuminating - thanks. [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/18468/stack [<ffffffffaa2d7856>] __cond_resched+0x26/0x30 [<ffffffffc10af35e>] dbuf_rele+0x1e/0x40 [zfs] [<ffffffffc10bb730>] dmu_buf_rele_array.part.6+0x40/0x70 [zfs] [<ffffffffc10bd96a>] dmu_write_uio_dnode+0x11a/0x180 [zfs] [<ffffffffc10bda24>] dmu_write_uio_dbuf+0x54/0x70 [zfs] [<ffffffffc11abd1b>] zfs_write+0xb9b/0xfb0 [zfs] [<ffffffffc11ed202>] zpl_aio_write+0x152/0x1a0 [zfs] [<ffffffffaa44dadb>] do_sync_readv_writev+0x7b/0xd0 [<ffffffffaa44f62e>] do_readv_writev+0xce/0x260 [<ffffffffaa44f855>] vfs_writev+0x35/0x60 [<ffffffffaa44fc12>] SyS_pwritev+0xc2/0xf0 [<ffffffffaa999f92>] system_call_fastpath+0x25/0x2a [<ffffffffffffffff>] 0xffffffffffffffff FWIW: both are running zfs-2.2.3 RPMs from zfsonlinux.org. It's surely possible that there's an issue that affects older kernels but not more recent ones. > > > > Full disclosure: the VM that hit this issue today has had storage-level > > > > errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3 > > > > days ago. > > > > > > So indeed, my suspicion from above is confirmed. > > > > I'd be fine with that conclusion (as in the earlier thread), except this > > has now happened on 2 different VMs, and the first one has no I/O > > issues. If this were another symptom of a storage failure, and hadn't > > previously happened on another VM, I wouldn't be re-reporting it. > > Is it the same VM hosting environment? And the same old distro? Yes, they're running centos7 with the indicated kernels. dmidecode shows they're both running: Product Name: VMware Virtual Platform But they're different customers, so I'd be somewhat surprised if they're running same versions of the hypervisor. -- Justin
On Tue, May 7, 2024 at 6:21 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > FWIW: both are running zfs-2.2.3 RPMs from zfsonlinux.org. ... > Yes, they're running centos7 with the indicated kernels. So far we've got: * spurious EIO when opening a file (your previous report) * hanging with CPU spinning (?) inside pwritev() * old kernel, bleeding edge ZFS From an (uninformed) peek at the ZFS code, if it really is spinning there is seems like a pretty low level problem: it's finish the write, and now is just trying to release (something like our unpin) and unlock the buffers, which involves various code paths that might touch various mutexes and spinlocks, and to get stuck like that I guess it's either corrupted itself or it is deadlocking against something else, but what? Do you see any other processes (including kernel threads) with any stuck stacks that might be a deadlock partner? While looking around for reported issues I found your abandoned report against an older ZFS version from a few years ago, same old Linux version: https://github.com/openzfs/zfs/issues/11641 I don't know enough to say anything useful about that but it certainly smells similar... I see you've been busy reporting lots of issues, which seems to involve big data, big "recordsize" (= ZFS block sizes), compression and PostgreSQL: https://github.com/openzfs/zfs/issues?q=is%3Aissue+author%3Ajustinpryzby
On Tue, May 07, 2024 at 10:55:28AM +1200, Thomas Munro wrote: > On Tue, May 7, 2024 at 6:21 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > FWIW: both are running zfs-2.2.3 RPMs from zfsonlinux.org. > ... > > Yes, they're running centos7 with the indicated kernels. > > So far we've got: > > * spurious EIO when opening a file (your previous report) > * hanging with CPU spinning (?) inside pwritev() > * old kernel, bleeding edge ZFS > > From an (uninformed) peek at the ZFS code, if it really is spinning > there is seems like a pretty low level problem: it's finish the write, > and now is just trying to release (something like our unpin) and > unlock the buffers, which involves various code paths that might touch > various mutexes and spinlocks, and to get stuck like that I guess it's > either corrupted itself or it is deadlocking against something else, > but what? Do you see any other processes (including kernel threads) > with any stuck stacks that might be a deadlock partner? Sorry, but even after forgetting several times, I finally remembered to go back to issue, and already rebooted the VM as needed to kill the stuck process. But .. it seems to have recurred again this AM. Note that yesterday, I'd taken the opportunity to upgrade to zfs-2.2.4. These two procs are the oldest active postgres procs, and also (now) adjacent in ps -ef --sort start_time. -[ RECORD 1 ]----+---------------------------------------------------------------- backend_start | 2024-05-07 09:45:06.228528-06 application_name | xact_start | 2024-05-07 09:55:38.409549-06 query_start | 2024-05-07 09:55:38.409549-06 state_change | 2024-05-07 09:55:38.409549-06 pid | 27449 backend_type | autovacuum worker wait_event_type | BufferPin wait_event | BufferPin state | active backend_xid | backend_xmin | 4293757489 query_id | left | autovacuum: VACUUM ANALYZE child. -[ RECORD 2 ]----+---------------------------------------------------------------- backend_start | 2024-05-07 09:49:24.686314-06 application_name | MasterMetricsLoader -n -m Xml xact_start | 2024-05-07 09:50:30.387156-06 query_start | 2024-05-07 09:50:32.744435-06 state_change | 2024-05-07 09:50:32.744436-06 pid | 5051 backend_type | client backend wait_event_type | IO wait_event | DataFileExtend state | active backend_xid | 4293757489 backend_xmin | 4293757429 query_id | 2230046159014513529 left | PREPARE mml_0 AS INSERT INTO chil The earlier proc is doing: strace: Process 27449 attached epoll_wait(11, ^Cstrace: Process 27449 detached <detached ...> The later process is stuck, with: [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/5051/stack [<ffffffffffffffff>] 0xffffffffffffffff For good measure: [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/27433/stack [<ffffffffc0600c2e>] taskq_thread+0x48e/0x4e0 [spl] [<ffffffff9eec5f91>] kthread+0xd1/0xe0 [<ffffffff9f599df7>] ret_from_fork_nospec_end+0x0/0x39 [<ffffffffffffffff>] 0xffffffffffffffff [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/27434/stack [<ffffffffc0600c2e>] taskq_thread+0x48e/0x4e0 [spl] [<ffffffff9eec5f91>] kthread+0xd1/0xe0 [<ffffffff9f599df7>] ret_from_fork_nospec_end+0x0/0x39 [<ffffffffffffffff>] 0xffffffffffffffff [pryzbyj@telsasoft-centos7 ~]$ ps -O wchan================ 5051 27449 PID =============== S TTY TIME COMMAND 5051 ? R ? 02:14:27 postgres: telsasoft ts ::1(53708) INSERT 27449 ep_poll S ? 00:05:16 postgres: autovacuum worker ts The interesting procs might be: ps -eO wchan===============,lstart --sort start_time ... 15632 worker_thread Mon May 6 23:51:34 2024 S ? 00:00:00 [kworker/2:2H] 27433 taskq_thread Tue May 7 09:35:59 2024 S ? 00:00:56 [z_wr_iss] 27434 taskq_thread Tue May 7 09:35:59 2024 S ? 00:00:57 [z_wr_iss] 27449 ep_poll Tue May 7 09:45:05 2024 S ? 00:05:16 postgres: autovacuum worker ts 5051 ? Tue May 7 09:49:23 2024 R ? 02:23:04 postgres: telsasoft ts ::1(53708) INSERT 7861 ep_poll Tue May 7 09:51:25 2024 S ? 00:03:04 /usr/local/bin/python3.8 -u /home/telsasoft/server/alarms/core/pr... 7912 ep_poll Tue May 7 09:51:27 2024 S ? 00:00:00 postgres: telsasoft ts ::1(53794) idle 24518 futex_wait_que Tue May 7 10:42:56 2024 S ? 00:00:55 java -jar /home/telsasoft/server/alarms/alcatel_lucent/jms/jms2rm... ... > While looking around for reported issues I found your abandoned report > against an older ZFS version from a few years ago, same old Linux > version: > > https://github.com/openzfs/zfs/issues/11641 > > I don't know enough to say anything useful about that but it certainly > smells similar... Wow - I'd completely forgotten about that problem report. The symptoms are the same, even with a zfs version 3+ years newer. I wish the ZFS people would do more with their problem reports. BTW, we'll be upgrading this VM to a newer kernel, if not a newer OS (for some reason, these projects take a very long time). With any luck, it'll either recur, or not. I'm not sure if any of that is useful, or interesting. -- Justin
On Wed, May 8, 2024 at 6:54 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Tue, May 07, 2024 at 10:55:28AM +1200, Thomas Munro wrote: > > https://github.com/openzfs/zfs/issues/11641 > > > > I don't know enough to say anything useful about that but it certainly > > smells similar... > > Wow - I'd completely forgotten about that problem report. > The symptoms are the same, even with a zfs version 3+ years newer. > I wish the ZFS people would do more with their problem reports. If I had to guess, my first idea would be that your 1MB or ginormous 16MB recordsize (a relatively new option) combined with PostgreSQL's 8KB block-at-a-time random order I/O patterns are tickling strange corners and finding a bug that no one has seen before. I would imagine that almost everyone in the galaxy who uses very large records does so with 'settled' data that gets streamed out once sequentially (for example I think some of the OpenZFS maintainers are at Lawrence Livermore National Lab where I guess they might pump around petabytes of data produced by particle physics research or whatever it might be, probably why they they are also adding direct I/O to avoid caches completely...). But for us, if we have lots of backends reading, writing and extending random 8KB fragments of a 16MB page concurrently (2048 pages per record!), maybe we hit some broken edge... I'd be sure to include that sort of detail in any future reports. Normally I suppress urges to blame problems on kernels, file systems etc and in the past accusations that ZFS was buggy turned out to be bugs in PostgreSQL IIRC, but user space sure seems to be off the hook for this one...