Re: backend stuck in DataFileExtend - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | Re: backend stuck in DataFileExtend |
Date | |
Msg-id | Zjp42NK4abklf6t5@pryzbyj2023 Whole thread Raw |
In response to | Re: backend stuck in DataFileExtend (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: backend stuck in DataFileExtend
|
List | pgsql-hackers |
On Tue, May 07, 2024 at 10:55:28AM +1200, Thomas Munro wrote: > On Tue, May 7, 2024 at 6:21 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > FWIW: both are running zfs-2.2.3 RPMs from zfsonlinux.org. > ... > > Yes, they're running centos7 with the indicated kernels. > > So far we've got: > > * spurious EIO when opening a file (your previous report) > * hanging with CPU spinning (?) inside pwritev() > * old kernel, bleeding edge ZFS > > From an (uninformed) peek at the ZFS code, if it really is spinning > there is seems like a pretty low level problem: it's finish the write, > and now is just trying to release (something like our unpin) and > unlock the buffers, which involves various code paths that might touch > various mutexes and spinlocks, and to get stuck like that I guess it's > either corrupted itself or it is deadlocking against something else, > but what? Do you see any other processes (including kernel threads) > with any stuck stacks that might be a deadlock partner? Sorry, but even after forgetting several times, I finally remembered to go back to issue, and already rebooted the VM as needed to kill the stuck process. But .. it seems to have recurred again this AM. Note that yesterday, I'd taken the opportunity to upgrade to zfs-2.2.4. These two procs are the oldest active postgres procs, and also (now) adjacent in ps -ef --sort start_time. -[ RECORD 1 ]----+---------------------------------------------------------------- backend_start | 2024-05-07 09:45:06.228528-06 application_name | xact_start | 2024-05-07 09:55:38.409549-06 query_start | 2024-05-07 09:55:38.409549-06 state_change | 2024-05-07 09:55:38.409549-06 pid | 27449 backend_type | autovacuum worker wait_event_type | BufferPin wait_event | BufferPin state | active backend_xid | backend_xmin | 4293757489 query_id | left | autovacuum: VACUUM ANALYZE child. -[ RECORD 2 ]----+---------------------------------------------------------------- backend_start | 2024-05-07 09:49:24.686314-06 application_name | MasterMetricsLoader -n -m Xml xact_start | 2024-05-07 09:50:30.387156-06 query_start | 2024-05-07 09:50:32.744435-06 state_change | 2024-05-07 09:50:32.744436-06 pid | 5051 backend_type | client backend wait_event_type | IO wait_event | DataFileExtend state | active backend_xid | 4293757489 backend_xmin | 4293757429 query_id | 2230046159014513529 left | PREPARE mml_0 AS INSERT INTO chil The earlier proc is doing: strace: Process 27449 attached epoll_wait(11, ^Cstrace: Process 27449 detached <detached ...> The later process is stuck, with: [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/5051/stack [<ffffffffffffffff>] 0xffffffffffffffff For good measure: [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/27433/stack [<ffffffffc0600c2e>] taskq_thread+0x48e/0x4e0 [spl] [<ffffffff9eec5f91>] kthread+0xd1/0xe0 [<ffffffff9f599df7>] ret_from_fork_nospec_end+0x0/0x39 [<ffffffffffffffff>] 0xffffffffffffffff [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/27434/stack [<ffffffffc0600c2e>] taskq_thread+0x48e/0x4e0 [spl] [<ffffffff9eec5f91>] kthread+0xd1/0xe0 [<ffffffff9f599df7>] ret_from_fork_nospec_end+0x0/0x39 [<ffffffffffffffff>] 0xffffffffffffffff [pryzbyj@telsasoft-centos7 ~]$ ps -O wchan================ 5051 27449 PID =============== S TTY TIME COMMAND 5051 ? R ? 02:14:27 postgres: telsasoft ts ::1(53708) INSERT 27449 ep_poll S ? 00:05:16 postgres: autovacuum worker ts The interesting procs might be: ps -eO wchan===============,lstart --sort start_time ... 15632 worker_thread Mon May 6 23:51:34 2024 S ? 00:00:00 [kworker/2:2H] 27433 taskq_thread Tue May 7 09:35:59 2024 S ? 00:00:56 [z_wr_iss] 27434 taskq_thread Tue May 7 09:35:59 2024 S ? 00:00:57 [z_wr_iss] 27449 ep_poll Tue May 7 09:45:05 2024 S ? 00:05:16 postgres: autovacuum worker ts 5051 ? Tue May 7 09:49:23 2024 R ? 02:23:04 postgres: telsasoft ts ::1(53708) INSERT 7861 ep_poll Tue May 7 09:51:25 2024 S ? 00:03:04 /usr/local/bin/python3.8 -u /home/telsasoft/server/alarms/core/pr... 7912 ep_poll Tue May 7 09:51:27 2024 S ? 00:00:00 postgres: telsasoft ts ::1(53794) idle 24518 futex_wait_que Tue May 7 10:42:56 2024 S ? 00:00:55 java -jar /home/telsasoft/server/alarms/alcatel_lucent/jms/jms2rm... ... > While looking around for reported issues I found your abandoned report > against an older ZFS version from a few years ago, same old Linux > version: > > https://github.com/openzfs/zfs/issues/11641 > > I don't know enough to say anything useful about that but it certainly > smells similar... Wow - I'd completely forgotten about that problem report. The symptoms are the same, even with a zfs version 3+ years newer. I wish the ZFS people would do more with their problem reports. BTW, we'll be upgrading this VM to a newer kernel, if not a newer OS (for some reason, these projects take a very long time). With any luck, it'll either recur, or not. I'm not sure if any of that is useful, or interesting. -- Justin
pgsql-hackers by date: