Re: backend stuck in DataFileExtend - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: backend stuck in DataFileExtend
Date
Msg-id ZjkfklbJXs9cuIDl@pryzbyj2023
Whole thread Raw
In response to Re: backend stuck in DataFileExtend  (Andres Freund <andres@anarazel.de>)
Responses Re: backend stuck in DataFileExtend
List pgsql-hackers
On Mon, May 06, 2024 at 10:51:08AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-05-06 12:37:26 -0500, Justin Pryzby wrote:
> > On Mon, May 06, 2024 at 10:04:13AM -0700, Andres Freund wrote:
> > > Hi,
> > >
> > > On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote:
> > > > In March, I noticed that a backend got stuck overnight doing:
> > > >
> > > > backend_start    | 2024-03-27 22:34:12.774195-07
> > > > xact_start       | 2024-03-27 22:34:39.741933-07
> > > > query_start      | 2024-03-27 22:34:41.997276-07
> > > > state_change     | 2024-03-27 22:34:41.997307-07
> > > > wait_event_type  | IO
> > > > wait_event       | DataFileExtend
> > > > state            | active
> > > > backend_xid      | 330896991
> > > > backend_xmin     | 330896991
> > > > query_id         | -3255287420265634540
> > > > query            | PREPARE mml_1 AS INSERT INTO child.un...
> > > >
> > > > The backend was spinning at 100% CPU:
> > > >
> > > > [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881
> > > >    PID WCHAN  %CPU S TTY          TIME COMMAND
> > > >   7881 ?      99.4 R ?        08:14:55 postgres: telsasoft ts [local] INSERT
> > > >
> > > > This was postgres 16 STABLE compiled at 14e991db8.
> > > >
> > > > It's possible that this is a rare issue that we haven't hit before.
> > > > It's also possible this this is a recent regression.  We previously
> > > > compiled at b2c9936a7 without hitting any issue (possibly by chance).
> > > >
> > > > I could neither strace the process nor attach a debugger.  They got
> > > > stuck attaching.  Maybe it's possible there's a kernel issue.  This is a
> > > > VM running centos7 / 3.10.0-1160.49.1.el7.x86_64.
> > >
> > > > $ awk '{print $14, $15}' /proc/7881/stat # usr / sys
> > > > 229 3088448
> > > >
> > > > When I tried to shut down postgres (hoping to finally be able to attach
> > > > a debugger), instead it got stuck:
> > >
> > > That strongly indicates either a kernel bug or storage having an issue. It
> > > can't be postgres' fault if an IO never completes.
> >
> > Is that for sure even though wchan=? (which I take to mean "not in a system
> > call"), and the process is stuck in user mode ?
> 
> Postgres doesn't do anything to prevent a debugger from working, so this is
> just indicative that the kernel is stuck somewhere that it didn't set up
> information about being blocked - because it's busy doing something.
> 
> 
> > > What do /proc/$pid/stack say?
> >
> > [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/18468/stat
> > 18468 (postgres) R 2274 18468 18468 0 -1 4857920 91836 0 3985 0 364 3794271 0 0 20 0 1 0 6092292660 941846528 10
184467440737095516154194304 12848820 140732995870240 140732995857304 139726958536394 0 4194304 19929088 536896135 0 0 0
173 0 0 1682 0 0 14949632 15052146 34668544 140732995874457 140732995874511 140732995874511 140732995874781 0
 
> 
> stack, not stat...

Ah, that is illuminating - thanks.

[pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/18468/stack 
[<ffffffffaa2d7856>] __cond_resched+0x26/0x30
[<ffffffffc10af35e>] dbuf_rele+0x1e/0x40 [zfs]
[<ffffffffc10bb730>] dmu_buf_rele_array.part.6+0x40/0x70 [zfs]
[<ffffffffc10bd96a>] dmu_write_uio_dnode+0x11a/0x180 [zfs]
[<ffffffffc10bda24>] dmu_write_uio_dbuf+0x54/0x70 [zfs]
[<ffffffffc11abd1b>] zfs_write+0xb9b/0xfb0 [zfs]
[<ffffffffc11ed202>] zpl_aio_write+0x152/0x1a0 [zfs]
[<ffffffffaa44dadb>] do_sync_readv_writev+0x7b/0xd0
[<ffffffffaa44f62e>] do_readv_writev+0xce/0x260
[<ffffffffaa44f855>] vfs_writev+0x35/0x60
[<ffffffffaa44fc12>] SyS_pwritev+0xc2/0xf0
[<ffffffffaa999f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

FWIW: both are running zfs-2.2.3 RPMs from zfsonlinux.org.

It's surely possible that there's an issue that affects older kernels
but not more recent ones.

> > > > Full disclosure: the VM that hit this issue today has had storage-level
> > > > errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3
> > > > days ago.
> > >
> > > So indeed, my suspicion from above is confirmed.
> >
> > I'd be fine with that conclusion (as in the earlier thread), except this
> > has now happened on 2 different VMs, and the first one has no I/O
> > issues.  If this were another symptom of a storage failure, and hadn't
> > previously happened on another VM, I wouldn't be re-reporting it.
> 
> Is it the same VM hosting environment? And the same old distro?

Yes, they're running centos7 with the indicated kernels.

dmidecode shows they're both running:

        Product Name: VMware Virtual Platform

But they're different customers, so I'd be somewhat surprised if they're
running same versions of the hypervisor.

-- 
Justin



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: On disable_cost
Next
From: Christophe Pettus
Date:
Subject: Re: Control flow in logical replication walsender