Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers

From James Coleman
Subject Re: RFC: Logging plan of the running query
Date
Msg-id CAAaqYe_Xz_bU09dZxPEP5C63N+V_j2qy7Xj2kW22BX2hcwGfGw@mail.gmail.com
Whole thread Raw
In response to Re: RFC: Logging plan of the running query  (torikoshia <torikoshia@oss.nttdata.com>)
List pgsql-hackers
On Fri, Oct 6, 2023 at 8:58 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2023-10-04 03:00, James Coleman wrote:
> > On Thu, Sep 7, 2023 at 2:09 AM torikoshia <torikoshia@oss.nttdata.com>
> > wrote:
> >>
> >> On 2023-09-06 11:17, James Coleman wrote:
> >>
> >> >> > I've never been able to reproduce it (haven't tested the new version,
> >> >> > but v28 at least) on my M1 Mac; where I've reproduced it is on Debian
> >> >> > (first buster and now bullseye).
> >> >> >
> >> >> > I'm attaching several stacktraces in the hope that they provide some
> >> >> > clues. These all match the ps output I sent earlier, though note in
> >> >> > that output there is both the regress instance and my test instance
> >> >> > (pid 3213249) running (different ports, of course, and they are from
> >> >> > the exact same compilation run). I've attached ps output for the
> >> >> > postgres processes under the make check process to simplify cross
> >> >> > referencing.
> >> >> >
> >> >> > A few interesting things:
> >> >> > - There's definitely a lock on a relation that seems to be what's
> >> >> > blocking the processes.
> >> >> > - When I try to connect with psql the process forks but then hangs
> >> >> > (see the ps output with task names stuck in "authentication"). I've
> >> >> > also included a trace from one of these.
> >> >>
> >> >> Thanks for sharing them!
> >> >>
> >> >> Many processes are waiting to acquire the LW lock, including the
> >> >> process
> >> >> trying to output the plan(select1.trace).
> >> >>
> >> >> I suspect that this is due to a lock that was acquired prior to being
> >> >> interrupted by ProcessLogQueryPlanInterrupt(), but have not been able
> >> >> to
> >> >> reproduce the same situation..
> >> >>
> >> >
> >> > I don't have time immediately to dig into this, but perhaps loading up
> >> > the core dumps would allow us to see what query is running in each
> >> > backend process (if it hasn't already been discarded by that point)
> >> > and thereby determine what point in each test process led to the error
> >> > condition?
> >>
> >> Thanks for the suggestion.
> >> I am concerned that core dumps may not be readable on different
> >> operating systems or other environments. (Unfortunately, I do not have
> >> Debian on hand)
> >>
> >> It seems that we can know what queries were running from the stack
> >> traces you shared.
> >> As described above, I suspect a lock which was acquired prior to
> >> ProcessLogQueryPlanInterrupt() caused the issue.
> >> I will try a little more to see if I can devise a way to create the
> >> same
> >> situation.
> >>
> >> > Alternatively we might be able to apply the same trick to the test
> >> > client instead...
> >> >
> >> > BTW, for my own easy reference in this thread: relid 1259 is pg_class
> >> > if I'm not mistaken.
> >>
> >> Yeah, and I think it's strange that the lock to 1259 appears twice and
> >> must be avoided.
> >>
> >>    #10 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1)
> >> at
> >> lmgr.c:117
> >>    ..
> >>    #49 0x0000559d61b4989d in ProcessLogQueryPlanInterrupt () at
> >> explain.c:5158
> >>    ..
> >>    #53 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1)
> >> at
> >> lmgr.c:117
> >
> > I chatted with Andres and David about this at PGConf.NYC,
> Thanks again for the discussion with hackers!
>
> > and I think
> > what we need to do is explicitly disallow running this code any time
> > we are inside of lock acquisition code.
>
> Yeah, I think it's a straightforward workaround.
> And I'm also concerned that the condition of being in the process
> of acquiring some kind of lock is too strict and will make it almost
> impossible to output a running plan.

I was concerned about this too, but I was wondering if we might be
able to cheat a bit by making such a case not clear the flag but
instead just skip it for now.

Regards,
James



pgsql-hackers by date:

Previous
From: James Coleman
Date:
Subject: Re: Opportunistically pruning page before update
Next
From: Peter Eisentraut
Date:
Subject: Add const to values and nulls arguments