Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers
From | James Coleman |
---|---|
Subject | Re: RFC: Logging plan of the running query |
Date | |
Msg-id | CAAaqYe_Xz_bU09dZxPEP5C63N+V_j2qy7Xj2kW22BX2hcwGfGw@mail.gmail.com Whole thread Raw |
In response to | Re: RFC: Logging plan of the running query (torikoshia <torikoshia@oss.nttdata.com>) |
List | pgsql-hackers |
On Fri, Oct 6, 2023 at 8:58 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2023-10-04 03:00, James Coleman wrote: > > On Thu, Sep 7, 2023 at 2:09 AM torikoshia <torikoshia@oss.nttdata.com> > > wrote: > >> > >> On 2023-09-06 11:17, James Coleman wrote: > >> > >> >> > I've never been able to reproduce it (haven't tested the new version, > >> >> > but v28 at least) on my M1 Mac; where I've reproduced it is on Debian > >> >> > (first buster and now bullseye). > >> >> > > >> >> > I'm attaching several stacktraces in the hope that they provide some > >> >> > clues. These all match the ps output I sent earlier, though note in > >> >> > that output there is both the regress instance and my test instance > >> >> > (pid 3213249) running (different ports, of course, and they are from > >> >> > the exact same compilation run). I've attached ps output for the > >> >> > postgres processes under the make check process to simplify cross > >> >> > referencing. > >> >> > > >> >> > A few interesting things: > >> >> > - There's definitely a lock on a relation that seems to be what's > >> >> > blocking the processes. > >> >> > - When I try to connect with psql the process forks but then hangs > >> >> > (see the ps output with task names stuck in "authentication"). I've > >> >> > also included a trace from one of these. > >> >> > >> >> Thanks for sharing them! > >> >> > >> >> Many processes are waiting to acquire the LW lock, including the > >> >> process > >> >> trying to output the plan(select1.trace). > >> >> > >> >> I suspect that this is due to a lock that was acquired prior to being > >> >> interrupted by ProcessLogQueryPlanInterrupt(), but have not been able > >> >> to > >> >> reproduce the same situation.. > >> >> > >> > > >> > I don't have time immediately to dig into this, but perhaps loading up > >> > the core dumps would allow us to see what query is running in each > >> > backend process (if it hasn't already been discarded by that point) > >> > and thereby determine what point in each test process led to the error > >> > condition? > >> > >> Thanks for the suggestion. > >> I am concerned that core dumps may not be readable on different > >> operating systems or other environments. (Unfortunately, I do not have > >> Debian on hand) > >> > >> It seems that we can know what queries were running from the stack > >> traces you shared. > >> As described above, I suspect a lock which was acquired prior to > >> ProcessLogQueryPlanInterrupt() caused the issue. > >> I will try a little more to see if I can devise a way to create the > >> same > >> situation. > >> > >> > Alternatively we might be able to apply the same trick to the test > >> > client instead... > >> > > >> > BTW, for my own easy reference in this thread: relid 1259 is pg_class > >> > if I'm not mistaken. > >> > >> Yeah, and I think it's strange that the lock to 1259 appears twice and > >> must be avoided. > >> > >> #10 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) > >> at > >> lmgr.c:117 > >> .. > >> #49 0x0000559d61b4989d in ProcessLogQueryPlanInterrupt () at > >> explain.c:5158 > >> .. > >> #53 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) > >> at > >> lmgr.c:117 > > > > I chatted with Andres and David about this at PGConf.NYC, > Thanks again for the discussion with hackers! > > > and I think > > what we need to do is explicitly disallow running this code any time > > we are inside of lock acquisition code. > > Yeah, I think it's a straightforward workaround. > And I'm also concerned that the condition of being in the process > of acquiring some kind of lock is too strict and will make it almost > impossible to output a running plan. I was concerned about this too, but I was wondering if we might be able to cheat a bit by making such a case not clear the flag but instead just skip it for now. Regards, James
pgsql-hackers by date: