Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers
| From | James Coleman | 
|---|---|
| Subject | Re: RFC: Logging plan of the running query | 
| Date | |
| Msg-id | CAAaqYe_p-849MtsVgN9_y-YL-uFzLFMpGw38T3Omb11gw-=K1w@mail.gmail.com Whole thread Raw | 
| In response to | Re: RFC: Logging plan of the running query (torikoshia <torikoshia@oss.nttdata.com>) | 
| Responses | Re: RFC: Logging plan of the running query | 
| List | pgsql-hackers | 
On Thu, Sep 7, 2023 at 2:09 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2023-09-06 11:17, James Coleman wrote: > > >> > I've never been able to reproduce it (haven't tested the new version, > >> > but v28 at least) on my M1 Mac; where I've reproduced it is on Debian > >> > (first buster and now bullseye). > >> > > >> > I'm attaching several stacktraces in the hope that they provide some > >> > clues. These all match the ps output I sent earlier, though note in > >> > that output there is both the regress instance and my test instance > >> > (pid 3213249) running (different ports, of course, and they are from > >> > the exact same compilation run). I've attached ps output for the > >> > postgres processes under the make check process to simplify cross > >> > referencing. > >> > > >> > A few interesting things: > >> > - There's definitely a lock on a relation that seems to be what's > >> > blocking the processes. > >> > - When I try to connect with psql the process forks but then hangs > >> > (see the ps output with task names stuck in "authentication"). I've > >> > also included a trace from one of these. > >> > >> Thanks for sharing them! > >> > >> Many processes are waiting to acquire the LW lock, including the > >> process > >> trying to output the plan(select1.trace). > >> > >> I suspect that this is due to a lock that was acquired prior to being > >> interrupted by ProcessLogQueryPlanInterrupt(), but have not been able > >> to > >> reproduce the same situation.. > >> > > > > I don't have time immediately to dig into this, but perhaps loading up > > the core dumps would allow us to see what query is running in each > > backend process (if it hasn't already been discarded by that point) > > and thereby determine what point in each test process led to the error > > condition? > > Thanks for the suggestion. > I am concerned that core dumps may not be readable on different > operating systems or other environments. (Unfortunately, I do not have > Debian on hand) > > It seems that we can know what queries were running from the stack > traces you shared. > As described above, I suspect a lock which was acquired prior to > ProcessLogQueryPlanInterrupt() caused the issue. > I will try a little more to see if I can devise a way to create the same > situation. > > > Alternatively we might be able to apply the same trick to the test > > client instead... > > > > BTW, for my own easy reference in this thread: relid 1259 is pg_class > > if I'm not mistaken. > > Yeah, and I think it's strange that the lock to 1259 appears twice and > must be avoided. > > #10 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at > lmgr.c:117 > .. > #49 0x0000559d61b4989d in ProcessLogQueryPlanInterrupt () at > explain.c:5158 > .. > #53 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at > lmgr.c:117 I chatted with Andres and David about this at PGConf.NYC, and I think what we need to do is explicitly disallow running this code any time we are inside of lock acquisition code. Regards, James Coleman
pgsql-hackers by date: