Re: Expose lock group leader pid in pg_stat_activity - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: Expose lock group leader pid in pg_stat_activity
Date
Msg-id 20200316054341.GG26184@telsasoft.com
Whole thread Raw
In response to Re: Expose lock group leader pid in pg_stat_activity  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: Expose lock group leader pid in pg_stat_activity  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Sun, Mar 15, 2020 at 11:27:52PM -0500, Justin Pryzby wrote:
> On Tue, Jan 28, 2020 at 12:36:41PM +0100, Julien Rouhaud wrote:
> > So, AFAICT the LockHashPartitionLockByProc is required when
> > iterating/modifying lockGroupMembers or lockGroupLink, but just
> > getting the leader pid should be safe.
> 
> This still seems unsafe:
> 
> git show -U11 -w --patience b025f32e0b src/backend/utils/adt/pgstatfuncs.c
> +                       /*
> +                        * If a PGPROC entry was retrieved, display wait events and lock
> +                        * group leader information if any.  To avoid extra overhead, no
> +                        * extra lock is being held, so there is no guarantee of
> +                        * consistency across multiple rows.
> +                        */
> ...
> +                               PGPROC     *leader;
> ...
> +                               leader = proc->lockGroupLeader;
> +                               if (leader)
> +                               {
> # does something guarantee that leader doesn't change ?
> +                                       values[29] = Int32GetDatum(leader->pid);
> +                                       nulls[29] = false;
>                                 }
> 
> Michael seems to have raised the issue:
> 
> On Thu, Jan 16, 2020 at 04:49:12PM +0900, Michael Paquier wrote:
> > And actually, the way you are looking at the leader's PID is visibly
> > incorrect and inconsistent because the patch takes no shared LWLock on
> > the leader using LockHashPartitionLockByProc() followed by
> > LWLockAcquire(), no?  That's incorrect because it could be perfectly
> > possible to crash with this code between the moment you check if 
> > lockGroupLeader is NULL and the moment you look at
> > lockGroupLeader->pid if a process is being stopped in-between and
> > removes itself from a lock group in ProcKill().  
> 
> But I don't see how it was addressed ?
> 
> I read this:
> 
> src/backend/storage/lmgr/lock.c:         * completely valid.  We cannot safely dereference another backend's
> src/backend/storage/lmgr/lock.c-         * lockGroupLeader field without holding all lock partition locks, and
> src/backend/storage/lmgr/lock.c-         * it's not worth that.)
> 
> I think if you do:
> |LWLockAcquire(&proc->backendLock, LW_SHARED);
> ..then it's at least *safe* to access leader->pid, but it may be inconsistent
> unless you also call LockHashPartitionLockByProc.
> 
> I wasn't able to produce a crash, so maybe I missed something.

I think I see.  Julien's v3 patch did this:
https://www.postgresql.org/message-id/attachment/106429/pgsa_leader_pid-v3.diff
+                if (proc->lockGroupLeader)
+                    values[29] = Int32GetDatum(proc->lockGroupLeader->pid);

..which is racy because a proc with a leader might die and be replaced by
another proc without a leader between 1 and 2.

But the code since v4 does:

+                leader = proc->lockGroupLeader;
+                if (leader)
+                                       values[29] = Int32GetDatum(leader->pid);

..which is safe because PROCs are allocated in shared memory, so leader is for
sure a non-NULL pointer to a PROC.  leader->pid may be read inconsistently,
which is what the comment says: "no extra lock is being held".

-- 
Justin



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Refactor compile-time assertion checks for C/C++
Next
From: Michael Paquier
Date:
Subject: Re: Expose lock group leader pid in pg_stat_activity