Thread: Expose lock group leader pid in pg_stat_activity
Hello, Guillaume (in Cc) recently pointed out [1] that it's currently not possible to retrieve the list of parallel workers for a given backend at the SQL level. His use case was to develop a function in plpgsql to sample a given query wait event, but it's not hard to imagine other useful use cases for this information, for instance doing some analysis on the average number of workers per parallel query, or ratio of parallel queries. IIUC parallel queries is for now the only user of lock group, so this should work just fine. I'm attaching a trivial patch to expose the group leader pid if any in pg_stat_activity. Quick example of usage: =# SELECT query, leader_pid, array_agg(pid) filter(WHERE leader_pid != pid) AS members FROM pg_stat_activity WHERE leader_pid IS NOT NULL GROUP BY query, leader_pid; query | leader_pid | members -------------------+------------+--------------- select * from t1; | 28701 | {28728,28732} (1 row) [1] https://twitter.com/g_lelarge/status/1209486212190343168
Attachment
On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > Guillaume (in Cc) recently pointed out [1] that it's currently not > possible to retrieve the list of parallel workers for a given backend > at the SQL level. His use case was to develop a function in plpgsql > to sample a given query wait event, but it's not hard to imagine other > useful use cases for this information, for instance doing some > analysis on the average number of workers per parallel query, or ratio > of parallel queries. IIUC parallel queries is for now the only user > of lock group, so this should work just fine. > > I'm attaching a trivial patch to expose the group leader pid if any > in pg_stat_activity. Quick example of usage: > > =# SELECT query, leader_pid, > array_agg(pid) filter(WHERE leader_pid != pid) AS members > FROM pg_stat_activity > WHERE leader_pid IS NOT NULL > GROUP BY query, leader_pid; > query | leader_pid | members > -------------------+------------+--------------- > select * from t1; | 28701 | {28728,28732} > (1 row) > > > [1] https://twitter.com/g_lelarge/status/1209486212190343168 And I just realized that I forgot to update rule.out, sorry about that. v2 attached.
Attachment
Le mer. 25 déc. 2019 à 19:30, Julien Rouhaud <rjuju123@gmail.com> a écrit :
On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Guillaume (in Cc) recently pointed out [1] that it's currently not
> possible to retrieve the list of parallel workers for a given backend
> at the SQL level. His use case was to develop a function in plpgsql
> to sample a given query wait event, but it's not hard to imagine other
> useful use cases for this information, for instance doing some
> analysis on the average number of workers per parallel query, or ratio
> of parallel queries. IIUC parallel queries is for now the only user
> of lock group, so this should work just fine.
>
> I'm attaching a trivial patch to expose the group leader pid if any
> in pg_stat_activity. Quick example of usage:
>
> =# SELECT query, leader_pid,
> array_agg(pid) filter(WHERE leader_pid != pid) AS members
> FROM pg_stat_activity
> WHERE leader_pid IS NOT NULL
> GROUP BY query, leader_pid;
> query | leader_pid | members
> -------------------+------------+---------------
> select * from t1; | 28701 | {28728,28732}
> (1 row)
>
>
> [1] https://twitter.com/g_lelarge/status/1209486212190343168
And I just realized that I forgot to update rule.out, sorry about
that. v2 attached.
So I tried your patch this morning, and it works really well.
On a SELECT count(*), I got this:
SELECT leader_pid, pid, wait_event_type, wait_event, state, backend_type FROM pg_stat_activity WHERE pid=111439 or leader_pid=111439;
┌────────────┬────────┬─────────────────┬──────────────┬────────┬─────────────────┐
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │
├────────────┼────────┼─────────────────┼──────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ LWLock │ WALWriteLock │ active │ client backend │
│ 111439 │ 116887 │ LWLock │ WALWriteLock │ active │ parallel worker │
│ 111439 │ 116888 │ IO │ WALSync │ active │ parallel worker │
└────────────┴────────┴─────────────────┴──────────────┴────────┴─────────────────┘
(3 rows)
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │
├────────────┼────────┼─────────────────┼──────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ LWLock │ WALWriteLock │ active │ client backend │
│ 111439 │ 116887 │ LWLock │ WALWriteLock │ active │ parallel worker │
│ 111439 │ 116888 │ IO │ WALSync │ active │ parallel worker │
└────────────┴────────┴─────────────────┴──────────────┴────────┴─────────────────┘
(3 rows)
and this from a CREATE INDEX:
┌────────────┬────────┬─────────────────┬────────────┬────────┬─────────────────┐
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │
├────────────┼────────┼─────────────────┼────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ │ │ active │ client backend │
│ 111439 │ 118775 │ │ │ active │ parallel worker │
└────────────┴────────┴─────────────────┴────────────┴────────┴─────────────────┘
(2 rows)
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │
├────────────┼────────┼─────────────────┼────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ │ │ active │ client backend │
│ 111439 │ 118775 │ │ │ active │ parallel worker │
└────────────┴────────┴─────────────────┴────────────┴────────┴─────────────────┘
(2 rows)
Anyway, it applies cleanly, it compiles, and it works. Documentation is available. So it looks to me it's good to go :)
--
Guillaume.
On Thu, Dec 26, 2019 at 9:08 AM Guillaume Lelarge <guillaume@lelarge.info> wrote: > > Le mer. 25 déc. 2019 à 19:30, Julien Rouhaud <rjuju123@gmail.com> a écrit : >> >> On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud <rjuju123@gmail.com> wrote: >> > >> > Guillaume (in Cc) recently pointed out [1] that it's currently not >> > possible to retrieve the list of parallel workers for a given backend >> > at the SQL level. His use case was to develop a function in plpgsql >> > to sample a given query wait event, but it's not hard to imagine other >> > useful use cases for this information, for instance doing some >> > analysis on the average number of workers per parallel query, or ratio >> > of parallel queries. IIUC parallel queries is for now the only user >> > of lock group, so this should work just fine. >> > >> > I'm attaching a trivial patch to expose the group leader pid if any >> > in pg_stat_activity. Quick example of usage: >> > >> > =# SELECT query, leader_pid, >> > array_agg(pid) filter(WHERE leader_pid != pid) AS members >> > FROM pg_stat_activity >> > WHERE leader_pid IS NOT NULL >> > GROUP BY query, leader_pid; >> > query | leader_pid | members >> > -------------------+------------+--------------- >> > select * from t1; | 28701 | {28728,28732} >> > (1 row) >> > >> > >> > [1] https://twitter.com/g_lelarge/status/1209486212190343168 >> >> And I just realized that I forgot to update rule.out, sorry about >> that. v2 attached. > > > So I tried your patch this morning, and it works really well. > > On a SELECT count(*), I got this: > > SELECT leader_pid, pid, wait_event_type, wait_event, state, backend_type FROM pg_stat_activity WHERE pid=111439 or leader_pid=111439; > > ┌────────────┬────────┬─────────────────┬──────────────┬────────┬─────────────────┐ > │ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │ > ├────────────┼────────┼─────────────────┼──────────────┼────────┼─────────────────┤ > │ 111439 │ 111439 │ LWLock │ WALWriteLock │ active │ client backend │ > │ 111439 │ 116887 │ LWLock │ WALWriteLock │ active │ parallel worker │ > │ 111439 │ 116888 │ IO │ WALSync │ active │ parallel worker │ > └────────────┴────────┴─────────────────┴──────────────┴────────┴─────────────────┘ > (3 rows) > > and this from a CREATE INDEX: > > ┌────────────┬────────┬─────────────────┬────────────┬────────┬─────────────────┐ > │ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │ > ├────────────┼────────┼─────────────────┼────────────┼────────┼─────────────────┤ > │ 111439 │ 111439 │ │ │ active │ client backend │ > │ 111439 │ 118775 │ │ │ active │ parallel worker │ > └────────────┴────────┴─────────────────┴────────────┴────────┴─────────────────┘ > (2 rows) > > Anyway, it applies cleanly, it compiles, and it works. Documentation is available. So it looks to me it's good to go :) Thanks for the review Guillaume. Double checking the doc, I see that I made a copy/pasto mistake in the new field name. Attached v3 should be all good.
Attachment
Le jeu. 26 déc. 2019 à 09:49, Julien Rouhaud <rjuju123@gmail.com> a écrit :
On Thu, Dec 26, 2019 at 9:08 AM Guillaume Lelarge
<guillaume@lelarge.info> wrote:
>
> Le mer. 25 déc. 2019 à 19:30, Julien Rouhaud <rjuju123@gmail.com> a écrit :
>>
>> On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>> >
>> > Guillaume (in Cc) recently pointed out [1] that it's currently not
>> > possible to retrieve the list of parallel workers for a given backend
>> > at the SQL level. His use case was to develop a function in plpgsql
>> > to sample a given query wait event, but it's not hard to imagine other
>> > useful use cases for this information, for instance doing some
>> > analysis on the average number of workers per parallel query, or ratio
>> > of parallel queries. IIUC parallel queries is for now the only user
>> > of lock group, so this should work just fine.
>> >
>> > I'm attaching a trivial patch to expose the group leader pid if any
>> > in pg_stat_activity. Quick example of usage:
>> >
>> > =# SELECT query, leader_pid,
>> > array_agg(pid) filter(WHERE leader_pid != pid) AS members
>> > FROM pg_stat_activity
>> > WHERE leader_pid IS NOT NULL
>> > GROUP BY query, leader_pid;
>> > query | leader_pid | members
>> > -------------------+------------+---------------
>> > select * from t1; | 28701 | {28728,28732}
>> > (1 row)
>> >
>> >
>> > [1] https://twitter.com/g_lelarge/status/1209486212190343168
>>
>> And I just realized that I forgot to update rule.out, sorry about
>> that. v2 attached.
>
>
> So I tried your patch this morning, and it works really well.
>
> On a SELECT count(*), I got this:
>
> SELECT leader_pid, pid, wait_event_type, wait_event, state, backend_type FROM pg_stat_activity WHERE pid=111439 or leader_pid=111439;
>
> ┌────────────┬────────┬─────────────────┬──────────────┬────────┬─────────────────┐
> │ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │
> ├────────────┼────────┼─────────────────┼──────────────┼────────┼─────────────────┤
> │ 111439 │ 111439 │ LWLock │ WALWriteLock │ active │ client backend │
> │ 111439 │ 116887 │ LWLock │ WALWriteLock │ active │ parallel worker │
> │ 111439 │ 116888 │ IO │ WALSync │ active │ parallel worker │
> └────────────┴────────┴─────────────────┴──────────────┴────────┴─────────────────┘
> (3 rows)
>
> and this from a CREATE INDEX:
>
> ┌────────────┬────────┬─────────────────┬────────────┬────────┬─────────────────┐
> │ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │
> ├────────────┼────────┼─────────────────┼────────────┼────────┼─────────────────┤
> │ 111439 │ 111439 │ │ │ active │ client backend │
> │ 111439 │ 118775 │ │ │ active │ parallel worker │
> └────────────┴────────┴─────────────────┴────────────┴────────┴─────────────────┘
> (2 rows)
>
> Anyway, it applies cleanly, it compiles, and it works. Documentation is available. So it looks to me it's good to go :)
Thanks for the review Guillaume. Double checking the doc, I see that
I made a copy/pasto mistake in the new field name. Attached v3 should
be all good.
Feeling bad I missed this. But, yeah, it's much better with the right column's name.
For me, it's looking good to be ready for commiter. Should I set it this way in the Commit Fest app?
Guillaume.
On Thu, Dec 26, 2019 at 10:20 AM Guillaume Lelarge <guillaume@lelarge.info> wrote: > > Le jeu. 26 déc. 2019 à 09:49, Julien Rouhaud <rjuju123@gmail.com> a écrit : >> >> On Thu, Dec 26, 2019 at 9:08 AM Guillaume Lelarge >> <guillaume@lelarge.info> wrote: >> > >> > Le mer. 25 déc. 2019 à 19:30, Julien Rouhaud <rjuju123@gmail.com> a écrit : >> >> >> >> On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud <rjuju123@gmail.com> wrote: >> >> > >> >> > Guillaume (in Cc) recently pointed out [1] that it's currently not >> >> > possible to retrieve the list of parallel workers for a given backend >> >> > at the SQL level. His use case was to develop a function in plpgsql >> >> > to sample a given query wait event, but it's not hard to imagine other >> >> > useful use cases for this information, for instance doing some >> >> > analysis on the average number of workers per parallel query, or ratio >> >> > of parallel queries. IIUC parallel queries is for now the only user >> >> > of lock group, so this should work just fine. >> >> > >> >> > I'm attaching a trivial patch to expose the group leader pid if any >> >> > in pg_stat_activity. Quick example of usage: >> >> > >> >> > =# SELECT query, leader_pid, >> >> > array_agg(pid) filter(WHERE leader_pid != pid) AS members >> >> > FROM pg_stat_activity >> >> > WHERE leader_pid IS NOT NULL >> >> > GROUP BY query, leader_pid; >> >> > query | leader_pid | members >> >> > -------------------+------------+--------------- >> >> > select * from t1; | 28701 | {28728,28732} >> >> > (1 row) >> >> > >> >> > >> >> > [1] https://twitter.com/g_lelarge/status/1209486212190343168 >> >> >> >> And I just realized that I forgot to update rule.out, sorry about >> >> that. v2 attached. >> > >> > >> > So I tried your patch this morning, and it works really well. >> > >> > On a SELECT count(*), I got this: >> > >> > SELECT leader_pid, pid, wait_event_type, wait_event, state, backend_type FROM pg_stat_activity WHERE pid=111439 or leader_pid=111439; >> > >> > ┌────────────┬────────┬─────────────────┬──────────────┬────────┬─────────────────┐ >> > │ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │ >> > ├────────────┼────────┼─────────────────┼──────────────┼────────┼─────────────────┤ >> > │ 111439 │ 111439 │ LWLock │ WALWriteLock │ active │ client backend │ >> > │ 111439 │ 116887 │ LWLock │ WALWriteLock │ active │ parallel worker │ >> > │ 111439 │ 116888 │ IO │ WALSync │ active │ parallel worker │ >> > └────────────┴────────┴─────────────────┴──────────────┴────────┴─────────────────┘ >> > (3 rows) >> > >> > and this from a CREATE INDEX: >> > >> > ┌────────────┬────────┬─────────────────┬────────────┬────────┬─────────────────┐ >> > │ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │ >> > ├────────────┼────────┼─────────────────┼────────────┼────────┼─────────────────┤ >> > │ 111439 │ 111439 │ │ │ active │ client backend │ >> > │ 111439 │ 118775 │ │ │ active │ parallel worker │ >> > └────────────┴────────┴─────────────────┴────────────┴────────┴─────────────────┘ >> > (2 rows) >> > >> > Anyway, it applies cleanly, it compiles, and it works. Documentation is available. So it looks to me it's good to go:) >> >> Thanks for the review Guillaume. Double checking the doc, I see that >> I made a copy/pasto mistake in the new field name. Attached v3 should >> be all good. > > > Feeling bad I missed this. But, yeah, it's much better with the right column's name. > > For me, it's looking good to be ready for commiter. Should I set it this way in the Commit Fest app? If you don't see any other issue with the patch, I'd say yes. A committer can still put it back to waiting on author if needed.
Le jeu. 26 déc. 2019 à 10:26, Julien Rouhaud <rjuju123@gmail.com> a écrit :
On Thu, Dec 26, 2019 at 10:20 AM Guillaume Lelarge
<guillaume@lelarge.info> wrote:
>
> Le jeu. 26 déc. 2019 à 09:49, Julien Rouhaud <rjuju123@gmail.com> a écrit :
>>
>> On Thu, Dec 26, 2019 at 9:08 AM Guillaume Lelarge
>> <guillaume@lelarge.info> wrote:
>> >
>> > Le mer. 25 déc. 2019 à 19:30, Julien Rouhaud <rjuju123@gmail.com> a écrit :
>> >>
>> >> On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>> >> >
>> >> > Guillaume (in Cc) recently pointed out [1] that it's currently not
>> >> > possible to retrieve the list of parallel workers for a given backend
>> >> > at the SQL level. His use case was to develop a function in plpgsql
>> >> > to sample a given query wait event, but it's not hard to imagine other
>> >> > useful use cases for this information, for instance doing some
>> >> > analysis on the average number of workers per parallel query, or ratio
>> >> > of parallel queries. IIUC parallel queries is for now the only user
>> >> > of lock group, so this should work just fine.
>> >> >
>> >> > I'm attaching a trivial patch to expose the group leader pid if any
>> >> > in pg_stat_activity. Quick example of usage:
>> >> >
>> >> > =# SELECT query, leader_pid,
>> >> > array_agg(pid) filter(WHERE leader_pid != pid) AS members
>> >> > FROM pg_stat_activity
>> >> > WHERE leader_pid IS NOT NULL
>> >> > GROUP BY query, leader_pid;
>> >> > query | leader_pid | members
>> >> > -------------------+------------+---------------
>> >> > select * from t1; | 28701 | {28728,28732}
>> >> > (1 row)
>> >> >
>> >> >
>> >> > [1] https://twitter.com/g_lelarge/status/1209486212190343168
>> >>
>> >> And I just realized that I forgot to update rule.out, sorry about
>> >> that. v2 attached.
>> >
>> >
>> > So I tried your patch this morning, and it works really well.
>> >
>> > On a SELECT count(*), I got this:
>> >
>> > SELECT leader_pid, pid, wait_event_type, wait_event, state, backend_type FROM pg_stat_activity WHERE pid=111439 or leader_pid=111439;
>> >
>> > ┌────────────┬────────┬─────────────────┬──────────────┬────────┬─────────────────┐
>> > │ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │
>> > ├────────────┼────────┼─────────────────┼──────────────┼────────┼─────────────────┤
>> > │ 111439 │ 111439 │ LWLock │ WALWriteLock │ active │ client backend │
>> > │ 111439 │ 116887 │ LWLock │ WALWriteLock │ active │ parallel worker │
>> > │ 111439 │ 116888 │ IO │ WALSync │ active │ parallel worker │
>> > └────────────┴────────┴─────────────────┴──────────────┴────────┴─────────────────┘
>> > (3 rows)
>> >
>> > and this from a CREATE INDEX:
>> >
>> > ┌────────────┬────────┬─────────────────┬────────────┬────────┬─────────────────┐
>> > │ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │
>> > ├────────────┼────────┼─────────────────┼────────────┼────────┼─────────────────┤
>> > │ 111439 │ 111439 │ │ │ active │ client backend │
>> > │ 111439 │ 118775 │ │ │ active │ parallel worker │
>> > └────────────┴────────┴─────────────────┴────────────┴────────┴─────────────────┘
>> > (2 rows)
>> >
>> > Anyway, it applies cleanly, it compiles, and it works. Documentation is available. So it looks to me it's good to go :)
>>
>> Thanks for the review Guillaume. Double checking the doc, I see that
>> I made a copy/pasto mistake in the new field name. Attached v3 should
>> be all good.
>
>
> Feeling bad I missed this. But, yeah, it's much better with the right column's name.
>
> For me, it's looking good to be ready for commiter. Should I set it this way in the Commit Fest app?
If you don't see any other issue with the patch, I'd say yes. A
committer can still put it back to waiting on author if needed.
That's also what I thought, but as I was the only one commenting on this... Anyway, done.
--
Guillaume.
Hello I doubt that "Process ID of the lock group leader" is enough for user documentation. I think we need note: - this field is related to parallel query execution - leader_pid = pid if process is parallel leader - leader_pid would point to pid of the leader if process is parallel worker - leader_pid will be NULL for non-parallel queries or idle sessions Also patch has no tests. Possible this is normal, not sure how to write a reliable test for this feature. Patch applies, compiles, pass tests regards, Sergei
Hello, On Thu, Dec 26, 2019 at 12:18 PM Sergei Kornilov <sk@zsrv.org> wrote: > > I doubt that "Process ID of the lock group leader" is enough for user documentation. I think we need note: > - this field is related to parallel query execution > - leader_pid = pid if process is parallel leader > - leader_pid would point to pid of the leader if process is parallel worker > - leader_pid will be NULL for non-parallel queries or idle sessions As I understand it, lock group is some infrastructure that is used by parallel queries, but could be used for something else too. So if more documentation is needed, we should say something like "For now, only parallel queries can have a lock group" or something like that. The fact that leader_pid == pid for the leader and different for the other member should be obvious, I'm not sure that it's worth documenting that. > Also patch has no tests. Possible this is normal, not sure how to write a reliable test for this feature. Yes, I was unsure if some extra testing was required. We could set force_parallel_mode to on and query "select leader_pid is not null from pg_stat_activity where pid = pg_backend_pid(), and then the opposite test, which should do the trick.
Hello > As I understand it, lock group is some infrastructure that is used by > parallel queries, but could be used for something else too. So if > more documentation is needed, we should say something like "For now, > only parallel queries can have a lock group" or something like that. If lockGroupLeader will be used in some way for non-parallel query, then the name leader_pid could be confusing. No? I treat pg_stat_activity as view for user. We have document somewhere what is "lock group leader" (excepts README in sourcetree)? I meant user going to read documentation, "ok, this field is process ID of the lock group leader, but what isit?". Expose a leader pid for parallel worker will be clear improvement for user. And seems lockGroupLeader->pid is exactlythis stuff. Therefore, I would like to see such description and meaning of the field. > The fact that leader_pid == pid for the leader and different for the > other member should be obvious, I'm not sure that it's worth > documenting that. It may be not obvious that leader_pid is not null in this case. But ok, no objections. regards, Sergei
On Fri, Dec 27, 2019 at 10:01 AM Sergei Kornilov <sk@zsrv.org> wrote: > > Hello > > > As I understand it, lock group is some infrastructure that is used by > > parallel queries, but could be used for something else too. So if > > more documentation is needed, we should say something like "For now, > > only parallel queries can have a lock group" or something like that. > > If lockGroupLeader will be used in some way for non-parallel query, then the name leader_pid could be confusing. No? > I treat pg_stat_activity as view for user. We have document somewhere what is "lock group leader" (excepts README in sourcetree)? I meant user going to read documentation, "ok, this field is process ID of the lock group leader, but what isit?". Expose a leader pid for parallel worker will be clear improvement for user. And seems lockGroupLeader->pid is exactlythis stuff. Therefore, I would like to see such description and meaning of the field. I think that not using "parallel" to name this field will help to avoid confusion if the lock group infrastructure is eventually used for something else, but that's only true if indeed we explain what a lock group is. > > The fact that leader_pid == pid for the leader and different for the > > other member should be obvious, I'm not sure that it's worth > > documenting that. > > It may be not obvious that leader_pid is not null in this case. But ok, no objections. If we adapt lmgr/README to document the group locking, it also addresses this. What do you thing of: The leader_pid is NULL for processes not involved in parallel query. When a process wants to cooperate with parallel workers, it becomes a lock group leader, which means that this field will be valued to its own pid. When a parallel worker starts up, this field will be valued with the leader pid.
On Fri, Dec 27, 2019 at 10:15:33AM +0100, Julien Rouhaud wrote: > I think that not using "parallel" to name this field will help to > avoid confusion if the lock group infrastructure is eventually used > for something else, but that's only true if indeed we explain what a > lock group is. As you already pointed out, src/backend/storage/lmgr/README includes a full description of this stuff under the section "Group Locking". So I agree that the patch ought to document your new field in a much better way, without mentioning the term "group locking" that's even better to not confuse the reader because this term not present in the docs at all. > The leader_pid is NULL for processes not involved in parallel query. > When a process wants to cooperate with parallel workers, it becomes a > lock group leader, which means that this field will be valued to its > own pid. When a parallel worker starts up, this field will be valued > with the leader pid. The first sentence is good to have. Now instead of "lock group leader", I think that we had better use "parallel group leader" as in other parts of the docs (see wait events for example). Then we just need to say that if leader_pid has the same value as pg_stat_activity.pid, then we have a group leader. If not, then it is a parallel worker process initially spawned by the leader whose PID is leader_pid (when executing Gather, Gather Merge, soon-to-be parallel vacuum or for a parallel btree build, but this does not need a mention in the docs). There could be an argument as well to have leader_pid set to NULL for a leader, but that would be inconsistent with what the PGPROC entry reports for the backend. While looking at the code, I think that we could refactor things a bit for raw_wait_event, wait_event_type and wait_event which has some duplicated code for backend and auxiliary processes. What about filling in the wait event data after fetching the PGPROC entry, and also fill in leader_pid for auxiliary processes. This does not matter now, perhaps it will never matter (or not), but that would make the code much more consistent. -- Michael
Attachment
On Thu, Jan 16, 2020 at 04:27:27PM +0900, Michael Paquier wrote: > While looking at the code, I think that we could refactor things a bit > for raw_wait_event, wait_event_type and wait_event which has some > duplicated code for backend and auxiliary processes. What about > filling in the wait event data after fetching the PGPROC entry, and > also fill in leader_pid for auxiliary processes. This does not matter > now, perhaps it will never matter (or not), but that would make the > code much more consistent. And actually, the way you are looking at the leader's PID is visibly incorrect and inconsistent because the patch takes no shared LWLock on the leader using LockHashPartitionLockByProc() followed by LWLockAcquire(), no? That's incorrect because it could be perfectly possible to crash with this code between the moment you check if lockGroupLeader is NULL and the moment you look at lockGroupLeader->pid if a process is being stopped in-between and removes itself from a lock group in ProcKill(). That's also inconsistent because it could be perfectly possible to finish with an incorrect view of the data while scanning for all the backend entries, like a leader set to NULL with workers pointing to the leader for example, or even workers marked incorrectly as NULL. The second one may not be a problem, but the first one could be confusing. -- Michael
Attachment
On Thu, Jan 16, 2020 at 8:28 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Dec 27, 2019 at 10:15:33AM +0100, Julien Rouhaud wrote: > > I think that not using "parallel" to name this field will help to > > avoid confusion if the lock group infrastructure is eventually used > > for something else, but that's only true if indeed we explain what a > > lock group is. > > As you already pointed out, src/backend/storage/lmgr/README includes a > full description of this stuff under the section "Group Locking". So > I agree that the patch ought to document your new field in a much > better way, without mentioning the term "group locking" that's even > better to not confuse the reader because this term not present in the > docs at all. > > > The leader_pid is NULL for processes not involved in parallel query. > > When a process wants to cooperate with parallel workers, it becomes a > > lock group leader, which means that this field will be valued to its > > own pid. When a parallel worker starts up, this field will be valued > > with the leader pid. > > The first sentence is good to have. Now instead of "lock group > leader", I think that we had better use "parallel group leader" as in > other parts of the docs (see wait events for example). Ok, I'll change this way. > Then we just > need to say that if leader_pid has the same value as > pg_stat_activity.pid, then we have a group leader. If not, then it is > a parallel worker process initially spawned by the leader whose PID is > leader_pid (when executing Gather, Gather Merge, soon-to-be parallel > vacuum or for a parallel btree build, but this does not need a mention > in the docs). There could be an argument as well to have leader_pid > set to NULL for a leader, but that would be inconsistent with what > the PGPROC entry reports for the backend. It would also slightly complicate things to get the full set of backends involved in a parallel query, while excluding the leader is entirely trivial. > While looking at the code, I think that we could refactor things a bit > for raw_wait_event, wait_event_type and wait_event which has some > duplicated code for backend and auxiliary processes. What about > filling in the wait event data after fetching the PGPROC entry, and > also fill in leader_pid for auxiliary processes. This does not matter > now, perhaps it will never matter (or not), but that would make the > code much more consistent. Yeah, I didn't think that auxiliary would be involved any time soon but I can include this refactoring.
On Thu, Jan 16, 2020 at 8:49 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Jan 16, 2020 at 04:27:27PM +0900, Michael Paquier wrote: > > While looking at the code, I think that we could refactor things a bit > > for raw_wait_event, wait_event_type and wait_event which has some > > duplicated code for backend and auxiliary processes. What about > > filling in the wait event data after fetching the PGPROC entry, and > > also fill in leader_pid for auxiliary processes. This does not matter > > now, perhaps it will never matter (or not), but that would make the > > code much more consistent. > > And actually, the way you are looking at the leader's PID is visibly > incorrect and inconsistent because the patch takes no shared LWLock on > the leader using LockHashPartitionLockByProc() followed by > LWLockAcquire(), no? That's incorrect because it could be perfectly > possible to crash with this code between the moment you check if > lockGroupLeader is NULL and the moment you look at > lockGroupLeader->pid if a process is being stopped in-between and > removes itself from a lock group in ProcKill(). That's also > inconsistent because it could be perfectly possible to finish with an > incorrect view of the data while scanning for all the backend entries, > like a leader set to NULL with workers pointing to the leader for > example, or even workers marked incorrectly as NULL. The second one > may not be a problem, but the first one could be confusing. Oh indeed. But unless we hold some LWLock during the whole function execution, we cannot guarantee a consistent view right? And isn't it already possible to e.g. see a parallel worker in pg_stat_activity while all other queries are shown are idle, if you're unlucky enough? Also, LockHashPartitionLockByProc requires the leader PGPROC, and there's no guarantee that we'll see the leader before any of the workers, so I'm unsure how to implement what you said. Wouldn't it be better to simply fetch the leader PGPROC after acquiring a shared ProcArrayLock, and using that copy to display the pid, after checking that we retrieved a non-null PGPROC?
On Fri, Jan 17, 2020 at 05:07:55PM +0100, Julien Rouhaud wrote: > Oh indeed. But unless we hold some LWLock during the whole function > execution, we cannot guarantee a consistent view right? Yep. That's possible. > And isn't it already possible to e.g. see a parallel worker in > pg_stat_activity while all other queries are shown are idle, if > you're unlucky enough? Yep. That's possible. > Also, LockHashPartitionLockByProc requires the leader PGPROC, and > there's no guarantee that we'll see the leader before any of the > workers, so I'm unsure how to implement what you said. Wouldn't it be > better to simply fetch the leader PGPROC after acquiring a shared > ProcArrayLock, and using that copy to display the pid, after checking > that we retrieved a non-null PGPROC? Another idea would be to check if the current PGPROC entry is a leader and print in an int[] the list of PIDs of all the workers while holding a shared LWLock to avoid anything to be unregistered. Less handy, but a bit more consistent. One problem with doing that is that you may have in your list of PIDs some worker processes that are already gone when going through all the other backend entries. An advantage is that an empty array could mean "I am the leader, but nothing has been registered yet to my group lock." (note that the leader adds itself to lockGroupMembers). -- Michael
Attachment
On Sat, Jan 18, 2020 at 3:51 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Jan 17, 2020 at 05:07:55PM +0100, Julien Rouhaud wrote: > > Oh indeed. But unless we hold some LWLock during the whole function > > execution, we cannot guarantee a consistent view right? > > Yep. That's possible. > > > And isn't it already possible to e.g. see a parallel worker in > > pg_stat_activity while all other queries are shown are idle, if > > you're unlucky enough? > > Yep. That's possible. > > > Also, LockHashPartitionLockByProc requires the leader PGPROC, and > > there's no guarantee that we'll see the leader before any of the > > workers, so I'm unsure how to implement what you said. Wouldn't it be > > better to simply fetch the leader PGPROC after acquiring a shared > > ProcArrayLock, and using that copy to display the pid, after checking > > that we retrieved a non-null PGPROC? > > Another idea would be to check if the current PGPROC entry is a leader > and print in an int[] the list of PIDs of all the workers while > holding a shared LWLock to avoid anything to be unregistered. Less > handy, but a bit more consistent. One problem with doing that is > that you may have in your list of PIDs some worker processes that are > already gone when going through all the other backend entries. An > advantage is that an empty array could mean "I am the leader, but > nothing has been registered yet to my group lock." (note that the > leader adds itself to lockGroupMembers). So, AFAICT the LockHashPartitionLockByProc is required when iterating/modifying lockGroupMembers or lockGroupLink, but just getting the leader pid should be safe. Since we'll never be able to get a totally consistent view of data here, I'm in favor of avoiding taking extra locks here. I agree that outputting an array of the pid would be more consistent for the leader, but will have its own set of corner cases. It seems to me that a new leader_pid column is easier to handle at SQL level, so I kept that approach in attached v4. If you have strong objections to it. I can still change it.
Attachment
On Tue, Jan 28, 2020 at 12:36:41PM +0100, Julien Rouhaud wrote: >On Sat, Jan 18, 2020 at 3:51 AM Michael Paquier <michael@paquier.xyz> wrote: >> >> On Fri, Jan 17, 2020 at 05:07:55PM +0100, Julien Rouhaud wrote: >> > Oh indeed. But unless we hold some LWLock during the whole function >> > execution, we cannot guarantee a consistent view right? >> >> Yep. That's possible. >> >> > And isn't it already possible to e.g. see a parallel worker in >> > pg_stat_activity while all other queries are shown are idle, if >> > you're unlucky enough? >> >> Yep. That's possible. >> >> > Also, LockHashPartitionLockByProc requires the leader PGPROC, and >> > there's no guarantee that we'll see the leader before any of the >> > workers, so I'm unsure how to implement what you said. Wouldn't it be >> > better to simply fetch the leader PGPROC after acquiring a shared >> > ProcArrayLock, and using that copy to display the pid, after checking >> > that we retrieved a non-null PGPROC? >> >> Another idea would be to check if the current PGPROC entry is a leader >> and print in an int[] the list of PIDs of all the workers while >> holding a shared LWLock to avoid anything to be unregistered. Less >> handy, but a bit more consistent. One problem with doing that is >> that you may have in your list of PIDs some worker processes that are >> already gone when going through all the other backend entries. An >> advantage is that an empty array could mean "I am the leader, but >> nothing has been registered yet to my group lock." (note that the >> leader adds itself to lockGroupMembers). > >So, AFAICT the LockHashPartitionLockByProc is required when >iterating/modifying lockGroupMembers or lockGroupLink, but just >getting the leader pid should be safe. Since we'll never be able to >get a totally consistent view of data here, I'm in favor of avoiding >taking extra locks here. I agree that outputting an array of the pid >would be more consistent for the leader, but will have its own set of >corner cases. It seems to me that a new leader_pid column is easier >to handle at SQL level, so I kept that approach in attached v4. If >you have strong objections to it. I can still change it. I agree a separate "leader_id" column is easier to work with, as it does not require unnesting and so on. As for the consistency, I agree we probably can't make this perfect, as we're fetching and processing the PGPROC records one by one. Fixing that would require acquiring a much stronger lock on PGPROC, and perhaps some other locks. That's pre-existing behavior, of course, it's just not very obvious as we don't have any dependencies between the rows, I think. Adding the leader_id will change, that, of course. But I think it's still mostly OK, even with the possible inconsistency. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 28, 2020 at 2:09 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > I agree a separate "leader_id" column is easier to work with, as it does > not require unnesting and so on. > > As for the consistency, I agree we probably can't make this perfect, as > we're fetching and processing the PGPROC records one by one. Fixing that > would require acquiring a much stronger lock on PGPROC, and perhaps some > other locks. That's pre-existing behavior, of course, it's just not very > obvious as we don't have any dependencies between the rows, I think. > Adding the leader_id will change, that, of course. But I think it's > still mostly OK, even with the possible inconsistency. There were already some dependencies between the rows since parallel queries were added, as you could see eg. a parallel worker while no query is currently active. This patch will make those corner cases more obvious. Should I document the possible inconsistencies?
On Tue, Jan 28, 2020 at 02:26:34PM +0100, Julien Rouhaud wrote: >On Tue, Jan 28, 2020 at 2:09 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> I agree a separate "leader_id" column is easier to work with, as it does >> not require unnesting and so on. >> >> As for the consistency, I agree we probably can't make this perfect, as >> we're fetching and processing the PGPROC records one by one. Fixing that >> would require acquiring a much stronger lock on PGPROC, and perhaps some >> other locks. That's pre-existing behavior, of course, it's just not very >> obvious as we don't have any dependencies between the rows, I think. >> Adding the leader_id will change, that, of course. But I think it's >> still mostly OK, even with the possible inconsistency. > >There were already some dependencies between the rows since parallel >queries were added, as you could see eg. a parallel worker while no >query is currently active. This patch will make those corner cases >more obvious. Yeah, sure. I mean explicit dependencies, e.g. a column referencing values from another row, like leader_id does. >Should I document the possible inconsistencies? I think it's worth mentioning that as a comment in the code, say before the pg_stat_get_activity function. IMO we don't need to document all possible inconsistencies, a generic explanation is enough. Not sure about the user docs. Does it currently say anything about this topic - consistency with stat catalogs? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 28, 2020 at 02:52:08PM +0100, Tomas Vondra wrote: > On Tue, Jan 28, 2020 at 02:26:34PM +0100, Julien Rouhaud wrote: >> There were already some dependencies between the rows since parallel >> queries were added, as you could see eg. a parallel worker while no >> query is currently active. This patch will make those corner cases >> more obvious. I was reviewing the code and one thing that I was wondering is if it would be better to make the code more defensive and return NULL when the PID of the referenced leader is 0 or InvalidPid. However that would mean that we have a dummy 2PC entry from PGPROC or something not yet started, which makes no sense. So your simpler version is actually fine. What you have here is that in the worst case you could finish with an incorrect reference to shared memory if a PGPROC is recycled between the moment you look for the leader field and the moment the PID value is fetched. That's very unlikely to happen, and I agree that it does not really justify the cost of taking extra locks every time we scan pg_stat_activity. > Yeah, sure. I mean explicit dependencies, e.g. a column referencing > values from another row, like leader_id does. + The leader_pid is NULL for processes not involved in parallel query. This is missing two markups, one for "NULL" and a second for "leader_pid". The documentation does not match the surroundings either, so I would suggest a reformulation for the beginning: PID of the leader process if this process is involved in parallel query. And actually this paragraph is not completely true, because leader_pid remains set even after one parallel query run has been finished for a session when leader_pid = pid as lockGroupLeader is set to NULL only once the process is stopped in ProcKill(). >> Should I document the possible inconsistencies? > > I think it's worth mentioning that as a comment in the code, say before > the pg_stat_get_activity function. IMO we don't need to document all > possible inconsistencies, a generic explanation is enough. Agreed that adding some information in the area when we look at the PGPROC entries for wait events and such would be nice. > Not sure about the user docs. Does it currently say anything about this > topic - consistency with stat catalogs? Not sure that it is the job of this patch to do that. Do you have something specific in mind? -- Michael
Attachment
On Thu, Jan 30, 2020 at 10:03:01PM +0900, Michael Paquier wrote: > On Tue, Jan 28, 2020 at 02:52:08PM +0100, Tomas Vondra wrote: > > On Tue, Jan 28, 2020 at 02:26:34PM +0100, Julien Rouhaud wrote: > >> There were already some dependencies between the rows since parallel > >> queries were added, as you could see eg. a parallel worker while no > >> query is currently active. This patch will make those corner cases > >> more obvious. > > I was reviewing the code and one thing that I was wondering is if it > would be better to make the code more defensive and return NULL when > the PID of the referenced leader is 0 or InvalidPid. However that > would mean that we have a dummy 2PC entry from PGPROC or something not > yet started, which makes no sense. So your simpler version is > actually fine. What you have here is that in the worst case you could > finish with an incorrect reference to shared memory if a PGPROC is > recycled between the moment you look for the leader field and the > moment the PID value is fetched. That's very unlikely to happen, and > I agree that it does not really justify the cost of taking extra locks > every time we scan pg_stat_activity. Ok. > > > Yeah, sure. I mean explicit dependencies, e.g. a column referencing > > values from another row, like leader_id does. > > + The leader_pid is NULL for processes not involved in parallel query. > This is missing two markups, one for "NULL" and a second for > "leader_pid". The extra "leader_pid" disappeared when I reworked the doc. I'm not sure what you meant here for NULL as I don't see any extra markup used in closeby documentation, so I hope this version is ok. > The documentation does not match the surroundings > either, so I would suggest a reformulation for the beginning: > PID of the leader process if this process is involved in parallel query. > And actually this paragraph is not completely true, because leader_pid > remains set even after one parallel query run has been finished for a > session when leader_pid = pid as lockGroupLeader is set to NULL only > once the process is stopped in ProcKill(). Oh good point, that's unfortunately not a super friendly behavior. I tried to adapt the documentation to address of all that. It's maybe slightly too verbose, but I guess that extra clarity is welcome here. > >> Should I document the possible inconsistencies? > > > > I think it's worth mentioning that as a comment in the code, say before > > the pg_stat_get_activity function. IMO we don't need to document all > > possible inconsistencies, a generic explanation is enough. > > Agreed that adding some information in the area when we look at the > PGPROC entries for wait events and such would be nice. I added some code comments to remind that we don't guarantee any consistency here.
Attachment
On Tue, Feb 04, 2020 at 03:27:25PM +0100, Julien Rouhaud wrote: > I added some code comments to remind that we don't guarantee any consistency > here. That's mostly fine. I have moved the comment related to AuxiliaryPidGetProc() within the inner part of its the "if" (or the comment should be changed to be conditional). An extra thing is that nulls[29] was not set to true for a user without the proper permission rights. Does that look fine to you? -- Michael
Attachment
On Wed, Feb 05, 2020 at 10:48:31AM +0900, Michael Paquier wrote: > On Tue, Feb 04, 2020 at 03:27:25PM +0100, Julien Rouhaud wrote: > > I added some code comments to remind that we don't guarantee any consistency > > here. > > That's mostly fine. I have moved the comment related to > AuxiliaryPidGetProc() within the inner part of its the "if" (or the > comment should be changed to be conditional). An extra thing is that > nulls[29] was not set to true for a user without the proper permission > rights. Oh, oops indeed. > Does that look fine to you? This looks good, thanks a lot!
On Wed, Feb 05, 2020 at 07:57:20AM +0100, Julien Rouhaud wrote: > This looks good, thanks a lot! Thanks for double-checking. And done. -- Michael
Attachment
On Thu, Feb 06, 2020 at 09:24:16AM +0900, Michael Paquier wrote: > On Wed, Feb 05, 2020 at 07:57:20AM +0100, Julien Rouhaud wrote: > > This looks good, thanks a lot! > > Thanks for double-checking. And done. Thanks! While on the topic, is there any reason why the backend stays a group leader for the rest of its lifetime, and should we change that? Also, while reading ProcKill, I noticed a typo in a comment: /* * Detach from any lock group of which we are a member. If the leader - * exist before all other group members, it's PGPROC will remain allocated + * exist before all other group members, its PGPROC will remain allocated * until the last group process exits; that process must return the * leader's PGPROC to the appropriate list. */
Attachment
On Thu, Feb 06, 2020 at 09:23:33AM +0100, Julien Rouhaud wrote: > While on the topic, is there any reason why the backend stays a group leader > for the rest of its lifetime, and should we change that? Nothing happens without a reason. a1c1af2 is the original commit, and the thread is here: https://www.postgresql.org/message-id/CA+TgmoapgKdy_Z0W9mHqZcGSo2t_t-4_V36DXaKim+X_fYp0oQ@mail.gmail.com By looking at the surroundings, there are a couple of assumptions behind the timing of the shutdown for the workers and the leader. I have not studied much the details on that, but my guess is that it makes the handling of the leader shutting down before its workers easier. Robert or Amit likely know all the details here. > Also, while reading ProcKill, I noticed a typo in a comment: > > /* > * Detach from any lock group of which we are a member. If the leader > - * exist before all other group members, it's PGPROC will remain allocated > + * exist before all other group members, its PGPROC will remain allocated > * until the last group process exits; that process must return the > * leader's PGPROC to the appropriate list. > */ Thanks, fixed. -- Michael
Attachment
On Tue, Jan 28, 2020 at 12:36:41PM +0100, Julien Rouhaud wrote: > So, AFAICT the LockHashPartitionLockByProc is required when > iterating/modifying lockGroupMembers or lockGroupLink, but just > getting the leader pid should be safe. This still seems unsafe: git show -U11 -w --patience b025f32e0b src/backend/utils/adt/pgstatfuncs.c + /* + * If a PGPROC entry was retrieved, display wait events and lock + * group leader information if any. To avoid extra overhead, no + * extra lock is being held, so there is no guarantee of + * consistency across multiple rows. + */ ... + PGPROC *leader; ... + leader = proc->lockGroupLeader; + if (leader) + { # does something guarantee that leader doesn't change ? + values[29] = Int32GetDatum(leader->pid); + nulls[29] = false; } Michael seems to have raised the issue: On Thu, Jan 16, 2020 at 04:49:12PM +0900, Michael Paquier wrote: > And actually, the way you are looking at the leader's PID is visibly > incorrect and inconsistent because the patch takes no shared LWLock on > the leader using LockHashPartitionLockByProc() followed by > LWLockAcquire(), no? That's incorrect because it could be perfectly > possible to crash with this code between the moment you check if > lockGroupLeader is NULL and the moment you look at > lockGroupLeader->pid if a process is being stopped in-between and > removes itself from a lock group in ProcKill(). But I don't see how it was addressed ? I read this: src/backend/storage/lmgr/lock.c: * completely valid. We cannot safely dereference another backend's src/backend/storage/lmgr/lock.c- * lockGroupLeader field without holding all lock partition locks, and src/backend/storage/lmgr/lock.c- * it's not worth that.) I think if you do: |LWLockAcquire(&proc->backendLock, LW_SHARED); ..then it's at least *safe* to access leader->pid, but it may be inconsistent unless you also call LockHashPartitionLockByProc. I wasn't able to produce a crash, so maybe I missed something. -- Justin
On Sun, Mar 15, 2020 at 11:27:52PM -0500, Justin Pryzby wrote: > On Tue, Jan 28, 2020 at 12:36:41PM +0100, Julien Rouhaud wrote: > > So, AFAICT the LockHashPartitionLockByProc is required when > > iterating/modifying lockGroupMembers or lockGroupLink, but just > > getting the leader pid should be safe. > > This still seems unsafe: > > git show -U11 -w --patience b025f32e0b src/backend/utils/adt/pgstatfuncs.c > + /* > + * If a PGPROC entry was retrieved, display wait events and lock > + * group leader information if any. To avoid extra overhead, no > + * extra lock is being held, so there is no guarantee of > + * consistency across multiple rows. > + */ > ... > + PGPROC *leader; > ... > + leader = proc->lockGroupLeader; > + if (leader) > + { > # does something guarantee that leader doesn't change ? > + values[29] = Int32GetDatum(leader->pid); > + nulls[29] = false; > } > > Michael seems to have raised the issue: > > On Thu, Jan 16, 2020 at 04:49:12PM +0900, Michael Paquier wrote: > > And actually, the way you are looking at the leader's PID is visibly > > incorrect and inconsistent because the patch takes no shared LWLock on > > the leader using LockHashPartitionLockByProc() followed by > > LWLockAcquire(), no? That's incorrect because it could be perfectly > > possible to crash with this code between the moment you check if > > lockGroupLeader is NULL and the moment you look at > > lockGroupLeader->pid if a process is being stopped in-between and > > removes itself from a lock group in ProcKill(). > > But I don't see how it was addressed ? > > I read this: > > src/backend/storage/lmgr/lock.c: * completely valid. We cannot safely dereference another backend's > src/backend/storage/lmgr/lock.c- * lockGroupLeader field without holding all lock partition locks, and > src/backend/storage/lmgr/lock.c- * it's not worth that.) > > I think if you do: > |LWLockAcquire(&proc->backendLock, LW_SHARED); > ..then it's at least *safe* to access leader->pid, but it may be inconsistent > unless you also call LockHashPartitionLockByProc. > > I wasn't able to produce a crash, so maybe I missed something. I think I see. Julien's v3 patch did this: https://www.postgresql.org/message-id/attachment/106429/pgsa_leader_pid-v3.diff + if (proc->lockGroupLeader) + values[29] = Int32GetDatum(proc->lockGroupLeader->pid); ..which is racy because a proc with a leader might die and be replaced by another proc without a leader between 1 and 2. But the code since v4 does: + leader = proc->lockGroupLeader; + if (leader) + values[29] = Int32GetDatum(leader->pid); ..which is safe because PROCs are allocated in shared memory, so leader is for sure a non-NULL pointer to a PROC. leader->pid may be read inconsistently, which is what the comment says: "no extra lock is being held". -- Justin
On Mon, Mar 16, 2020 at 12:43:41AM -0500, Justin Pryzby wrote: > I think I see. Julien's v3 patch did this: > https://www.postgresql.org/message-id/attachment/106429/pgsa_leader_pid-v3.diff > + if (proc->lockGroupLeader) > + values[29] = Int32GetDatum(proc->lockGroupLeader->pid); > > ..which is racy because a proc with a leader might die and be replaced by > another proc without a leader between 1 and 2. > > But the code since v4 does: > > + leader = proc->lockGroupLeader; > + if (leader) > + values[29] = Int32GetDatum(leader->pid); > > ..which is safe because PROCs are allocated in shared memory, so leader is for > sure a non-NULL pointer to a PROC. leader->pid may be read inconsistently, > which is what the comment says: "no extra lock is being held". Yes, you have the correct answer here. As shaped, the code relies on the state of a PGPROC entry in shared memory. -- Michael