Thread: Parallel leader process info in EXPLAIN

Parallel leader process info in EXPLAIN

From
Thomas Munro
Date:
Hi,

While working on some slides explaining EXPLAIN, I couldn't resist the
urge to add the missing $SUBJECT.  The attached 0001 patch gives the
following:

Gather  ... time=0.146..33.077 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=4425
  ->  Parallel Seq Scan on public.t ... time=19.421..30.092 rows=0 loops=3)
        Filter: (t.i = 42)
        Rows Removed by Filter: 333333
        Leader: actual time=0.013..32.025 rows=1 loops=1      <--- NEW
          Buffers: shared hit=1546                            <--- NEW
        Worker 0: actual time=29.069..29.069 rows=0 loops=1
          Buffers: shared hit=1126
        Worker 1: actual time=29.181..29.181 rows=0 loops=1
          Buffers: shared hit=1753

Without that, you have to deduce what work was done in the leader, but
I'd rather just show it.

The 0002 patch adjusts Sort for consistency with that scheme, so you get:

Sort  ... time=84.303..122.238 rows=333333 loops=3)
   Output: t1.i
   Sort Key: t1.i
   Leader:  Sort Method: external merge  Disk: 5864kB       <--- DIFFERENT
   Worker 0:  Sort Method: external merge  Disk: 3376kB
   Worker 1:  Sort Method: external merge  Disk: 4504kB
   Leader: actual time=119.624..165.949 rows=426914 loops=1
   Worker 0: actual time=61.239..90.984 rows=245612 loops=1
   Worker 1: actual time=72.046..109.782 rows=327474 loops=1

Without the "Leader" label, it's not really clear to the uninitiated
whether you're looking at combined, average or single process numbers.

Of course there are some more things that could be reported in a
similar way eventually, such as filter counters and hash join details.

For the XML/JSON/YAML formats, I decided to use a <Worker> element
with <Worker-Number>-1</Worker-Number> to indicate the leader.
Perhaps there should be a <Leader> element instead?

Thoughts?

Attachment

Re: Parallel leader process info in EXPLAIN

From
Melanie Plageman
Date:

On Wed, Oct 23, 2019 at 12:30 AM Thomas Munro <thomas.munro@gmail.com> wrote:

While working on some slides explaining EXPLAIN, I couldn't resist the
urge to add the missing $SUBJECT.  The attached 0001 patch gives the
following:

Gather  ... time=0.146..33.077 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=4425
  ->  Parallel Seq Scan on public.t ... time=19.421..30.092 rows=0 loops=3)
        Filter: (t.i = 42)
        Rows Removed by Filter: 333333
        Leader: actual time=0.013..32.025 rows=1 loops=1      <--- NEW
          Buffers: shared hit=1546                            <--- NEW
        Worker 0: actual time=29.069..29.069 rows=0 loops=1
          Buffers: shared hit=1126
        Worker 1: actual time=29.181..29.181 rows=0 loops=1
          Buffers: shared hit=1753

Without that, you have to deduce what work was done in the leader, but
I'd rather just show it.

The 0002 patch adjusts Sort for consistency with that scheme, so you get:

Sort  ... time=84.303..122.238 rows=333333 loops=3)
   Output: t1.i
   Sort Key: t1.i
   Leader:  Sort Method: external merge  Disk: 5864kB       <--- DIFFERENT
   Worker 0:  Sort Method: external merge  Disk: 3376kB
   Worker 1:  Sort Method: external merge  Disk: 4504kB
   Leader: actual time=119.624..165.949 rows=426914 loops=1
   Worker 0: actual time=61.239..90.984 rows=245612 loops=1
   Worker 1: actual time=72.046..109.782 rows=327474 loops=1

Without the "Leader" label, it's not really clear to the uninitiated
whether you're looking at combined, average or single process numbers.

Cool! I dig it.
Checked out the patches a bit and noticed that the tuplesort
instrumentation uses spaceUsed and I saw this comment in
tuplesort_get_stats()

/*
* Note: it might seem we should provide both memory and disk usage for a
* disk-based sort.  However, the current code doesn't track memory space
* accurately once we have begun to return tuples to the caller (since we
* don't account for pfree's the caller is expected to do), so we cannot
* rely on availMem in a disk sort.  This does not seem worth the overhead
* to fix.  Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/

might it be worth trying out the memory accounting API
5dd7fc1519461548eebf26c33eac6878ea3e8788 here?
 

Of course there are some more things that could be reported in a
similar way eventually, such as filter counters and hash join details.


This made me think about other explain wishlist items.
For parallel hashjoin, I would find it useful to know which batches
each worker participated in (maybe just probing to start with, but
loading would be great too).

I'm not sure anyone else (especially users) would care about this,
though.

--
Melanie Plageman

Re: Parallel leader process info in EXPLAIN

From
Peter Geoghegan
Date:
On Wed, Oct 30, 2019 at 9:24 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Checked out the patches a bit and noticed that the tuplesort
> instrumentation uses spaceUsed and I saw this comment in
> tuplesort_get_stats()

> might it be worth trying out the memory accounting API
> 5dd7fc1519461548eebf26c33eac6878ea3e8788 here?

I made exactly the same suggestion several years back, not long after
the API was first proposed by Jeff. However, tuplesort.c has changed a
lot since that time, to the extent that that comment now seems kind of
obsolete. These days, availMem accounting still isn't used at all for
disk sorts. Rather, the slab allocator is used. Virtually all the
memory used to merge is now managed by logtape.c, with only fixed
per-tape memory buffers within tuplesort.c. This per-tape stuff is
tiny anyway -- slightly more than 1KiB per tape.

It would now be relatively straightforward to report the memory used
by disk-based sorts, without needing to use the memory accounting API.
I imagine that it would report the high watermark memory usage during
the final merge. It's possible for this to be lower than the high
watermark during initial run generation (e.g. because of the
MaxAllocSize limit in buffer size within logtape.c), but that still
seems like the most useful figure to users. There'd be a new
"LogicalTapeSetMemory()" function to go along with the existing
LogicalTapeSetBlocks() function, or something along those lines.

Not planning to work on this now, but perhaps you have time for it.

-- 
Peter Geoghegan



Re: Parallel leader process info in EXPLAIN

From
Tomas Vondra
Date:
On Wed, Oct 30, 2019 at 10:39:04AM -0700, Peter Geoghegan wrote:
>On Wed, Oct 30, 2019 at 9:24 AM Melanie Plageman
><melanieplageman@gmail.com> wrote:
>> Checked out the patches a bit and noticed that the tuplesort
>> instrumentation uses spaceUsed and I saw this comment in
>> tuplesort_get_stats()
>
>> might it be worth trying out the memory accounting API
>> 5dd7fc1519461548eebf26c33eac6878ea3e8788 here?
>
>I made exactly the same suggestion several years back, not long after
>the API was first proposed by Jeff. However, tuplesort.c has changed a
>lot since that time, to the extent that that comment now seems kind of
>obsolete. These days, availMem accounting still isn't used at all for
>disk sorts. Rather, the slab allocator is used. Virtually all the
>memory used to merge is now managed by logtape.c, with only fixed
>per-tape memory buffers within tuplesort.c. This per-tape stuff is
>tiny anyway -- slightly more than 1KiB per tape.
>
>It would now be relatively straightforward to report the memory used
>by disk-based sorts, without needing to use the memory accounting API.
>I imagine that it would report the high watermark memory usage during
>the final merge. It's possible for this to be lower than the high
>watermark during initial run generation (e.g. because of the
>MaxAllocSize limit in buffer size within logtape.c), but that still
>seems like the most useful figure to users. There'd be a new
>"LogicalTapeSetMemory()" function to go along with the existing
>LogicalTapeSetBlocks() function, or something along those lines.
>
>Not planning to work on this now, but perhaps you have time for it.
>

Another thing worth mentioning is that the memory accounting API does
nothing about the pfree() calls, mentioned in the comment. The memory is
tracked at the block level, so unless the pfree() frees the whole block
(which only really happens for oversized chunks) the accounting info
does not change.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Parallel leader process info in EXPLAIN

From
Thomas Munro
Date:
On Thu, Oct 31, 2019 at 5:24 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Wed, Oct 23, 2019 at 12:30 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Of course there are some more things that could be reported in a
>> similar way eventually, such as filter counters and hash join details.
>
> This made me think about other explain wishlist items.
> For parallel hashjoin, I would find it useful to know which batches
> each worker participated in (maybe just probing to start with, but
> loading would be great too).
>
> I'm not sure anyone else (especially users) would care about this,
> though.

Yeah, I think that'd be interesting.  At some point in the patch set
when I was working on the batch load balancing strategy I showed the
number of tuples hashed and number of batches probed by each process
(not the actual batch numbers, since that seems a bit over the top):

https://www.postgresql.org/message-id/CAEepm%3D0th8Le2SDCv32zN7tMyCJYR9oGYJ52fXNYJz1hrpGW%2BQ%40mail.gmail.com

I guess I thought of that as a debugging feature and took it out
because it was too verbose, but maybe it just needs to be controlled
by the VERBOSE switch.  Do you think we should put that back?



Re: Parallel leader process info in EXPLAIN

From
Thomas Munro
Date:
On Mon, Nov 4, 2019 at 12:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I guess I thought of that as a debugging feature and took it out
> because it was too verbose, but maybe it just needs to be controlled
> by the VERBOSE switch.  Do you think we should put that back?

By which I mean: would you like to send a patch?  :-)

Here is a new version of the "Leader:" patch, because cfbot told me
that gcc didn't like it as much as clang.

Attachment

Re: Parallel leader process info in EXPLAIN

From
Rafia Sabih
Date:
On Mon, 4 Nov 2019 at 00:30, Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Nov 4, 2019 at 12:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I guess I thought of that as a debugging feature and took it out
> because it was too verbose, but maybe it just needs to be controlled
> by the VERBOSE switch.  Do you think we should put that back?

By which I mean: would you like to send a patch?  :-)

Here is a new version of the "Leader:" patch, because cfbot told me
that gcc didn't like it as much as clang.

I was reviewing this patch and here are a few comments,

+static void
+ExplainNodePerProcess(ExplainState *es, bool *opened_group,
+  int worker_number, Instrumentation *instrument)
+{

A small description about this routine would be helpful and will give the file a consistent look.

Also, I noticed that the worker details are displayed for sort node even without verbose, but for scans it is only with verbose. Am I missing something or there is something behind? However, I am not sure if this is the introduced by this patch-set.

--
Regards,
Rafia Sabih

Re: Parallel leader process info in EXPLAIN

From
Thomas Munro
Date:
On Thu, Nov 7, 2019 at 11:37 PM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> I was reviewing this patch and here are a few comments,

Hi Rafia,

Thanks!

> +static void
> +ExplainNodePerProcess(ExplainState *es, bool *opened_group,
> +  int worker_number, Instrumentation *instrument)
> +{
>
> A small description about this routine would be helpful and will give the file a consistent look.

Done for both new functions.  I also improved the commit message for
0001 a bit to explain the change better.

> Also, I noticed that the worker details are displayed for sort node even without verbose, but for scans it is only
withverbose. Am I missing something or there is something behind? However, I am not sure if this is the introduced by
thispatch-set.
 

Yeah, it's a pre-existing thing, but I agree it's an interesting
difference.  We currently don't have a way to show a 'combined'
version of a parallel (oblivious) sort: we always show the per-process
version, and all this patch changes is how we label the leader's
stats.  I suppose someone could argue that in non-VERBOSE mode we
should show the total memory usage (sum from all processes). I suppose
it's possible they use different sort types (one worker runs out of
work_mem and another doesn't), and I'm not sure how how you'd
aggregate that.

Attachment

Re: Parallel leader process info in EXPLAIN

From
Melanie Plageman
Date:
Both patches aren't applying cleanly anymore.
The first patch in the set applied cleanly for me before b925a00f4ef65

It mostly seems like the default settings for the patch program were
my problem, but, since I noticed that the patch tester bot was failing
to apply it also, I thought I would suggest rebasing it.

I applied it to the sha before b925a00f4ef65 and then cherry-picked it
to master and it applied fine. I attached that rebased patch with a
new version number (is that the preferred way to indicate that it is
newer even if it contains no new content?).

The second patch in the set needed a bit more looking at to rebase,
which I didn't do yet.

I played around with the first patch in the patchset and very much
appreciate seeing the leaders contribution.
However, I noticed that there were no EXPLAIN diffs in any test files
and just wondered if this was a conscious choice (even with xxx the
actual numbers, I would have thought that there would be an EXPLAIN
VERBOSE with leader participation somewhere in regress).

--
Melanie Plageman
Attachment

Re: Parallel leader process info in EXPLAIN

From
Melanie Plageman
Date:
So, I think from a code review perspective the code in the patches
LGTM.
As for the EXPLAIN ANALYZE tests--I don't see that many of them in
regress, so maybe that's because they aren't normally very useful.  In
this case, it would only be to protect against regressions in printing
the leader instrumentation, I think.
The problem with that is, even with all the non-deterministic info
redacted, if the leader doesn't participate (which is not guaranteed),
then its stats wouldn't be printed at all and that would cause an
incorrectly failing test case...okay I just talked myself out of the
usefulness of testing this.
So, I would move it to "ready for committer", but, since it is not
applying cleanly, I have changed the status to "waiting on author".

Re: Parallel leader process info in EXPLAIN

From
Thomas Munro
Date:
On Sat, Jan 25, 2020 at 3:39 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> So, I think from a code review perspective the code in the patches
> LGTM.
> As for the EXPLAIN ANALYZE tests--I don't see that many of them in
> regress, so maybe that's because they aren't normally very useful.  In
> this case, it would only be to protect against regressions in printing
> the leader instrumentation, I think.
> The problem with that is, even with all the non-deterministic info
> redacted, if the leader doesn't participate (which is not guaranteed),
> then its stats wouldn't be printed at all and that would cause an
> incorrectly failing test case...okay I just talked myself out of the
> usefulness of testing this.
> So, I would move it to "ready for committer", but, since it is not
> applying cleanly, I have changed the status to "waiting on author".

Hi Melanie,

Thanks for the reviews!

I think I'm going to abandon 0002 for now, because that stuff is being
refactored independently over here, so rebasing would be futile:

https://www.postgresql.org/message-id/flat/CAOtHd0AvAA8CLB9Xz0wnxu1U%3DzJCKrr1r4QwwXi_kcQsHDVU%3DQ%40mail.gmail.com

On that basis, I've set it to ready for committer (meaning 0001 only).
Thanks for the rebase.  I'll let that sit for a couple of days and see
if anything conflicting comes out of that other thread.  It's a fair
complaint that we lack tests that show the new output; I'll think
about adding one too.



Re: Parallel leader process info in EXPLAIN

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> I think I'm going to abandon 0002 for now, because that stuff is being
> refactored independently over here, so rebasing would be futile:
> https://www.postgresql.org/message-id/flat/CAOtHd0AvAA8CLB9Xz0wnxu1U%3DzJCKrr1r4QwwXi_kcQsHDVU%3DQ%40mail.gmail.com

Yeah, your 0002 needs some rethinking.  I kind of like the proposed
change in the text-format output:

          Workers Launched: 4
          ->  Sort (actual rows=2000 loops=15)
                Sort Key: tenk1.ten
-               Sort Method: quicksort  Memory: xxx
+               Leader:  Sort Method: quicksort  Memory: xxx
                Worker 0:  Sort Method: quicksort  Memory: xxx
                Worker 1:  Sort Method: quicksort  Memory: xxx
                Worker 2:  Sort Method: quicksort  Memory: xxx

but it's quite unclear to me how that translates into non-text
formats, especially if we're not to break invariants about which
fields are present in a non-text output structure (cf [1]).

I've occasionally wondered whether we'd be better off presenting
this info as if the leader were "worker 0" and then the N workers
are workers 1 to N.  I've not worked out the implications of that
in any detail though.  It's fairly easy to see what to do for
fields that can be aggregated (the numbers printed for the node
as a whole are totals), but it doesn't help us any with something
like Sort Method.

On a narrower note, I'm not at all happy with the fact that 0001
adds yet another field to *every* PlanState.  I think this is
doubling down on a fundamentally wrong decision to have
ExecParallelRetrieveInstrumentation do some aggregation immediately.
I think we should abandon that and just say that it returns the raw
leader and per-worker data, and then explain.c can aggregate as it
wishes.

            regards, tom lane

[1] https://www.postgresql.org/message-id/19416.1580069629%40sss.pgh.pa.us



Re: Parallel leader process info in EXPLAIN

From
Thomas Munro
Date:
On Mon, Jan 27, 2020 at 11:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I've occasionally wondered whether we'd be better off presenting
> this info as if the leader were "worker 0" and then the N workers
> are workers 1 to N.  I've not worked out the implications of that
> in any detail though.  It's fairly easy to see what to do for
> fields that can be aggregated (the numbers printed for the node
> as a whole are totals), but it doesn't help us any with something
> like Sort Method.

Yeah, in the 0001 patch (which no longer applies and probably just
needs to be rewritten now), I used "Leader:" in the text format, but
worker number -1 in the structured formats, which I expected some
blowback on.  I also thought about adding one to all the numbers as
you suggest.

In PHJ I had a related problem: I had to +1 the worker number to get a
zero-based "participant number" so that the leader would have a slot
in various data structures, and I wondered if we shouldn't just do
that to the whole system (eg not just in explain's output or in
localised bits of PHJ code).

> On a narrower note, I'm not at all happy with the fact that 0001
> adds yet another field to *every* PlanState.  I think this is
> doubling down on a fundamentally wrong decision to have
> ExecParallelRetrieveInstrumentation do some aggregation immediately.
> I think we should abandon that and just say that it returns the raw
> leader and per-worker data, and then explain.c can aggregate as it
> wishes.

Fair point.  I will look into that.



Re: Parallel leader process info in EXPLAIN

From
David Steele
Date:
Hi Thomas,

On 1/26/20 7:03 PM, Thomas Munro wrote:
> 
> Fair point.  I will look into that.

Are you still planning on looking at this patch for PG13?

Based on the current state (002 abandoned, 001 needs total rework) I'd 
say it should just be Returned with Feedback or Closed for now.

Regards,
-- 
-David
david@pgmasters.net



Re: Parallel leader process info in EXPLAIN

From
Thomas Munro
Date:
On Tue, Mar 17, 2020 at 2:39 AM David Steele <david@pgmasters.net> wrote:
> On 1/26/20 7:03 PM, Thomas Munro wrote:
> > Fair point.  I will look into that.
>
> Are you still planning on looking at this patch for PG13?
>
> Based on the current state (002 abandoned, 001 needs total rework) I'd
> say it should just be Returned with Feedback or Closed for now.

When you put it like that, yeah :-)  I marked it returned with
feedback.  Thanks Melanie and Rafia for the reviews so far, and I'll
be back with a new version for PG14.



Re: Parallel leader process info in EXPLAIN

From
James Coleman
Date:
On Thu, Nov 7, 2019 at 9:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Nov 7, 2019 at 11:37 PM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> > ...
> > Also, I noticed that the worker details are displayed for sort node even without verbose, but for scans it is only
withverbose. Am I missing something or there is something behind? However, I am not sure if this is the introduced by
thispatch-set.
 
>
> Yeah, it's a pre-existing thing, but I agree it's an interesting
> difference.  We currently don't have a way to show a 'combined'
> version of a parallel (oblivious) sort: we always show the per-process
> version, and all this patch changes is how we label the leader's
> stats.  I suppose someone could argue that in non-VERBOSE mode we
> should show the total memory usage (sum from all processes). I suppose
> it's possible they use different sort types (one worker runs out of
> work_mem and another doesn't), and I'm not sure how how you'd
> aggregate that.

Over at [1] (incremental sort patch) I had a similar question, since
each sort node (even non-parallel) can execute multiple tuplesorts.
The approach I took was to show both average and max for both disk and
memory usage as well as all sort strategies used. It looks like this:

   ->  Incremental Sort
         Sort Key: a, b
         Presorted Key: a
         Full-sort Groups: 4 (Methods: quicksort) Memory: 26kB (avg), 26kB (max)
         ->  Index Scan using idx_t_a...

It'd be great if that had a use here too :)

James

[1]: https://www.postgresql.org/message-id/CAAaqYe_ctGqQsauuYS5StPULkES7%3Dt8vNwvEPyzXQdbjAuZ6vA%40mail.gmail.com