Thread: Parallel leader process info in EXPLAIN
Hi, While working on some slides explaining EXPLAIN, I couldn't resist the urge to add the missing $SUBJECT. The attached 0001 patch gives the following: Gather ... time=0.146..33.077 rows=1 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=4425 -> Parallel Seq Scan on public.t ... time=19.421..30.092 rows=0 loops=3) Filter: (t.i = 42) Rows Removed by Filter: 333333 Leader: actual time=0.013..32.025 rows=1 loops=1 <--- NEW Buffers: shared hit=1546 <--- NEW Worker 0: actual time=29.069..29.069 rows=0 loops=1 Buffers: shared hit=1126 Worker 1: actual time=29.181..29.181 rows=0 loops=1 Buffers: shared hit=1753 Without that, you have to deduce what work was done in the leader, but I'd rather just show it. The 0002 patch adjusts Sort for consistency with that scheme, so you get: Sort ... time=84.303..122.238 rows=333333 loops=3) Output: t1.i Sort Key: t1.i Leader: Sort Method: external merge Disk: 5864kB <--- DIFFERENT Worker 0: Sort Method: external merge Disk: 3376kB Worker 1: Sort Method: external merge Disk: 4504kB Leader: actual time=119.624..165.949 rows=426914 loops=1 Worker 0: actual time=61.239..90.984 rows=245612 loops=1 Worker 1: actual time=72.046..109.782 rows=327474 loops=1 Without the "Leader" label, it's not really clear to the uninitiated whether you're looking at combined, average or single process numbers. Of course there are some more things that could be reported in a similar way eventually, such as filter counters and hash join details. For the XML/JSON/YAML formats, I decided to use a <Worker> element with <Worker-Number>-1</Worker-Number> to indicate the leader. Perhaps there should be a <Leader> element instead? Thoughts?
Attachment
On Wed, Oct 23, 2019 at 12:30 AM Thomas Munro <thomas.munro@gmail.com> wrote:
While working on some slides explaining EXPLAIN, I couldn't resist the
urge to add the missing $SUBJECT. The attached 0001 patch gives the
following:
Gather ... time=0.146..33.077 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=4425
-> Parallel Seq Scan on public.t ... time=19.421..30.092 rows=0 loops=3)
Filter: (t.i = 42)
Rows Removed by Filter: 333333
Leader: actual time=0.013..32.025 rows=1 loops=1 <--- NEW
Buffers: shared hit=1546 <--- NEW
Worker 0: actual time=29.069..29.069 rows=0 loops=1
Buffers: shared hit=1126
Worker 1: actual time=29.181..29.181 rows=0 loops=1
Buffers: shared hit=1753
Without that, you have to deduce what work was done in the leader, but
I'd rather just show it.
The 0002 patch adjusts Sort for consistency with that scheme, so you get:
Sort ... time=84.303..122.238 rows=333333 loops=3)
Output: t1.i
Sort Key: t1.i
Leader: Sort Method: external merge Disk: 5864kB <--- DIFFERENT
Worker 0: Sort Method: external merge Disk: 3376kB
Worker 1: Sort Method: external merge Disk: 4504kB
Leader: actual time=119.624..165.949 rows=426914 loops=1
Worker 0: actual time=61.239..90.984 rows=245612 loops=1
Worker 1: actual time=72.046..109.782 rows=327474 loops=1
Without the "Leader" label, it's not really clear to the uninitiated
whether you're looking at combined, average or single process numbers.
Cool! I dig it.
Checked out the patches a bit and noticed that the tuplesort
instrumentation uses spaceUsed and I saw this comment in
tuplesort_get_stats()
/*
* Note: it might seem we should provide both memory and disk usage for a
* disk-based sort. However, the current code doesn't track memory space
* accurately once we have begun to return tuples to the caller (since we
* don't account for pfree's the caller is expected to do), so we cannot
* rely on availMem in a disk sort. This does not seem worth the overhead
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
might it be worth trying out the memory accounting API
5dd7fc1519461548eebf26c33eac6878ea3e8788 here?
Checked out the patches a bit and noticed that the tuplesort
instrumentation uses spaceUsed and I saw this comment in
tuplesort_get_stats()
/*
* Note: it might seem we should provide both memory and disk usage for a
* disk-based sort. However, the current code doesn't track memory space
* accurately once we have begun to return tuples to the caller (since we
* don't account for pfree's the caller is expected to do), so we cannot
* rely on availMem in a disk sort. This does not seem worth the overhead
* to fix. Is it worth creating an API for the memory context code to
* tell us how much is actually used in sortcontext?
*/
might it be worth trying out the memory accounting API
5dd7fc1519461548eebf26c33eac6878ea3e8788 here?
Of course there are some more things that could be reported in a
similar way eventually, such as filter counters and hash join details.
This made me think about other explain wishlist items.
For parallel hashjoin, I would find it useful to know which batches
each worker participated in (maybe just probing to start with, but
loading would be great too).
I'm not sure anyone else (especially users) would care about this,
though.
For parallel hashjoin, I would find it useful to know which batches
each worker participated in (maybe just probing to start with, but
loading would be great too).
I'm not sure anyone else (especially users) would care about this,
though.
Melanie Plageman
On Wed, Oct 30, 2019 at 9:24 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > Checked out the patches a bit and noticed that the tuplesort > instrumentation uses spaceUsed and I saw this comment in > tuplesort_get_stats() > might it be worth trying out the memory accounting API > 5dd7fc1519461548eebf26c33eac6878ea3e8788 here? I made exactly the same suggestion several years back, not long after the API was first proposed by Jeff. However, tuplesort.c has changed a lot since that time, to the extent that that comment now seems kind of obsolete. These days, availMem accounting still isn't used at all for disk sorts. Rather, the slab allocator is used. Virtually all the memory used to merge is now managed by logtape.c, with only fixed per-tape memory buffers within tuplesort.c. This per-tape stuff is tiny anyway -- slightly more than 1KiB per tape. It would now be relatively straightforward to report the memory used by disk-based sorts, without needing to use the memory accounting API. I imagine that it would report the high watermark memory usage during the final merge. It's possible for this to be lower than the high watermark during initial run generation (e.g. because of the MaxAllocSize limit in buffer size within logtape.c), but that still seems like the most useful figure to users. There'd be a new "LogicalTapeSetMemory()" function to go along with the existing LogicalTapeSetBlocks() function, or something along those lines. Not planning to work on this now, but perhaps you have time for it. -- Peter Geoghegan
On Wed, Oct 30, 2019 at 10:39:04AM -0700, Peter Geoghegan wrote: >On Wed, Oct 30, 2019 at 9:24 AM Melanie Plageman ><melanieplageman@gmail.com> wrote: >> Checked out the patches a bit and noticed that the tuplesort >> instrumentation uses spaceUsed and I saw this comment in >> tuplesort_get_stats() > >> might it be worth trying out the memory accounting API >> 5dd7fc1519461548eebf26c33eac6878ea3e8788 here? > >I made exactly the same suggestion several years back, not long after >the API was first proposed by Jeff. However, tuplesort.c has changed a >lot since that time, to the extent that that comment now seems kind of >obsolete. These days, availMem accounting still isn't used at all for >disk sorts. Rather, the slab allocator is used. Virtually all the >memory used to merge is now managed by logtape.c, with only fixed >per-tape memory buffers within tuplesort.c. This per-tape stuff is >tiny anyway -- slightly more than 1KiB per tape. > >It would now be relatively straightforward to report the memory used >by disk-based sorts, without needing to use the memory accounting API. >I imagine that it would report the high watermark memory usage during >the final merge. It's possible for this to be lower than the high >watermark during initial run generation (e.g. because of the >MaxAllocSize limit in buffer size within logtape.c), but that still >seems like the most useful figure to users. There'd be a new >"LogicalTapeSetMemory()" function to go along with the existing >LogicalTapeSetBlocks() function, or something along those lines. > >Not planning to work on this now, but perhaps you have time for it. > Another thing worth mentioning is that the memory accounting API does nothing about the pfree() calls, mentioned in the comment. The memory is tracked at the block level, so unless the pfree() frees the whole block (which only really happens for oversized chunks) the accounting info does not change. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Oct 31, 2019 at 5:24 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Wed, Oct 23, 2019 at 12:30 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> Of course there are some more things that could be reported in a >> similar way eventually, such as filter counters and hash join details. > > This made me think about other explain wishlist items. > For parallel hashjoin, I would find it useful to know which batches > each worker participated in (maybe just probing to start with, but > loading would be great too). > > I'm not sure anyone else (especially users) would care about this, > though. Yeah, I think that'd be interesting. At some point in the patch set when I was working on the batch load balancing strategy I showed the number of tuples hashed and number of batches probed by each process (not the actual batch numbers, since that seems a bit over the top): https://www.postgresql.org/message-id/CAEepm%3D0th8Le2SDCv32zN7tMyCJYR9oGYJ52fXNYJz1hrpGW%2BQ%40mail.gmail.com I guess I thought of that as a debugging feature and took it out because it was too verbose, but maybe it just needs to be controlled by the VERBOSE switch. Do you think we should put that back?
On Mon, Nov 4, 2019 at 12:11 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I guess I thought of that as a debugging feature and took it out > because it was too verbose, but maybe it just needs to be controlled > by the VERBOSE switch. Do you think we should put that back? By which I mean: would you like to send a patch? :-) Here is a new version of the "Leader:" patch, because cfbot told me that gcc didn't like it as much as clang.
Attachment
On Mon, 4 Nov 2019 at 00:30, Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Nov 4, 2019 at 12:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I guess I thought of that as a debugging feature and took it out
> because it was too verbose, but maybe it just needs to be controlled
> by the VERBOSE switch. Do you think we should put that back?
By which I mean: would you like to send a patch? :-)
Here is a new version of the "Leader:" patch, because cfbot told me
that gcc didn't like it as much as clang.
I was reviewing this patch and here are a few comments,
+static void
+ExplainNodePerProcess(ExplainState *es, bool *opened_group,
+ int worker_number, Instrumentation *instrument)
+{
+ExplainNodePerProcess(ExplainState *es, bool *opened_group,
+ int worker_number, Instrumentation *instrument)
+{
A small description about this routine would be helpful and will give the file a consistent look.
Also, I noticed that the worker details are displayed for sort node even without verbose, but for scans it is only with verbose. Am I missing something or there is something behind? However, I am not sure if this is the introduced by this patch-set.
Regards,
Rafia Sabih
Rafia Sabih
On Thu, Nov 7, 2019 at 11:37 PM Rafia Sabih <rafia.pghackers@gmail.com> wrote: > I was reviewing this patch and here are a few comments, Hi Rafia, Thanks! > +static void > +ExplainNodePerProcess(ExplainState *es, bool *opened_group, > + int worker_number, Instrumentation *instrument) > +{ > > A small description about this routine would be helpful and will give the file a consistent look. Done for both new functions. I also improved the commit message for 0001 a bit to explain the change better. > Also, I noticed that the worker details are displayed for sort node even without verbose, but for scans it is only withverbose. Am I missing something or there is something behind? However, I am not sure if this is the introduced by thispatch-set. Yeah, it's a pre-existing thing, but I agree it's an interesting difference. We currently don't have a way to show a 'combined' version of a parallel (oblivious) sort: we always show the per-process version, and all this patch changes is how we label the leader's stats. I suppose someone could argue that in non-VERBOSE mode we should show the total memory usage (sum from all processes). I suppose it's possible they use different sort types (one worker runs out of work_mem and another doesn't), and I'm not sure how how you'd aggregate that.
Attachment
Both patches aren't applying cleanly anymore.
The first patch in the set applied cleanly for me before b925a00f4ef65
It mostly seems like the default settings for the patch program were
my problem, but, since I noticed that the patch tester bot was failing
to apply it also, I thought I would suggest rebasing it.
I applied it to the sha before b925a00f4ef65 and then cherry-picked it
to master and it applied fine. I attached that rebased patch with a
new version number (is that the preferred way to indicate that it is
newer even if it contains no new content?).
The second patch in the set needed a bit more looking at to rebase,
which I didn't do yet.
I played around with the first patch in the patchset and very much
appreciate seeing the leaders contribution.
However, I noticed that there were no EXPLAIN diffs in any test files
and just wondered if this was a conscious choice (even with xxx the
actual numbers, I would have thought that there would be an EXPLAIN
VERBOSE with leader participation somewhere in regress).
The first patch in the set applied cleanly for me before b925a00f4ef65
It mostly seems like the default settings for the patch program were
my problem, but, since I noticed that the patch tester bot was failing
to apply it also, I thought I would suggest rebasing it.
I applied it to the sha before b925a00f4ef65 and then cherry-picked it
to master and it applied fine. I attached that rebased patch with a
new version number (is that the preferred way to indicate that it is
newer even if it contains no new content?).
The second patch in the set needed a bit more looking at to rebase,
which I didn't do yet.
I played around with the first patch in the patchset and very much
appreciate seeing the leaders contribution.
However, I noticed that there were no EXPLAIN diffs in any test files
and just wondered if this was a conscious choice (even with xxx the
actual numbers, I would have thought that there would be an EXPLAIN
VERBOSE with leader participation somewhere in regress).
--
Melanie Plageman
Attachment
So, I think from a code review perspective the code in the patches
LGTM.
As for the EXPLAIN ANALYZE tests--I don't see that many of them in
regress, so maybe that's because they aren't normally very useful. In
this case, it would only be to protect against regressions in printing
the leader instrumentation, I think.
The problem with that is, even with all the non-deterministic info
redacted, if the leader doesn't participate (which is not guaranteed),
then its stats wouldn't be printed at all and that would cause an
incorrectly failing test case...okay I just talked myself out of the
usefulness of testing this.
So, I would move it to "ready for committer", but, since it is not
applying cleanly, I have changed the status to "waiting on author".
LGTM.
As for the EXPLAIN ANALYZE tests--I don't see that many of them in
regress, so maybe that's because they aren't normally very useful. In
this case, it would only be to protect against regressions in printing
the leader instrumentation, I think.
The problem with that is, even with all the non-deterministic info
redacted, if the leader doesn't participate (which is not guaranteed),
then its stats wouldn't be printed at all and that would cause an
incorrectly failing test case...okay I just talked myself out of the
usefulness of testing this.
So, I would move it to "ready for committer", but, since it is not
applying cleanly, I have changed the status to "waiting on author".
On Sat, Jan 25, 2020 at 3:39 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > So, I think from a code review perspective the code in the patches > LGTM. > As for the EXPLAIN ANALYZE tests--I don't see that many of them in > regress, so maybe that's because they aren't normally very useful. In > this case, it would only be to protect against regressions in printing > the leader instrumentation, I think. > The problem with that is, even with all the non-deterministic info > redacted, if the leader doesn't participate (which is not guaranteed), > then its stats wouldn't be printed at all and that would cause an > incorrectly failing test case...okay I just talked myself out of the > usefulness of testing this. > So, I would move it to "ready for committer", but, since it is not > applying cleanly, I have changed the status to "waiting on author". Hi Melanie, Thanks for the reviews! I think I'm going to abandon 0002 for now, because that stuff is being refactored independently over here, so rebasing would be futile: https://www.postgresql.org/message-id/flat/CAOtHd0AvAA8CLB9Xz0wnxu1U%3DzJCKrr1r4QwwXi_kcQsHDVU%3DQ%40mail.gmail.com On that basis, I've set it to ready for committer (meaning 0001 only). Thanks for the rebase. I'll let that sit for a couple of days and see if anything conflicting comes out of that other thread. It's a fair complaint that we lack tests that show the new output; I'll think about adding one too.
Thomas Munro <thomas.munro@gmail.com> writes: > I think I'm going to abandon 0002 for now, because that stuff is being > refactored independently over here, so rebasing would be futile: > https://www.postgresql.org/message-id/flat/CAOtHd0AvAA8CLB9Xz0wnxu1U%3DzJCKrr1r4QwwXi_kcQsHDVU%3DQ%40mail.gmail.com Yeah, your 0002 needs some rethinking. I kind of like the proposed change in the text-format output: Workers Launched: 4 -> Sort (actual rows=2000 loops=15) Sort Key: tenk1.ten - Sort Method: quicksort Memory: xxx + Leader: Sort Method: quicksort Memory: xxx Worker 0: Sort Method: quicksort Memory: xxx Worker 1: Sort Method: quicksort Memory: xxx Worker 2: Sort Method: quicksort Memory: xxx but it's quite unclear to me how that translates into non-text formats, especially if we're not to break invariants about which fields are present in a non-text output structure (cf [1]). I've occasionally wondered whether we'd be better off presenting this info as if the leader were "worker 0" and then the N workers are workers 1 to N. I've not worked out the implications of that in any detail though. It's fairly easy to see what to do for fields that can be aggregated (the numbers printed for the node as a whole are totals), but it doesn't help us any with something like Sort Method. On a narrower note, I'm not at all happy with the fact that 0001 adds yet another field to *every* PlanState. I think this is doubling down on a fundamentally wrong decision to have ExecParallelRetrieveInstrumentation do some aggregation immediately. I think we should abandon that and just say that it returns the raw leader and per-worker data, and then explain.c can aggregate as it wishes. regards, tom lane [1] https://www.postgresql.org/message-id/19416.1580069629%40sss.pgh.pa.us
On Mon, Jan 27, 2020 at 11:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I've occasionally wondered whether we'd be better off presenting > this info as if the leader were "worker 0" and then the N workers > are workers 1 to N. I've not worked out the implications of that > in any detail though. It's fairly easy to see what to do for > fields that can be aggregated (the numbers printed for the node > as a whole are totals), but it doesn't help us any with something > like Sort Method. Yeah, in the 0001 patch (which no longer applies and probably just needs to be rewritten now), I used "Leader:" in the text format, but worker number -1 in the structured formats, which I expected some blowback on. I also thought about adding one to all the numbers as you suggest. In PHJ I had a related problem: I had to +1 the worker number to get a zero-based "participant number" so that the leader would have a slot in various data structures, and I wondered if we shouldn't just do that to the whole system (eg not just in explain's output or in localised bits of PHJ code). > On a narrower note, I'm not at all happy with the fact that 0001 > adds yet another field to *every* PlanState. I think this is > doubling down on a fundamentally wrong decision to have > ExecParallelRetrieveInstrumentation do some aggregation immediately. > I think we should abandon that and just say that it returns the raw > leader and per-worker data, and then explain.c can aggregate as it > wishes. Fair point. I will look into that.
Hi Thomas, On 1/26/20 7:03 PM, Thomas Munro wrote: > > Fair point. I will look into that. Are you still planning on looking at this patch for PG13? Based on the current state (002 abandoned, 001 needs total rework) I'd say it should just be Returned with Feedback or Closed for now. Regards, -- -David david@pgmasters.net
On Tue, Mar 17, 2020 at 2:39 AM David Steele <david@pgmasters.net> wrote: > On 1/26/20 7:03 PM, Thomas Munro wrote: > > Fair point. I will look into that. > > Are you still planning on looking at this patch for PG13? > > Based on the current state (002 abandoned, 001 needs total rework) I'd > say it should just be Returned with Feedback or Closed for now. When you put it like that, yeah :-) I marked it returned with feedback. Thanks Melanie and Rafia for the reviews so far, and I'll be back with a new version for PG14.
On Thu, Nov 7, 2019 at 9:48 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Thu, Nov 7, 2019 at 11:37 PM Rafia Sabih <rafia.pghackers@gmail.com> wrote: > > ... > > Also, I noticed that the worker details are displayed for sort node even without verbose, but for scans it is only withverbose. Am I missing something or there is something behind? However, I am not sure if this is the introduced by thispatch-set. > > Yeah, it's a pre-existing thing, but I agree it's an interesting > difference. We currently don't have a way to show a 'combined' > version of a parallel (oblivious) sort: we always show the per-process > version, and all this patch changes is how we label the leader's > stats. I suppose someone could argue that in non-VERBOSE mode we > should show the total memory usage (sum from all processes). I suppose > it's possible they use different sort types (one worker runs out of > work_mem and another doesn't), and I'm not sure how how you'd > aggregate that. Over at [1] (incremental sort patch) I had a similar question, since each sort node (even non-parallel) can execute multiple tuplesorts. The approach I took was to show both average and max for both disk and memory usage as well as all sort strategies used. It looks like this: -> Incremental Sort Sort Key: a, b Presorted Key: a Full-sort Groups: 4 (Methods: quicksort) Memory: 26kB (avg), 26kB (max) -> Index Scan using idx_t_a... It'd be great if that had a use here too :) James [1]: https://www.postgresql.org/message-id/CAAaqYe_ctGqQsauuYS5StPULkES7%3Dt8vNwvEPyzXQdbjAuZ6vA%40mail.gmail.com