Thread: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
Hi hackers, EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports the number of heap blocks processed by the leader. It's missing the per-worker stats. The attached patch adds that functionality in the spirit of e.g. Sort or Memoize. Here is a simple test case and the EXPLAIN ANALYZE output with and without the patch: create table foo(col0 int, col1 int); insert into foo select generate_series(1, 1000, 0.001), generate_series(1000, 2000, 0.001); create index idx0 on foo(col0); create index idx1 on foo(col1); set parallel_tuple_cost = 0; set parallel_setup_cost = 0; explain (analyze, costs off, timing off) select * from foo where col0 > 900 or col1 = 1; With the patch: Gather (actual rows=99501 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Bitmap Heap Scan on foo (actual rows=33167 loops=3) Recheck Cond: ((col0 > 900) OR (col1 = 1)) Heap Blocks: exact=98 Worker 0: Heap Blocks: exact=171 lossy=0 Worker 1: Heap Blocks: exact=172 lossy=0 -> BitmapOr (actual rows=0 loops=1) -> Bitmap Index Scan on idx0 (actual rows=99501 loops=1) Index Cond: (col0 > 900) -> Bitmap Index Scan on idx1 (actual rows=0 loops=1) Index Cond: (col1 = 1) Without the patch: Gather (actual rows=99501 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Bitmap Heap Scan on foo (actual rows=33167 loops=3) Recheck Cond: ((col0 > 900) OR (col1 = 1)) Heap Blocks: exact=91 -> BitmapOr (actual rows=0 loops=1) -> Bitmap Index Scan on idx0 (actual rows=99501 loops=1) Index Cond: (col0 > 900) -> Bitmap Index Scan on idx1 (actual rows=0 loops=1) Index Cond: (col1 = 1) So in total the parallel Bitmap Heap Scan actually processed 441 heap blocks instead of just 91. Now two variable length arrays (VLA) would be needed, one for the snapshot and one for the stats. As this obviously doesn't work, I now use a single, big VLA and added functions to retrieve pointers to the respective fields. I'm using MAXALIGN() to make sure the latter field is aligned properly. Am I doing this correctly? I'm not entirely sure around alignment conventions and requirements of other platforms. I couldn't find existing tests that exercise the EXPLAIN ANALYZE output of specific nodes. I could only find a few basic smoke tests for EXPLAIN ANALYZE with parallel nodes in parallel_select.sql. Do we want tests for the changed functionality? If so I could right away also add tests for EXPLAIN ANALYZE including other parallel nodes. Thank you for your feedback. -- David Geier (ServiceNow)
Attachment
Hi, On 1/20/23 09:34, David Geier wrote: > EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports > the number of heap blocks processed by the leader. It's missing the > per-worker stats. The attached patch adds that functionality in the > spirit of e.g. Sort or Memoize. Here is a simple test case and the > EXPLAIN ANALYZE output with and without the patch: Attached is a rebased version of the patch. I would appreciate someone taking a look. As background: the change doesn't come out of thin air. We repeatedly took wrong conclusions in our query analysis because we assumed that the reported block counts include the workers. If no one objects I would also register the patch at the commit fest. The patch is passing cleanly on CI. -- David Geier (ServiceNow)
Attachment
> On Tue, Feb 21, 2023 at 01:02:35PM +0100, David Geier wrote: > Hi, > > On 1/20/23 09:34, David Geier wrote: > > EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports > > the number of heap blocks processed by the leader. It's missing the > > per-worker stats. The attached patch adds that functionality in the > > spirit of e.g. Sort or Memoize. Here is a simple test case and the > > EXPLAIN ANALYZE output with and without the patch: > > Attached is a rebased version of the patch. I would appreciate someone > taking a look. > > As background: the change doesn't come out of thin air. We repeatedly took > wrong conclusions in our query analysis because we assumed that the reported > block counts include the workers. > > If no one objects I would also register the patch at the commit fest. The > patch is passing cleanly on CI. Thanks for the patch. The idea sounds reasonable to me, but I have to admit snapshot_and_stats implementation looks awkward. Maybe it would be better to have a separate structure field for both stats and snapshot, which will be set to point to a corresponding place in the shared FAM e.g. when the worker is getting initialized? shm_toc_allocate mentions BUFFERALIGN to handle possibility of some atomic operations needing it, so I guess that would have to be an alignment in this case as well. Probably another option would be to allocate two separate pieces of shared memory, which resolves questions like proper alignment, but annoyingly will require an extra lookup and a new key.
Hi Dmitry, Thanks for looking at the patch and sorry for the long line. On 3/17/23 21:14, Dmitry Dolgov wrote: > The idea sounds reasonable to me, but I have to admit snapshot_and_stats > implementation looks awkward. Maybe it would be better to have a > separate structure field for both stats and snapshot, which will be set > to point to a corresponding place in the shared FAM e.g. when the worker > is getting initialized? shm_toc_allocate mentions BUFFERALIGN to handle > possibility of some atomic operations needing it, so I guess that would > have to be an alignment in this case as well. > > Probably another option would be to allocate two separate pieces of > shared memory, which resolves questions like proper alignment, but > annoyingly will require an extra lookup and a new key. I considered the other options and it seems to me none of them is particularly superior. All of them have pros and cons with the cons mostly outweighing the pros. Let me quickly elaborate: 1. Use multiple shm_toc entries: Shared state is split into multiple pieces. Extra pointers in BitmapHeapScanState needed to point at the split out data. BitmapHeapScanState has already a shared_info member, which is not a pointer to the shared memory but a pointer to the leader local data allocated used to store the instrumentation data from the workers. This is confusing but at least consistent with how its done in other places (e.g. execSort.c, nodeHash.c, nodeIncrementalSort.c). Having another pointer there which points to the shared memory makes it even more confusing. If we go this way we would have e.g. shared_info_copy and shared_info members in BitmapHeapScanState. 2. Store two extra pointers to the shared FAM entries in BitmapHeapScanState: IMHO, that is the better alternative of (1) as it doesn't need an extra TOC entry but comes with the same confusion of multiple pointers to SharedBitmapHeapScanInfo in BitmapHeapScanState. But maybe that's not too bad? 3. Solution in initial patch (use two functions to obtain pointers where/when needed): Avoids the need for another pointer in BitmapHeapScanState at the cost / ugliness of having to call the helper functions. Another, not yet discussed, option I can see work is: 4. Allocate a fixed amount of memory for the instrumentation stats based on MAX_PARALLEL_WORKER_LIMIT: MAX_PARALLEL_WORKER_LIMIT is 1024 and used as the limit of the max_parallel_workers GUC. This way MAX_PARALLEL_WORKER_LIMIT * sizeof(BitmapHeapScanInstrumentation) = 1024 * 8 = 8192 bytes would be allocated. To cut this down in half we could additionally change the type of lossy_pages and exact_pages from long to uint32. Only possibly needed memory would have to get initialized, the remaining unused memory would remain untouched to not waste cycles. My first preference is the new option (4). My second preference is option (1). What's your take? -- David Geier (ServiceNow)
> On Wed, Sep 20, 2023 at 03:42:43PM +0200, David Geier wrote: > Another, not yet discussed, option I can see work is: > > 4. Allocate a fixed amount of memory for the instrumentation stats based on > MAX_PARALLEL_WORKER_LIMIT: MAX_PARALLEL_WORKER_LIMIT is 1024 and used as the > limit of the max_parallel_workers GUC. This way MAX_PARALLEL_WORKER_LIMIT * > sizeof(BitmapHeapScanInstrumentation) = 1024 * 8 = 8192 bytes would be > allocated. To cut this down in half we could additionally change the type of > lossy_pages and exact_pages from long to uint32. Only possibly needed memory > would have to get initialized, the remaining unused memory would remain > untouched to not waste cycles. I'm not sure that it would be acceptable -- if I understand correctly it would be 8192 bytes per parallel bitmap heap scan node, and could be noticeable in the worst case scenario with too many connections and too many such nodes in every query. I find the original approach with an offset not that bad, after all there is something similar going on in other places, e.g. parallel heap scan also has phs_snapshot_off (although the rest is fixed sized). My commentary above in the thread was mostly about the cosmetic side. Giving snapshot_and_stats a decent name and maybe even ditching the access functions, using instead only the offset field couple of times, and it would look better to me.
Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
From
Michael Christofides
Date:
EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports
the number of heap blocks processed by the leader. It's missing the
per-worker stats.
Hi David,
According to the docs[1]: "In a parallel bitmap heap scan, one process is chosen as the leader. That process performs a scan of one or more indexes and builds a bitmap indicating which table blocks need to be visited. These blocks are then divided among the cooperating processes as in a parallel sequential scan."
My understanding is that the "Heap Blocks" statistic is only reporting blocks for the bitmap (i.e. not the subsequent scan). As such, I think it is correct that the workers do not report additional exact heap blocks.
explain (analyze, costs off, timing off) select * from foo where col0 >
900 or col1 = 1;
In your example, if you add the buffers and verbose parameters, do the worker reported buffers numbers report what you are looking for?
i.e. explain (analyze, buffers, verbose, costs off, timing off) select * from foo where col0 > 900 or col1 = 1;
—
Michael Christofides
Founder, pgMustard
[1]: https://www.postgresql.org/docs/current/parallel-plans.html#PARALLEL-SCANS
Michael Christofides
Founder, pgMustard
[1]: https://www.postgresql.org/docs/current/parallel-plans.html#PARALLEL-SCANS
On Mon, Oct 16, 2023 at 12:31 PM Michael Christofides <michael@pgmustard.com> wrote: > According to the docs[1]: "In a parallel bitmap heap scan, one process is chosen as the leader. That process performs ascan of one or more indexes and builds a bitmap indicating which table blocks need to be visited. These blocks are thendivided among the cooperating processes as in a parallel sequential scan." > > My understanding is that the "Heap Blocks" statistic is only reporting blocks for the bitmap (i.e. not the subsequent scan).As such, I think it is correct that the workers do not report additional exact heap blocks. I think you're wrong about that. The bitmap index scans are what scan the indexes and build the bitmap. The bitmap heap scan node is what scans the heap i.e. the table, and that is what is divided across the workers. On the patch itself, snapshot_and_stats doesn't strike me as a great name. If we added three more variable-length things would we call the member snapshot_and_stats_and_pink_and_orange_and_blue? Probably better to pick a name that is somehow more generic. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jan 20, 2023 at 2:04 PM David Geier <geidav.pg@gmail.com> wrote: > > Hi hackers, > > EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports > the number of heap blocks processed by the leader. It's missing the > per-worker stats. The attached patch adds that functionality in the > spirit of e.g. Sort or Memoize. Here is a simple test case and the > EXPLAIN ANALYZE output with and without the patch: > > With the patch: > > Gather (actual rows=99501 loops=1) > Workers Planned: 2 > Workers Launched: 2 > -> Parallel Bitmap Heap Scan on foo (actual rows=33167 loops=3) > Recheck Cond: ((col0 > 900) OR (col1 = 1)) > Heap Blocks: exact=98 > Worker 0: Heap Blocks: exact=171 lossy=0 > Worker 1: Heap Blocks: exact=172 lossy=0 else { + if (planstate->stats.exact_pages > 0) + appendStringInfo(es->str, " exact=%ld", planstate->stats.exact_pages); + if (planstate->stats.lossy_pages > 0) + appendStringInfo(es->str, " lossy=%ld", planstate->stats.lossy_pages); appendStringInfoChar(es->str, '\n'); } } .... + for (int n = 0; n < planstate->shared_info->num_workers; n++) + { .... + "Heap Blocks: exact="UINT64_FORMAT" lossy=" INT64_FORMAT"\n", + si->exact_pages, si->lossy_pages); Shouldn't we use the same format for reporting exact and lossy pages for the actual backend and the worker? I mean here for the backend you are showing lossy pages only if it is > 0 whereas for workers we are showing 0 lossy pages as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi David, Do you plan to work continue working on this patch? I did take a look, and on the whole it looks reasonable - it modifies the right places etc. I think there are two things that may need an improvement: 1) Storing variable-length data in ParallelBitmapHeapState I agree with Robert the snapshot_and_stats name is not great. I see Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData - the reasons are somewhat different (phs_snapshot_off exists because we don't know which exact struct will be allocated), while here we simply need to allocate two variable-length pieces of memory. But it seems like it would work nicely for this. That is, we could try adding an offset for each of those pieces of memory: - snapshot_off - stats_off I don't like the GetSharedSnapshotData name very much, it seems very close to GetSnapshotData - quite confusing, I think. Dmitry also suggested we might add a separate piece of shared memory. I don't quite see how that would work for ParallelBitmapHeapState, but I doubt it'd be simpler than having two offsets. I don't think the extra complexity (paid by everyone) would be worth it just to make EXPLAIN ANALYZE work. 2) Leader vs. worker counters It seems to me this does nothing to add the per-worker values from "Heap Blocks" into the leader, which means we get stuff like this: Heap Blocks: exact=102 lossy=10995 Worker 0: actual time=50.559..209.773 rows=215253 loops=1 Heap Blocks: exact=207 lossy=19354 Worker 1: actual time=50.543..211.387 rows=162934 loops=1 Heap Blocks: exact=161 lossy=14636 I think this is wrong / confusing, and inconsistent with what we do for other nodes. It's also inconsistent with how we deal e.g. with BUFFERS, where we *do* add the values to the leader: Heap Blocks: exact=125 lossy=10789 Buffers: shared hit=11 read=45420 Worker 0: actual time=51.419..221.904 rows=150437 loops=1 Heap Blocks: exact=136 lossy=13541 Buffers: shared hit=4 read=13541 Worker 1: actual time=56.610..222.469 rows=229738 loops=1 Heap Blocks: exact=209 lossy=20655 Buffers: shared hit=4 read=20655 Here it's not entirely obvious, because leader participates in the execution, but once we disable leader participation, it's clearer: Buffers: shared hit=7 read=45421 Worker 0: actual time=28.540..247.683 rows=309112 loops=1 Heap Blocks: exact=282 lossy=27806 Buffers: shared hit=4 read=28241 Worker 1: actual time=24.290..251.993 rows=190815 loops=1 Heap Blocks: exact=188 lossy=17179 Buffers: shared hit=3 read=17180 Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap Blocks" simply disappeared because the leader does nothing and we don't print zeros. 3) I'm not sure dealing with various EXPLAIN flags may not be entirely correct. Consider this: EXPLAIN (ANALYZE): -> Parallel Bitmap Heap Scan on t (...) Recheck Cond: (a < 5000) Rows Removed by Index Recheck: 246882 Worker 0: Heap Blocks: exact=168 lossy=15648 Worker 1: Heap Blocks: exact=302 lossy=29337 EXPLAIN (ANALYZE, VERBOSE): -> Parallel Bitmap Heap Scan on public.t (...) Recheck Cond: (t.a < 5000) Rows Removed by Index Recheck: 246882 Worker 0: actual time=35.067..300.882 rows=282108 loops=1 Heap Blocks: exact=257 lossy=25358 Worker 1: actual time=32.827..302.224 rows=217819 loops=1 Heap Blocks: exact=213 lossy=19627 EXPLAIN (ANALYZE, BUFFERS): -> Parallel Bitmap Heap Scan on t (...) Recheck Cond: (a < 5000) Rows Removed by Index Recheck: 246882 Buffers: shared hit=7 read=45421 Worker 0: Heap Blocks: exact=236 lossy=21870 Worker 1: Heap Blocks: exact=234 lossy=23115 EXPLAIN (ANALYZE, VERBOSE, BUFFERS): -> Parallel Bitmap Heap Scan on public.t (...) Recheck Cond: (t.a < 5000) Rows Removed by Index Recheck: 246882 Buffers: shared hit=7 read=45421 Worker 0: actual time=28.265..260.381 rows=261264 loops=1 Heap Blocks: exact=260 lossy=23477 Buffers: shared hit=3 read=23478 Worker 1: actual time=28.224..261.627 rows=238663 loops=1 Heap Blocks: exact=210 lossy=21508 Buffers: shared hit=4 read=21943 Why should the per-worker buffer info be shown when combined with the VERBOSE flag, and not just with BUFFERS, when the patch shows the per-worker info always? 4) Now that I think about this, isn't the *main* problem really that we don't display the sum of the per-worker stats (which I think is wrong)? I mean, we already can get the worker details VERBOSEm right? So the only reason to display that by default seems to be that it the values in "Heap Blocks" are from the leader only. BTW doesn't this also suggest some of the code added to explain.c may not be quite necessary? Wouldn't it be enough to just "extend" the existing code printing per-worker stats. (I haven't tried, so maybe I'm wrong and we need the new code.) regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
From
Melanie Plageman
Date:
On Sat, Feb 17, 2024 at 5:31 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi David, > > Do you plan to work continue working on this patch? I did take a look, > and on the whole it looks reasonable - it modifies the right places etc. I haven't started reviewing this patch yet, but I just ran into the behavior that it fixes and was very outraged. +10000 to fixing this. - Melanie
Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
From
Heikki Linnakangas
Date:
On 18/02/2024 00:31, Tomas Vondra wrote: > Do you plan to work continue working on this patch? I did take a look, > and on the whole it looks reasonable - it modifies the right places etc. +1 > I think there are two things that may need an improvement: > > 1) Storing variable-length data in ParallelBitmapHeapState > > I agree with Robert the snapshot_and_stats name is not great. I see > Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData - > the reasons are somewhat different (phs_snapshot_off exists because we > don't know which exact struct will be allocated), while here we simply > need to allocate two variable-length pieces of memory. But it seems like > it would work nicely for this. That is, we could try adding an offset > for each of those pieces of memory: > > - snapshot_off > - stats_off > > I don't like the GetSharedSnapshotData name very much, it seems very > close to GetSnapshotData - quite confusing, I think. > > Dmitry also suggested we might add a separate piece of shared memory. I > don't quite see how that would work for ParallelBitmapHeapState, but I > doubt it'd be simpler than having two offsets. I don't think the extra > complexity (paid by everyone) would be worth it just to make EXPLAIN > ANALYZE work. I just removed phs_snapshot_data in commit 84c18acaf6. I thought that would make this moot, but now that I rebased this, there are stills some aesthetic questions on how best to represent this. In all the other node types that use shared instrumentation like this, the pattern is as follows: (using Memoize here as an example, but it's similar for Sort, IncrementalSort, Agg and Hash) /* ---------------- * Shared memory container for per-worker memoize information * ---------------- */ typedef struct SharedMemoizeInfo { int num_workers; MemoizeInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER]; } SharedMemoizeInfo; /* this struct is backend-private */ typedef struct MemoizeState { ScanState ss; /* its first field is NodeTag */ ... MemoizeInstrumentation stats; /* execution statistics */ SharedMemoizeInfo *shared_info; /* statistics for parallel workers */ } MemoizeState; While the scan is running, the node updates its private data in MemoizeState->stats. At the end of a parallel scan, the worker process copies the MemoizeState->stats to MemoizeState->shared_info->stats, which lives in shared memory. The leader process copies MemoizeState->shared_info->stats to its own backend-private copy, which it then stores in its MemoizeState->shared_info, replacing the pointer to the shared memory with a pointer to the private copy. That happens in ExecMemoizeRetrieveInstrumentation(). This is a little different for parallel bitmap heap scans, because a bitmap heap scan keeps some other data in shared memory too, not just instrumentation data. Also, the naming is inconsistent: the equivalent of SharedMemoizeInfo is actually called ParallelBitmapHeapState. I think we should rename it to SharedBitmapHeapInfo, to make it clear that it lives in shared memory, but I digress. We could now put the new stats at the end of ParallelBitmapHeapState as a varlen field. But I'm not sure that's a good idea. In ExecBitmapHeapRetrieveInstrumentation(), would we make a backend-private copy of the whole ParallelBitmapHeapState struct, even though the other fields don't make sense after the shared memory is released? Sounds confusing. Or we could introduce a separate struct for the stats, and copy just that: typedef struct SharedBitmapHeapInstrumentation { int num_workers; BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER]; } SharedBitmapHeapInstrumentation; typedef struct BitmapHeapScanState { ScanState ss; /* its first field is NodeTag */ ... SharedBitmapHeapInstrumentation sinstrument; } BitmapHeapScanState; that compiles, at least with my compiler, but I find it weird to have a variable-length inner struct embedded in an outer struct like that. Long story short, I think it's still better to store ParallelBitmapHeapInstrumentationInfo separately in the DSM chunk, not as part of ParallelBitmapHeapState. Attached patch does that, rebased over current master. I didn't address any of the other things that you, Tomas, pointed out, but I think they're valid concerns. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
From
Melanie Plageman
Date:
On Thu, Mar 14, 2024 at 05:30:30PM +0200, Heikki Linnakangas wrote: > On 18/02/2024 00:31, Tomas Vondra wrote: > > Do you plan to work continue working on this patch? I did take a look, > > and on the whole it looks reasonable - it modifies the right places etc. > > +1 > > > I think there are two things that may need an improvement: > > > > 1) Storing variable-length data in ParallelBitmapHeapState > > > > I agree with Robert the snapshot_and_stats name is not great. I see > > Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData - > > the reasons are somewhat different (phs_snapshot_off exists because we > > don't know which exact struct will be allocated), while here we simply > > need to allocate two variable-length pieces of memory. But it seems like > > it would work nicely for this. That is, we could try adding an offset > > for each of those pieces of memory: > > > > - snapshot_off > > - stats_off > > > > I don't like the GetSharedSnapshotData name very much, it seems very > > close to GetSnapshotData - quite confusing, I think. > > > > Dmitry also suggested we might add a separate piece of shared memory. I > > don't quite see how that would work for ParallelBitmapHeapState, but I > > doubt it'd be simpler than having two offsets. I don't think the extra > > complexity (paid by everyone) would be worth it just to make EXPLAIN > > ANALYZE work. > > I just removed phs_snapshot_data in commit 84c18acaf6. I thought that would > make this moot, but now that I rebased this, there are stills some aesthetic > questions on how best to represent this. > > In all the other node types that use shared instrumentation like this, the > pattern is as follows: (using Memoize here as an example, but it's similar > for Sort, IncrementalSort, Agg and Hash) > > /* ---------------- > * Shared memory container for per-worker memoize information > * ---------------- > */ > typedef struct SharedMemoizeInfo > { > int num_workers; > MemoizeInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER]; > } SharedMemoizeInfo; > > /* this struct is backend-private */ > typedef struct MemoizeState > { > ScanState ss; /* its first field is NodeTag */ > ... > MemoizeInstrumentation stats; /* execution statistics */ > SharedMemoizeInfo *shared_info; /* statistics for parallel workers */ > } MemoizeState; > > While the scan is running, the node updates its private data in > MemoizeState->stats. At the end of a parallel scan, the worker process > copies the MemoizeState->stats to MemoizeState->shared_info->stats, which > lives in shared memory. The leader process copies > MemoizeState->shared_info->stats to its own backend-private copy, which it > then stores in its MemoizeState->shared_info, replacing the pointer to the > shared memory with a pointer to the private copy. That happens in > ExecMemoizeRetrieveInstrumentation(). > > This is a little different for parallel bitmap heap scans, because a bitmap > heap scan keeps some other data in shared memory too, not just > instrumentation data. Also, the naming is inconsistent: the equivalent of > SharedMemoizeInfo is actually called ParallelBitmapHeapState. I think we > should rename it to SharedBitmapHeapInfo, to make it clear that it lives in > shared memory, but I digress. FWIW, if we merge a BHS streaming read user like the one I propose in [1] (not as a pre-condition to this but just as something to make you more comfortable with these names), the ParallelBitmapHeapState will basically only contain the shared iterator and the coordination state for accessing it and could be named as such. Then if you really wanted to be consistent with Memoize, you could name the instrumentation SharedBitmapHeapInfo. But, personally I prefer the name you gave it: SharedBitmapHeapInstrumentation. I think that would have been a better name for SharedMemoizeInfo since num_workers is really just used as the length of the array of instrumentation info. > We could now put the new stats at the end of ParallelBitmapHeapState as a > varlen field. But I'm not sure that's a good idea. In > ExecBitmapHeapRetrieveInstrumentation(), would we make a backend-private > copy of the whole ParallelBitmapHeapState struct, even though the other > fields don't make sense after the shared memory is released? Sounds > confusing. Or we could introduce a separate struct for the stats, and copy > just that: > > typedef struct SharedBitmapHeapInstrumentation > { > int num_workers; > BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER]; > } SharedBitmapHeapInstrumentation; > > typedef struct BitmapHeapScanState > { > ScanState ss; /* its first field is NodeTag */ > ... > SharedBitmapHeapInstrumentation sinstrument; > } BitmapHeapScanState; > > that compiles, at least with my compiler, but I find it weird to have a > variable-length inner struct embedded in an outer struct like that. In the attached patch, BitmapHeapScanState->sinstrument is a pointer, though. Or are you proposing the above as an alternative that you decided not to go with? > Long story short, I think it's still better to store > ParallelBitmapHeapInstrumentationInfo separately in the DSM chunk, not as > part of ParallelBitmapHeapState. Attached patch does that, rebased over > current master. The approach in the attached patch looks good to me. > I didn't address any of the other things that you, Tomas, pointed out, but I > think they're valid concerns. I'll send a separate review of these issues that are still present in your patch as well. - Melanie [1] https://www.postgresql.org/message-id/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com
Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
From
Heikki Linnakangas
Date:
On 14/03/2024 22:00, Melanie Plageman wrote: > On Thu, Mar 14, 2024 at 05:30:30PM +0200, Heikki Linnakangas wrote: >> typedef struct SharedBitmapHeapInstrumentation >> { >> int num_workers; >> BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER]; >> } SharedBitmapHeapInstrumentation; >> >> typedef struct BitmapHeapScanState >> { >> ScanState ss; /* its first field is NodeTag */ >> ... >> SharedBitmapHeapInstrumentation sinstrument; >> } BitmapHeapScanState; >> >> that compiles, at least with my compiler, but I find it weird to have a >> variable-length inner struct embedded in an outer struct like that. > > In the attached patch, BitmapHeapScanState->sinstrument is a pointer, > though. Or are you proposing the above as an alternative that you > decided not to go with? Right, the above is what I contemplated at first but decided it was a bad idea. -- Heikki Linnakangas Neon (https://neon.tech)
Hi Tomas,
>
> On Sat, Feb 17, 2024 at 2:31 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> Hi David,
>
> Do you plan to work continue working on this patch? I did take a look,
> and on the whole it looks reasonable - it modifies the right places etc.
>
> I think there are two things that may need an improvement:
>
> 1) Storing variable-length data in ParallelBitmapHeapState
>
> I agree with Robert the snapshot_and_stats name is not great. I see
> Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData -
> the reasons are somewhat different (phs_snapshot_off exists because we
> don't know which exact struct will be allocated), while here we simply
> need to allocate two variable-length pieces of memory. But it seems like
> it would work nicely for this. That is, we could try adding an offset
> for each of those pieces of memory:
>
> - snapshot_off
> - stats_off
>
> I don't like the GetSharedSnapshotData name very much, it seems very
> close to GetSnapshotData - quite confusing, I think.
>
> Dmitry also suggested we might add a separate piece of shared memory. I
> don't quite see how that would work for ParallelBitmapHeapState, but I
> doubt it'd be simpler than having two offsets. I don't think the extra
> complexity (paid by everyone) would be worth it just to make EXPLAIN
> ANALYZE work.
This issue is now gone after Heikki's fix.
> 2) Leader vs. worker counters
>
> It seems to me this does nothing to add the per-worker values from "Heap
> Blocks" into the leader, which means we get stuff like this:
>
> Heap Blocks: exact=102 lossy=10995
> Worker 0: actual time=50.559..209.773 rows=215253 loops=1
> Heap Blocks: exact=207 lossy=19354
> Worker 1: actual time=50.543..211.387 rows=162934 loops=1
> Heap Blocks: exact=161 lossy=14636
>
> I think this is wrong / confusing, and inconsistent with what we do for
> other nodes. It's also inconsistent with how we deal e.g. with BUFFERS,
> where we *do* add the values to the leader:
>
> Heap Blocks: exact=125 lossy=10789
> Buffers: shared hit=11 read=45420
> Worker 0: actual time=51.419..221.904 rows=150437 loops=1
> Heap Blocks: exact=136 lossy=13541
> Buffers: shared hit=4 read=13541
> Worker 1: actual time=56.610..222.469 rows=229738 loops=1
> Heap Blocks: exact=209 lossy=20655
> Buffers: shared hit=4 read=20655
>
> Here it's not entirely obvious, because leader participates in the
> execution, but once we disable leader participation, it's clearer:
>
> Buffers: shared hit=7 read=45421
> Worker 0: actual time=28.540..247.683 rows=309112 loops=1
> Heap Blocks: exact=282 lossy=27806
> Buffers: shared hit=4 read=28241
> Worker 1: actual time=24.290..251.993 rows=190815 loops=1
> Heap Blocks: exact=188 lossy=17179
> Buffers: shared hit=3 read=17180
>
> Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap
> Blocks" simply disappeared because the leader does nothing and we don't
> print zeros.
Heap Blocks is specific to Bitmap Heap Scan. It seems that node specific stats
do not aggregate workers' stats into leaders for some existing nodes. For example,
Memorize node for Hits, Misses, etc
-> Nested Loop (actual rows=166667 loops=3)
-> Parallel Seq Scan on t (actual rows=33333 loops=3)
-> Memoize (actual rows=5 loops=100000)
Cache Key: t.j
Cache Mode: logical
Hits: 32991 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB
Worker 0: Hits: 33551 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB
Worker 1: Hits: 33443 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB
-> Index Scan using uj on u (actual rows=5 loops=15)
Index Cond: (j = t.j)
Sort, HashAggregate also do the same stuff.
> 3) I'm not sure dealing with various EXPLAIN flags may not be entirely
> correct. Consider this:
>
> EXPLAIN (ANALYZE):
>
> -> Parallel Bitmap Heap Scan on t (...)
> Recheck Cond: (a < 5000)
> Rows Removed by Index Recheck: 246882
> Worker 0: Heap Blocks: exact=168 lossy=15648
> Worker 1: Heap Blocks: exact=302 lossy=29337
>
> EXPLAIN (ANALYZE, VERBOSE):
>
> -> Parallel Bitmap Heap Scan on public.t (...)
> Recheck Cond: (t.a < 5000)
> Rows Removed by Index Recheck: 246882
> Worker 0: actual time=35.067..300.882 rows=282108 loops=1
> Heap Blocks: exact=257 lossy=25358
> Worker 1: actual time=32.827..302.224 rows=217819 loops=1
> Heap Blocks: exact=213 lossy=19627
>
> EXPLAIN (ANALYZE, BUFFERS):
>
> -> Parallel Bitmap Heap Scan on t (...)
> Recheck Cond: (a < 5000)
> Rows Removed by Index Recheck: 246882
> Buffers: shared hit=7 read=45421
> Worker 0: Heap Blocks: exact=236 lossy=21870
> Worker 1: Heap Blocks: exact=234 lossy=23115
>
> EXPLAIN (ANALYZE, VERBOSE, BUFFERS):
>
> -> Parallel Bitmap Heap Scan on public.t (...)
> Recheck Cond: (t.a < 5000)
> Rows Removed by Index Recheck: 246882
> Buffers: shared hit=7 read=45421
> Worker 0: actual time=28.265..260.381 rows=261264 loops=1
> Heap Blocks: exact=260 lossy=23477
> Buffers: shared hit=3 read=23478
> Worker 1: actual time=28.224..261.627 rows=238663 loops=1
> Heap Blocks: exact=210 lossy=21508
> Buffers: shared hit=4 read=21943
>
> Why should the per-worker buffer info be shown when combined with the
> VERBOSE flag, and not just with BUFFERS, when the patch shows the
> per-worker info always?
>
It seems that the general explain print framework requires verbose mode to show per worker stats.
For example, how Buffers hits, JIT are printed. While in some specific nodes which involves parallelism,
they always show worker blocks. This is why we see that some worker blocks don't have buffers
stats in non verbose mode. There are several existing nodes have the same issue as what this
patch does: memorize, sort, hashaggregate.
> 4) Now that I think about this, isn't the *main* problem really that we
> don't display the sum of the per-worker stats (which I think is wrong)?
> I mean, we already can get the worker details VERBOSEm right? So the
> only reason to display that by default seems to be that it the values in
> "Heap Blocks" are from the leader only.
If we print aggregate Heap Blocks in the 'leader' block and show worker stats only in
verbose mode, does it look better? In this way, it matches better with how the
general framework prints workers stats, but it differs from some existing nodes which
also print worker stats. As mentioned above, they are memorize, sort, and hashaggregate nodes.
By the way, is that also a problem for these nodes?
> BTW doesn't this also suggest some of the code added to explain.c may
> not be quite necessary? Wouldn't it be enough to just "extend" the
> existing code printing per-worker stats. (I haven't tried, so maybe I'm
> wrong and we need the new code.)
We need the new code as they are node specific stats and we do call the worker print function
to lay out the explain plan. I think the problem is in which mode we should show worker blocks. This
is discussed in the above section.
>
> On Sat, Feb 17, 2024 at 2:31 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> Hi David,
>
> Do you plan to work continue working on this patch? I did take a look,
> and on the whole it looks reasonable - it modifies the right places etc.
>
> I think there are two things that may need an improvement:
>
> 1) Storing variable-length data in ParallelBitmapHeapState
>
> I agree with Robert the snapshot_and_stats name is not great. I see
> Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData -
> the reasons are somewhat different (phs_snapshot_off exists because we
> don't know which exact struct will be allocated), while here we simply
> need to allocate two variable-length pieces of memory. But it seems like
> it would work nicely for this. That is, we could try adding an offset
> for each of those pieces of memory:
>
> - snapshot_off
> - stats_off
>
> I don't like the GetSharedSnapshotData name very much, it seems very
> close to GetSnapshotData - quite confusing, I think.
>
> Dmitry also suggested we might add a separate piece of shared memory. I
> don't quite see how that would work for ParallelBitmapHeapState, but I
> doubt it'd be simpler than having two offsets. I don't think the extra
> complexity (paid by everyone) would be worth it just to make EXPLAIN
> ANALYZE work.
This issue is now gone after Heikki's fix.
> 2) Leader vs. worker counters
>
> It seems to me this does nothing to add the per-worker values from "Heap
> Blocks" into the leader, which means we get stuff like this:
>
> Heap Blocks: exact=102 lossy=10995
> Worker 0: actual time=50.559..209.773 rows=215253 loops=1
> Heap Blocks: exact=207 lossy=19354
> Worker 1: actual time=50.543..211.387 rows=162934 loops=1
> Heap Blocks: exact=161 lossy=14636
>
> I think this is wrong / confusing, and inconsistent with what we do for
> other nodes. It's also inconsistent with how we deal e.g. with BUFFERS,
> where we *do* add the values to the leader:
>
> Heap Blocks: exact=125 lossy=10789
> Buffers: shared hit=11 read=45420
> Worker 0: actual time=51.419..221.904 rows=150437 loops=1
> Heap Blocks: exact=136 lossy=13541
> Buffers: shared hit=4 read=13541
> Worker 1: actual time=56.610..222.469 rows=229738 loops=1
> Heap Blocks: exact=209 lossy=20655
> Buffers: shared hit=4 read=20655
>
> Here it's not entirely obvious, because leader participates in the
> execution, but once we disable leader participation, it's clearer:
>
> Buffers: shared hit=7 read=45421
> Worker 0: actual time=28.540..247.683 rows=309112 loops=1
> Heap Blocks: exact=282 lossy=27806
> Buffers: shared hit=4 read=28241
> Worker 1: actual time=24.290..251.993 rows=190815 loops=1
> Heap Blocks: exact=188 lossy=17179
> Buffers: shared hit=3 read=17180
>
> Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap
> Blocks" simply disappeared because the leader does nothing and we don't
> print zeros.
Heap Blocks is specific to Bitmap Heap Scan. It seems that node specific stats
do not aggregate workers' stats into leaders for some existing nodes. For example,
Memorize node for Hits, Misses, etc
-> Nested Loop (actual rows=166667 loops=3)
-> Parallel Seq Scan on t (actual rows=33333 loops=3)
-> Memoize (actual rows=5 loops=100000)
Cache Key: t.j
Cache Mode: logical
Hits: 32991 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB
Worker 0: Hits: 33551 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB
Worker 1: Hits: 33443 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB
-> Index Scan using uj on u (actual rows=5 loops=15)
Index Cond: (j = t.j)
Sort, HashAggregate also do the same stuff.
> 3) I'm not sure dealing with various EXPLAIN flags may not be entirely
> correct. Consider this:
>
> EXPLAIN (ANALYZE):
>
> -> Parallel Bitmap Heap Scan on t (...)
> Recheck Cond: (a < 5000)
> Rows Removed by Index Recheck: 246882
> Worker 0: Heap Blocks: exact=168 lossy=15648
> Worker 1: Heap Blocks: exact=302 lossy=29337
>
> EXPLAIN (ANALYZE, VERBOSE):
>
> -> Parallel Bitmap Heap Scan on public.t (...)
> Recheck Cond: (t.a < 5000)
> Rows Removed by Index Recheck: 246882
> Worker 0: actual time=35.067..300.882 rows=282108 loops=1
> Heap Blocks: exact=257 lossy=25358
> Worker 1: actual time=32.827..302.224 rows=217819 loops=1
> Heap Blocks: exact=213 lossy=19627
>
> EXPLAIN (ANALYZE, BUFFERS):
>
> -> Parallel Bitmap Heap Scan on t (...)
> Recheck Cond: (a < 5000)
> Rows Removed by Index Recheck: 246882
> Buffers: shared hit=7 read=45421
> Worker 0: Heap Blocks: exact=236 lossy=21870
> Worker 1: Heap Blocks: exact=234 lossy=23115
>
> EXPLAIN (ANALYZE, VERBOSE, BUFFERS):
>
> -> Parallel Bitmap Heap Scan on public.t (...)
> Recheck Cond: (t.a < 5000)
> Rows Removed by Index Recheck: 246882
> Buffers: shared hit=7 read=45421
> Worker 0: actual time=28.265..260.381 rows=261264 loops=1
> Heap Blocks: exact=260 lossy=23477
> Buffers: shared hit=3 read=23478
> Worker 1: actual time=28.224..261.627 rows=238663 loops=1
> Heap Blocks: exact=210 lossy=21508
> Buffers: shared hit=4 read=21943
>
> Why should the per-worker buffer info be shown when combined with the
> VERBOSE flag, and not just with BUFFERS, when the patch shows the
> per-worker info always?
>
It seems that the general explain print framework requires verbose mode to show per worker stats.
For example, how Buffers hits, JIT are printed. While in some specific nodes which involves parallelism,
they always show worker blocks. This is why we see that some worker blocks don't have buffers
stats in non verbose mode. There are several existing nodes have the same issue as what this
patch does: memorize, sort, hashaggregate.
> 4) Now that I think about this, isn't the *main* problem really that we
> don't display the sum of the per-worker stats (which I think is wrong)?
> I mean, we already can get the worker details VERBOSEm right? So the
> only reason to display that by default seems to be that it the values in
> "Heap Blocks" are from the leader only.
If we print aggregate Heap Blocks in the 'leader' block and show worker stats only in
verbose mode, does it look better? In this way, it matches better with how the
general framework prints workers stats, but it differs from some existing nodes which
also print worker stats. As mentioned above, they are memorize, sort, and hashaggregate nodes.
By the way, is that also a problem for these nodes?
> BTW doesn't this also suggest some of the code added to explain.c may
> not be quite necessary? Wouldn't it be enough to just "extend" the
> existing code printing per-worker stats. (I haven't tried, so maybe I'm
> wrong and we need the new code.)
We need the new code as they are node specific stats and we do call the worker print function
to lay out the explain plan. I think the problem is in which mode we should show worker blocks. This
is discussed in the above section.
v3 failed on master, attached a rebased version.
Regards,
Donghang Lin
(ServiceNow)
Regards,
Donghang Lin
(ServiceNow)
Attachment
Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
From
Melanie Plageman
Date:
On Mon, Mar 25, 2024 at 2:29 AM Donghang Lin <donghanglin@gmail.com> wrote: > > > > On Sat, Feb 17, 2024 at 2:31 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > 2) Leader vs. worker counters > > > > It seems to me this does nothing to add the per-worker values from "Heap > > Blocks" into the leader, which means we get stuff like this: > > > > Heap Blocks: exact=102 lossy=10995 > > Worker 0: actual time=50.559..209.773 rows=215253 loops=1 > > Heap Blocks: exact=207 lossy=19354 > > Worker 1: actual time=50.543..211.387 rows=162934 loops=1 > > Heap Blocks: exact=161 lossy=14636 > > > > I think this is wrong / confusing, and inconsistent with what we do for > > other nodes. It's also inconsistent with how we deal e.g. with BUFFERS, > > where we *do* add the values to the leader: > > > > Heap Blocks: exact=125 lossy=10789 > > Buffers: shared hit=11 read=45420 > > Worker 0: actual time=51.419..221.904 rows=150437 loops=1 > > Heap Blocks: exact=136 lossy=13541 > > Buffers: shared hit=4 read=13541 > > Worker 1: actual time=56.610..222.469 rows=229738 loops=1 > > Heap Blocks: exact=209 lossy=20655 > > Buffers: shared hit=4 read=20655 > > > > Here it's not entirely obvious, because leader participates in the > > execution, but once we disable leader participation, it's clearer: > > > > Buffers: shared hit=7 read=45421 > > Worker 0: actual time=28.540..247.683 rows=309112 loops=1 > > Heap Blocks: exact=282 lossy=27806 > > Buffers: shared hit=4 read=28241 > > Worker 1: actual time=24.290..251.993 rows=190815 loops=1 > > Heap Blocks: exact=188 lossy=17179 > > Buffers: shared hit=3 read=17180 > > > > Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap > > Blocks" simply disappeared because the leader does nothing and we don't > > print zeros. > > Heap Blocks is specific to Bitmap Heap Scan. It seems that node specific stats > do not aggregate workers' stats into leaders for some existing nodes. For example, > Memorize node for Hits, Misses, etc > > -> Nested Loop (actual rows=166667 loops=3) > -> Parallel Seq Scan on t (actual rows=33333 loops=3) > -> Memoize (actual rows=5 loops=100000) > Cache Key: t.j > Cache Mode: logical > Hits: 32991 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB > Worker 0: Hits: 33551 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB > Worker 1: Hits: 33443 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB > -> Index Scan using uj on u (actual rows=5 loops=15) > Index Cond: (j = t.j) > > Sort, HashAggregate also do the same stuff. > > > 3) I'm not sure dealing with various EXPLAIN flags may not be entirely > > correct. Consider this: > > > > EXPLAIN (ANALYZE): > > > > -> Parallel Bitmap Heap Scan on t (...) > > Recheck Cond: (a < 5000) > > Rows Removed by Index Recheck: 246882 > > Worker 0: Heap Blocks: exact=168 lossy=15648 > > Worker 1: Heap Blocks: exact=302 lossy=29337 > > > > EXPLAIN (ANALYZE, VERBOSE): > > > > -> Parallel Bitmap Heap Scan on public.t (...) > > Recheck Cond: (t.a < 5000) > > Rows Removed by Index Recheck: 246882 > > Worker 0: actual time=35.067..300.882 rows=282108 loops=1 > > Heap Blocks: exact=257 lossy=25358 > > Worker 1: actual time=32.827..302.224 rows=217819 loops=1 > > Heap Blocks: exact=213 lossy=19627 > > > > EXPLAIN (ANALYZE, BUFFERS): > > > > -> Parallel Bitmap Heap Scan on t (...) > > Recheck Cond: (a < 5000) > > Rows Removed by Index Recheck: 246882 > > Buffers: shared hit=7 read=45421 > > Worker 0: Heap Blocks: exact=236 lossy=21870 > > Worker 1: Heap Blocks: exact=234 lossy=23115 > > > > EXPLAIN (ANALYZE, VERBOSE, BUFFERS): > > > > -> Parallel Bitmap Heap Scan on public.t (...) > > Recheck Cond: (t.a < 5000) > > Rows Removed by Index Recheck: 246882 > > Buffers: shared hit=7 read=45421 > > Worker 0: actual time=28.265..260.381 rows=261264 loops=1 > > Heap Blocks: exact=260 lossy=23477 > > Buffers: shared hit=3 read=23478 > > Worker 1: actual time=28.224..261.627 rows=238663 loops=1 > > Heap Blocks: exact=210 lossy=21508 > > Buffers: shared hit=4 read=21943 > > > > Why should the per-worker buffer info be shown when combined with the > > VERBOSE flag, and not just with BUFFERS, when the patch shows the > > per-worker info always? > > > > It seems that the general explain print framework requires verbose mode to show per worker stats. > For example, how Buffers hits, JIT are printed. While in some specific nodes which involves parallelism, > they always show worker blocks. This is why we see that some worker blocks don't have buffers > stats in non verbose mode. There are several existing nodes have the same issue as what this > patch does: memorize, sort, hashaggregate. I don't think passing explain the BUFFERS option should impact what is shown for bitmap heap scan lossy/exact. BUFFERS has to do with buffer usage. "Heap Blocks" here is actually more accurately "Bitmap Blocks". 1) it is not heap specific, so at least changing it to "Table Blocks" would be better and 2) the meaning is about whether or not the blocks were represented lossily in the bitmap -- yes, those blocks that we are talking about are table blocks (in contrast to index blocks), but the important part is the bitmap. So, BUFFERS shouldn't cause this info to show. As for whether or not the leader number should be inclusive of all the worker numbers, if there is a combination of options in which per-worker stats are not displayed, then the leader count should be inclusive of the worker numbers. However, if the worker numbers are always displayed, I think it is okay for the leader number to only display its own count. Though I agree with Tomas that that can be confusing when parallel_leader_participation is off. You end up without a topline number. As for whether or not per-worker stats should be displayed by default or only with VERBOSE, it sounds like there are two different precedents. I don't have a strong feeling one way or the other. Whichever is most consistent. Donghang, could you list again which plan nodes and explain options always print per-worker stats and which only do with the VERBOSE option? I think there is an issue with the worker counts on rescan though. I was playing around with this patch with one of the regression test suite tables: drop table if exists tenk1; drop table if exists tenk2; CREATE TABLE tenk1 ( unique1 int4, unique2 int4, two int4, four int4, ten int4, twenty int4, hundred int4, thousand int4, twothousand int4, fivethous int4, tenthous int4, odd int4, even int4, stringu1 name, stringu2 name, string4 name ) with (autovacuum_enabled = false); COPY tenk1 FROM '/[source directory]/src/test/regress/data/tenk.data'; CREATE TABLE tenk2 AS SELECT * FROM tenk1; CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops); VACUUM ANALYZE tenk1; VACUUM ANALYZE tenk2; set enable_seqscan to off; set enable_indexscan to off; set enable_hashjoin to off; set enable_mergejoin to off; set enable_material to off; set parallel_setup_cost=0; set parallel_tuple_cost=0; set min_parallel_table_scan_size=0; set max_parallel_workers_per_gather=2; set parallel_leader_participation = off; explain (analyze, costs off, verbose) select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; I don't think the worker counts are getting carried across rescans. For this query, with parallel_leader_participation, you can see the leader has a high number of exact heap blocks (there are 30 rescans, so it is the sum across all of those). But the workers have low counts. That might be normal because maybe they did less work, but if you set parallel_leader_participation = off, you see the worker numbers are still small. I did some logging and I do see workers with counts of lossy/exact not making it into the final count. I haven't had time to debug more, but it is worth looking into. parallel_leader_participation = on ----------------------------------------------------------------------------------------------- Aggregate (actual time=62.321..63.178 rows=1 loops=1) Output: count(*) -> Nested Loop (actual time=2.058..55.202 rows=98000 loops=1) -> Seq Scan on public.tenk2 (actual time=0.182..3.699 rows=10 loops=1) Output: tenk2.unique1 ... Filter: (tenk2.thousand = 0) Rows Removed by Filter: 9990 -> Gather (actual time=1.706..4.142 rows=9800 loops=10) Workers Planned: 2 Workers Launched: 2 -> Parallel Bitmap Heap Scan on public.tenk1 (actual time=0.365..0.958 rows=3267 loops=30) Recheck Cond: (tenk1.hundred > 1) Heap Blocks: exact=1801 Worker 0: actual time=0.033..0.414 rows=1993 loops=10 Heap Blocks: exact=78 lossy=0 Worker 1: actual time=0.032..0.550 rows=2684 loops=10 Heap Blocks: exact=86 lossy=0 -> Bitmap Index Scan on tenk1_hundred (actual time=0.972..0.972 rows=9800 loops=10) Index Cond: (tenk1.hundred > 1) parallel_leader_participation = off ----------------------------------------------------------------------------------------------- Aggregate (actual time=84.502..84.977 rows=1 loops=1) Output: count(*) -> Nested Loop (actual time=6.185..77.085 rows=98000 loops=1) -> Seq Scan on public.tenk2 (actual time=0.182..3.709 rows=10 loops=1) Output: tenk2.unique1... Filter: (tenk2.thousand = 0) Rows Removed by Filter: 9990 -> Gather (actual time=5.265..6.355 rows=9800 loops=10) Workers Planned: 2 Workers Launched: 2 -> Parallel Bitmap Heap Scan on public.tenk1 (actual time=0.951..1.863 rows=4900 loops=20) Recheck Cond: (tenk1.hundred > 1) Worker 0: actual time=0.794..1.705 rows=4909 loops=10 Heap Blocks: exact=168 lossy=0 Worker 1: actual time=1.108..2.021 rows=4891 loops=10 Heap Blocks: exact=177 lossy=0 -> Bitmap Index Scan on tenk1_hundred (actual time=1.024..1.024 rows=9800 loops=10) Index Cond: (tenk1.hundred > 1) Worker 1: actual time=1.024..1.024 rows=9800 loops=10 - Melanie
On Mon, Mar 25, 2024 at 2:11 PM Melanie Plageman <melanieplageman@gmail.com> wrote:
> As for whether or not per-worker stats should be displayed by default
> or only with VERBOSE, it sounds like there are two different
> precedents. I don't have a strong feeling one way or the other.
> Whichever is most consistent.
> Donghang, could you list again which plan nodes and explain options
> always print per-worker stats and which only do with the VERBOSE
> option?
I took a look at explain.c where workers info is printed out.
These works for every parallel aware nodes:
Buffers stats print for workers with VERBOSE and BUFFERS
WAL stats print for workers with VERBOSE and WAL
JIT stats print for workers with VERBOSE and COSTS
Timing print for workers with VERBOSE and TIMING
Rows and loops print for workers with VERBOSE
Some specific nodes:
Sort / Incremental Sort / Hash / HashAggregate / Memorize and Bitmap Heap Scan (this patch) nodes
always print their specific stats for workers.
> I did some logging and I do see workers with
> counts of lossy/exact not making it into the final count. I haven't
> had time to debug more, but it is worth looking into.
Indeed, rescan overrides previous scan stats in workers.
Attach v5 with v4 plus the fix to aggregate the counts.
Regards,
Donghang Lin
(ServiceNow)
> or only with VERBOSE, it sounds like there are two different
> precedents. I don't have a strong feeling one way or the other.
> Whichever is most consistent.
> Donghang, could you list again which plan nodes and explain options
> always print per-worker stats and which only do with the VERBOSE
> option?
I took a look at explain.c where workers info is printed out.
These works for every parallel aware nodes:
Buffers stats print for workers with VERBOSE and BUFFERS
WAL stats print for workers with VERBOSE and WAL
JIT stats print for workers with VERBOSE and COSTS
Timing print for workers with VERBOSE and TIMING
Rows and loops print for workers with VERBOSE
Some specific nodes:
Sort / Incremental Sort / Hash / HashAggregate / Memorize and Bitmap Heap Scan (this patch) nodes
always print their specific stats for workers.
> I did some logging and I do see workers with
> counts of lossy/exact not making it into the final count. I haven't
> had time to debug more, but it is worth looking into.
Indeed, rescan overrides previous scan stats in workers.
Attach v5 with v4 plus the fix to aggregate the counts.
Regards,
Donghang Lin
(ServiceNow)
Attachment
Hi! Thank you for your work on this issue! Your patch required a little revision. I did this and attached the patch. Also, I think you should add some clarification to the comments about printing 'exact' and 'loosy' pages in show_hashagg_info function, which you get from planstate->stats, whereas previously it was output only from planstate. Perhaps it is enough to mention this in the comment to the commit. I mean this place: diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 926d70afaf..02251994c6 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -3467,26 +3467,57 @@ show_hashagg_info(AggState *aggstate, ExplainState *es) static void show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es) { + Assert(es->analyze); + if (es->format != EXPLAIN_FORMAT_TEXT) { ExplainPropertyInteger("Exact Heap Blocks", NULL, - planstate->exact_pages, es); + planstate->stats.exact_pages, es); ExplainPropertyInteger("Lossy Heap Blocks", NULL, - planstate->lossy_pages, es); + planstate->stats.lossy_pages, es); } else { - if (planstate->exact_pages > 0 || planstate->lossy_pages > 0) + if (planstate->stats.exact_pages > 0 || planstate->stats.lossy_pages > 0) { ExplainIndentText(es); appendStringInfoString(es->str, "Heap Blocks:"); - if (planstate->exact_pages > 0) - appendStringInfo(es->str, " exact=%ld", planstate->exact_pages); - if (planstate->lossy_pages > 0) - appendStringInfo(es->str, " lossy=%ld", planstate->lossy_pages); + if (planstate->stats.exact_pages > 0) + appendStringInfo(es->str, " exact=%ld", planstate->stats.exact_pages); + if (planstate->stats.lossy_pages > 0) + appendStringInfo(es->str, " lossy=%ld", planstate->stats.lossy_pages); appendStringInfoChar(es->str, '\n'); } } + + if (planstate->pstate != NULL) + { + for (int n = 0; n < planstate->sinstrument->num_workers; n++) + { + BitmapHeapScanInstrumentation *si = &planstate->sinstrument->sinstrument[n]; + + if (si->exact_pages == 0 && si->lossy_pages == 0) + continue; + + if (es->workers_state) + ExplainOpenWorker(n, es); + + if (es->format == EXPLAIN_FORMAT_TEXT) + { + ExplainIndentText(es); + appendStringInfo(es->str, "Heap Blocks: exact=%ld lossy=%ld\n", + si->exact_pages, si->lossy_pages); + } + else + { + ExplainPropertyInteger("Exact Heap Blocks", NULL, si->exact_pages, es); + ExplainPropertyInteger("Lossy Heap Blocks", NULL, si->lossy_pages, es); + } + + if (es->workers_state) + ExplainCloseWorker(n, es); + } + } } I suggest some code refactoring (diff.diff.no-cfbot file) that allows you to improve your code. -- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hi, Thanks for working it! I'm interested in this feature, so I'd like to participate in the patch review. Though I've just started looking at the patch, I have two comments about the v6 patch. (And I want to confirm the thread active.) 1) Unify the print format of leader and worker In show_tidbitmap_info(), the number of exact/loosy blocks of the leader and workers are printed. I think the printed format should be same. Currently, the leader does not print the blocks of exact/lossy with a value of 0, but the workers could even if it is 0. IMHO, it's better to print both exact/lossy blocks if at least one of the numbers of exact/lossy blocks is greater than 0. After all, the print logic is redundant for leader and workers, but I thought it would be better to make it a common function. 2) Move es->workers_state check In show_tidbitmap_info(), ExplainOpenWorker() and ExplainCloseWorker() are called after checking es->worker_state is not NULL. However, es->workers_state seem to be able to be NULL only for the Gather node (I see ExplainPrintPlan()). Also, reading the comments, there is a description that each worker information needs to be hidden when printing the plan. Even if es->workers_state becomes NULL in BitmapHeapScan node in the future, I think that workers' information(Heap Blocks) should not be printed. Therefore, I think es->workers_state check should be move to the place of "if (planstate->pstate != NULL)" like ExplainNode(), doesn't it? IIUC, we need to correct show_sort_info() and so on too… Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Hi, Thanks for working it! I'm interested in this feature, so I'd like to participate in the patch review. Though I've just started looking at the patch, I have two comments about the v6 patch. (And I want to confirm the thread active.) 1) Unify the print format of leader and worker In show_tidbitmap_info(), the number of exact/loosy blocks of the leader and workers are printed. I think the printed format should be same. Currently, the leader does not print the blocks of exact/lossy with a value of 0, but the workers could even if it is 0. IMHO, it's better to print both exact/lossy blocks if at least one of the numbers of exact/lossy blocks is greater than 0. After all, the print logic is redundant for leader and workers, but I thought it would be better to make it a common function. 2) Move es->workers_state check In show_tidbitmap_info(), ExplainOpenWorker() and ExplainCloseWorker() are called after checking es->worker_state is not NULL. However, es->workers_state seem to be able to be NULL only for the Gather node (I see ExplainPrintPlan()). Also, reading the comments, there is a description that each worker information needs to be hidden when printing the plan. Even if es->workers_state becomes NULL in BitmapHeapScan node in the future, I think that workers' information(Heap Blocks) should not be printed. Therefore, I think es->workers_state check should be move to the place of "if (planstate->pstate != NULL)" like ExplainNode(), doesn't it? IIUC, we need to correct show_sort_info() and so on too… Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On Wed, 26 Jun 2024 at 22:22, <Masahiro.Ikeda@nttdata.com> wrote: > 1) Unify the print format of leader and worker > > In show_tidbitmap_info(), the number of exact/loosy blocks of the leader and workers > are printed. I think the printed format should be same. Currently, the leader does not > print the blocks of exact/lossy with a value of 0, but the workers could even if it is 0. I agree with this. The two should match. I've fixed that in the attached. I also made a pass over the patch, and I also changed: 1. Fixed up a few outdated comments in execnodes.h. 2. Added a comment in ExecEndBitmapHeapScan() to explain why we += the stats rather than memcpy the BitmapHeapScanInstrumentation. 3. A bunch of other comments. 4. updated typedefs.list and ran pgindent. For #2, I was surprised at this. I think there's probably a bug in the Memoize stats code for the same reason. I've not looked into that yet. I find it a little bit strange that we're showing stats for Worker N when that worker could have been made up from possibly hundreds of different parallel workers in the case where the Gather/GatherMerge node is rescanned and the worker gets shut down at the end of each Gather and fresh ones started up on rescan. I do agree that we need to accumulate the totals from previous scans as that's what the non-parallel version does. Many people have been hacking on this and I'm wondering who should be listed as authors. I plan to put David Geier first. Should anyone else be listed there? I've attached the rebased v5 patch with part of Alena's changes from the diff.diff.no-cfbot file. I left the following one off as it looks wrong. - ptr += MAXALIGN(sizeof(ParallelBitmapHeapState)); + ptr += size; That would make ptr point to the end of the allocation. I'd like to commit this patch soon, so if anyone wants to give it a final look, can they do so before next week? David
Attachment
On Fri, 5 Jul 2024 at 01:59, David Rowley <dgrowleyml@gmail.com> wrote: > I also made a pass over the patch, and I also changed: > > 1. Fixed up a few outdated comments in execnodes.h. > 2. Added a comment in ExecEndBitmapHeapScan() to explain why we += the > stats rather than memcpy the BitmapHeapScanInstrumentation. > 3. A bunch of other comments. > 4. updated typedefs.list and ran pgindent. One other thing I think we should do while on this topic is move away from using "long" as a data type for storing the number of exact and lossy pages. The problem is that sizeof(long) on 64-bit MSVC is 32 bits. A signed 32-bit type isn't large enough to store anything more than 16TBs worth of 8k pages. I propose we change these to uint64 while causing churn in this area, probably as a follow-on patch. I think a uint32 isn't wide enough as you could exceed the limit with rescans. David
On Fri, 5 Jul 2024 at 12:52, David Rowley <dgrowleyml@gmail.com> wrote: > I propose we change these to uint64 while causing churn in this area, > probably as a follow-on patch. I think a uint32 isn't wide enough as > you could exceed the limit with rescans. I wondered how large a query it would take to cause this problem. I tried: create table a (a int); insert into a select x%1000 from generate_Series(1,1500000)x; create index on a(a); vacuum freeze analyze a; set enable_hashjoin=0; set enable_mergejoin=0; set enable_indexscan=0; set max_parallel_workers_per_gather=0; explain (analyze, costs off, timing off, summary off) select count(*) from a a1 inner join a a2 on a1.a=a2.a; After about 15 mins, the trimmed output from Linux is: Aggregate (actual rows=1 loops=1) -> Nested Loop (actual rows=2250000000 loops=1) -> Seq Scan on a a1 (actual rows=1500000 loops=1) -> Bitmap Heap Scan on a a2 (actual rows=1500 loops=1500000) Recheck Cond: (a1.a = a) Heap Blocks: exact=2250000000 -> Bitmap Index Scan on a_a_idx (actual rows=1500 loops=1500000) Index Cond: (a = a1.a) Whereas, on MSVC, due to sizeof(long) == 4, it's: Aggregate (actual rows=1 loops=1) -> Nested Loop (actual rows=2250000000 loops=1) -> Seq Scan on a a1 (actual rows=1500000 loops=1) -> Bitmap Heap Scan on a a2 (actual rows=1500 loops=1500000) Recheck Cond: (a1.a = a) -> Bitmap Index Scan on a_a_idx (actual rows=1500 loops=1500000) Index Cond: (a = a1.a) Notice the "Heap Blocks: exact=2250000000" is missing on Windows. This is because it wrapped around to a negative value and show_tidbitmap_info() only shows > 0 values. I feel this is a good enough justification to increase the width of those counters to uint64, so I'll do that too. David
On Mon, 8 Jul 2024 at 12:19, David Rowley <dgrowleyml@gmail.com> wrote: > Notice the "Heap Blocks: exact=2250000000" is missing on Windows. > This is because it wrapped around to a negative value and > show_tidbitmap_info() only shows > 0 values. > > I feel this is a good enough justification to increase the width of > those counters to uint64, so I'll do that too. I pushed the widening of the types first as I saw some code in the EXPLAIN patch which assumed var == 0 is the negator of var > 0. I couldn't bring myself to commit that knowing it was wrong and also couldn't bring myself to write <= 0 knowing I was about to make that look like a weird thing to write for an unsigned type. David
On Sun, 18 Feb 2024 at 11:31, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > 2) Leader vs. worker counters > > It seems to me this does nothing to add the per-worker values from "Heap > Blocks" into the leader, which means we get stuff like this: > > Heap Blocks: exact=102 lossy=10995 > Worker 0: actual time=50.559..209.773 rows=215253 loops=1 > Heap Blocks: exact=207 lossy=19354 > Worker 1: actual time=50.543..211.387 rows=162934 loops=1 > Heap Blocks: exact=161 lossy=14636 > > I think this is wrong / confusing, and inconsistent with what we do for > other nodes. Are you able to share which other nodes that you mean here? I used the following to compare to Sort and Memoize, and as far as I see, the behaviour matches with the attached v8 patch. Is there some inconsistency here that I'm not seeing? create table mill (a int); create index on mill(a); insert into mill select x%1000 from generate_Series(1,10000000)x; vacuum analyze mill; create table big (a int primary key); insert into big select x from generate_series(1,10000000)x; create table probe (a int); insert into probe select 1 from generate_Series(1,1000000); analyze big analyze probe; set parallel_tuple_cost=0; set parallel_setup_cost=0; set enable_indexscan=0; -- compare Parallel Bitmap Heap Scan with Memoize and Sort. -- each includes "Worker N:" with stats for the operation. explain (analyze) select * from mill where a < 100; explain (analyze) select * from big b inner join probe p on b.a=p.a; explain (analyze) select * from probe order by a; -- each includes "Worker N:" with stats for the operation -- also includes actual time and rows for each worker. explain (analyze, verbose) select * from mill where a < 100; explain (analyze, verbose) select * from big b inner join probe p on b.a=p.a; explain (analyze, verbose) select * from probe order by a; -- each includes "Worker N:" with stats for the operation -- shows a single total buffers which includes leader and worker buffers. explain (analyze, buffers) select * from mill where a < 100; explain (analyze, buffers) select * from big b inner join probe p on b.a=p.a; explain (analyze, buffers) select * from probe order by a; -- each includes "Worker N:" with stats for the operation -- also includes actual time and rows for each worker. -- shows a single total buffers which includes leader and worker buffers. -- shows buffer counts for each worker process explain (analyze, buffers, verbose) select * from mill where a < 100; explain (analyze, buffers, verbose) select * from big b inner join probe p on b.a=p.a; explain (analyze, buffers, verbose) select * from probe order by a; If we did want to adjust things to show the totals for each worker rather than the stats for the leader, what would Sort Method show if one worker spilled to disk and another did not? David
Attachment
On Mon, 8 Jul 2024 at 15:43, David Rowley <dgrowleyml@gmail.com> wrote: > > On Sun, 18 Feb 2024 at 11:31, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > 2) Leader vs. worker counters > > > > It seems to me this does nothing to add the per-worker values from "Heap > > Blocks" into the leader, which means we get stuff like this: > > > > Heap Blocks: exact=102 lossy=10995 > > Worker 0: actual time=50.559..209.773 rows=215253 loops=1 > > Heap Blocks: exact=207 lossy=19354 > > Worker 1: actual time=50.543..211.387 rows=162934 loops=1 > > Heap Blocks: exact=161 lossy=14636 > > > > I think this is wrong / confusing, and inconsistent with what we do for > > other nodes. > > Are you able to share which other nodes that you mean here? I did the analysis on this and out of the node types that have parallel instrumentation (per ExecParallelRetrieveInstrumentation()), Parallel Hash is the only node that does anything different from the others. Looking at the loop inside show_hash_info(), you can see it takes the Max() of each property. There's some discussion in [1] about why this came about. In particular [2]. I see no reason to copy the odd one out here, so I'm planning on going ahead with the patch that has Bitmap Heap Scan copy what the majority of other nodes do. I think we should consider aligning Parallel Hash with the other Parallel node behaviour. I've attached the (roughly done) schema and queries I used to obtain the plans to do this analysis. David [1] https://www.postgresql.org/message-id/flat/20200323165059.GA24950%40alvherre.pgsql [2] https://www.postgresql.org/message-id/31321.1586549487%40sss.pgh.pa.us
Attachment
On Tue, 9 Jul 2024 at 11:51, David Rowley <dgrowleyml@gmail.com> wrote: > I think we should consider aligning Parallel Hash > with the other Parallel node behaviour. I looked at that and quickly realised that it makes sense that Parallel Hash does something different here. All the workers are contributing to building the same hash table, so they're all going to show the same set of values, provided they managed to help building it. We're able to tell how much each worker helped according to EXPLAIN (ANALYZE, VERBOSE)'s Worker N: rows=n output. I don't think there's anything else not already shown that would be interesting to know per worker. David