Thread: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
David Geier
Date:
Hi hackers,

EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports 
the number of heap blocks processed by the leader. It's missing the 
per-worker stats. The attached patch adds that functionality in the 
spirit of e.g. Sort or Memoize. Here is a simple test case and the 
EXPLAIN ANALYZE output with and without the patch:

create table foo(col0 int, col1 int);
insert into foo select generate_series(1, 1000, 0.001), 
generate_series(1000, 2000, 0.001);
create index idx0 on foo(col0);
create index idx1 on foo(col1);
set parallel_tuple_cost = 0;
set parallel_setup_cost = 0;
explain (analyze, costs off, timing off) select * from foo where col0 > 
900 or col1 = 1;

With the patch:

  Gather (actual rows=99501 loops=1)
    Workers Planned: 2
    Workers Launched: 2
    ->  Parallel Bitmap Heap Scan on foo (actual rows=33167 loops=3)
          Recheck Cond: ((col0 > 900) OR (col1 = 1))
          Heap Blocks: exact=98
          Worker 0:  Heap Blocks: exact=171 lossy=0
          Worker 1:  Heap Blocks: exact=172 lossy=0
          ->  BitmapOr (actual rows=0 loops=1)
                ->  Bitmap Index Scan on idx0 (actual rows=99501 loops=1)
                      Index Cond: (col0 > 900)
                ->  Bitmap Index Scan on idx1 (actual rows=0 loops=1)
                      Index Cond: (col1 = 1)

Without the patch:

  Gather (actual rows=99501 loops=1)
    Workers Planned: 2
    Workers Launched: 2
    ->  Parallel Bitmap Heap Scan on foo (actual rows=33167 loops=3)
          Recheck Cond: ((col0 > 900) OR (col1 = 1))
          Heap Blocks: exact=91
          ->  BitmapOr (actual rows=0 loops=1)
                ->  Bitmap Index Scan on idx0 (actual rows=99501 loops=1)
                      Index Cond: (col0 > 900)
                ->  Bitmap Index Scan on idx1 (actual rows=0 loops=1)
                      Index Cond: (col1 = 1)

So in total the parallel Bitmap Heap Scan actually processed 441 heap 
blocks instead of just 91.

Now two variable length arrays (VLA) would be needed, one for the 
snapshot and one for the stats. As this obviously doesn't work, I now 
use a single, big VLA and added functions to retrieve pointers to the 
respective fields. I'm using MAXALIGN() to make sure the latter field is 
aligned properly. Am I doing this correctly? I'm not entirely sure 
around alignment conventions and requirements of other platforms.

I couldn't find existing tests that exercise the EXPLAIN ANALYZE output 
of specific nodes. I could only find a few basic smoke tests for EXPLAIN 
ANALYZE with parallel nodes in parallel_select.sql. Do we want tests for 
the changed functionality? If so I could right away also add tests for 
EXPLAIN ANALYZE including other parallel nodes.

Thank you for your feedback.

-- 
David Geier
(ServiceNow)

Attachment

Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
David Geier
Date:
Hi,

On 1/20/23 09:34, David Geier wrote:
> EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports 
> the number of heap blocks processed by the leader. It's missing the 
> per-worker stats. The attached patch adds that functionality in the 
> spirit of e.g. Sort or Memoize. Here is a simple test case and the 
> EXPLAIN ANALYZE output with and without the patch:

Attached is a rebased version of the patch. I would appreciate someone 
taking a look.

As background: the change doesn't come out of thin air. We repeatedly 
took wrong conclusions in our query analysis because we assumed that the 
reported block counts include the workers.

If no one objects I would also register the patch at the commit fest. 
The patch is passing cleanly on CI.

-- 
David Geier
(ServiceNow)

Attachment

Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
Dmitry Dolgov
Date:
> On Tue, Feb 21, 2023 at 01:02:35PM +0100, David Geier wrote:
> Hi,
>
> On 1/20/23 09:34, David Geier wrote:
> > EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports
> > the number of heap blocks processed by the leader. It's missing the
> > per-worker stats. The attached patch adds that functionality in the
> > spirit of e.g. Sort or Memoize. Here is a simple test case and the
> > EXPLAIN ANALYZE output with and without the patch:
>
> Attached is a rebased version of the patch. I would appreciate someone
> taking a look.
>
> As background: the change doesn't come out of thin air. We repeatedly took
> wrong conclusions in our query analysis because we assumed that the reported
> block counts include the workers.
>
> If no one objects I would also register the patch at the commit fest. The
> patch is passing cleanly on CI.

Thanks for the patch.

The idea sounds reasonable to me, but I have to admit snapshot_and_stats
implementation looks awkward. Maybe it would be better to have a
separate structure field for both stats and snapshot, which will be set
to point to a corresponding place in the shared FAM e.g. when the worker
is getting initialized? shm_toc_allocate mentions BUFFERALIGN to handle
possibility of some atomic operations needing it, so I guess that would
have to be an alignment in this case as well.

Probably another option would be to allocate two separate pieces of
shared memory, which resolves questions like proper alignment, but
annoyingly will require an extra lookup and a new key.



Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
David Geier
Date:
Hi Dmitry,

Thanks for looking at the patch and sorry for the long line.

On 3/17/23 21:14, Dmitry Dolgov wrote:
> The idea sounds reasonable to me, but I have to admit snapshot_and_stats
> implementation looks awkward. Maybe it would be better to have a
> separate structure field for both stats and snapshot, which will be set
> to point to a corresponding place in the shared FAM e.g. when the worker
> is getting initialized? shm_toc_allocate mentions BUFFERALIGN to handle
> possibility of some atomic operations needing it, so I guess that would
> have to be an alignment in this case as well.
>
> Probably another option would be to allocate two separate pieces of
> shared memory, which resolves questions like proper alignment, but
> annoyingly will require an extra lookup and a new key.

I considered the other options and it seems to me none of them is 
particularly superior. All of them have pros and cons with the cons 
mostly outweighing the pros. Let me quickly elaborate:

1. Use multiple shm_toc entries: Shared state is split into multiple 
pieces. Extra pointers in BitmapHeapScanState needed to point at the 
split out data. BitmapHeapScanState has already a shared_info member, 
which is not a pointer to the shared memory but a pointer to the leader 
local data allocated used to store the instrumentation data from the 
workers. This is confusing but at least consistent with how its done in 
other places (e.g. execSort.c, nodeHash.c, nodeIncrementalSort.c). 
Having another pointer there which points to the shared memory makes it 
even more confusing. If we go this way we would have e.g. 
shared_info_copy and shared_info members in BitmapHeapScanState.

2. Store two extra pointers to the shared FAM entries in 
BitmapHeapScanState: IMHO, that is the better alternative of (1) as it 
doesn't need an extra TOC entry but comes with the same confusion of 
multiple pointers to SharedBitmapHeapScanInfo in BitmapHeapScanState. 
But maybe that's not too bad?

3. Solution in initial patch (use two functions to obtain pointers 
where/when needed): Avoids the need for another pointer in 
BitmapHeapScanState at the cost / ugliness of having to call the helper 
functions.

Another, not yet discussed, option I can see work is:

4. Allocate a fixed amount of memory for the instrumentation stats based 
on MAX_PARALLEL_WORKER_LIMIT: MAX_PARALLEL_WORKER_LIMIT is 1024 and used 
as the limit of the max_parallel_workers GUC. This way 
MAX_PARALLEL_WORKER_LIMIT * sizeof(BitmapHeapScanInstrumentation) = 1024 
* 8 = 8192 bytes would be allocated. To cut this down in half we could 
additionally change the type of lossy_pages and exact_pages from long to 
uint32. Only possibly needed memory would have to get initialized, the 
remaining unused memory would remain untouched to not waste cycles.

My first preference is the new option (4). My second preference is 
option (1). What's your take?

-- 
David Geier
(ServiceNow)




Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
Dmitry Dolgov
Date:
> On Wed, Sep 20, 2023 at 03:42:43PM +0200, David Geier wrote:
> Another, not yet discussed, option I can see work is:
>
> 4. Allocate a fixed amount of memory for the instrumentation stats based on
> MAX_PARALLEL_WORKER_LIMIT: MAX_PARALLEL_WORKER_LIMIT is 1024 and used as the
> limit of the max_parallel_workers GUC. This way MAX_PARALLEL_WORKER_LIMIT *
> sizeof(BitmapHeapScanInstrumentation) = 1024 * 8 = 8192 bytes would be
> allocated. To cut this down in half we could additionally change the type of
> lossy_pages and exact_pages from long to uint32. Only possibly needed memory
> would have to get initialized, the remaining unused memory would remain
> untouched to not waste cycles.

I'm not sure that it would be acceptable -- if I understand correctly it
would be 8192 bytes per parallel bitmap heap scan node, and could be
noticeable in the worst case scenario with too many connections and too
many such nodes in every query.

I find the original approach with an offset not that bad, after all
there is something similar going on in other places, e.g. parallel heap
scan also has phs_snapshot_off (although the rest is fixed sized). My
commentary above in the thread was mostly about the cosmetic side.
Giving snapshot_and_stats a decent name and maybe even ditching the
access functions, using instead only the offset field couple of times,
and it would look better to me.



Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
Michael Christofides
Date:
 
EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports
the number of heap blocks processed by the leader. It's missing the
per-worker stats.

Hi David,

According to the docs[1]: "In a parallel bitmap heap scan, one process is chosen as the leader. That process performs a scan of one or more indexes and builds a bitmap indicating which table blocks need to be visited. These blocks are then divided among the cooperating processes as in a parallel sequential scan."

My understanding is that the "Heap Blocks" statistic is only reporting blocks for the bitmap (i.e. not the subsequent scan). As such, I think it is correct that the workers do not report additional exact heap blocks. 
 
explain (analyze, costs off, timing off) select * from foo where col0 >
900 or col1 = 1;

In your example, if you add the buffers and verbose parameters, do the worker reported buffers numbers report what you are looking for?

i.e. explain (analyze, buffers, verbose, costs off, timing off) select * from foo where col0 > 900 or col1 = 1;

Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
Robert Haas
Date:
On Mon, Oct 16, 2023 at 12:31 PM Michael Christofides
<michael@pgmustard.com> wrote:
> According to the docs[1]: "In a parallel bitmap heap scan, one process is chosen as the leader. That process performs
ascan of one or more indexes and builds a bitmap indicating which table blocks need to be visited. These blocks are
thendivided among the cooperating processes as in a parallel sequential scan." 
>
> My understanding is that the "Heap Blocks" statistic is only reporting blocks for the bitmap (i.e. not the subsequent
scan).As such, I think it is correct that the workers do not report additional exact heap blocks. 

I think you're wrong about that. The bitmap index scans are what scan
the indexes and build the bitmap. The bitmap heap scan node is what
scans the heap i.e. the table, and that is what is divided across the
workers.

On the patch itself, snapshot_and_stats doesn't strike me as a great
name. If we added three more variable-length things would we call the
member snapshot_and_stats_and_pink_and_orange_and_blue? Probably
better to pick a name that is somehow more generic.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
Dilip Kumar
Date:
On Fri, Jan 20, 2023 at 2:04 PM David Geier <geidav.pg@gmail.com> wrote:
>
> Hi hackers,
>
> EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports
> the number of heap blocks processed by the leader. It's missing the
> per-worker stats. The attached patch adds that functionality in the
> spirit of e.g. Sort or Memoize. Here is a simple test case and the
> EXPLAIN ANALYZE output with and without the patch:
>

> With the patch:
>
>   Gather (actual rows=99501 loops=1)
>     Workers Planned: 2
>     Workers Launched: 2
>     ->  Parallel Bitmap Heap Scan on foo (actual rows=33167 loops=3)
>           Recheck Cond: ((col0 > 900) OR (col1 = 1))
>           Heap Blocks: exact=98
>           Worker 0:  Heap Blocks: exact=171 lossy=0
>           Worker 1:  Heap Blocks: exact=172 lossy=0


else
  {
+ if (planstate->stats.exact_pages > 0)
+    appendStringInfo(es->str, " exact=%ld", planstate->stats.exact_pages);
+ if (planstate->stats.lossy_pages > 0)
+ appendStringInfo(es->str, " lossy=%ld", planstate->stats.lossy_pages);
      appendStringInfoChar(es->str, '\n');
  }
  }
....
+ for (int n = 0; n < planstate->shared_info->num_workers; n++)
+ {
....
+ "Heap Blocks: exact="UINT64_FORMAT" lossy=" INT64_FORMAT"\n", +
si->exact_pages, si->lossy_pages);

Shouldn't we use the same format for reporting exact and lossy pages
for the actual backend and the worker? I mean here for the backend you
are showing lossy pages only if it is > 0 whereas for workers we are
showing 0 lossy pages as well.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Hi David,

Do you plan to work continue working on this patch? I did take a look,
and on the whole it looks reasonable - it modifies the right places etc.

I think there are two things that may need an improvement:

1) Storing variable-length data in ParallelBitmapHeapState

I agree with Robert the snapshot_and_stats name is not great. I see
Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData -
the reasons are somewhat different (phs_snapshot_off exists because we
don't know which exact struct will be allocated), while here we simply
need to allocate two variable-length pieces of memory. But it seems like
it would work nicely for this. That is, we could try adding an offset
for each of those pieces of memory:

 - snapshot_off
 - stats_off

I don't like the GetSharedSnapshotData name very much, it seems very
close to GetSnapshotData - quite confusing, I think.

Dmitry also suggested we might add a separate piece of shared memory. I
don't quite see how that would work for ParallelBitmapHeapState, but I
doubt it'd be simpler than having two offsets. I don't think the extra
complexity (paid by everyone) would be worth it just to make EXPLAIN
ANALYZE work.


2) Leader vs. worker counters

It seems to me this does nothing to add the per-worker values from "Heap
Blocks" into the leader, which means we get stuff like this:

    Heap Blocks: exact=102 lossy=10995
    Worker 0:  actual time=50.559..209.773 rows=215253 loops=1
       Heap Blocks: exact=207 lossy=19354
    Worker 1:  actual time=50.543..211.387 rows=162934 loops=1
       Heap Blocks: exact=161 lossy=14636

I think this is wrong / confusing, and inconsistent with what we do for
other nodes. It's also inconsistent with how we deal e.g. with BUFFERS,
where we *do* add the values to the leader:

    Heap Blocks: exact=125 lossy=10789
    Buffers: shared hit=11 read=45420
    Worker 0:  actual time=51.419..221.904 rows=150437 loops=1
      Heap Blocks: exact=136 lossy=13541
      Buffers: shared hit=4 read=13541
    Worker 1:  actual time=56.610..222.469 rows=229738 loops=1
      Heap Blocks: exact=209 lossy=20655
      Buffers: shared hit=4 read=20655

Here it's not entirely obvious, because leader participates in the
execution, but once we disable leader participation, it's clearer:

    Buffers: shared hit=7 read=45421
    Worker 0:  actual time=28.540..247.683 rows=309112 loops=1
      Heap Blocks: exact=282 lossy=27806
      Buffers: shared hit=4 read=28241
    Worker 1:  actual time=24.290..251.993 rows=190815 loops=1
      Heap Blocks: exact=188 lossy=17179
      Buffers: shared hit=3 read=17180

Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap
Blocks" simply disappeared because the leader does nothing and we don't
print zeros.


3) I'm not sure dealing with various EXPLAIN flags may not be entirely
correct. Consider this:

EXPLAIN (ANALYZE):

   ->  Parallel Bitmap Heap Scan on t  (...)
         Recheck Cond: (a < 5000)
         Rows Removed by Index Recheck: 246882
         Worker 0:  Heap Blocks: exact=168 lossy=15648
         Worker 1:  Heap Blocks: exact=302 lossy=29337

EXPLAIN (ANALYZE, VERBOSE):

   ->  Parallel Bitmap Heap Scan on public.t  (...)
         Recheck Cond: (t.a < 5000)
         Rows Removed by Index Recheck: 246882
         Worker 0:  actual time=35.067..300.882 rows=282108 loops=1
           Heap Blocks: exact=257 lossy=25358
         Worker 1:  actual time=32.827..302.224 rows=217819 loops=1
           Heap Blocks: exact=213 lossy=19627

EXPLAIN (ANALYZE, BUFFERS):

   ->  Parallel Bitmap Heap Scan on t  (...)
         Recheck Cond: (a < 5000)
         Rows Removed by Index Recheck: 246882
         Buffers: shared hit=7 read=45421
         Worker 0:  Heap Blocks: exact=236 lossy=21870
         Worker 1:  Heap Blocks: exact=234 lossy=23115

EXPLAIN (ANALYZE, VERBOSE, BUFFERS):

   ->  Parallel Bitmap Heap Scan on public.t  (...)
         Recheck Cond: (t.a < 5000)
         Rows Removed by Index Recheck: 246882
         Buffers: shared hit=7 read=45421
         Worker 0:  actual time=28.265..260.381 rows=261264 loops=1
           Heap Blocks: exact=260 lossy=23477
           Buffers: shared hit=3 read=23478
         Worker 1:  actual time=28.224..261.627 rows=238663 loops=1
           Heap Blocks: exact=210 lossy=21508
           Buffers: shared hit=4 read=21943

Why should the per-worker buffer info be shown when combined with the
VERBOSE flag, and not just with BUFFERS, when the patch shows the
per-worker info always?


4) Now that I think about this, isn't the *main* problem really that we
don't display the sum of the per-worker stats (which I think is wrong)?
I mean, we already can get the worker details VERBOSEm right? So the
only reason to display that by default seems to be that it the values in
"Heap Blocks" are from the leader only.

BTW doesn't this also suggest some of the code added to explain.c may
not be quite necessary? Wouldn't it be enough to just "extend" the
existing code printing per-worker stats. (I haven't tried, so maybe I'm
wrong and we need the new code.)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
Melanie Plageman
Date:
On Sat, Feb 17, 2024 at 5:31 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi David,
>
> Do you plan to work continue working on this patch? I did take a look,
> and on the whole it looks reasonable - it modifies the right places etc.

I haven't started reviewing this patch yet, but I just ran into the
behavior that it fixes and was very outraged. +10000 to fixing this.

- Melanie



Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
Heikki Linnakangas
Date:
On 18/02/2024 00:31, Tomas Vondra wrote:
> Do you plan to work continue working on this patch? I did take a look,
> and on the whole it looks reasonable - it modifies the right places etc.

+1

> I think there are two things that may need an improvement:
> 
> 1) Storing variable-length data in ParallelBitmapHeapState
> 
> I agree with Robert the snapshot_and_stats name is not great. I see
> Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData -
> the reasons are somewhat different (phs_snapshot_off exists because we
> don't know which exact struct will be allocated), while here we simply
> need to allocate two variable-length pieces of memory. But it seems like
> it would work nicely for this. That is, we could try adding an offset
> for each of those pieces of memory:
> 
>   - snapshot_off
>   - stats_off
> 
> I don't like the GetSharedSnapshotData name very much, it seems very
> close to GetSnapshotData - quite confusing, I think.
> 
> Dmitry also suggested we might add a separate piece of shared memory. I
> don't quite see how that would work for ParallelBitmapHeapState, but I
> doubt it'd be simpler than having two offsets. I don't think the extra
> complexity (paid by everyone) would be worth it just to make EXPLAIN
> ANALYZE work.

I just removed phs_snapshot_data in commit 84c18acaf6. I thought that 
would make this moot, but now that I rebased this, there are stills some 
aesthetic questions on how best to represent this.

In all the other node types that use shared instrumentation like this, 
the pattern is as follows: (using Memoize here as an example, but it's 
similar for Sort, IncrementalSort, Agg and Hash)

/* ----------------
  *     Shared memory container for per-worker memoize information
  * ----------------
  */
typedef struct SharedMemoizeInfo
{
    int            num_workers;
    MemoizeInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
} SharedMemoizeInfo;

/* this struct is backend-private */
typedef struct MemoizeState
{
    ScanState    ss;                /* its first field is NodeTag */
    ...
    MemoizeInstrumentation stats;    /* execution statistics */
    SharedMemoizeInfo *shared_info; /* statistics for parallel workers */
} MemoizeState;

While the scan is running, the node updates its private data in 
MemoizeState->stats. At the end of a parallel scan, the worker process 
copies the MemoizeState->stats to MemoizeState->shared_info->stats, 
which lives in shared memory. The leader process copies 
MemoizeState->shared_info->stats to its own backend-private copy, which 
it then stores in its MemoizeState->shared_info, replacing the pointer 
to the shared memory with a pointer to the private copy. That happens in 
ExecMemoizeRetrieveInstrumentation().

This is a little different for parallel bitmap heap scans, because a 
bitmap heap scan keeps some other data in shared memory too, not just 
instrumentation data. Also, the naming is inconsistent: the equivalent 
of SharedMemoizeInfo is actually called ParallelBitmapHeapState. I think 
we should rename it to SharedBitmapHeapInfo, to make it clear that it 
lives in shared memory, but I digress.

We could now put the new stats at the end of ParallelBitmapHeapState as 
a varlen field. But I'm not sure that's a good idea. In 
ExecBitmapHeapRetrieveInstrumentation(), would we make a backend-private 
copy of the whole ParallelBitmapHeapState struct, even though the other 
fields don't make sense after the shared memory is released? Sounds 
confusing. Or we could introduce a separate struct for the stats, and 
copy just that:

typedef struct SharedBitmapHeapInstrumentation
{
    int            num_workers;
    BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
} SharedBitmapHeapInstrumentation;

typedef struct BitmapHeapScanState
{
    ScanState    ss;                /* its first field is NodeTag */
    ...
    SharedBitmapHeapInstrumentation sinstrument;
} BitmapHeapScanState;

that compiles, at least with my compiler, but I find it weird to have a 
variable-length inner struct embedded in an outer struct like that.

Long story short, I think it's still better to store 
ParallelBitmapHeapInstrumentationInfo separately in the DSM chunk, not 
as part of ParallelBitmapHeapState. Attached patch does that, rebased 
over current master.


I didn't address any of the other things that you, Tomas, pointed out, 
but I think they're valid concerns.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
Melanie Plageman
Date:
On Thu, Mar 14, 2024 at 05:30:30PM +0200, Heikki Linnakangas wrote:
> On 18/02/2024 00:31, Tomas Vondra wrote:
> > Do you plan to work continue working on this patch? I did take a look,
> > and on the whole it looks reasonable - it modifies the right places etc.
> 
> +1
> 
> > I think there are two things that may need an improvement:
> > 
> > 1) Storing variable-length data in ParallelBitmapHeapState
> > 
> > I agree with Robert the snapshot_and_stats name is not great. I see
> > Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData -
> > the reasons are somewhat different (phs_snapshot_off exists because we
> > don't know which exact struct will be allocated), while here we simply
> > need to allocate two variable-length pieces of memory. But it seems like
> > it would work nicely for this. That is, we could try adding an offset
> > for each of those pieces of memory:
> > 
> >   - snapshot_off
> >   - stats_off
> > 
> > I don't like the GetSharedSnapshotData name very much, it seems very
> > close to GetSnapshotData - quite confusing, I think.
> > 
> > Dmitry also suggested we might add a separate piece of shared memory. I
> > don't quite see how that would work for ParallelBitmapHeapState, but I
> > doubt it'd be simpler than having two offsets. I don't think the extra
> > complexity (paid by everyone) would be worth it just to make EXPLAIN
> > ANALYZE work.
> 
> I just removed phs_snapshot_data in commit 84c18acaf6. I thought that would
> make this moot, but now that I rebased this, there are stills some aesthetic
> questions on how best to represent this.
> 
> In all the other node types that use shared instrumentation like this, the
> pattern is as follows: (using Memoize here as an example, but it's similar
> for Sort, IncrementalSort, Agg and Hash)
> 
> /* ----------------
>  *     Shared memory container for per-worker memoize information
>  * ----------------
>  */
> typedef struct SharedMemoizeInfo
> {
>     int            num_workers;
>     MemoizeInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
> } SharedMemoizeInfo;
> 
> /* this struct is backend-private */
> typedef struct MemoizeState
> {
>     ScanState    ss;                /* its first field is NodeTag */
>     ...
>     MemoizeInstrumentation stats;    /* execution statistics */
>     SharedMemoizeInfo *shared_info; /* statistics for parallel workers */
> } MemoizeState;
> 
> While the scan is running, the node updates its private data in
> MemoizeState->stats. At the end of a parallel scan, the worker process
> copies the MemoizeState->stats to MemoizeState->shared_info->stats, which
> lives in shared memory. The leader process copies
> MemoizeState->shared_info->stats to its own backend-private copy, which it
> then stores in its MemoizeState->shared_info, replacing the pointer to the
> shared memory with a pointer to the private copy. That happens in
> ExecMemoizeRetrieveInstrumentation().
> 
> This is a little different for parallel bitmap heap scans, because a bitmap
> heap scan keeps some other data in shared memory too, not just
> instrumentation data. Also, the naming is inconsistent: the equivalent of
> SharedMemoizeInfo is actually called ParallelBitmapHeapState. I think we
> should rename it to SharedBitmapHeapInfo, to make it clear that it lives in
> shared memory, but I digress.

FWIW, if we merge a BHS streaming read user like the one I propose in
[1] (not as a pre-condition to this but just as something to make you
more comfortable with these names), the ParallelBitmapHeapState will
basically only contain the shared iterator and the coordination state
for accessing it and could be named as such.

Then if you really wanted to be consistent with Memoize, you could name
the instrumentation SharedBitmapHeapInfo. But, personally I prefer the
name you gave it: SharedBitmapHeapInstrumentation. I think that would
have been a better name for SharedMemoizeInfo since num_workers is
really just used as the length of the array of instrumentation info.

> We could now put the new stats at the end of ParallelBitmapHeapState as a
> varlen field. But I'm not sure that's a good idea. In
> ExecBitmapHeapRetrieveInstrumentation(), would we make a backend-private
> copy of the whole ParallelBitmapHeapState struct, even though the other
> fields don't make sense after the shared memory is released? Sounds
> confusing. Or we could introduce a separate struct for the stats, and copy
> just that:
> 
> typedef struct SharedBitmapHeapInstrumentation
> {
>     int            num_workers;
>     BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
> } SharedBitmapHeapInstrumentation;
> 
> typedef struct BitmapHeapScanState
> {
>     ScanState    ss;                /* its first field is NodeTag */
>     ...
>     SharedBitmapHeapInstrumentation sinstrument;
> } BitmapHeapScanState;
> 
> that compiles, at least with my compiler, but I find it weird to have a
> variable-length inner struct embedded in an outer struct like that.

In the attached patch, BitmapHeapScanState->sinstrument is a pointer,
though. Or are you proposing the above as an alternative that you
decided not to go with?

> Long story short, I think it's still better to store
> ParallelBitmapHeapInstrumentationInfo separately in the DSM chunk, not as
> part of ParallelBitmapHeapState. Attached patch does that, rebased over
> current master.

The approach in the attached patch looks good to me.

> I didn't address any of the other things that you, Tomas, pointed out, but I
> think they're valid concerns.

I'll send a separate review of these issues that are still present in
your patch as well.

- Melanie

[1] https://www.postgresql.org/message-id/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com



Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
Heikki Linnakangas
Date:
On 14/03/2024 22:00, Melanie Plageman wrote:
> On Thu, Mar 14, 2024 at 05:30:30PM +0200, Heikki Linnakangas wrote:
>> typedef struct SharedBitmapHeapInstrumentation
>> {
>>     int            num_workers;
>>     BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER];
>> } SharedBitmapHeapInstrumentation;
>>
>> typedef struct BitmapHeapScanState
>> {
>>     ScanState    ss;                /* its first field is NodeTag */
>>     ...
>>     SharedBitmapHeapInstrumentation sinstrument;
>> } BitmapHeapScanState;
>>
>> that compiles, at least with my compiler, but I find it weird to have a
>> variable-length inner struct embedded in an outer struct like that.
> 
> In the attached patch, BitmapHeapScanState->sinstrument is a pointer,
> though. Or are you proposing the above as an alternative that you
> decided not to go with?

Right, the above is what I contemplated at first but decided it was a 
bad idea.

-- 
Heikki Linnakangas
Neon (https://neon.tech)





Hi Tomas,
>
> On Sat, Feb 17, 2024 at 2:31 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> Hi David,
>
> Do you plan to work continue working on this patch? I did take a look,
> and on the whole it looks reasonable - it modifies the right places etc.
>
> I think there are two things that may need an improvement:
>
> 1) Storing variable-length data in ParallelBitmapHeapState
>
> I agree with Robert the snapshot_and_stats name is not great. I see
> Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData -
> the reasons are somewhat different (phs_snapshot_off exists because we
> don't know which exact struct will be allocated), while here we simply
> need to allocate two variable-length pieces of memory. But it seems like
> it would work nicely for this. That is, we could try adding an offset
> for each of those pieces of memory:
>
>  - snapshot_off
>  - stats_off
>
> I don't like the GetSharedSnapshotData name very much, it seems very
> close to GetSnapshotData - quite confusing, I think.
>
> Dmitry also suggested we might add a separate piece of shared memory. I
> don't quite see how that would work for ParallelBitmapHeapState, but I
> doubt it'd be simpler than having two offsets. I don't think the extra
> complexity (paid by everyone) would be worth it just to make EXPLAIN
> ANALYZE work.

This issue is now gone after Heikki's fix.  

> 2) Leader vs. worker counters
>
> It seems to me this does nothing to add the per-worker values from "Heap
> Blocks" into the leader, which means we get stuff like this:
>
>     Heap Blocks: exact=102 lossy=10995
>     Worker 0:  actual time=50.559..209.773 rows=215253 loops=1
>        Heap Blocks: exact=207 lossy=19354
>     Worker 1:  actual time=50.543..211.387 rows=162934 loops=1
>        Heap Blocks: exact=161 lossy=14636
>
> I think this is wrong / confusing, and inconsistent with what we do for
> other nodes. It's also inconsistent with how we deal e.g. with BUFFERS,
> where we *do* add the values to the leader:
>
>     Heap Blocks: exact=125 lossy=10789
>     Buffers: shared hit=11 read=45420
>     Worker 0:  actual time=51.419..221.904 rows=150437 loops=1
>       Heap Blocks: exact=136 lossy=13541
>       Buffers: shared hit=4 read=13541
>     Worker 1:  actual time=56.610..222.469 rows=229738 loops=1
>       Heap Blocks: exact=209 lossy=20655
>       Buffers: shared hit=4 read=20655
>
> Here it's not entirely obvious, because leader participates in the
> execution, but once we disable leader participation, it's clearer:
>
>     Buffers: shared hit=7 read=45421
>     Worker 0:  actual time=28.540..247.683 rows=309112 loops=1
>       Heap Blocks: exact=282 lossy=27806
>       Buffers: shared hit=4 read=28241
>     Worker 1:  actual time=24.290..251.993 rows=190815 loops=1
>       Heap Blocks: exact=188 lossy=17179
>       Buffers: shared hit=3 read=17180
>
> Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap
> Blocks" simply disappeared because the leader does nothing and we don't
> print zeros.

Heap Blocks is specific to Bitmap Heap Scan. It seems that node specific stats
do not aggregate workers' stats into leaders for some existing nodes. For example,
Memorize node for Hits, Misses, etc

   ->  Nested Loop (actual rows=166667 loops=3)
         ->  Parallel Seq Scan on t (actual rows=33333 loops=3)
         ->  Memoize (actual rows=5 loops=100000)
               Cache Key: t.j
               Cache Mode: logical
               Hits: 32991  Misses: 5  Evictions: 0  Overflows: 0  Memory Usage: 2kB
               Worker 0:  Hits: 33551  Misses: 5  Evictions: 0  Overflows: 0  Memory Usage: 2kB
               Worker 1:  Hits: 33443  Misses: 5  Evictions: 0  Overflows: 0  Memory Usage: 2kB
               ->  Index Scan using uj on u (actual rows=5 loops=15)
                     Index Cond: (j = t.j)

Sort, HashAggregate also do the same stuff.

> 3) I'm not sure dealing with various EXPLAIN flags may not be entirely
> correct. Consider this:
>
> EXPLAIN (ANALYZE):
>
>    ->  Parallel Bitmap Heap Scan on t  (...)
>          Recheck Cond: (a < 5000)
>          Rows Removed by Index Recheck: 246882
>          Worker 0:  Heap Blocks: exact=168 lossy=15648
>          Worker 1:  Heap Blocks: exact=302 lossy=29337
>
> EXPLAIN (ANALYZE, VERBOSE):
>
>    ->  Parallel Bitmap Heap Scan on public.t  (...)
>          Recheck Cond: (t.a < 5000)
>          Rows Removed by Index Recheck: 246882
>          Worker 0:  actual time=35.067..300.882 rows=282108 loops=1
>            Heap Blocks: exact=257 lossy=25358
>          Worker 1:  actual time=32.827..302.224 rows=217819 loops=1
>            Heap Blocks: exact=213 lossy=19627
>
> EXPLAIN (ANALYZE, BUFFERS):
>
>    ->  Parallel Bitmap Heap Scan on t  (...)
>          Recheck Cond: (a < 5000)
>          Rows Removed by Index Recheck: 246882
>          Buffers: shared hit=7 read=45421
>          Worker 0:  Heap Blocks: exact=236 lossy=21870
>          Worker 1:  Heap Blocks: exact=234 lossy=23115
>
> EXPLAIN (ANALYZE, VERBOSE, BUFFERS):
>
>    ->  Parallel Bitmap Heap Scan on public.t  (...)
>          Recheck Cond: (t.a < 5000)
>          Rows Removed by Index Recheck: 246882
>          Buffers: shared hit=7 read=45421
>          Worker 0:  actual time=28.265..260.381 rows=261264 loops=1
>            Heap Blocks: exact=260 lossy=23477
>            Buffers: shared hit=3 read=23478
>          Worker 1:  actual time=28.224..261.627 rows=238663 loops=1
>            Heap Blocks: exact=210 lossy=21508
>            Buffers: shared hit=4 read=21943
>
> Why should the per-worker buffer info be shown when combined with the
> VERBOSE flag, and not just with BUFFERS, when the patch shows the
> per-worker info always?
>

It seems that the general explain print framework requires verbose mode to show per worker stats.
For example, how Buffers hits, JIT are printed. While in some specific nodes which involves parallelism,
they always show worker blocks. This is why we see that some worker blocks don't have buffers
stats in non verbose mode. There are several existing nodes have the same issue as what this
patch does: memorize, sort, hashaggregate.

> 4) Now that I think about this, isn't the *main* problem really that we
> don't display the sum of the per-worker stats (which I think is wrong)?
> I mean, we already can get the worker details VERBOSEm right? So the
> only reason to display that by default seems to be that it the values in
> "Heap Blocks" are from the leader only.

If we print aggregate Heap Blocks in the 'leader' block and show worker stats only in
verbose mode, does it look better? In this way, it matches better with how the
general framework prints workers stats, but it differs from some existing nodes which
also print worker stats. As mentioned above, they are memorize, sort, and hashaggregate nodes.
By the way, is that also a problem for these nodes?

> BTW doesn't this also suggest some of the code added to explain.c may
> not be quite necessary? Wouldn't it be enough to just "extend" the
> existing code printing per-worker stats. (I haven't tried, so maybe I'm
> wrong and we need the new code.)
 
We need the new code as they are node specific stats and we do call the worker print function
to lay out the explain plan. I think the problem is in which mode we should show worker blocks. This
is discussed in the above section.    

v3 failed on master, attached a rebased version.    

Regards,
Donghang Lin
(ServiceNow)
Attachment

Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From
Melanie Plageman
Date:
On Mon, Mar 25, 2024 at 2:29 AM Donghang Lin <donghanglin@gmail.com> wrote:
>
>
> > On Sat, Feb 17, 2024 at 2:31 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> > 2) Leader vs. worker counters
> >
> > It seems to me this does nothing to add the per-worker values from "Heap
> > Blocks" into the leader, which means we get stuff like this:
> >
> >     Heap Blocks: exact=102 lossy=10995
> >     Worker 0:  actual time=50.559..209.773 rows=215253 loops=1
> >        Heap Blocks: exact=207 lossy=19354
> >     Worker 1:  actual time=50.543..211.387 rows=162934 loops=1
> >        Heap Blocks: exact=161 lossy=14636
> >
> > I think this is wrong / confusing, and inconsistent with what we do for
> > other nodes. It's also inconsistent with how we deal e.g. with BUFFERS,
> > where we *do* add the values to the leader:
> >
> >     Heap Blocks: exact=125 lossy=10789
> >     Buffers: shared hit=11 read=45420
> >     Worker 0:  actual time=51.419..221.904 rows=150437 loops=1
> >       Heap Blocks: exact=136 lossy=13541
> >       Buffers: shared hit=4 read=13541
> >     Worker 1:  actual time=56.610..222.469 rows=229738 loops=1
> >       Heap Blocks: exact=209 lossy=20655
> >       Buffers: shared hit=4 read=20655
> >
> > Here it's not entirely obvious, because leader participates in the
> > execution, but once we disable leader participation, it's clearer:
> >
> >     Buffers: shared hit=7 read=45421
> >     Worker 0:  actual time=28.540..247.683 rows=309112 loops=1
> >       Heap Blocks: exact=282 lossy=27806
> >       Buffers: shared hit=4 read=28241
> >     Worker 1:  actual time=24.290..251.993 rows=190815 loops=1
> >       Heap Blocks: exact=188 lossy=17179
> >       Buffers: shared hit=3 read=17180
> >
> > Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap
> > Blocks" simply disappeared because the leader does nothing and we don't
> > print zeros.
>
> Heap Blocks is specific to Bitmap Heap Scan. It seems that node specific stats
> do not aggregate workers' stats into leaders for some existing nodes. For example,
> Memorize node for Hits, Misses, etc
>
>    ->  Nested Loop (actual rows=166667 loops=3)
>          ->  Parallel Seq Scan on t (actual rows=33333 loops=3)
>          ->  Memoize (actual rows=5 loops=100000)
>                Cache Key: t.j
>                Cache Mode: logical
>                Hits: 32991  Misses: 5  Evictions: 0  Overflows: 0  Memory Usage: 2kB
>                Worker 0:  Hits: 33551  Misses: 5  Evictions: 0  Overflows: 0  Memory Usage: 2kB
>                Worker 1:  Hits: 33443  Misses: 5  Evictions: 0  Overflows: 0  Memory Usage: 2kB
>                ->  Index Scan using uj on u (actual rows=5 loops=15)
>                      Index Cond: (j = t.j)
>
> Sort, HashAggregate also do the same stuff.
>
> > 3) I'm not sure dealing with various EXPLAIN flags may not be entirely
> > correct. Consider this:
> >
> > EXPLAIN (ANALYZE):
> >
> >    ->  Parallel Bitmap Heap Scan on t  (...)
> >          Recheck Cond: (a < 5000)
> >          Rows Removed by Index Recheck: 246882
> >          Worker 0:  Heap Blocks: exact=168 lossy=15648
> >          Worker 1:  Heap Blocks: exact=302 lossy=29337
> >
> > EXPLAIN (ANALYZE, VERBOSE):
> >
> >    ->  Parallel Bitmap Heap Scan on public.t  (...)
> >          Recheck Cond: (t.a < 5000)
> >          Rows Removed by Index Recheck: 246882
> >          Worker 0:  actual time=35.067..300.882 rows=282108 loops=1
> >            Heap Blocks: exact=257 lossy=25358
> >          Worker 1:  actual time=32.827..302.224 rows=217819 loops=1
> >            Heap Blocks: exact=213 lossy=19627
> >
> > EXPLAIN (ANALYZE, BUFFERS):
> >
> >    ->  Parallel Bitmap Heap Scan on t  (...)
> >          Recheck Cond: (a < 5000)
> >          Rows Removed by Index Recheck: 246882
> >          Buffers: shared hit=7 read=45421
> >          Worker 0:  Heap Blocks: exact=236 lossy=21870
> >          Worker 1:  Heap Blocks: exact=234 lossy=23115
> >
> > EXPLAIN (ANALYZE, VERBOSE, BUFFERS):
> >
> >    ->  Parallel Bitmap Heap Scan on public.t  (...)
> >          Recheck Cond: (t.a < 5000)
> >          Rows Removed by Index Recheck: 246882
> >          Buffers: shared hit=7 read=45421
> >          Worker 0:  actual time=28.265..260.381 rows=261264 loops=1
> >            Heap Blocks: exact=260 lossy=23477
> >            Buffers: shared hit=3 read=23478
> >          Worker 1:  actual time=28.224..261.627 rows=238663 loops=1
> >            Heap Blocks: exact=210 lossy=21508
> >            Buffers: shared hit=4 read=21943
> >
> > Why should the per-worker buffer info be shown when combined with the
> > VERBOSE flag, and not just with BUFFERS, when the patch shows the
> > per-worker info always?
> >
>
> It seems that the general explain print framework requires verbose mode to show per worker stats.
> For example, how Buffers hits, JIT are printed. While in some specific nodes which involves parallelism,
> they always show worker blocks. This is why we see that some worker blocks don't have buffers
> stats in non verbose mode. There are several existing nodes have the same issue as what this
> patch does: memorize, sort, hashaggregate.

I don't think passing explain the BUFFERS option should impact what is
shown for bitmap heap scan lossy/exact. BUFFERS has to do with buffer
usage. "Heap Blocks" here is actually more accurately "Bitmap Blocks".
1) it is not heap specific, so at least changing it to "Table Blocks"
would be better and 2) the meaning is about whether or not the blocks
were represented lossily in the bitmap -- yes, those blocks that we
are talking about are table blocks (in contrast to index blocks), but
the important part is the bitmap.

So, BUFFERS shouldn't cause this info to show.

As for whether or not the leader number should be inclusive of all the
worker numbers, if there is a combination of options in which
per-worker stats are not  displayed, then the leader count should be
inclusive of the worker numbers. However, if the worker numbers are
always displayed, I think it is okay for the leader number to only
display its own count. Though I agree with Tomas that that can be
confusing when parallel_leader_participation is off. You end up
without a topline number.

As for whether or not per-worker stats should be displayed by default
or only with VERBOSE, it sounds like there are two different
precedents. I don't have a strong feeling one way or the other.
Whichever is most consistent.
Donghang, could you list again which plan nodes and explain options
always print per-worker stats and which only do with the VERBOSE
option?

I think there is an issue with the worker counts on rescan though. I
was playing around with this patch with one of the regression test
suite tables:

drop table if exists tenk1;
drop table if exists tenk2;
CREATE TABLE tenk1 (
    unique1        int4,
    unique2        int4,
    two            int4,
    four        int4,
    ten            int4,
    twenty        int4,
    hundred        int4,
    thousand    int4,
    twothousand    int4,
    fivethous    int4,
    tenthous    int4,
    odd            int4,
    even        int4,
    stringu1    name,
    stringu2    name,
    string4        name
) with (autovacuum_enabled = false);

COPY tenk1 FROM '/[source directory]/src/test/regress/data/tenk.data';

CREATE TABLE tenk2 AS SELECT * FROM tenk1;
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
VACUUM ANALYZE tenk1;
VACUUM ANALYZE tenk2;

set enable_seqscan to off;
set enable_indexscan to off;
set enable_hashjoin to off;
set enable_mergejoin to off;
set enable_material to off;
set parallel_setup_cost=0;
set parallel_tuple_cost=0;
set min_parallel_table_scan_size=0;
set max_parallel_workers_per_gather=2;
set parallel_leader_participation = off;
explain (analyze, costs off, verbose)
    select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;

I don't think the worker counts are getting carried across rescans.
For this query, with parallel_leader_participation, you can see the
leader has a high number of exact heap blocks (there are 30 rescans,
so it is the sum across all of those). But the workers have low
counts. That might be normal because maybe they did less work, but if
you set parallel_leader_participation = off, you see the worker
numbers are still small. I did some logging and I do see workers with
counts of lossy/exact not making it into the final count. I haven't
had time to debug more, but it is worth looking into.

parallel_leader_participation = on
-----------------------------------------------------------------------------------------------
 Aggregate (actual time=62.321..63.178 rows=1 loops=1)
   Output: count(*)
   ->  Nested Loop (actual time=2.058..55.202 rows=98000 loops=1)
         ->  Seq Scan on public.tenk2 (actual time=0.182..3.699 rows=10 loops=1)
               Output: tenk2.unique1 ...
               Filter: (tenk2.thousand = 0)
               Rows Removed by Filter: 9990
         ->  Gather (actual time=1.706..4.142 rows=9800 loops=10)
               Workers Planned: 2
               Workers Launched: 2
               ->  Parallel Bitmap Heap Scan on public.tenk1 (actual
time=0.365..0.958 rows=3267 loops=30)
                     Recheck Cond: (tenk1.hundred > 1)
                     Heap Blocks: exact=1801
                     Worker 0:  actual time=0.033..0.414 rows=1993 loops=10
                       Heap Blocks: exact=78 lossy=0
                     Worker 1:  actual time=0.032..0.550 rows=2684 loops=10
                       Heap Blocks: exact=86 lossy=0
                     ->  Bitmap Index Scan on tenk1_hundred (actual
time=0.972..0.972 rows=9800 loops=10)
                           Index Cond: (tenk1.hundred > 1)

parallel_leader_participation = off
-----------------------------------------------------------------------------------------------
 Aggregate (actual time=84.502..84.977 rows=1 loops=1)
   Output: count(*)
   ->  Nested Loop (actual time=6.185..77.085 rows=98000 loops=1)
         ->  Seq Scan on public.tenk2 (actual time=0.182..3.709 rows=10 loops=1)
               Output: tenk2.unique1...
               Filter: (tenk2.thousand = 0)
               Rows Removed by Filter: 9990
         ->  Gather (actual time=5.265..6.355 rows=9800 loops=10)
               Workers Planned: 2
               Workers Launched: 2
               ->  Parallel Bitmap Heap Scan on public.tenk1 (actual
time=0.951..1.863 rows=4900 loops=20)
                     Recheck Cond: (tenk1.hundred > 1)
                     Worker 0:  actual time=0.794..1.705 rows=4909 loops=10
                       Heap Blocks: exact=168 lossy=0
                     Worker 1:  actual time=1.108..2.021 rows=4891 loops=10
                       Heap Blocks: exact=177 lossy=0
                     ->  Bitmap Index Scan on tenk1_hundred (actual
time=1.024..1.024 rows=9800 loops=10)
                           Index Cond: (tenk1.hundred > 1)
                           Worker 1:  actual time=1.024..1.024
rows=9800 loops=10

- Melanie



On Mon, Mar 25, 2024 at 2:11 PM Melanie Plageman <melanieplageman@gmail.com> wrote:
> As for whether or not per-worker stats should be displayed by default
> or only with VERBOSE, it sounds like there are two different
> precedents. I don't have a strong feeling one way or the other.
> Whichever is most consistent.
> Donghang, could you list again which plan nodes and explain options
> always print per-worker stats and which only do with the VERBOSE
> option?

I took a look at explain.c where workers info is printed out.

These works for every parallel aware nodes:
Buffers stats print for workers with VERBOSE and BUFFERS
WAL stats print for workers with VERBOSE and WAL
JIT stats print for workers with VERBOSE and COSTS
Timing print for workers with VERBOSE and TIMING
Rows and loops print for workers with VERBOSE

Some specific nodes:
Sort / Incremental Sort / Hash / HashAggregate / Memorize and Bitmap Heap Scan (this patch) nodes
always print their specific stats for workers.  

> I did some logging and I do see workers with
> counts of lossy/exact not making it into the final count. I haven't
> had time to debug more, but it is worth looking into.

Indeed, rescan overrides previous scan stats in workers.  
Attach v5 with v4 plus the fix to aggregate the counts.  

Regards,
Donghang Lin
(ServiceNow)
Attachment
Hi! Thank you for your work on this issue!

Your patch required a little revision. I did this and attached the patch.

Also, I think you should add some clarification to the comments about 
printing 'exact' and 'loosy' pages in show_hashagg_info function, which 
you get from planstate->stats, whereas previously it was output only 
from planstate. Perhaps it is enough to mention this in the comment to 
the commit.

I mean this place:

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 926d70afaf..02251994c6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3467,26 +3467,57 @@ show_hashagg_info(AggState *aggstate, 
ExplainState *es)
  static void
  show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
  {
+    Assert(es->analyze);
+
      if (es->format != EXPLAIN_FORMAT_TEXT)
      {
          ExplainPropertyInteger("Exact Heap Blocks", NULL,
-                               planstate->exact_pages, es);
+                               planstate->stats.exact_pages, es);
          ExplainPropertyInteger("Lossy Heap Blocks", NULL,
-                               planstate->lossy_pages, es);
+                               planstate->stats.lossy_pages, es);
      }
      else
      {
-        if (planstate->exact_pages > 0 || planstate->lossy_pages > 0)
+        if (planstate->stats.exact_pages > 0 || 
planstate->stats.lossy_pages > 0)
          {
              ExplainIndentText(es);
              appendStringInfoString(es->str, "Heap Blocks:");
-            if (planstate->exact_pages > 0)
-                appendStringInfo(es->str, " exact=%ld", 
planstate->exact_pages);
-            if (planstate->lossy_pages > 0)
-                appendStringInfo(es->str, " lossy=%ld", 
planstate->lossy_pages);
+            if (planstate->stats.exact_pages > 0)
+                appendStringInfo(es->str, " exact=%ld", 
planstate->stats.exact_pages);
+            if (planstate->stats.lossy_pages > 0)
+                appendStringInfo(es->str, " lossy=%ld", 
planstate->stats.lossy_pages);
              appendStringInfoChar(es->str, '\n');
          }
      }
+
+    if (planstate->pstate != NULL)
+    {
+        for (int n = 0; n < planstate->sinstrument->num_workers; n++)
+        {
+            BitmapHeapScanInstrumentation *si = 
&planstate->sinstrument->sinstrument[n];
+
+            if (si->exact_pages == 0 && si->lossy_pages == 0)
+                continue;
+
+            if (es->workers_state)
+                ExplainOpenWorker(n, es);
+
+            if (es->format == EXPLAIN_FORMAT_TEXT)
+            {
+                ExplainIndentText(es);
+                appendStringInfo(es->str, "Heap Blocks: exact=%ld 
lossy=%ld\n",
+                         si->exact_pages, si->lossy_pages);
+            }
+            else
+            {
+                ExplainPropertyInteger("Exact Heap Blocks", NULL, 
si->exact_pages, es);
+                ExplainPropertyInteger("Lossy Heap Blocks", NULL, 
si->lossy_pages, es);
+            }
+
+            if (es->workers_state)
+                ExplainCloseWorker(n, es);
+        }
+    }
  }

I suggest some code refactoring (diff.diff.no-cfbot file) that allows 
you to improve your code.

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment
Hi,

Thanks for working it! I'm interested in this feature, so I'd like to participate in the
patch review. Though I've just started looking at the patch, I have two comments
about the v6 patch. (And I want to confirm the thread active.)


1) Unify the print format of leader and worker

In show_tidbitmap_info(), the number of exact/loosy blocks of the leader and workers
are printed. I think the printed format should be same. Currently, the leader does not
print the blocks of exact/lossy with a value of 0, but the workers could even if it is 0.

IMHO, it's better to print both exact/lossy blocks if at least one of the numbers of
exact/lossy blocks is greater than 0. After all, the print logic is redundant for leader
and workers, but I thought it would be better to make it a common function.

2) Move es->workers_state check

In show_tidbitmap_info(), ExplainOpenWorker() and ExplainCloseWorker() are called
after checking es->worker_state is not NULL. However, es->workers_state seem to be
able to be NULL only for the Gather node (I see ExplainPrintPlan()). Also, reading the
comments, there is a description that each worker information needs to be hidden
when printing the plan.

Even if es->workers_state becomes NULL in BitmapHeapScan node in the future,
I think that workers' information(Heap Blocks) should not be printed. Therefore,
I think es->workers_state check should be move to the place of 
"if (planstate->pstate != NULL)" like ExplainNode(), doesn't it?

IIUC, we need to correct show_sort_info() and so on too…

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION


Hi,

Thanks for working it! I'm interested in this feature, so I'd like to participate
in the patch review. Though I've just started looking at the patch, I have two
comments about the v6 patch. (And I want to confirm the thread active.)


1) Unify the print format of leader and worker

In show_tidbitmap_info(), the number of exact/loosy blocks of the leader and
workers are printed. I think the printed format should be same. Currently, the
leader does not print the blocks of exact/lossy with a value of 0, but the workers
could even if it is 0.

IMHO, it's better to print both exact/lossy blocks if at least one of the numbers of
exact/lossy blocks is greater than 0. After all, the print logic is redundant for leader
and workers, but I thought it would be better to make it a common function.


2) Move es->workers_state check

In show_tidbitmap_info(), ExplainOpenWorker() and ExplainCloseWorker() are called
after checking es->worker_state is not NULL. However, es->workers_state seem to be
able to be NULL only for the Gather node (I see ExplainPrintPlan()). Also, reading the
comments, there is a description that each worker information needs to be hidden when
printing the plan.

Even if es->workers_state becomes NULL in BitmapHeapScan node in the future, I think
that workers' information(Heap Blocks) should not be printed. Therefore, I think
es->workers_state check should be move to the place of "if (planstate->pstate != NULL)"
like ExplainNode(), doesn't it?

IIUC, we need to correct show_sort_info() and so on too…

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION



On Wed, 26 Jun 2024 at 22:22, <Masahiro.Ikeda@nttdata.com> wrote:
> 1) Unify the print format of leader and worker
>
> In show_tidbitmap_info(), the number of exact/loosy blocks of the leader and workers
> are printed. I think the printed format should be same. Currently, the leader does not
> print the blocks of exact/lossy with a value of 0, but the workers could even if it is 0.

I agree with this. The two should match. I've fixed that in the attached.

I also made a pass over the patch, and I also changed:

1. Fixed up a few outdated comments in execnodes.h.
2. Added a comment in ExecEndBitmapHeapScan() to explain why we += the
stats rather than memcpy the BitmapHeapScanInstrumentation.
3. A bunch of other comments.
4. updated typedefs.list and ran pgindent.

For #2, I was surprised at this. I think there's probably a bug in the
Memoize stats code for the same reason. I've not looked into that yet.
I find it a little bit strange that we're showing stats for Worker N
when that worker could have been made up from possibly hundreds of
different parallel workers in the case where the Gather/GatherMerge
node is rescanned and the worker gets shut down at the end of each
Gather and fresh ones started up on rescan. I do agree that we need to
accumulate the totals from previous scans as that's what the
non-parallel version does.

Many people have been hacking on this and I'm wondering who should be
listed as authors.  I plan to put David Geier first. Should anyone
else be listed there?

I've attached the rebased v5 patch with part of Alena's changes from
the diff.diff.no-cfbot file. I left the following one off as it looks
wrong.

- ptr += MAXALIGN(sizeof(ParallelBitmapHeapState));
+ ptr += size;

That would make ptr point to the end of the allocation.

I'd like to commit this patch soon, so if anyone wants to give it a
final look, can they do so before next week?

David

Attachment
On Fri, 5 Jul 2024 at 01:59, David Rowley <dgrowleyml@gmail.com> wrote:
> I also made a pass over the patch, and I also changed:
>
> 1. Fixed up a few outdated comments in execnodes.h.
> 2. Added a comment in ExecEndBitmapHeapScan() to explain why we += the
> stats rather than memcpy the BitmapHeapScanInstrumentation.
> 3. A bunch of other comments.
> 4. updated typedefs.list and ran pgindent.

One other thing I think we should do while on this topic is move away
from using "long" as a data type for storing the number of exact and
lossy pages. The problem is that sizeof(long) on 64-bit MSVC is 32
bits. A signed 32-bit type isn't large enough to store anything more
than 16TBs worth of 8k pages.

I propose we change these to uint64 while causing churn in this area,
probably as a follow-on patch.  I think a uint32 isn't wide enough as
you could exceed the limit with rescans.

David



On Fri, 5 Jul 2024 at 12:52, David Rowley <dgrowleyml@gmail.com> wrote:
> I propose we change these to uint64 while causing churn in this area,
> probably as a follow-on patch.  I think a uint32 isn't wide enough as
> you could exceed the limit with rescans.

I wondered how large a query it would take to cause this problem.  I tried:

create table a (a int);
insert into a select x%1000 from generate_Series(1,1500000)x;
create index on a(a);
vacuum freeze analyze a;

set enable_hashjoin=0;
set enable_mergejoin=0;
set enable_indexscan=0;
set max_parallel_workers_per_gather=0;

explain (analyze, costs off, timing off, summary off)
select count(*) from a a1 inner join a a2 on a1.a=a2.a;

After about 15 mins, the trimmed output from Linux is:

 Aggregate (actual rows=1 loops=1)
   ->  Nested Loop (actual rows=2250000000 loops=1)
         ->  Seq Scan on a a1 (actual rows=1500000 loops=1)
         ->  Bitmap Heap Scan on a a2 (actual rows=1500 loops=1500000)
               Recheck Cond: (a1.a = a)
               Heap Blocks: exact=2250000000
               ->  Bitmap Index Scan on a_a_idx (actual rows=1500 loops=1500000)
                     Index Cond: (a = a1.a)

Whereas, on MSVC, due to sizeof(long) == 4, it's:

 Aggregate  (actual rows=1 loops=1)
   ->  Nested Loop  (actual rows=2250000000 loops=1)
         ->  Seq Scan on a a1  (actual rows=1500000 loops=1)
         ->  Bitmap Heap Scan on a a2  (actual rows=1500 loops=1500000)
               Recheck Cond: (a1.a = a)
               ->  Bitmap Index Scan on a_a_idx (actual rows=1500 loops=1500000)
                     Index Cond: (a = a1.a)

Notice the "Heap Blocks: exact=2250000000" is missing on Windows.
This is because it wrapped around to a negative value and
show_tidbitmap_info() only shows > 0 values.

I feel this is a good enough justification to increase the width of
those counters to uint64, so I'll do that too.

David



On Mon, 8 Jul 2024 at 12:19, David Rowley <dgrowleyml@gmail.com> wrote:
> Notice the "Heap Blocks: exact=2250000000" is missing on Windows.
> This is because it wrapped around to a negative value and
> show_tidbitmap_info() only shows > 0 values.
>
> I feel this is a good enough justification to increase the width of
> those counters to uint64, so I'll do that too.

I pushed the widening of the types first as I saw some code in the
EXPLAIN patch which assumed var == 0 is the negator of var > 0.  I
couldn't bring myself to commit that knowing it was wrong and also
couldn't bring myself to write <= 0 knowing I was about to make that
look like a weird thing to write for an unsigned type.

David



On Sun, 18 Feb 2024 at 11:31, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> 2) Leader vs. worker counters
>
> It seems to me this does nothing to add the per-worker values from "Heap
> Blocks" into the leader, which means we get stuff like this:
>
>     Heap Blocks: exact=102 lossy=10995
>     Worker 0:  actual time=50.559..209.773 rows=215253 loops=1
>        Heap Blocks: exact=207 lossy=19354
>     Worker 1:  actual time=50.543..211.387 rows=162934 loops=1
>        Heap Blocks: exact=161 lossy=14636
>
> I think this is wrong / confusing, and inconsistent with what we do for
> other nodes.

Are you able to share which other nodes that you mean here?

I used the following to compare to Sort and Memoize, and as far as I
see, the behaviour matches with the attached v8 patch.

Is there some inconsistency here that I'm not seeing?

create table mill (a int);
create index on mill(a);
insert into mill select x%1000 from generate_Series(1,10000000)x;
vacuum analyze mill;
create table big (a int primary key);
insert into big select x from generate_series(1,10000000)x;
create table probe (a int);
insert into probe select 1 from generate_Series(1,1000000);
analyze big
analyze probe;

set parallel_tuple_cost=0;
set parallel_setup_cost=0;
set enable_indexscan=0;

-- compare Parallel Bitmap Heap Scan with Memoize and Sort.

-- each includes "Worker N:" with stats for the operation.
explain (analyze) select * from mill where a < 100;
explain (analyze) select * from big b inner join probe p on b.a=p.a;
explain (analyze) select * from probe order by a;

-- each includes "Worker N:" with stats for the operation
-- also includes actual time and rows for each worker.
explain (analyze, verbose) select * from mill where a < 100;
explain (analyze, verbose) select * from big b inner join probe p on b.a=p.a;
explain (analyze, verbose) select * from probe order by a;

-- each includes "Worker N:" with stats for the operation
-- shows a single total buffers which includes leader and worker buffers.
explain (analyze, buffers) select * from mill where a < 100;
explain (analyze, buffers) select * from big b inner join probe p on b.a=p.a;
explain (analyze, buffers) select * from probe order by a;

-- each includes "Worker N:" with stats for the operation
-- also includes actual time and rows for each worker.
-- shows a single total buffers which includes leader and worker buffers.
-- shows buffer counts for each worker process
explain (analyze, buffers, verbose) select * from mill where a < 100;
explain (analyze, buffers, verbose) select * from big b inner join
probe p on b.a=p.a;
explain (analyze, buffers, verbose) select * from probe order by a;

If we did want to adjust things to show the totals for each worker
rather than the stats for the leader, what would Sort Method show if
one worker spilled to disk and another did not?

David

Attachment
On Mon, 8 Jul 2024 at 15:43, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sun, 18 Feb 2024 at 11:31, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> > 2) Leader vs. worker counters
> >
> > It seems to me this does nothing to add the per-worker values from "Heap
> > Blocks" into the leader, which means we get stuff like this:
> >
> >     Heap Blocks: exact=102 lossy=10995
> >     Worker 0:  actual time=50.559..209.773 rows=215253 loops=1
> >        Heap Blocks: exact=207 lossy=19354
> >     Worker 1:  actual time=50.543..211.387 rows=162934 loops=1
> >        Heap Blocks: exact=161 lossy=14636
> >
> > I think this is wrong / confusing, and inconsistent with what we do for
> > other nodes.
>
> Are you able to share which other nodes that you mean here?

I did the analysis on this and out of the node types that have
parallel instrumentation (per ExecParallelRetrieveInstrumentation()),
Parallel Hash is the only node that does anything different from the
others. Looking at the loop inside show_hash_info(), you can see it
takes the Max() of each property. There's some discussion in [1] about
why this came about. In particular [2].

I see no reason to copy the odd one out here, so I'm planning on going
ahead with the patch that has Bitmap Heap Scan copy what the majority
of other nodes do. I think we should consider aligning Parallel Hash
with the other Parallel node behaviour.

I've attached the (roughly done) schema and queries I used to obtain
the plans to do this analysis.

David

[1] https://www.postgresql.org/message-id/flat/20200323165059.GA24950%40alvherre.pgsql
[2] https://www.postgresql.org/message-id/31321.1586549487%40sss.pgh.pa.us

Attachment
On Tue, 9 Jul 2024 at 11:51, David Rowley <dgrowleyml@gmail.com> wrote:
> I think we should consider aligning Parallel Hash
> with the other Parallel node behaviour.

I looked at that and quickly realised that it makes sense that
Parallel Hash does something different here.  All the workers are
contributing to building the same hash table, so they're all going to
show the same set of values, provided they managed to help building
it.

We're able to tell how much each worker helped according to EXPLAIN
(ANALYZE, VERBOSE)'s Worker N: rows=n output.  I don't think there's
anything else not already shown that would be interesting to know per
worker.

David