Thread: Explain buffers wrong counter with parallel plans

Explain buffers wrong counter with parallel plans

From
Adrien Nayrat
Date:
Hello,


I notice Explain report wrong counters with parallel plans :

create unlogged table t1 (c1 int);
insert into t1 select generate_series(1,100000000);
vacuum t1;

(stop PG, drop caches,start)

set track_io_timing to on;
explain (analyze,buffers)  select count(*) from t1;

                                                                  QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=964311.55..964311.56 rows=1 width=8) (actual
time=34478.972..34478.972 rows=1 loops=1)
   Buffers: shared read=147952
   I/O Timings: read=20407.894
   ->  Gather  (cost=964311.33..964311.54 rows=2 width=8) (actual
time=34478.835..34478.937 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared read=147952
         I/O Timings: read=20407.894
         ->  Partial Aggregate  (cost=963311.33..963311.34 rows=1 width=8)
(actual time=34283.809..34283.810 rows=1 loops=3)
               Buffers: shared read=442478 dirtied=1 written=1
               I/O Timings: read=60706.314 write=0.498
               ->  Parallel Seq Scan on t1  (cost=0.00..859144.67 rows=41666667
width=0) (actual time=27.027..27405.058 rows=33333333 loops=3)
                     Buffers: shared read=442478 dirtied=1 written=1
                     I/O Timings: read=60706.314 write=0.498
 Planning time: 39.037 ms
 Execution time: 34480.280 ms

(stop pg, drop caches,start)

set max_parallel_workers_per_gather to 0;
explain (analyze,buffers)  select count(*) from t1;

                                                         QUERY PLAN


----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1692478.00..1692478.01 rows=1 width=8) (actual
time=34367.678..34367.678 rows=1 loops=1)
   Buffers: shared read=442478
   I/O Timings: read=7977.775
   ->  Seq Scan on t1  (cost=0.00..1442478.00 rows=100000000 width=0) (actual
time=8.672..19685.654 rows=100000000 loops=1)
         Buffers: shared read=442478
         I/O Timings: read=7977.775
 Planning time: 38.971 ms
 Execution time: 34368.097 ms
(8 rows)



Gather and Finalize Aggregate should report shared read=442478.

I am also surprised for I/O Timings in parallel node:
I/O Timings: read=60706.314 but the query last 34s?

I played with git bisect and I found this commit :

commit 01edb5c7fc3bcf6aea15f2b3be36189b52ad9d1a
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Fri Sep 1 17:38:54 2017 -0400

    Improve division of labor between execParallel.c and nodeGather[Merge].c.

    Move the responsibility for creating/destroying TupleQueueReaders into
    execParallel.c, to avoid duplicative coding in nodeGather.c and
    nodeGatherMerge.c.  Also, instead of having DestroyTupleQueueReader do
    shm_mq_detach, do it in the caller (which is now only ExecParallelFinish).
    This means execParallel.c does both the attaching and detaching of the
    tuple-queue-reader shm_mqs, which seems less weird than the previous
    arrangement.

    These changes also eliminate a vestigial memory leak (of the pei->tqueue
    array).  It's now demonstrable that rescans of Gather or GatherMerge don't
    leak memory.

    Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us



Before this commit here is the result :

explain (analyze,buffers)  select count(*) from t1;

                                                                  QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=964311.55..964311.56 rows=1 width=8) (actual
time=34745.387..34745.387 rows=1 loops=1)
   Buffers: shared hit=160 read=442478 dirtied=1 written=1
   I/O Timings: read=61219.308 write=0.033
   ->  Gather  (cost=964311.33..964311.54 rows=2 width=8) (actual
time=34745.273..34745.379 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=160 read=442478 dirtied=1 written=1
         I/O Timings: read=61219.308 write=0.033
         ->  Partial Aggregate  (cost=963311.33..963311.34 rows=1 width=8)
(actual time=34660.023..34660.023 rows=1 loops=3)
               Buffers: shared read=442478 dirtied=1 written=1
               I/O Timings: read=61219.308 write=0.033
               ->  Parallel Seq Scan on t1  (cost=0.00..859144.67 rows=41666667
width=0) (actual time=11.559..27607.271 rows=33333333 loops=3)
                     Buffers: shared read=442478 dirtied=1 written=1
                     I/O Timings: read=61219.308 write=0.033
 Planning time: 38.094 ms
 Execution time: 34746.395 ms
(16 rows)

set max_parallel_workers_per_gather to 0;
explain (analyze,buffers)  select count(*) from t1;

                                                         QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1692478.00..1692478.01 rows=1 width=8) (actual
time=34368.540..34368.540 rows=1 loops=1)
   Buffers: shared read=442478
   I/O Timings: read=8874.662
   ->  Seq Scan on t1  (cost=0.00..1442478.00 rows=100000000 width=0) (actual
time=20.604..19992.941 rows=100000000 loops=1)
         Buffers: shared read=442478
         I/O Timings: read=8874.662
 Planning time: 37.834 ms
 Execution time: 34368.817 ms
(8 rows)

I got similar results on 9.6.

Regards,

--
Adrien NAYRAT




Attachment

Re: Explain buffers wrong counter with parallel plans

From
Adrien Nayrat
Date:
Hello,

I tried to understand this issue and it seems Gather node only take account of
this own buffer usage :


create unlogged table t1 (c1 int);
insert into t1 select generate_series(1,1000000);
vacuum t1;

explain (analyze,buffers,timing off,costs off) select count(*) from t1;
                              QUERY PLAN
------------------------------------------------------------------------
 Finalize Aggregate (actual rows=1 loops=1)
   Buffers: shared hit=1531
   ->  Gather (actual rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=1531
         ->  Partial Aggregate (actual rows=1 loops=3)
               Buffers: shared hit=4425
               ->  Parallel Seq Scan on t1 (actual rows=333333 loops=3)
                     Buffers: shared hit=4425


Same query without parallelism

                     QUERY PLAN
----------------------------------------------------
 Aggregate (actual rows=1 loops=1)
   Buffers: shared hit=4425
   ->  Seq Scan on t1 (actual rows=1000000 loops=1)
         Buffers: shared hit=4425


We can notice Parallel Seq Scan and Partial Aggregate report 4425 buffers, same
for the plan without parallelism.


I put elog debug around theses lines in execParallel.c :

/* Accumulate the statistics from all workers. */
instrument = GetInstrumentationArray(instrumentation);
instrument += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
  {
    elog(LOG, "worker %d - shared_blks_read: %ld - shared_blks_hit: %ld", n,
instrument[n].bufusage.shared_blks_read,instrument[n].bufusage.shared_blks_hit);
    InstrAggNode(planstate->instrument, &instrument[n]);
  }


And I get theses messages :

LOG:  worker 0 - shared_blks_read: 0 - shared_blks_hit: 1443
LOG:  worker 1 - shared_blks_read: 0 - shared_blks_hit: 1451


If you sum each worker's shared_blks_read and Gather's shared hit you get :
1443 + 1451 + 1531 = 4425 (size of t1)

It seems Gather node and Finalize Aggregate only report buffers read by the main
process. Workers seem omitted.

Same issue occurs for Gather merge :

explain (analyze,buffers,timing off,costs off)
  select * from t1 where c1%10=4 order by 1;

                           QUERY PLAN
-----------------------------------------------------------------
 Gather Merge (actual rows=100000 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=1549
   ->  Sort (actual rows=33333 loops=3)
         Sort Key: c1
         Sort Method: quicksort  Memory: 3143kB
         Buffers: shared hit=4521
         ->  Parallel Seq Scan on t1 (actual rows=33333 loops=3)
               Filter: ((c1 % 10) = 4)
               Rows Removed by Filter: 300000
               Buffers: shared hit=4425


In my log :
worker 0 - shared_blks_read: 0 - shared_blks_hit: 1487
worker 1 - shared_blks_read: 0 - shared_blks_hit: 1485

1487 + 1485 + 1549 = 4521

I tried to understand how instrumentation works, but it is above my knowledge :(



Regards,

--
Adrien



Attachment

Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Wed, Mar 28, 2018 at 12:07 AM, Adrien Nayrat
<adrien.nayrat@anayrat.info> wrote:
> Hello,
>
> I notice Explain report wrong counters with parallel plans :
>
> create unlogged table t1 (c1 int);
> insert into t1 select generate_series(1,100000000);
> vacuum t1;
>
> (stop PG, drop caches,start)
>
> set track_io_timing to on;
> explain (analyze,buffers)  select count(*) from t1;
>
>                                                                   QUERY PLAN
>
>
-----------------------------------------------------------------------------------------------------------------------------------------------
>  Finalize Aggregate  (cost=964311.55..964311.56 rows=1 width=8) (actual
> time=34478.972..34478.972 rows=1 loops=1)
>    Buffers: shared read=147952
>    I/O Timings: read=20407.894
>    ->  Gather  (cost=964311.33..964311.54 rows=2 width=8) (actual
> time=34478.835..34478.937 rows=3 loops=1)
>          Workers Planned: 2
>          Workers Launched: 2
>          Buffers: shared read=147952
>          I/O Timings: read=20407.894
>          ->  Partial Aggregate  (cost=963311.33..963311.34 rows=1 width=8)
> (actual time=34283.809..34283.810 rows=1 loops=3)
>                Buffers: shared read=442478 dirtied=1 written=1
>                I/O Timings: read=60706.314 write=0.498
>                ->  Parallel Seq Scan on t1  (cost=0.00..859144.67 rows=41666667
> width=0) (actual time=27.027..27405.058 rows=33333333 loops=3)
>                      Buffers: shared read=442478 dirtied=1 written=1
>                      I/O Timings: read=60706.314 write=0.498
>  Planning time: 39.037 ms
>  Execution time: 34480.280 ms
>
> (stop pg, drop caches,start)
>
> set max_parallel_workers_per_gather to 0;
> explain (analyze,buffers)  select count(*) from t1;
>
>                                                          QUERY PLAN
>
>
----------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=1692478.00..1692478.01 rows=1 width=8) (actual
> time=34367.678..34367.678 rows=1 loops=1)
>    Buffers: shared read=442478
>    I/O Timings: read=7977.775
>    ->  Seq Scan on t1  (cost=0.00..1442478.00 rows=100000000 width=0) (actual
> time=8.672..19685.654 rows=100000000 loops=1)
>          Buffers: shared read=442478
>          I/O Timings: read=7977.775
>  Planning time: 38.971 ms
now>  Execution time: 34368.097 ms
> (8 rows)
>
>
>
> Gather and Finalize Aggregate should report shared read=442478.
>

Yeah, it would have been convenient, if Gather and Finalize Aggregate
displays that way as it would have been easier for users to
understand.  However, as of now, the aggregated stats for parallel
workers and leader are displayed at parallel/partial nodes as is
displayed in the plan.  So according to me, what you are seeing is
okay, it is only slightly tricky to understand it.

> I am also surprised for I/O Timings in parallel node:
> I/O Timings: read=60706.314 but the query last 34s?
>

This is the aggregated time of all the parallel workers, so you need
to divide it by loops (which in your case is 3).

> I played with git bisect and I found this commit :
>
> commit 01edb5c7fc3bcf6aea15f2b3be36189b52ad9d1a
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Fri Sep 1 17:38:54 2017 -0400
>

I think you were seeing different results before this commit because
before that we were shutting down workers as soon as parallel workers
are done and the buffer_usage stats were accumulated and were being
used for upper nodes.  According to me behavior after the commit is
consistent, for example, I think if you check the case of GatherMerge
before this commit, you will still get the stats in the way it is
after commit.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Fri, Apr 20, 2018 at 6:36 PM, Adrien Nayrat
<adrien.nayrat@anayrat.info> wrote:
> Hello,
>
> I tried to understand this issue and it seems Gather node only take account of
> this own buffer usage :
>
>
> create unlogged table t1 (c1 int);
> insert into t1 select generate_series(1,1000000);
> vacuum t1;
>
> explain (analyze,buffers,timing off,costs off) select count(*) from t1;
>                               QUERY PLAN
> ------------------------------------------------------------------------
>  Finalize Aggregate (actual rows=1 loops=1)
>    Buffers: shared hit=1531
>    ->  Gather (actual rows=3 loops=1)
>          Workers Planned: 2
>          Workers Launched: 2
>          Buffers: shared hit=1531
>          ->  Partial Aggregate (actual rows=1 loops=3)
>                Buffers: shared hit=4425
>                ->  Parallel Seq Scan on t1 (actual rows=333333 loops=3)
>                      Buffers: shared hit=4425
>
>
> Same query without parallelism
>
>                      QUERY PLAN
> ----------------------------------------------------
>  Aggregate (actual rows=1 loops=1)
>    Buffers: shared hit=4425
>    ->  Seq Scan on t1 (actual rows=1000000 loops=1)
>          Buffers: shared hit=4425
>
>
> We can notice Parallel Seq Scan and Partial Aggregate report 4425 buffers, same
> for the plan without parallelism.
>
>
> I put elog debug around theses lines in execParallel.c :
>
> /* Accumulate the statistics from all workers. */
> instrument = GetInstrumentationArray(instrumentation);
> instrument += i * instrumentation->num_workers;
> for (n = 0; n < instrumentation->num_workers; ++n)
>   {
>     elog(LOG, "worker %d - shared_blks_read: %ld - shared_blks_hit: %ld", n,
> instrument[n].bufusage.shared_blks_read,instrument[n].bufusage.shared_blks_hit);
>     InstrAggNode(planstate->instrument, &instrument[n]);
>   }
>
>
> And I get theses messages :
>
> LOG:  worker 0 - shared_blks_read: 0 - shared_blks_hit: 1443
> LOG:  worker 1 - shared_blks_read: 0 - shared_blks_hit: 1451
>

I think you can try 'verbose' option, it will give per-worker stats.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Adrien Nayrat
Date:
On 04/29/2018 05:11 PM, Amit Kapila wrote:
> I think you can try 'verbose' option, it will give per-worker stats.

Thanks! I didn't know verbose give per-woker stats!

--
Adrien



Attachment

Re: Explain buffers wrong counter with parallel plans

From
Adrien Nayrat
Date:
On 04/29/2018 05:10 PM, Amit Kapila wrote:
> Yeah, it would have been convenient, if Gather and Finalize Aggregate
> displays that way as it would have been easier for users to
> understand.  However, as of now, the aggregated stats for parallel
> workers and leader are displayed at parallel/partial nodes as is
> displayed in the plan.  So according to me, what you are seeing is
> okay, it is only slightly tricky to understand it.
>
[...]
>
>> I played with git bisect and I found this commit :
>>
>> commit 01edb5c7fc3bcf6aea15f2b3be36189b52ad9d1a
>> Author: Tom Lane <tgl@sss.pgh.pa.us>
>> Date:   Fri Sep 1 17:38:54 2017 -0400
>>
> I think you were seeing different results before this commit because
> before that we were shutting down workers as soon as parallel workers
> are done and the buffer_usage stats were accumulated and were being
> used for upper nodes.  According to me behavior after the commit is
> consistent, for example, I think if you check the case of GatherMerge
> before this commit, you will still get the stats in the way it is
> after commit.

I understand. Maybe, this change should be mentioned in releases notes and/or
documentation?

Thanks,


Attachment

Re: Explain buffers wrong counter with parallel plans

From
Andres Freund
Date:
Hi,

On 2018-04-29 20:40:43 +0530, Amit Kapila wrote:
> On Wed, Mar 28, 2018 at 12:07 AM, Adrien Nayrat
> >                                                          QUERY PLAN
> >
> >
----------------------------------------------------------------------------------------------------------------------------
> >  Aggregate  (cost=1692478.00..1692478.01 rows=1 width=8) (actual
> > time=34367.678..34367.678 rows=1 loops=1)
> >    Buffers: shared read=442478
> >    I/O Timings: read=7977.775
> >    ->  Seq Scan on t1  (cost=0.00..1442478.00 rows=100000000 width=0) (actual
> > time=8.672..19685.654 rows=100000000 loops=1)
> >          Buffers: shared read=442478
> >          I/O Timings: read=7977.775
> >  Planning time: 38.971 ms
> now>  Execution time: 34368.097 ms
> > (8 rows)
> >
> >
> >
> > Gather and Finalize Aggregate should report shared read=442478.
> >
> 
> Yeah, it would have been convenient, if Gather and Finalize Aggregate
> displays that way as it would have been easier for users to
> understand.  However, as of now, the aggregated stats for parallel
> workers and leader are displayed at parallel/partial nodes as is
> displayed in the plan.  So according to me, what you are seeing is
> okay, it is only slightly tricky to understand it.
> 
> > I am also surprised for I/O Timings in parallel node:
> > I/O Timings: read=60706.314 but the query last 34s?
> >
> 
> This is the aggregated time of all the parallel workers, so you need
> to divide it by loops (which in your case is 3).
> 
> > I played with git bisect and I found this commit :
> >
> > commit 01edb5c7fc3bcf6aea15f2b3be36189b52ad9d1a
> > Author: Tom Lane <tgl@sss.pgh.pa.us>
> > Date:   Fri Sep 1 17:38:54 2017 -0400
> >
> 
> I think you were seeing different results before this commit because
> before that we were shutting down workers as soon as parallel workers
> are done and the buffer_usage stats were accumulated and were being
> used for upper nodes.  According to me behavior after the commit is
> consistent, for example, I think if you check the case of GatherMerge
> before this commit, you will still get the stats in the way it is
> after commit.

Robert, you added this as an open item. I don't think it's clear that
there's a bug here, and even less clear that it's something new for
v11. Am I understanding that right?  Should we close this open item?

Greetings,

Andres Freund


Re: Explain buffers wrong counter with parallel plans

From
Adrien Nayrat
Date:
On 05/01/2018 10:57 PM, Andres Freund wrote:
> Robert, you added this as an open item. I don't think it's clear that
> there's a bug here, and even less clear that it's something new for
> v11. Am I understanding that right?  Should we close this open item?

Hi,

FIY this change is related to 01edb5c7fc which is not specific to v11.
01edb5c7fc is also in v10.

Regards,

--
Adrien


Attachment

Re: Explain buffers wrong counter with parallel plans

From
Robert Haas
Date:
On Tue, May 1, 2018 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote:
> Robert, you added this as an open item. I don't think it's clear that
> there's a bug here, and even less clear that it's something new for
> v11. Am I understanding that right?  Should we close this open item?

Yeah, I added it in response to the original post, but it sounds like
this is just another case of somebody being confused by the fact that
EXPLAIN divides the row counts, startup time, total time, and number
of rows filtered by the loop count.  So it's not an open item.

That having been said, what I think we ought to do about the confusion
that regularly results from that logic is rip it out, as in the
attached patch.  This would affect not only the parallel query case
but also the inner sides of nested loops, but that doesn't bother me a
bit because I think dividing is unhelpful in both cases.  Note that
explain.depesz.com, for example, goes and re-multiplies the times by
the loop count to get a figure for how much time was spent in that
node; I regularly do the same thing by hand when I read such EXPLAIN
output.  Also, if you did think those cases should be handled
differently, you'd have to explain what should happen when there's a
nested loop that gets run by multiple workers, possibly different
numbers of times in each one.

One problem with this patch is that tools like explain.depesz.com (or
my brain) would need to know if they were getting output from 11+ or
10-, so it might be good to perturb this output in some other way so
as to avoid confusion, like by replacing "loops=" with some other
text.  I don't have a specific proposal for that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Explain buffers wrong counter with parallel plans

From
Adrien Nayrat
Date:
Hi,

On 05/02/2018 05:22 PM, Robert Haas wrote:
> On Tue, May 1, 2018 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote:
>> Robert, you added this as an open item. I don't think it's clear that
>> there's a bug here, and even less clear that it's something new for
>> v11. Am I understanding that right?  Should we close this open item?
>
> Yeah, I added it in response to the original post, but it sounds like
> this is just another case of somebody being confused by the fact that
> EXPLAIN divides the row counts, startup time, total time, and number
> of rows filtered by the loop count.  So it's not an open item.

To precise, I am not confused by these counters. They do not change between 9.6
and 10.

I am confused by buffers counter which are differents between 9.6 and 10.

In 9.6 gather node reports sum of buffers for main process + workers. In 10,
gather node only reports buffers from the main process.


Thanks,


--
Adrien



Attachment

Re: Explain buffers wrong counter with parallel plans

From
Robert Haas
Date:
On Wed, May 2, 2018 at 11:37 AM, Adrien Nayrat
<adrien.nayrat@anayrat.info> wrote:
> In 9.6 gather node reports sum of buffers for main process + workers. In 10,
> gather node only reports buffers from the main process.

Oh, really?  Well, that sounds like a bug.  Amit seems to think it's
expected behavior, but I don't know why it should be.  The commit
message makes it sound like it's just refactoring, but in fact it
seems to have made a significant behavior change that doesn't look
very desirable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
 On Fri, May 4, 2018 at 10:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 2, 2018 at 11:37 AM, Adrien Nayrat
> <adrien.nayrat@anayrat.info> wrote:
>> In 9.6 gather node reports sum of buffers for main process + workers. In 10,
>> gather node only reports buffers from the main process.
>
> Oh, really?  Well, that sounds like a bug.  Amit seems to think it's
> expected behavior, but I don't know why it should be.
>

The reason why I think the current behavior is okay because it is
coincidental that they were displayed correctly.  We have not made any
effort to percolate it to upper nodes.  For ex., before that commit
also, it was not being displayed for Gather Merge or Gather with some
kind of node like 'Limit' where we have to stop before reaching the
end of the result.

>  The commit
> message makes it sound like it's just refactoring, but in fact it
> seems to have made a significant behavior change that doesn't look
> very desirable.
>

I think it is below part of that commit which has made this
difference, basically shutting down the workers.

                if (readerdone)
                {
                        Assert(!tup);
..
                        if (gatherstate->nreaders == 0)
-                       {
-                               ExecShutdownGatherWorkers(gatherstate);
                                return NULL;
-                       }


The reason why we were getting different results due to above code is
that because while shutting down workers, we gather the buffer usage
of all workers in 'pgBufferUsage' via InstrAccumParallelQuery and then
upper-level node say Gather in this case would get chance to use that
stats via ExecProcNodeInstr->InstrStopNode.  However, this won't be
true in other cases where we need to exit before reaching the end of
results like in 'Limit' node case as in such cases after shutting down
the workers via ExecShutdownNode we won't do InstrStopNode for
upper-level nodes.  I think if we want that all the stats being
collected by workers should percolate to Gather and nodes above it,
then we need to somehow ensure that we always shutdown
Gather/GatherMerge before we do InstrStopNode for those nodes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Robert Haas
Date:
On Sat, May 5, 2018 at 8:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> The reason why I think the current behavior is okay because it is
> coincidental that they were displayed correctly.  We have not made any
> effort to percolate it to upper nodes.  For ex., before that commit
> also, it was not being displayed for Gather Merge or Gather with some
> kind of node like 'Limit' where we have to stop before reaching the
> end of the result.

It's not entirely coincidental.  I had the intention to try to ensure
that the workers would be shut down before the Gather or Gather Merge,
and I think that various things in the code testify to that intention.
It sounds like I missed some cases, but now we're missing even more
cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Explain buffers wrong counter with parallel plans

From
Alvaro Herrera
Date:
On 2018-May-07, Robert Haas wrote:

> On Sat, May 5, 2018 at 8:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > The reason why I think the current behavior is okay because it is
> > coincidental that they were displayed correctly.  We have not made any
> > effort to percolate it to upper nodes.  For ex., before that commit
> > also, it was not being displayed for Gather Merge or Gather with some
> > kind of node like 'Limit' where we have to stop before reaching the
> > end of the result.
> 
> It's not entirely coincidental.  I had the intention to try to ensure
> that the workers would be shut down before the Gather or Gather Merge,
> and I think that various things in the code testify to that intention.
> It sounds like I missed some cases, but now we're missing even more
> cases.

So, apparently this is not a Postgres 11 open item, but rather a bug
that goes back to pg10.  However, maybe it would be worth fixing soon
anyway?  In particular, if we want to perturb the explain output as
suggested upthread, maybe *that* should be considered an open item?

Anyway, this has stalled for a month now.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Explain buffers wrong counter with parallel plans

From
Robert Haas
Date:
On Wed, Jun 6, 2018 at 12:08 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> So, apparently this is not a Postgres 11 open item, but rather a bug
> that goes back to pg10.  However, maybe it would be worth fixing soon
> anyway?  In particular, if we want to perturb the explain output as
> suggested upthread, maybe *that* should be considered an open item?
>
> Anyway, this has stalled for a month now.

Yeah.  I'm not willing to do anything here unilaterally.  There is
neither universal agreement that there is a problem here nor agreement
on a fix.  Unless that changes, I don't know what to do here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Fri, Jun 8, 2018 at 11:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 6, 2018 at 12:08 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> So, apparently this is not a Postgres 11 open item, but rather a bug
>> that goes back to pg10.  However, maybe it would be worth fixing soon
>> anyway?  In particular, if we want to perturb the explain output as
>> suggested upthread, maybe *that* should be considered an open item?
>>
>> Anyway, this has stalled for a month now.
>
> Yeah.  I'm not willing to do anything here unilaterally.  There is
> neither universal agreement that there is a problem here nor agreement
> on a fix.
>

Right, I think we have following options:
(a) Come up with a solution which allows percolating the buffer usage
and or similar stats to upper nodes in all cases.
(b) Allow it to work for some of the cases as it was earlier.

I think (b) can cause confusion and could lead to further questions on
which specific cases will it work and for which it won't work.  If we
think (a) is a reasonable approach, then we can close this item with a
conclusion as a work item for future and OTOH if we think option (b)
is the better way to deal with it, then we can come up with a patch to
do so.  My inclination is to go with option (a), but I don't mind if
the decision is to choose option (b).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Alvaro Herrera
Date:
On 2018-Jun-08, Robert Haas wrote:

> On Wed, Jun 6, 2018 at 12:08 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > So, apparently this is not a Postgres 11 open item, but rather a bug
> > that goes back to pg10.  However, maybe it would be worth fixing soon
> > anyway?  In particular, if we want to perturb the explain output as
> > suggested upthread, maybe *that* should be considered an open item?
> >
> > Anyway, this has stalled for a month now.
> 
> Yeah.  I'm not willing to do anything here unilaterally.  There is
> neither universal agreement that there is a problem here nor agreement
> on a fix.  Unless that changes, I don't know what to do here.

After going over this whole thing once more, my opinion is that this
reporting is broken.  IMO if we want to display per-process counts, then
let's display them all, or none; if we're going to display aggregated
counts, then we should mark them as such.  Otherwise it's pretty much
impossible to read the output unless you know exactly what the code does
for each value.  (The other option would be to document what the code
does for each value, but that seems suboptimal -- that just requires
people to glue one more cheat sheet to their walls, which is not Pg
style.)

I'm CCing Tom here, as author of the patch that caused (most of) the
issue.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Explain buffers wrong counter with parallel plans

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I'm CCing Tom here, as author of the patch that caused (most of) the
> issue.

Uh ... me?  I thought this was a parallel-query issue, which I've
pretty much not been involved in.

            regards, tom lane


Re: Explain buffers wrong counter with parallel plans

From
Alvaro Herrera
Date:
On 2018-Jun-29, Tom Lane wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > I'm CCing Tom here, as author of the patch that caused (most of) the
> > issue.
> 
> Uh ... me?  I thought this was a parallel-query issue, which I've
> pretty much not been involved in.

Well, maybe it's a matter of opinion.  Amit K said a few messages back
that your 01edb5c7fc3b ("Improve division of labor between
execParallel.c and nodeGather[Merge].c.") had changed the way these
numbers are printed, but only now I realize that he then indicated that
a different code path was already behaving in that way.

I stand by my opinion that we should not give misleading/confusing info;
either let's show it all in the default output, or only do it in
VERBOSE, but if the latter then let's suppress the misleading numbers in
the default output.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Sat, Jun 30, 2018 at 4:08 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> On 2018-Jun-29, Tom Lane wrote:
>
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> > I'm CCing Tom here, as author of the patch that caused (most of) the
>> > issue.
>>
>> Uh ... me?  I thought this was a parallel-query issue, which I've
>> pretty much not been involved in.
>
> Well, maybe it's a matter of opinion.  Amit K said a few messages back
> that your 01edb5c7fc3b ("Improve division of labor between
> execParallel.c and nodeGather[Merge].c.") had changed the way these
> numbers are printed, but only now I realize that he then indicated that
> a different code path was already behaving in that way.
>
> I stand by my opinion that we should not give misleading/confusing info;
> either let's show it all in the default output, or only do it in
> VERBOSE, but if the latter then let's suppress the misleading numbers in
> the default output.
>

It is not clear to me what exactly is your expectation.  Can you be a
bit more specific?  AFAIU, the primary confusion to OP is that the
aggregated stats like buffer_usage (Buffers: ..) should be displayed
correctly at Gather/Gather Merge and nodes above it.  Currently, those
are being displayed correctly at Parallel (Seq Scan) nodes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Robert Haas
Date:
On Fri, Jun 29, 2018 at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> I'm CCing Tom here, as author of the patch that caused (most of) the
>> issue.
>
> Uh ... me?  I thought this was a parallel-query issue, which I've
> pretty much not been involved in.

Well, it was your commit that caused the behavior change, as the
original poster mentioned in his email.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Explain buffers wrong counter with parallel plans

From
Robert Haas
Date:
On Sun, Jun 10, 2018 at 1:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Right, I think we have following options:
> (a) Come up with a solution which allows percolating the buffer usage
> and or similar stats to upper nodes in all cases.
> (b) Allow it to work for some of the cases as it was earlier.
>
> I think (b) can cause confusion and could lead to further questions on
> which specific cases will it work and for which it won't work.  If we
> think (a) is a reasonable approach, then we can close this item with a
> conclusion as a work item for future and OTOH if we think option (b)
> is the better way to deal with it, then we can come up with a patch to
> do so.  My inclination is to go with option (a), but I don't mind if
> the decision is to choose option (b).

I think the core problem here is this hunk from gather_readnext:

         {
             Assert(!tup);
-            DestroyTupleQueueReader(reader);
             --gatherstate->nreaders;
             if (gatherstate->nreaders == 0)
-            {
-                ExecShutdownGatherWorkers(gatherstate);
                 return NULL;
-            }
             memmove(&gatherstate->reader[gatherstate->nextreader],
                     &gatherstate->reader[gatherstate->nextreader + 1],
                     sizeof(TupleQueueReader *)

Since ExecShutdownGatherWorkers() is no longer called there, the
instrumentation data isn't accumulated into the Gather node when the
workers are shut down.  I think that's a bug and we should fix it.

To fix the problem with Limit that you mention, we could just modify
nodeLimit.c so that when the state is changed from LIMIT_INWINDOW to
LIMIT_WINDOWEND, we also call ExecShutdownNode on the child plan.

We can fix other cases as we find them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Mon, Jul 2, 2018 at 6:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I think the core problem here is this hunk from gather_readnext:
>
>          {
>              Assert(!tup);
> -            DestroyTupleQueueReader(reader);
>              --gatherstate->nreaders;
>              if (gatherstate->nreaders == 0)
> -            {
> -                ExecShutdownGatherWorkers(gatherstate);
>                  return NULL;
> -            }
>              memmove(&gatherstate->reader[gatherstate->nextreader],
>                      &gatherstate->reader[gatherstate->nextreader + 1],
>                      sizeof(TupleQueueReader *)
>
> Since ExecShutdownGatherWorkers() is no longer called there, the
> instrumentation data isn't accumulated into the Gather node when the
> workers are shut down.  I think that's a bug and we should fix it.
>

Yeah, previously, I have also pointed out the same code [1].  However,
I have not done any testing to prove it.

> To fix the problem with Limit that you mention, we could just modify
> nodeLimit.c so that when the state is changed from LIMIT_INWINDOW to
> LIMIT_WINDOWEND, we also call ExecShutdownNode on the child plan.
>

It should work.

> We can fix other cases as we find them.
>

I think we have a similar problem with GatherMerge, but that also
appears to be fixable.

Are you planning to work on it?  If not, then I can look into it.


[1] - https://www.postgresql.org/message-id/CAA4eK1KZEbYKj9HHP-6WqqjAXuoB%2BWJu-w1s9uovj%3DeeBxC48Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Robert Haas
Date:
On Tue, Jul 3, 2018 at 6:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, previously, I have also pointed out the same code [1].  However,
> I have not done any testing to prove it.

Ah, OK.  Sorry, forgot about that email.

> Are you planning to work on it?  If not, then I can look into it.

I won't have time for a few weeks, so if you have time sooner please have at it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Tue, Jul 3, 2018 at 4:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jul 2, 2018 at 6:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>
>> To fix the problem with Limit that you mention, we could just modify
>> nodeLimit.c so that when the state is changed from LIMIT_INWINDOW to
>> LIMIT_WINDOWEND, we also call ExecShutdownNode on the child plan.
>>
>
> It should work.
>

I have tried this idea, but it doesn't completely solve the problem.
The problem is that nodes below LIMIT won't get a chance to accumulate
the stats as they won't be able to call InstrStopNode.  So the result
will be something like below:

postgres=# explain (analyze,buffers,timing off,costs off) select *
from t1 limit 50000;
                           QUERY PLAN
-----------------------------------------------------------------
 Limit (actual rows=50000 loops=1)
   Buffers: shared hit=6 read=224
   ->  Gather (actual rows=50000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=1 read=63
         ->  Parallel Seq Scan on t1 (actual rows=17213 loops=3)
               Buffers: shared hit=6 read=224
 Planning Time: 0.105 ms
 Execution Time: 1363068.675 ms
(10 rows)

In this plan, you can notice that stats (Buffers:) at Parallel Seq
Scan and Limit are same, but Gather node shows different stats.  One
idea could be that in ExecShutdownNode, somehow, we allow the nodes to
count the stats, maybe by calling InstrStartNode/InstrStopNode, but
not sure if that is the best way to fix it.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Robert Haas
Date:
On Fri, Jul 6, 2018 at 9:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have tried this idea, but it doesn't completely solve the problem.
> The problem is that nodes below LIMIT won't get a chance to accumulate
> the stats as they won't be able to call InstrStopNode.

I'm not sure I understand.  Why not?  I see that we'd need to insert
an extra call to InstrStopNode() if we were stopping the node while it
was running, because then InstrStartNode() would have already been
done, but the corresponding call to InstrStopNode() would not have
been done.  But I'm not sure how that would happen in this case.  Can
you explain further?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Sat, Jul 7, 2018 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jul 6, 2018 at 9:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I have tried this idea, but it doesn't completely solve the problem.
>> The problem is that nodes below LIMIT won't get a chance to accumulate
>> the stats as they won't be able to call InstrStopNode.
>
> I'm not sure I understand.  Why not?  I see that we'd need to insert
> an extra call to InstrStopNode() if we were stopping the node while it
> was running, because then InstrStartNode() would have already been
> done, but the corresponding call to InstrStopNode() would not have
> been done.  But I'm not sure how that would happen in this case.  Can
> you explain further?
>

Okay, let me try.   The code flow is that for each node we will call
InstrStartNode()->ExecProcNodeReal()->InstrStopNode().  Now let's say
we have to execute a plan Limit->Gather-> Parallel SeqScan.  In this,
first for Limit node, we will call InstrStartNode() and
ExecProcNodeReal() and then for Gather we will call InstrStartNode(),
ExecProcNodeReal() and InstrStopNode().  Now, Limit node decides that
it needs to shutdown all the nodes (ExecShutdownNode) and after that
it will call InstrStopNode() for Limit node.  So, in this flow after
shutting down nodes, we never get chance for Gather node to use stats
collected during ExecShutdownNode.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Sat, Jul 7, 2018 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Jul 7, 2018 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Jul 6, 2018 at 9:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> I have tried this idea, but it doesn't completely solve the problem.
>>> The problem is that nodes below LIMIT won't get a chance to accumulate
>>> the stats as they won't be able to call InstrStopNode.
>>
>> I'm not sure I understand.  Why not?  I see that we'd need to insert
>> an extra call to InstrStopNode() if we were stopping the node while it
>> was running, because then InstrStartNode() would have already been
>> done, but the corresponding call to InstrStopNode() would not have
>> been done.  But I'm not sure how that would happen in this case.  Can
>> you explain further?
>>
>
> Okay, let me try.   The code flow is that for each node we will call
> InstrStartNode()->ExecProcNodeReal()->InstrStopNode().  Now let's say
> we have to execute a plan Limit->Gather-> Parallel SeqScan.  In this,
> first for Limit node, we will call InstrStartNode() and
> ExecProcNodeReal() and then for Gather we will call InstrStartNode(),
> ExecProcNodeReal() and InstrStopNode().  Now, Limit node decides that
> it needs to shutdown all the nodes (ExecShutdownNode) and after that
> it will call InstrStopNode() for Limit node.  So, in this flow after
> shutting down nodes, we never get chance for Gather node to use stats
> collected during ExecShutdownNode.
>

I went ahead and tried the solution which I had mentioned yesterday,
that is to allow ExecShutdownNode to count stats.  Apart from fixing
this problem, it will also fix the problem with Gather Merge as
reported by Adrien [1], because now Gather Merge will also get a
chance to count stats after shutting down workers.

Note that, I have changed the location of InstrStartParallelQuery in
ParallelQueryMain so that the buffer usage stats are accumulated only
for the plan execution which is what we do for instrumentation
information as well.  If we don't do that, it will count some
additional stats for ExecutorStart which won't match with what we have
in Instrumentation structure of each node.

[1] - https://www.postgresql.org/message-id/01952aab-33ca-36cd-e74b-ce32f3eefc84%40anayrat.info

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Explain buffers wrong counter with parallel plans

From
"Jonathan S. Katz"
Date:

On Jul 7, 2018, at 12:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Jul 7, 2018 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Jul 7, 2018 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jul 6, 2018 at 9:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I have tried this idea, but it doesn't completely solve the problem.
The problem is that nodes below LIMIT won't get a chance to accumulate
the stats as they won't be able to call InstrStopNode.

I'm not sure I understand.  Why not?  I see that we'd need to insert
an extra call to InstrStopNode() if we were stopping the node while it
was running, because then InstrStartNode() would have already been
done, but the corresponding call to InstrStopNode() would not have
been done.  But I'm not sure how that would happen in this case.  Can
you explain further?


Okay, let me try.   The code flow is that for each node we will call
InstrStartNode()->ExecProcNodeReal()->InstrStopNode().  Now let's say
we have to execute a plan Limit->Gather-> Parallel SeqScan.  In this,
first for Limit node, we will call InstrStartNode() and
ExecProcNodeReal() and then for Gather we will call InstrStartNode(),
ExecProcNodeReal() and InstrStopNode().  Now, Limit node decides that
it needs to shutdown all the nodes (ExecShutdownNode) and after that
it will call InstrStopNode() for Limit node.  So, in this flow after
shutting down nodes, we never get chance for Gather node to use stats
collected during ExecShutdownNode.


I went ahead and tried the solution which I had mentioned yesterday,
that is to allow ExecShutdownNode to count stats.  Apart from fixing
this problem, it will also fix the problem with Gather Merge as
reported by Adrien [1], because now Gather Merge will also get a
chance to count stats after shutting down workers.

Note that, I have changed the location of InstrStartParallelQuery in
ParallelQueryMain so that the buffer usage stats are accumulated only
for the plan execution which is what we do for instrumentation
information as well.  If we don't do that, it will count some
additional stats for ExecutorStart which won't match with what we have
in Instrumentation structure of each node.

I tried running the below on both 11beta2 and HEAD with the patch
applied:

CREATE UNLOGGED TABLE t1 (c1 int);
INSERT INTO t1 SELECT generate_series(1,100000000);
/** restart PostgreSQL */
EXPLAIN (analyze,buffers,timing off,costs off)
SELECT count(*) FROM t1;
/** repeat above two queries */

I have identical postgresql.conf files on both instances as well.

11beta2
======
    buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
    from t1;
                                    QUERY PLAN                                
    --------------------------------------------------------------------------
     Finalize Aggregate (actual rows=1 loops=1)
       Buffers: shared read=63175
       ->  Gather (actual rows=7 loops=1)
             Workers Planned: 6
             Workers Launched: 6
             Buffers: shared read=63175
             ->  Partial Aggregate (actual rows=1 loops=7)
                   Buffers: shared read=442478
                   ->  Parallel Seq Scan on t1 (actual rows=14285714 loops=7)
                         Buffers: shared read=442478
     Planning Time: 1.422 ms
     Execution Time: 3214.407 ms
    (12 rows)

    buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
    from t1;
                                    QUERY PLAN                                
    --------------------------------------------------------------------------
     Finalize Aggregate (actual rows=1 loops=1)
       Buffers: shared hit=27 read=64960
       ->  Gather (actual rows=7 loops=1)
             Workers Planned: 6
             Workers Launched: 6
             Buffers: shared hit=27 read=64960
             ->  Partial Aggregate (actual rows=1 loops=7)
                   Buffers: shared hit=224 read=442254
                   ->  Parallel Seq Scan on t1 (actual rows=14285714 loops=7)
                         Buffers: shared hit=224 read=442254
     Planning Time: 0.080 ms
     Execution Time: 3774.091 ms
    (12 rows)

HEAD
=====
    buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
    from t1;
                                    QUERY PLAN                                
    --------------------------------------------------------------------------
     Finalize Aggregate (actual rows=1 loops=1)
       Buffers: shared read=442478
       ->  Gather (actual rows=7 loops=1)
             Workers Planned: 6
             Workers Launched: 6
             Buffers: shared read=442478
             ->  Partial Aggregate (actual rows=1 loops=7)
                   Buffers: shared read=442478
                   ->  Parallel Seq Scan on t1 (actual rows=14285714 loops=7)
                         Buffers: shared read=442478
     Planning Time: 1.594 ms
     Execution Time: 6207.799 ms
    (12 rows)

    buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
    from t1;
                                    QUERY PLAN                                
    --------------------------------------------------------------------------
     Finalize Aggregate (actual rows=1 loops=1)
       Buffers: shared hit=224 read=442254
       ->  Gather (actual rows=7 loops=1)
             Workers Planned: 6
             Workers Launched: 6
             Buffers: shared hit=224 read=442254
             ->  Partial Aggregate (actual rows=1 loops=7)
                   Buffers: shared hit=224 read=442254
                   ->  Parallel Seq Scan on t1 (actual rows=14285714 loops=7)
                         Buffers: shared hit=224 read=442254
     Planning Time: 0.074 ms
     Execution Time: 5006.395 ms
    (12 rows)

Notice the “read” numbers just before the “Finalize Aggregate” is much
higher, and there appears to be a performance hit.

Jonathan
Attachment

Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Wed, Jul 25, 2018 at 7:42 PM, Jonathan S. Katz
<jonathan.katz@excoventures.com> wrote:
>
> On Jul 7, 2018, at 12:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I tried running the below on both 11beta2 and HEAD with the patch
> applied:
>
> CREATE UNLOGGED TABLE t1 (c1 int);
> INSERT INTO t1 SELECT generate_series(1,100000000);
> /** restart PostgreSQL */
> EXPLAIN (analyze,buffers,timing off,costs off)
> SELECT count(*) FROM t1;
> /** repeat above two queries */
>
> I have identical postgresql.conf files on both instances as well.
>
> 11beta2
> ======
>     buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
>     from t1;
>                                     QUERY PLAN
>
> --------------------------------------------------------------------------
>      Finalize Aggregate (actual rows=1 loops=1)
>        Buffers: shared read=63175
>        ->  Gather (actual rows=7 loops=1)
>              Workers Planned: 6
>              Workers Launched: 6
>              Buffers: shared read=63175
>              ->  Partial Aggregate (actual rows=1 loops=7)
>                    Buffers: shared read=442478
>                    ->  Parallel Seq Scan on t1 (actual rows=14285714
> loops=7)
>                          Buffers: shared read=442478
>      Planning Time: 1.422 ms
>      Execution Time: 3214.407 ms
>     (12 rows)
>
>     buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
>     from t1;
>                                     QUERY PLAN
>
> --------------------------------------------------------------------------
>      Finalize Aggregate (actual rows=1 loops=1)
>        Buffers: shared hit=27 read=64960
>        ->  Gather (actual rows=7 loops=1)
>              Workers Planned: 6
>              Workers Launched: 6
>              Buffers: shared hit=27 read=64960
>              ->  Partial Aggregate (actual rows=1 loops=7)
>                    Buffers: shared hit=224 read=442254
>                    ->  Parallel Seq Scan on t1 (actual rows=14285714
> loops=7)
>                          Buffers: shared hit=224 read=442254
>      Planning Time: 0.080 ms
>      Execution Time: 3774.091 ms
>     (12 rows)
>
> HEAD
> =====
>     buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
>     from t1;
>                                     QUERY PLAN
>
> --------------------------------------------------------------------------
>      Finalize Aggregate (actual rows=1 loops=1)
>        Buffers: shared read=442478
>        ->  Gather (actual rows=7 loops=1)
>              Workers Planned: 6
>              Workers Launched: 6
>              Buffers: shared read=442478
>              ->  Partial Aggregate (actual rows=1 loops=7)
>                    Buffers: shared read=442478
>                    ->  Parallel Seq Scan on t1 (actual rows=14285714
> loops=7)
>                          Buffers: shared read=442478
>      Planning Time: 1.594 ms
>      Execution Time: 6207.799 ms
>     (12 rows)
>
>     buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
>     from t1;
>                                     QUERY PLAN
>
> --------------------------------------------------------------------------
>      Finalize Aggregate (actual rows=1 loops=1)
>        Buffers: shared hit=224 read=442254
>        ->  Gather (actual rows=7 loops=1)
>              Workers Planned: 6
>              Workers Launched: 6
>              Buffers: shared hit=224 read=442254
>              ->  Partial Aggregate (actual rows=1 loops=7)
>                    Buffers: shared hit=224 read=442254
>                    ->  Parallel Seq Scan on t1 (actual rows=14285714
> loops=7)
>                          Buffers: shared hit=224 read=442254
>      Planning Time: 0.074 ms
>      Execution Time: 5006.395 ms
>     (12 rows)
>
> Notice the “read” numbers just before the “Finalize Aggregate” is much
> higher,
>

You mean to say the number (Buffers: shared read=442478) in HEAD,
right?  If so, yeah, I am also wondering why the results of the patch
are different in HEAD and 11beta2.  So, if I read correctly, the
numbers in 11beta2 appears correct, but on HEAD it is not correct, it
should have divided the buffers read by loops.  I will figure that
out, but this is just to clarify that both of us are seeing the
results in the same way.

>and there appears to be a performance hit.
>

Do you mean to say the performance of the same query in 11beta2 and
HEAD or something else?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
"Jonathan S. Katz"
Date:
> On Jul 25, 2018, at 10:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 25, 2018 at 7:42 PM, Jonathan S. Katz
> <jonathan.katz@excoventures.com> wrote:
>>
>> On Jul 7, 2018, at 12:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> I tried running the below on both 11beta2 and HEAD with the patch
>> applied:
>>
>> CREATE UNLOGGED TABLE t1 (c1 int);
>> INSERT INTO t1 SELECT generate_series(1,100000000);
>> /** restart PostgreSQL */
>> EXPLAIN (analyze,buffers,timing off,costs off)
>> SELECT count(*) FROM t1;
>> /** repeat above two queries */
>>
>> I have identical postgresql.conf files on both instances as well.
>>
>> 11beta2
>> ======
>>    buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
>>    from t1;
>>                                    QUERY PLAN
>>
>> --------------------------------------------------------------------------
>>     Finalize Aggregate (actual rows=1 loops=1)
>>       Buffers: shared read=63175
>>       ->  Gather (actual rows=7 loops=1)
>>             Workers Planned: 6
>>             Workers Launched: 6
>>             Buffers: shared read=63175
>>             ->  Partial Aggregate (actual rows=1 loops=7)
>>                   Buffers: shared read=442478
>>                   ->  Parallel Seq Scan on t1 (actual rows=14285714
>> loops=7)
>>                         Buffers: shared read=442478
>>     Planning Time: 1.422 ms
>>     Execution Time: 3214.407 ms
>>    (12 rows)
>>
>>    buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
>>    from t1;
>>                                    QUERY PLAN
>>
>> --------------------------------------------------------------------------
>>     Finalize Aggregate (actual rows=1 loops=1)
>>       Buffers: shared hit=27 read=64960
>>       ->  Gather (actual rows=7 loops=1)
>>             Workers Planned: 6
>>             Workers Launched: 6
>>             Buffers: shared hit=27 read=64960
>>             ->  Partial Aggregate (actual rows=1 loops=7)
>>                   Buffers: shared hit=224 read=442254
>>                   ->  Parallel Seq Scan on t1 (actual rows=14285714
>> loops=7)
>>                         Buffers: shared hit=224 read=442254
>>     Planning Time: 0.080 ms
>>     Execution Time: 3774.091 ms
>>    (12 rows)
>>
>> HEAD
>> =====
>>    buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
>>    from t1;
>>                                    QUERY PLAN
>>
>> --------------------------------------------------------------------------
>>     Finalize Aggregate (actual rows=1 loops=1)
>>       Buffers: shared read=442478
>>       ->  Gather (actual rows=7 loops=1)
>>             Workers Planned: 6
>>             Workers Launched: 6
>>             Buffers: shared read=442478
>>             ->  Partial Aggregate (actual rows=1 loops=7)
>>                   Buffers: shared read=442478
>>                   ->  Parallel Seq Scan on t1 (actual rows=14285714
>> loops=7)
>>                         Buffers: shared read=442478
>>     Planning Time: 1.594 ms
>>     Execution Time: 6207.799 ms
>>    (12 rows)
>>
>>    buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
>>    from t1;
>>                                    QUERY PLAN
>>
>> --------------------------------------------------------------------------
>>     Finalize Aggregate (actual rows=1 loops=1)
>>       Buffers: shared hit=224 read=442254
>>       ->  Gather (actual rows=7 loops=1)
>>             Workers Planned: 6
>>             Workers Launched: 6
>>             Buffers: shared hit=224 read=442254
>>             ->  Partial Aggregate (actual rows=1 loops=7)
>>                   Buffers: shared hit=224 read=442254
>>                   ->  Parallel Seq Scan on t1 (actual rows=14285714
>> loops=7)
>>                         Buffers: shared hit=224 read=442254
>>     Planning Time: 0.074 ms
>>     Execution Time: 5006.395 ms
>>    (12 rows)
>>
>> Notice the “read” numbers just before the “Finalize Aggregate” is much
>> higher,
>>
>
> You mean to say the number (Buffers: shared read=442478) in HEAD,
> right?  If so, yeah, I am also wondering why the results of the patch
> are different in HEAD and 11beta2.  So, if I read correctly, the
> numbers in 11beta2 appears correct, but on HEAD it is not correct, it
> should have divided the buffers read by loops.  I will figure that
> out, but this is just to clarify that both of us are seeing the
> results in the same way.

I’ll try it again but patch it against 11beta2 and see what results I get.

>
>> and there appears to be a performance hit.
>>
>
> Do you mean to say the performance of the same query in 11beta2 and
> HEAD or something else?

Correct. But per your advice let me try running it against a patched
version of 11beta2 and see what happens.

Jonathan


Attachment

Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Thu, Jul 26, 2018 at 7:31 PM, Jonathan S. Katz
<jonathan.katz@excoventures.com> wrote:
>
>> On Jul 25, 2018, at 10:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>>
>> You mean to say the number (Buffers: shared read=442478) in HEAD,
>> right?  If so, yeah, I am also wondering why the results of the patch
>> are different in HEAD and 11beta2.  So, if I read correctly, the
>> numbers in 11beta2 appears correct, but on HEAD it is not correct, it
>> should have divided the buffers read by loops.

I want to correct myself here, the numbers on HEAD are correct, but
not on PG11beta2.  Is there any chance that either you forgot to apply
the patch or maybe it is not using correct binaries in case of
11beta2.

>>  I will figure that
>> out, but this is just to clarify that both of us are seeing the
>> results in the same way.
>
> I’ll try it again but patch it against 11beta2 and see what results I get.
>
>>
>>> and there appears to be a performance hit.
>>>
>>
>> Do you mean to say the performance of the same query in 11beta2 and
>> HEAD or something else?
>
> Correct. But per your advice let me try running it against a patched
> version of 11beta2 and see what happens.
>

Yeah, that would be better.  Today, I have tried the patch on both
Head and PG11 and I am getting same and correct results.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
"Jonathan S. Katz"
Date:
> On Jul 27, 2018, at 8:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 26, 2018 at 7:31 PM, Jonathan S. Katz
> <jonathan.katz@excoventures.com> wrote:
>>
>>> On Jul 25, 2018, at 10:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>>
>>> You mean to say the number (Buffers: shared read=442478) in HEAD,
>>> right?  If so, yeah, I am also wondering why the results of the patch
>>> are different in HEAD and 11beta2.  So, if I read correctly, the
>>> numbers in 11beta2 appears correct, but on HEAD it is not correct, it
>>> should have divided the buffers read by loops.
>
> I want to correct myself here, the numbers on HEAD are correct, but
> not on PG11beta2.  Is there any chance that either you forgot to apply
> the patch or maybe it is not using correct binaries in case of
> 11beta2.

I need to correct myself too as I was unclear - that was an *unpatched*
11beta2, sorry for the confusion.

>>> I will figure that
>>> out, but this is just to clarify that both of us are seeing the
>>> results in the same way.
>>
>> I’ll try it again but patch it against 11beta2 and see what results I get.
>>
>>>
>>>> and there appears to be a performance hit.
>>>>
>>>
>>> Do you mean to say the performance of the same query in 11beta2 and
>>> HEAD or something else?
>>
>> Correct. But per your advice let me try running it against a patched
>> version of 11beta2 and see what happens.
>>
>
> Yeah, that would be better.  Today, I have tried the patch on both
> Head and PG11 and I am getting same and correct results.

I have applied the the patch to PG11beta2 and tested. I received
similar numbers to to the patched HEAD tests, e.g:

    => EXPLAIN (analyze,buffers,timing off,costs off) SELECT count(*) FROM t1;
                                    QUERY PLAN
    --------------------------------------------------------------------------
     Finalize Aggregate (actual rows=1 loops=1)
       Buffers: shared hit=224 read=442254
       ->  Gather (actual rows=7 loops=1)
             Workers Planned: 6
             Workers Launched: 6
             Buffers: shared hit=224 read=442254
             ->  Partial Aggregate (actual rows=1 loops=7)
                   Buffers: shared hit=224 read=442254
                   ->  Parallel Seq Scan on t1 (actual rows=14285714 loops=7)
                         Buffers: shared hit=224 read=442254
     Planning Time: 0.104 ms
     Execution Time: 5308.140 ms


I ran the tests a few more times and I understand why the execution
times vary (with an explanation from Andres) so I am satisfied.

Thanks for working on this!

Jonathan

Attachment

Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Fri, Jul 27, 2018 at 11:12 PM, Jonathan S. Katz
<jonathan.katz@excoventures.com> wrote:
>
>> On Jul 27, 2018, at 8:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>>
>> Yeah, that would be better.  Today, I have tried the patch on both
>> Head and PG11 and I am getting same and correct results.
>
> I have applied the the patch to PG11beta2 and tested.
>

I think we should backpatch this till 9.6 where the parallel query was
introduced.  Note, that for HEAD and 11, the patch is same.  For PG10,
the patch code is same, but because surrounding code is different, the
same patch didn't apply.  For 9.6, we don't need to collect stats in
ExecShutdownNode.   I have tested it in all the back branches and it
works fine.

Robert,

We have done verification that the approach works and fixes the bug in
all known cases.  Do you see any problem with this approach?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Explain buffers wrong counter with parallel plans

From
"Jonathan S. Katz"
Date:
> On Jul 28, 2018, at 2:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jul 27, 2018 at 11:12 PM, Jonathan S. Katz
> <jonathan.katz@excoventures.com> wrote:
>>
>>> On Jul 27, 2018, at 8:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>>
>>> Yeah, that would be better.  Today, I have tried the patch on both
>>> Head and PG11 and I am getting same and correct results.
>>
>> I have applied the the patch to PG11beta2 and tested.
>>
>
> I think we should backpatch this till 9.6 where the parallel query was
> introduced.  Note, that for HEAD and 11, the patch is same.  For PG10,
> the patch code is same, but because surrounding code is different, the
> same patch didn't apply.  For 9.6, we don't need to collect stats in
> ExecShutdownNode.   I have tested it in all the back branches and it
> works fine.

The logic on backpatching seems sounds. I confirmed my tests of the respective
patches against 9.6.9 and 10.4. I'll defer to someone else for comments on the
code, but from my read it appears to be a consistent approach for each version.

Jonathan


Attachment

Re: Explain buffers wrong counter with parallel plans

From
Andres Freund
Date:
Hi,

I'm not an expert in the area of the code, but here's a review anyway. I
did not read through the entire thread.


I think we should try to get this fixed soon, to make some progress
towards release-ability. Or just declare it to be entirely unrelated to
the release, and remove it from the open items list; not an unreasonable
choice either. This has been an open item for quite a while...


> diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
> index 60aaa822b7e..ac22bedf0e2 100644
> --- a/src/backend/executor/execParallel.c
> +++ b/src/backend/executor/execParallel.c
> @@ -979,9 +979,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
>      /* Report workers' query for monitoring purposes */
>      pgstat_report_activity(STATE_RUNNING, debug_query_string);
>  
> -    /* Prepare to track buffer usage during query execution. */
> -    InstrStartParallelQuery();
> -
>      /* Attach to the dynamic shared memory area. */
>      area_space = shm_toc_lookup(toc, PARALLEL_KEY_DSA, false);
>      area = dsa_attach_in_place(area_space, seg);
> @@ -993,6 +990,9 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
>      queryDesc->planstate->state->es_query_dsa = area;
>      ExecParallelInitializeWorker(queryDesc->planstate, toc);
>  
> +    /* Prepare to track buffer usage during query execution. */
> +    InstrStartParallelQuery();
> +
>      /* Run the plan */
>      ExecutorRun(queryDesc, ForwardScanDirection, 0L, true);

Isn't this an independent change?  And one with potentially negative
side effects? I think there's some arguments for changing this (and some
against), but I think it's a bad idea to do so in the same patch.


> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index 36d2914249c..a0d49ec0fba 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -737,6 +737,13 @@ ExecShutdownNode(PlanState *node)
>  
>      planstate_tree_walker(node, ExecShutdownNode, NULL);
>  
> +    /*
> +     * Allow instrumentation to count stats collected during shutdown for
> +     * nodes that are executed atleast once.
> +     */
> +    if (node->instrument && node->instrument->running)
> +        InstrStartNode(node->instrument);
> +
>      switch (nodeTag(node))
>      {
>          case T_GatherState:
> @@ -755,5 +762,12 @@ ExecShutdownNode(PlanState *node)
>              break;
>      }
>  
> +    /*
> +     * Allow instrumentation to count stats collected during shutdown for
> +     * nodes that are executed atleast once.
> +     */
> +    if (node->instrument && node->instrument->running)
> +        InstrStopNode(node->instrument, 0);
> +
>      return false;
>  }

Duplicating the comment doesn't seem like a good idea. Just say
something like "/* see comment for InstrStartNode above */".


> diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
> index ac5a2ff0e60..cf1851e235f 100644
> --- a/src/backend/executor/nodeLimit.c
> +++ b/src/backend/executor/nodeLimit.c
> @@ -134,6 +134,8 @@ ExecLimit(PlanState *pstate)
>                      node->position - node->offset >= node->count)
>                  {
>                      node->lstate = LIMIT_WINDOWEND;
> +                    /* Allow nodes to release or shut down resources. */
> +                    (void) ExecShutdownNode(outerPlan);
>                      return NULL;
>                  }
>  

Um, is this actually correct?  What if somebody rewinds afterwards?
That won't happen for parallel query currently, but architecturally I
don't think we can do so unconditionally?


Greetings,

Andres Freund


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Tue, Jul 31, 2018 at 12:58 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> I'm not an expert in the area of the code, but here's a review anyway. I
> did not read through the entire thread.
>
>
> I think we should try to get this fixed soon, to make some progress
> towards release-ability. Or just declare it to be entirely unrelated to
> the release, and remove it from the open items list; not an unreasonable
> choice either. This has been an open item for quite a while...
>

I am okay either way, but I can work on getting it committed for this release.

>
>> diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
>> index 60aaa822b7e..ac22bedf0e2 100644
>> --- a/src/backend/executor/execParallel.c
>> +++ b/src/backend/executor/execParallel.c
>> @@ -979,9 +979,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
>>       /* Report workers' query for monitoring purposes */
>>       pgstat_report_activity(STATE_RUNNING, debug_query_string);
>>
>> -     /* Prepare to track buffer usage during query execution. */
>> -     InstrStartParallelQuery();
>> -
>>       /* Attach to the dynamic shared memory area. */
>>       area_space = shm_toc_lookup(toc, PARALLEL_KEY_DSA, false);
>>       area = dsa_attach_in_place(area_space, seg);
>> @@ -993,6 +990,9 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
>>       queryDesc->planstate->state->es_query_dsa = area;
>>       ExecParallelInitializeWorker(queryDesc->planstate, toc);
>>
>> +     /* Prepare to track buffer usage during query execution. */
>> +     InstrStartParallelQuery();
>> +
>>       /* Run the plan */
>>       ExecutorRun(queryDesc, ForwardScanDirection, 0L, true);
>
> Isn't this an independent change?  And one with potentially negative
> side effects? I think there's some arguments for changing this (and some
> against),
>

This is to ensure that buffer usage stats are tracked consistently in
master and worker backends.   In the master backend, we don't track
the similar stats for ExecutorStart phase.  Also, it would help us
give the same stats for buffer usage in parallel and non-parallel
version of the query.

> but I think it's a bad idea to do so in the same patch.
>

Agreed, we can do this part separately.

>
>> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
>> index 36d2914249c..a0d49ec0fba 100644
>> --- a/src/backend/executor/execProcnode.c
>> +++ b/src/backend/executor/execProcnode.c
>> @@ -737,6 +737,13 @@ ExecShutdownNode(PlanState *node)
>>
>>       planstate_tree_walker(node, ExecShutdownNode, NULL);
>>
>> +     /*
>> +      * Allow instrumentation to count stats collected during shutdown for
>> +      * nodes that are executed atleast once.
>> +      */
>> +     if (node->instrument && node->instrument->running)
>> +             InstrStartNode(node->instrument);
>> +
>>       switch (nodeTag(node))
>>       {
>>               case T_GatherState:
>> @@ -755,5 +762,12 @@ ExecShutdownNode(PlanState *node)
>>                       break;
>>       }
>>
>> +     /*
>> +      * Allow instrumentation to count stats collected during shutdown for
>> +      * nodes that are executed atleast once.
>> +      */
>> +     if (node->instrument && node->instrument->running)
>> +             InstrStopNode(node->instrument, 0);
>> +
>>       return false;
>>  }
>
> Duplicating the comment doesn't seem like a good idea. Just say
> something like "/* see comment for InstrStartNode above */".
>

Okay, will change in the next version.

>
>> diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
>> index ac5a2ff0e60..cf1851e235f 100644
>> --- a/src/backend/executor/nodeLimit.c
>> +++ b/src/backend/executor/nodeLimit.c
>> @@ -134,6 +134,8 @@ ExecLimit(PlanState *pstate)
>>                                       node->position - node->offset >= node->count)
>>                               {
>>                                       node->lstate = LIMIT_WINDOWEND;
>> +                                     /* Allow nodes to release or shut down resources. */
>> +                                     (void) ExecShutdownNode(outerPlan);
>>                                       return NULL;
>>                               }
>>
>
> Um, is this actually correct?  What if somebody rewinds afterwards?
> That won't happen for parallel query currently, but architecturally I
> don't think we can do so unconditionally?
>

Currently,  this is done for the forward scans (basically under
ScanDirectionIsForward), is that enough, if not, what kind of
additional check do you think we need here?   I am not sure at this
stage how we will design it for backward scans, but I think we already
do this at other places in the code, see below code in ExecutePlan.

ExecutePlan()
{
..
if (numberTuples && numberTuples == current_tuple_count)
{
/* Allow nodes to release or shut down resources. */
(void) ExecShutdownNode(planstate);
break;
}
..
}

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Robert Haas
Date:
On Sat, Jul 28, 2018 at 2:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> We have done verification that the approach works and fixes the bug in
> all known cases.  Do you see any problem with this approach?

Regarding the change to execParallel.c, I think if you're going to
move that code you should add a comment explaining the justification
for the new positioning, like:

/*
 * Prepare to track buffer usage during query execution.
 *
 * We do this after starting up the executor to match what happens in the
 * leader, which also doesn't count buffer accesses that occur during
 * executor startup.
 */

If you just make a change and move the code without adding a comment
explaining why you moved it, the next person will wonder why they
shouldn't just move it back when they hit some other problem.  I also
agree with Andres that this could be committed separately.

Regarding the change to execProcnode.c, "atleast" should be "at
least".  But more broadly, I don't think the comment is very clear
about the justification for what we're doing, and Andres is right that
duplication the comment isn't helpful.  I've attempted to improve this
in the attached version.

Regarding Andres's comments about the changes to nodeLimit.c, I think
he's right, but as you say, this isn't the first place to have this
problem.  The issue is that someone might be using cursor operations
to fetch forward through the plan one tuple at a time, and then after
triggering the shutdown, they might start fetching backward.  That
doesn't matter so long as shutdown callbacks are only used for
shutting down stuff related to parallel query, because you can't fetch
backwards from a Gather node or anything under it.  But if we want to
make broader use of shutdown callbacks -- and I think that would be a
good idea -- then it's a problem.  There's already a latent bug here,
because custom scans and foreign scans are allowed to have shutdown
callbacks (but postgres_fdw and file_fdw don't).

Fixing that seems like a separate issue, and I think it would probably
be OK to proceed with the change as you have it for now, but we ought
to do something about it.  I'm not sure there's a problem with
rewinding, as Andres suggests: if the node is entirely rescanned, I
believe it's allowed to regenerate the output, and Gather (Merge) can
do that by firing up a new set of workers.  But scanning backwards is
a problem.  I'm not exactly sure what the best way of handling that
is, but one thing I think might work is to save ExecutePlan's
execute_once flag in the EState and then make the call in nodeLimit.c
and the one in ExecutePlan itself conditional on that flag.  If we
know that the plan is only going to be executed once, then there can
never be any backward fetches and it's fine to shut down as soon as we
finish going forward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Explain buffers wrong counter with parallel plans

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> ... But scanning backwards is
> a problem.  I'm not exactly sure what the best way of handling that
> is, but one thing I think might work is to save ExecutePlan's
> execute_once flag in the EState and then make the call in nodeLimit.c
> and the one in ExecutePlan itself conditional on that flag.  If we
> know that the plan is only going to be executed once, then there can
> never be any backward fetches and it's fine to shut down as soon as we
> finish going forward.

Shouldn't this be dealt with by a eflag bit passed down at executor
init time?  What you're describing sounds a lot like somebody
invented a different way because they were unfamiliar with eflags.

            regards, tom lane


Re: Explain buffers wrong counter with parallel plans

From
Robert Haas
Date:
On Wed, Aug 1, 2018 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> ... But scanning backwards is
>> a problem.  I'm not exactly sure what the best way of handling that
>> is, but one thing I think might work is to save ExecutePlan's
>> execute_once flag in the EState and then make the call in nodeLimit.c
>> and the one in ExecutePlan itself conditional on that flag.  If we
>> know that the plan is only going to be executed once, then there can
>> never be any backward fetches and it's fine to shut down as soon as we
>> finish going forward.
>
> Shouldn't this be dealt with by a eflag bit passed down at executor
> init time?  What you're describing sounds a lot like somebody
> invented a different way because they were unfamiliar with eflags.

Uh, yeah, maybe.  I'm a bit unfamiliar with eflags.  :-)

Looking at it, this sounds suspiciously something where we could just
test EXEC_FLAG_BACKWARD, based on the comments for that flag, but I
wouldn't be willing to bet very much on me being right about that.
Can somebody try to fetch backward even if this flag wasn't provided?
If so, we need to make sure that the shutting down a node doesn't
leave it in a state where a fetch attempt will crash.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Explain buffers wrong counter with parallel plans

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Looking at it, this sounds suspiciously something where we could just
> test EXEC_FLAG_BACKWARD, based on the comments for that flag, but I
> wouldn't be willing to bet very much on me being right about that.
> Can somebody try to fetch backward even if this flag wasn't provided?

They're not supposed to, and it would be a bug of theirs not yours
if they did.  Other node types that rely on eflags being provided
honestly generally just Assert that they're not asked for something
else later.

            regards, tom lane


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Wed, Aug 1, 2018 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Jul 28, 2018 at 2:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> We have done verification that the approach works and fixes the bug in
>> all known cases.  Do you see any problem with this approach?
>
> Regarding the change to execParallel.c, I think if you're going to
> move that code you should add a comment explaining the justification
> for the new positioning, like:
>
> /*
>  * Prepare to track buffer usage during query execution.
>  *
>  * We do this after starting up the executor to match what happens in the
>  * leader, which also doesn't count buffer accesses that occur during
>  * executor startup.
>  */
>
> If you just make a change and move the code without adding a comment
> explaining why you moved it, the next person will wonder why they
> shouldn't just move it back when they hit some other problem.  I also
> agree with Andres that this could be committed separately.
>
> Regarding the change to execProcnode.c, "atleast" should be "at
> least".  But more broadly, I don't think the comment is very clear
> about the justification for what we're doing, and Andres is right that
> duplication the comment isn't helpful.  I've attempted to improve this
> in the attached version.
>
> Regarding Andres's comments about the changes to nodeLimit.c, I think
> he's right, but as you say, this isn't the first place to have this
> problem.  The issue is that someone might be using cursor operations
> to fetch forward through the plan one tuple at a time, and then after
> triggering the shutdown, they might start fetching backward.  That
> doesn't matter so long as shutdown callbacks are only used for
> shutting down stuff related to parallel query, because you can't fetch
> backwards from a Gather node or anything under it.
>

Yeah and at the first place, we won't allow parallel query for
cursors.   The CURSOR_OPT_PARALLEL_OK won't be set for them and
standard_planner will prohibit generating parallel plans.


>  But if we want to
> make broader use of shutdown callbacks -- and I think that would be a
> good idea -- then it's a problem.  There's already a latent bug here,
> because custom scans and foreign scans are allowed to have shutdown
> callbacks (but postgres_fdw and file_fdw don't).
>
> Fixing that seems like a separate issue, and I think it would probably
> be OK to proceed with the change as you have it for now, but we ought
> to do something about it.  I'm not sure there's a problem with
> rewinding, as Andres suggests: if the node is entirely rescanned, I
> believe it's allowed to regenerate the output, and Gather (Merge) can
> do that by firing up a new set of workers.  But scanning backwards is
> a problem.  I'm not exactly sure what the best way of handling that
> is, but one thing I think might work is to save ExecutePlan's
> execute_once flag in the EState and then make the call in nodeLimit.c
> and the one in ExecutePlan itself conditional on that flag.  If we
> know that the plan is only going to be executed once, then there can
> never be any backward fetches and it's fine to shut down as soon as we
> finish going forward.
>

I think something on the lines what Tom and you are suggesting can be
done with the help of EXEC_FLAG_BACKWARD, but I don't see the need to
do anything for this patch.  The change in nodeLimit.c is any way for
forward scans, so there shouldn't be any need for any other check.

If we agree that the current patch is sufficient for this bug fix,
then I can work on splitting the patch and preparing the patch for
back-branches.  As mentioned above, I think we should backpatch this
till 9.6.  Any objections on backpatching this fix?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Andres Freund
Date:
Hi,

On 2018-08-02 08:21:58 +0530, Amit Kapila wrote:
> I think something on the lines what Tom and you are suggesting can be
> done with the help of EXEC_FLAG_BACKWARD, but I don't see the need to
> do anything for this patch.  The change in nodeLimit.c is any way for
> forward scans, so there shouldn't be any need for any other check.

I think this is almost a guarantee to introduce bugs in the future. And
besides that, as Robert points out, it's essentially an exiting bug for
custom scans.  Given that EXEC_FLAG_BACKWARD already exists, why not do
the right thing here?

Greetings,

Andres Freund


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Thu, Aug 2, 2018 at 8:38 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2018-08-02 08:21:58 +0530, Amit Kapila wrote:
>> I think something on the lines what Tom and you are suggesting can be
>> done with the help of EXEC_FLAG_BACKWARD, but I don't see the need to
>> do anything for this patch.  The change in nodeLimit.c is any way for
>> forward scans, so there shouldn't be any need for any other check.
>
> I think this is almost a guarantee to introduce bugs in the future. And
> besides that, as Robert points out, it's essentially an exiting bug for
> custom scans.  Given that EXEC_FLAG_BACKWARD already exists, why not do
> the right thing here?
>

Sure, if we want we can do it in this patch, but this is not the
problem of this patch.  It is also related to existing usage of
shutdown callbacks.  I think we can use es_top_eflags in Estate to
detect it and then call shutdown only if EXEC_FLAG_BACKWARD is not
set.  I think we should do that as a separate patch either before or
after this patch rather than conflating that change into this patch.
IIUC, then Robert also said that we should fix that separately.  I
will do as per whatever consensus we reach here.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Thu, Aug 2, 2018 at 10:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Aug 2, 2018 at 8:38 AM, Andres Freund <andres@anarazel.de> wrote:
>> Hi,
>>
>> On 2018-08-02 08:21:58 +0530, Amit Kapila wrote:
>>> I think something on the lines what Tom and you are suggesting can be
>>> done with the help of EXEC_FLAG_BACKWARD, but I don't see the need to
>>> do anything for this patch.  The change in nodeLimit.c is any way for
>>> forward scans, so there shouldn't be any need for any other check.
>>
>> I think this is almost a guarantee to introduce bugs in the future. And
>> besides that, as Robert points out, it's essentially an exiting bug for
>> custom scans.  Given that EXEC_FLAG_BACKWARD already exists, why not do
>> the right thing here?
>>
>
> Sure, if we want we can do it in this patch, but this is not the
> problem of this patch.  It is also related to existing usage of
> shutdown callbacks.  I think we can use es_top_eflags in Estate to
> detect it and then call shutdown only if EXEC_FLAG_BACKWARD is not
> set.  I think we should do that as a separate patch either before or
> after this patch rather than conflating that change into this patch.
> IIUC, then Robert also said that we should fix that separately.  I
> will do as per whatever consensus we reach here.
>

I have created three patches (a) move InstrStartParallelQuery from its
original location so that we perform it just before ExecutorRun (b)
patch to fix the gather stats by calling shutdown at appropriate place
and allow stats collection in ExecShutdownNode (c) Probit calling
ExecShutdownNode if there is a possibility of backward scans (I have
done some basic tests with this patch, if we decide to proceed with
it, then some more verification and testing would be required).

I think we should commit first two patches as that fixes the problem
being discussed in this thread and then do some additional
verification for the third patch (mentioned in option c).  I can
understand if people want to commit the third patch before the second
patch, so let me know what you guys think.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Explain buffers wrong counter with parallel plans

From
Robert Haas
Date:
On Thu, Aug 2, 2018 at 5:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have created three patches (a) move InstrStartParallelQuery from its
> original location so that we perform it just before ExecutorRun (b)
> patch to fix the gather stats by calling shutdown at appropriate place
> and allow stats collection in ExecShutdownNode (c) Probit calling
> ExecShutdownNode if there is a possibility of backward scans (I have
> done some basic tests with this patch, if we decide to proceed with
> it, then some more verification and testing would be required).
>
> I think we should commit first two patches as that fixes the problem
> being discussed in this thread and then do some additional
> verification for the third patch (mentioned in option c).  I can
> understand if people want to commit the third patch before the second
> patch, so let me know what you guys think.

I'm happy with the first two patches.  In the third one, I don't think
"See ExecLimit" is a good thing to put a comment like this, because
it's too hard to find the comment to which it refers, and because
future commits are likely to edit or remove that comment without
noticing the references to it from elsewhere.  Instead I would just
write, in all three places, /* If we know we won't need to back up, we
can release resources at this point. */ or something like that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Thu, Aug 2, 2018 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Aug 2, 2018 at 5:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I have created three patches (a) move InstrStartParallelQuery from its
>> original location so that we perform it just before ExecutorRun (b)
>> patch to fix the gather stats by calling shutdown at appropriate place
>> and allow stats collection in ExecShutdownNode (c) Probit calling
>> ExecShutdownNode if there is a possibility of backward scans (I have
>> done some basic tests with this patch, if we decide to proceed with
>> it, then some more verification and testing would be required).
>>
>> I think we should commit first two patches as that fixes the problem
>> being discussed in this thread and then do some additional
>> verification for the third patch (mentioned in option c).  I can
>> understand if people want to commit the third patch before the second
>> patch, so let me know what you guys think.
>
> I'm happy with the first two patches.
>

Thanks.  I have pushed those two patches.

>  In the third one, I don't think
> "See ExecLimit" is a good thing to put a comment like this, because
> it's too hard to find the comment to which it refers, and because
> future commits are likely to edit or remove that comment without
> noticing the references to it from elsewhere.  Instead I would just
> write, in all three places, /* If we know we won't need to back up, we
> can release resources at this point. */ or something like that.
>

Okay, I have changed the comment as per your suggestion in the
attached patch.  I will do some more testing/verification of this
patch and will commit over the weekend or on Monday if everything is
fine.

I have noticed that part of the comment atop ExecShutdownNode is now
redundant.  See attached edit_comments_shutdown_node_v1, let me know
if you think otherwise.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Explain buffers wrong counter with parallel plans

From
Adrien NAYRAT
Date:
On 08/03/2018 10:39 AM, Amit Kapila wrote:
> Thanks.  I have pushed those two patches.

Thanks!


Re: Explain buffers wrong counter with parallel plans

From
Robert Haas
Date:
On Fri, Aug 3, 2018 at 4:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have noticed that part of the comment atop ExecShutdownNode is now
> redundant.  See attached edit_comments_shutdown_node_v1, let me know
> if you think otherwise.

Oh, yes, that should be adjusted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Explain buffers wrong counter with parallel plans

From
Alvaro Herrera
Date:
On 2018-Aug-03, Adrien NAYRAT wrote:

> On 08/03/2018 10:39 AM, Amit Kapila wrote:
> > Thanks.  I have pushed those two patches.
> 
> Thanks!

I am marking this open item as closed.  If that's wrong, let's hear why
:-)

Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Fri, Aug 3, 2018 at 2:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Aug 2, 2018 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Aug 2, 2018 at 5:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> I have created three patches (a) move InstrStartParallelQuery from its
>>> original location so that we perform it just before ExecutorRun (b)
>>> patch to fix the gather stats by calling shutdown at appropriate place
>>> and allow stats collection in ExecShutdownNode (c) Probit calling
>>> ExecShutdownNode if there is a possibility of backward scans (I have
>>> done some basic tests with this patch, if we decide to proceed with
>>> it, then some more verification and testing would be required).
>>>
>>> I think we should commit first two patches as that fixes the problem
>>> being discussed in this thread and then do some additional
>>> verification for the third patch (mentioned in option c).  I can
>>> understand if people want to commit the third patch before the second
>>> patch, so let me know what you guys think.
>>
>> I'm happy with the first two patches.
>>
>
> Thanks.  I have pushed those two patches.
>
>>  In the third one, I don't think
>> "See ExecLimit" is a good thing to put a comment like this, because
>> it's too hard to find the comment to which it refers, and because
>> future commits are likely to edit or remove that comment without
>> noticing the references to it from elsewhere.  Instead I would just
>> write, in all three places, /* If we know we won't need to back up, we
>> can release resources at this point. */ or something like that.
>>
>
> Okay, I have changed the comment as per your suggestion in the
> attached patch.  I will do some more testing/verification of this
> patch and will commit over the weekend or on Monday if everything is
> fine.
>

I have verified that the patch works whenever we use scrollable
cursors.  Please find the attached patch with the modified commit
message.  I think now it is a bit late for this minor-release and this
doesn't appear to be a blocker issue, it is better to push it after
the release.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Fri, Aug 3, 2018 at 8:08 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2018-Aug-03, Adrien NAYRAT wrote:
>
>> On 08/03/2018 10:39 AM, Amit Kapila wrote:
>> > Thanks.  I have pushed those two patches.
>>
>> Thanks!
>
> I am marking this open item as closed.  If that's wrong, let's hear why
>

Thanks, it is not wrong.  I initially thought I will close it after
committing the pending patch which I just posted.  However, it is okay
as that is to address a somewhat related, but a different issue.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Sun, Aug 5, 2018 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Aug 3, 2018 at 2:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Aug 2, 2018 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>  In the third one, I don't think
>>> "See ExecLimit" is a good thing to put a comment like this, because
>>> it's too hard to find the comment to which it refers, and because
>>> future commits are likely to edit or remove that comment without
>>> noticing the references to it from elsewhere.  Instead I would just
>>> write, in all three places, /* If we know we won't need to back up, we
>>> can release resources at this point. */ or something like that.
>>>
>>
>> Okay, I have changed the comment as per your suggestion in the
>> attached patch.  I will do some more testing/verification of this
>> patch and will commit over the weekend or on Monday if everything is
>> fine.
>>
>
> I have verified that the patch works whenever we use scrollable
> cursors.  Please find the attached patch with the modified commit
> message.  I think now it is a bit late for this minor-release and this
> doesn't appear to be a blocker issue, it is better to push it after
> the release.
>

Pushed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Explain buffers wrong counter with parallel plans

From
Amit Kapila
Date:
On Fri, Aug 3, 2018 at 6:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 3, 2018 at 4:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I have noticed that part of the comment atop ExecShutdownNode is now
>> redundant.  See attached edit_comments_shutdown_node_v1, let me know
>> if you think otherwise.
>
> Oh, yes, that should be adjusted.
>

Pushed the patch for same.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com