Thread: Explain buffers wrong counter with parallel plans
Hello, I notice Explain report wrong counters with parallel plans : create unlogged table t1 (c1 int); insert into t1 select generate_series(1,100000000); vacuum t1; (stop PG, drop caches,start) set track_io_timing to on; explain (analyze,buffers) select count(*) from t1; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=964311.55..964311.56 rows=1 width=8) (actual time=34478.972..34478.972 rows=1 loops=1) Buffers: shared read=147952 I/O Timings: read=20407.894 -> Gather (cost=964311.33..964311.54 rows=2 width=8) (actual time=34478.835..34478.937 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared read=147952 I/O Timings: read=20407.894 -> Partial Aggregate (cost=963311.33..963311.34 rows=1 width=8) (actual time=34283.809..34283.810 rows=1 loops=3) Buffers: shared read=442478 dirtied=1 written=1 I/O Timings: read=60706.314 write=0.498 -> Parallel Seq Scan on t1 (cost=0.00..859144.67 rows=41666667 width=0) (actual time=27.027..27405.058 rows=33333333 loops=3) Buffers: shared read=442478 dirtied=1 written=1 I/O Timings: read=60706.314 write=0.498 Planning time: 39.037 ms Execution time: 34480.280 ms (stop pg, drop caches,start) set max_parallel_workers_per_gather to 0; explain (analyze,buffers) select count(*) from t1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1692478.00..1692478.01 rows=1 width=8) (actual time=34367.678..34367.678 rows=1 loops=1) Buffers: shared read=442478 I/O Timings: read=7977.775 -> Seq Scan on t1 (cost=0.00..1442478.00 rows=100000000 width=0) (actual time=8.672..19685.654 rows=100000000 loops=1) Buffers: shared read=442478 I/O Timings: read=7977.775 Planning time: 38.971 ms Execution time: 34368.097 ms (8 rows) Gather and Finalize Aggregate should report shared read=442478. I am also surprised for I/O Timings in parallel node: I/O Timings: read=60706.314 but the query last 34s? I played with git bisect and I found this commit : commit 01edb5c7fc3bcf6aea15f2b3be36189b52ad9d1a Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri Sep 1 17:38:54 2017 -0400 Improve division of labor between execParallel.c and nodeGather[Merge].c. Move the responsibility for creating/destroying TupleQueueReaders into execParallel.c, to avoid duplicative coding in nodeGather.c and nodeGatherMerge.c. Also, instead of having DestroyTupleQueueReader do shm_mq_detach, do it in the caller (which is now only ExecParallelFinish). This means execParallel.c does both the attaching and detaching of the tuple-queue-reader shm_mqs, which seems less weird than the previous arrangement. These changes also eliminate a vestigial memory leak (of the pei->tqueue array). It's now demonstrable that rescans of Gather or GatherMerge don't leak memory. Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us Before this commit here is the result : explain (analyze,buffers) select count(*) from t1; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=964311.55..964311.56 rows=1 width=8) (actual time=34745.387..34745.387 rows=1 loops=1) Buffers: shared hit=160 read=442478 dirtied=1 written=1 I/O Timings: read=61219.308 write=0.033 -> Gather (cost=964311.33..964311.54 rows=2 width=8) (actual time=34745.273..34745.379 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=160 read=442478 dirtied=1 written=1 I/O Timings: read=61219.308 write=0.033 -> Partial Aggregate (cost=963311.33..963311.34 rows=1 width=8) (actual time=34660.023..34660.023 rows=1 loops=3) Buffers: shared read=442478 dirtied=1 written=1 I/O Timings: read=61219.308 write=0.033 -> Parallel Seq Scan on t1 (cost=0.00..859144.67 rows=41666667 width=0) (actual time=11.559..27607.271 rows=33333333 loops=3) Buffers: shared read=442478 dirtied=1 written=1 I/O Timings: read=61219.308 write=0.033 Planning time: 38.094 ms Execution time: 34746.395 ms (16 rows) set max_parallel_workers_per_gather to 0; explain (analyze,buffers) select count(*) from t1; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1692478.00..1692478.01 rows=1 width=8) (actual time=34368.540..34368.540 rows=1 loops=1) Buffers: shared read=442478 I/O Timings: read=8874.662 -> Seq Scan on t1 (cost=0.00..1442478.00 rows=100000000 width=0) (actual time=20.604..19992.941 rows=100000000 loops=1) Buffers: shared read=442478 I/O Timings: read=8874.662 Planning time: 37.834 ms Execution time: 34368.817 ms (8 rows) I got similar results on 9.6. Regards, -- Adrien NAYRAT
Attachment
Hello, I tried to understand this issue and it seems Gather node only take account of this own buffer usage : create unlogged table t1 (c1 int); insert into t1 select generate_series(1,1000000); vacuum t1; explain (analyze,buffers,timing off,costs off) select count(*) from t1; QUERY PLAN ------------------------------------------------------------------------ Finalize Aggregate (actual rows=1 loops=1) Buffers: shared hit=1531 -> Gather (actual rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=1531 -> Partial Aggregate (actual rows=1 loops=3) Buffers: shared hit=4425 -> Parallel Seq Scan on t1 (actual rows=333333 loops=3) Buffers: shared hit=4425 Same query without parallelism QUERY PLAN ---------------------------------------------------- Aggregate (actual rows=1 loops=1) Buffers: shared hit=4425 -> Seq Scan on t1 (actual rows=1000000 loops=1) Buffers: shared hit=4425 We can notice Parallel Seq Scan and Partial Aggregate report 4425 buffers, same for the plan without parallelism. I put elog debug around theses lines in execParallel.c : /* Accumulate the statistics from all workers. */ instrument = GetInstrumentationArray(instrumentation); instrument += i * instrumentation->num_workers; for (n = 0; n < instrumentation->num_workers; ++n) { elog(LOG, "worker %d - shared_blks_read: %ld - shared_blks_hit: %ld", n, instrument[n].bufusage.shared_blks_read,instrument[n].bufusage.shared_blks_hit); InstrAggNode(planstate->instrument, &instrument[n]); } And I get theses messages : LOG: worker 0 - shared_blks_read: 0 - shared_blks_hit: 1443 LOG: worker 1 - shared_blks_read: 0 - shared_blks_hit: 1451 If you sum each worker's shared_blks_read and Gather's shared hit you get : 1443 + 1451 + 1531 = 4425 (size of t1) It seems Gather node and Finalize Aggregate only report buffers read by the main process. Workers seem omitted. Same issue occurs for Gather merge : explain (analyze,buffers,timing off,costs off) select * from t1 where c1%10=4 order by 1; QUERY PLAN ----------------------------------------------------------------- Gather Merge (actual rows=100000 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=1549 -> Sort (actual rows=33333 loops=3) Sort Key: c1 Sort Method: quicksort Memory: 3143kB Buffers: shared hit=4521 -> Parallel Seq Scan on t1 (actual rows=33333 loops=3) Filter: ((c1 % 10) = 4) Rows Removed by Filter: 300000 Buffers: shared hit=4425 In my log : worker 0 - shared_blks_read: 0 - shared_blks_hit: 1487 worker 1 - shared_blks_read: 0 - shared_blks_hit: 1485 1487 + 1485 + 1549 = 4521 I tried to understand how instrumentation works, but it is above my knowledge :( Regards, -- Adrien
Attachment
On Wed, Mar 28, 2018 at 12:07 AM, Adrien Nayrat <adrien.nayrat@anayrat.info> wrote: > Hello, > > I notice Explain report wrong counters with parallel plans : > > create unlogged table t1 (c1 int); > insert into t1 select generate_series(1,100000000); > vacuum t1; > > (stop PG, drop caches,start) > > set track_io_timing to on; > explain (analyze,buffers) select count(*) from t1; > > QUERY PLAN > > ----------------------------------------------------------------------------------------------------------------------------------------------- > Finalize Aggregate (cost=964311.55..964311.56 rows=1 width=8) (actual > time=34478.972..34478.972 rows=1 loops=1) > Buffers: shared read=147952 > I/O Timings: read=20407.894 > -> Gather (cost=964311.33..964311.54 rows=2 width=8) (actual > time=34478.835..34478.937 rows=3 loops=1) > Workers Planned: 2 > Workers Launched: 2 > Buffers: shared read=147952 > I/O Timings: read=20407.894 > -> Partial Aggregate (cost=963311.33..963311.34 rows=1 width=8) > (actual time=34283.809..34283.810 rows=1 loops=3) > Buffers: shared read=442478 dirtied=1 written=1 > I/O Timings: read=60706.314 write=0.498 > -> Parallel Seq Scan on t1 (cost=0.00..859144.67 rows=41666667 > width=0) (actual time=27.027..27405.058 rows=33333333 loops=3) > Buffers: shared read=442478 dirtied=1 written=1 > I/O Timings: read=60706.314 write=0.498 > Planning time: 39.037 ms > Execution time: 34480.280 ms > > (stop pg, drop caches,start) > > set max_parallel_workers_per_gather to 0; > explain (analyze,buffers) select count(*) from t1; > > QUERY PLAN > > ---------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=1692478.00..1692478.01 rows=1 width=8) (actual > time=34367.678..34367.678 rows=1 loops=1) > Buffers: shared read=442478 > I/O Timings: read=7977.775 > -> Seq Scan on t1 (cost=0.00..1442478.00 rows=100000000 width=0) (actual > time=8.672..19685.654 rows=100000000 loops=1) > Buffers: shared read=442478 > I/O Timings: read=7977.775 > Planning time: 38.971 ms now> Execution time: 34368.097 ms > (8 rows) > > > > Gather and Finalize Aggregate should report shared read=442478. > Yeah, it would have been convenient, if Gather and Finalize Aggregate displays that way as it would have been easier for users to understand. However, as of now, the aggregated stats for parallel workers and leader are displayed at parallel/partial nodes as is displayed in the plan. So according to me, what you are seeing is okay, it is only slightly tricky to understand it. > I am also surprised for I/O Timings in parallel node: > I/O Timings: read=60706.314 but the query last 34s? > This is the aggregated time of all the parallel workers, so you need to divide it by loops (which in your case is 3). > I played with git bisect and I found this commit : > > commit 01edb5c7fc3bcf6aea15f2b3be36189b52ad9d1a > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: Fri Sep 1 17:38:54 2017 -0400 > I think you were seeing different results before this commit because before that we were shutting down workers as soon as parallel workers are done and the buffer_usage stats were accumulated and were being used for upper nodes. According to me behavior after the commit is consistent, for example, I think if you check the case of GatherMerge before this commit, you will still get the stats in the way it is after commit. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Apr 20, 2018 at 6:36 PM, Adrien Nayrat <adrien.nayrat@anayrat.info> wrote: > Hello, > > I tried to understand this issue and it seems Gather node only take account of > this own buffer usage : > > > create unlogged table t1 (c1 int); > insert into t1 select generate_series(1,1000000); > vacuum t1; > > explain (analyze,buffers,timing off,costs off) select count(*) from t1; > QUERY PLAN > ------------------------------------------------------------------------ > Finalize Aggregate (actual rows=1 loops=1) > Buffers: shared hit=1531 > -> Gather (actual rows=3 loops=1) > Workers Planned: 2 > Workers Launched: 2 > Buffers: shared hit=1531 > -> Partial Aggregate (actual rows=1 loops=3) > Buffers: shared hit=4425 > -> Parallel Seq Scan on t1 (actual rows=333333 loops=3) > Buffers: shared hit=4425 > > > Same query without parallelism > > QUERY PLAN > ---------------------------------------------------- > Aggregate (actual rows=1 loops=1) > Buffers: shared hit=4425 > -> Seq Scan on t1 (actual rows=1000000 loops=1) > Buffers: shared hit=4425 > > > We can notice Parallel Seq Scan and Partial Aggregate report 4425 buffers, same > for the plan without parallelism. > > > I put elog debug around theses lines in execParallel.c : > > /* Accumulate the statistics from all workers. */ > instrument = GetInstrumentationArray(instrumentation); > instrument += i * instrumentation->num_workers; > for (n = 0; n < instrumentation->num_workers; ++n) > { > elog(LOG, "worker %d - shared_blks_read: %ld - shared_blks_hit: %ld", n, > instrument[n].bufusage.shared_blks_read,instrument[n].bufusage.shared_blks_hit); > InstrAggNode(planstate->instrument, &instrument[n]); > } > > > And I get theses messages : > > LOG: worker 0 - shared_blks_read: 0 - shared_blks_hit: 1443 > LOG: worker 1 - shared_blks_read: 0 - shared_blks_hit: 1451 > I think you can try 'verbose' option, it will give per-worker stats. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 04/29/2018 05:11 PM, Amit Kapila wrote: > I think you can try 'verbose' option, it will give per-worker stats. Thanks! I didn't know verbose give per-woker stats! -- Adrien
Attachment
On 04/29/2018 05:10 PM, Amit Kapila wrote: > Yeah, it would have been convenient, if Gather and Finalize Aggregate > displays that way as it would have been easier for users to > understand. However, as of now, the aggregated stats for parallel > workers and leader are displayed at parallel/partial nodes as is > displayed in the plan. So according to me, what you are seeing is > okay, it is only slightly tricky to understand it. > [...] > >> I played with git bisect and I found this commit : >> >> commit 01edb5c7fc3bcf6aea15f2b3be36189b52ad9d1a >> Author: Tom Lane <tgl@sss.pgh.pa.us> >> Date: Fri Sep 1 17:38:54 2017 -0400 >> > I think you were seeing different results before this commit because > before that we were shutting down workers as soon as parallel workers > are done and the buffer_usage stats were accumulated and were being > used for upper nodes. According to me behavior after the commit is > consistent, for example, I think if you check the case of GatherMerge > before this commit, you will still get the stats in the way it is > after commit. I understand. Maybe, this change should be mentioned in releases notes and/or documentation? Thanks,
Attachment
Hi, On 2018-04-29 20:40:43 +0530, Amit Kapila wrote: > On Wed, Mar 28, 2018 at 12:07 AM, Adrien Nayrat > > QUERY PLAN > > > > ---------------------------------------------------------------------------------------------------------------------------- > > Aggregate (cost=1692478.00..1692478.01 rows=1 width=8) (actual > > time=34367.678..34367.678 rows=1 loops=1) > > Buffers: shared read=442478 > > I/O Timings: read=7977.775 > > -> Seq Scan on t1 (cost=0.00..1442478.00 rows=100000000 width=0) (actual > > time=8.672..19685.654 rows=100000000 loops=1) > > Buffers: shared read=442478 > > I/O Timings: read=7977.775 > > Planning time: 38.971 ms > now> Execution time: 34368.097 ms > > (8 rows) > > > > > > > > Gather and Finalize Aggregate should report shared read=442478. > > > > Yeah, it would have been convenient, if Gather and Finalize Aggregate > displays that way as it would have been easier for users to > understand. However, as of now, the aggregated stats for parallel > workers and leader are displayed at parallel/partial nodes as is > displayed in the plan. So according to me, what you are seeing is > okay, it is only slightly tricky to understand it. > > > I am also surprised for I/O Timings in parallel node: > > I/O Timings: read=60706.314 but the query last 34s? > > > > This is the aggregated time of all the parallel workers, so you need > to divide it by loops (which in your case is 3). > > > I played with git bisect and I found this commit : > > > > commit 01edb5c7fc3bcf6aea15f2b3be36189b52ad9d1a > > Author: Tom Lane <tgl@sss.pgh.pa.us> > > Date: Fri Sep 1 17:38:54 2017 -0400 > > > > I think you were seeing different results before this commit because > before that we were shutting down workers as soon as parallel workers > are done and the buffer_usage stats were accumulated and were being > used for upper nodes. According to me behavior after the commit is > consistent, for example, I think if you check the case of GatherMerge > before this commit, you will still get the stats in the way it is > after commit. Robert, you added this as an open item. I don't think it's clear that there's a bug here, and even less clear that it's something new for v11. Am I understanding that right? Should we close this open item? Greetings, Andres Freund
On 05/01/2018 10:57 PM, Andres Freund wrote: > Robert, you added this as an open item. I don't think it's clear that > there's a bug here, and even less clear that it's something new for > v11. Am I understanding that right? Should we close this open item? Hi, FIY this change is related to 01edb5c7fc which is not specific to v11. 01edb5c7fc is also in v10. Regards, -- Adrien
Attachment
On Tue, May 1, 2018 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote: > Robert, you added this as an open item. I don't think it's clear that > there's a bug here, and even less clear that it's something new for > v11. Am I understanding that right? Should we close this open item? Yeah, I added it in response to the original post, but it sounds like this is just another case of somebody being confused by the fact that EXPLAIN divides the row counts, startup time, total time, and number of rows filtered by the loop count. So it's not an open item. That having been said, what I think we ought to do about the confusion that regularly results from that logic is rip it out, as in the attached patch. This would affect not only the parallel query case but also the inner sides of nested loops, but that doesn't bother me a bit because I think dividing is unhelpful in both cases. Note that explain.depesz.com, for example, goes and re-multiplies the times by the loop count to get a figure for how much time was spent in that node; I regularly do the same thing by hand when I read such EXPLAIN output. Also, if you did think those cases should be handled differently, you'd have to explain what should happen when there's a nested loop that gets run by multiple workers, possibly different numbers of times in each one. One problem with this patch is that tools like explain.depesz.com (or my brain) would need to know if they were getting output from 11+ or 10-, so it might be good to perturb this output in some other way so as to avoid confusion, like by replacing "loops=" with some other text. I don't have a specific proposal for that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 05/02/2018 05:22 PM, Robert Haas wrote: > On Tue, May 1, 2018 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote: >> Robert, you added this as an open item. I don't think it's clear that >> there's a bug here, and even less clear that it's something new for >> v11. Am I understanding that right? Should we close this open item? > > Yeah, I added it in response to the original post, but it sounds like > this is just another case of somebody being confused by the fact that > EXPLAIN divides the row counts, startup time, total time, and number > of rows filtered by the loop count. So it's not an open item. To precise, I am not confused by these counters. They do not change between 9.6 and 10. I am confused by buffers counter which are differents between 9.6 and 10. In 9.6 gather node reports sum of buffers for main process + workers. In 10, gather node only reports buffers from the main process. Thanks, -- Adrien
Attachment
On Wed, May 2, 2018 at 11:37 AM, Adrien Nayrat <adrien.nayrat@anayrat.info> wrote: > In 9.6 gather node reports sum of buffers for main process + workers. In 10, > gather node only reports buffers from the main process. Oh, really? Well, that sounds like a bug. Amit seems to think it's expected behavior, but I don't know why it should be. The commit message makes it sound like it's just refactoring, but in fact it seems to have made a significant behavior change that doesn't look very desirable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, May 4, 2018 at 10:35 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 2, 2018 at 11:37 AM, Adrien Nayrat > <adrien.nayrat@anayrat.info> wrote: >> In 9.6 gather node reports sum of buffers for main process + workers. In 10, >> gather node only reports buffers from the main process. > > Oh, really? Well, that sounds like a bug. Amit seems to think it's > expected behavior, but I don't know why it should be. > The reason why I think the current behavior is okay because it is coincidental that they were displayed correctly. We have not made any effort to percolate it to upper nodes. For ex., before that commit also, it was not being displayed for Gather Merge or Gather with some kind of node like 'Limit' where we have to stop before reaching the end of the result. > The commit > message makes it sound like it's just refactoring, but in fact it > seems to have made a significant behavior change that doesn't look > very desirable. > I think it is below part of that commit which has made this difference, basically shutting down the workers. if (readerdone) { Assert(!tup); .. if (gatherstate->nreaders == 0) - { - ExecShutdownGatherWorkers(gatherstate); return NULL; - } The reason why we were getting different results due to above code is that because while shutting down workers, we gather the buffer usage of all workers in 'pgBufferUsage' via InstrAccumParallelQuery and then upper-level node say Gather in this case would get chance to use that stats via ExecProcNodeInstr->InstrStopNode. However, this won't be true in other cases where we need to exit before reaching the end of results like in 'Limit' node case as in such cases after shutting down the workers via ExecShutdownNode we won't do InstrStopNode for upper-level nodes. I think if we want that all the stats being collected by workers should percolate to Gather and nodes above it, then we need to somehow ensure that we always shutdown Gather/GatherMerge before we do InstrStopNode for those nodes. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, May 5, 2018 at 8:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > The reason why I think the current behavior is okay because it is > coincidental that they were displayed correctly. We have not made any > effort to percolate it to upper nodes. For ex., before that commit > also, it was not being displayed for Gather Merge or Gather with some > kind of node like 'Limit' where we have to stop before reaching the > end of the result. It's not entirely coincidental. I had the intention to try to ensure that the workers would be shut down before the Gather or Gather Merge, and I think that various things in the code testify to that intention. It sounds like I missed some cases, but now we're missing even more cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-May-07, Robert Haas wrote: > On Sat, May 5, 2018 at 8:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > The reason why I think the current behavior is okay because it is > > coincidental that they were displayed correctly. We have not made any > > effort to percolate it to upper nodes. For ex., before that commit > > also, it was not being displayed for Gather Merge or Gather with some > > kind of node like 'Limit' where we have to stop before reaching the > > end of the result. > > It's not entirely coincidental. I had the intention to try to ensure > that the workers would be shut down before the Gather or Gather Merge, > and I think that various things in the code testify to that intention. > It sounds like I missed some cases, but now we're missing even more > cases. So, apparently this is not a Postgres 11 open item, but rather a bug that goes back to pg10. However, maybe it would be worth fixing soon anyway? In particular, if we want to perturb the explain output as suggested upthread, maybe *that* should be considered an open item? Anyway, this has stalled for a month now. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jun 6, 2018 at 12:08 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > So, apparently this is not a Postgres 11 open item, but rather a bug > that goes back to pg10. However, maybe it would be worth fixing soon > anyway? In particular, if we want to perturb the explain output as > suggested upthread, maybe *that* should be considered an open item? > > Anyway, this has stalled for a month now. Yeah. I'm not willing to do anything here unilaterally. There is neither universal agreement that there is a problem here nor agreement on a fix. Unless that changes, I don't know what to do here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 8, 2018 at 11:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 6, 2018 at 12:08 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> So, apparently this is not a Postgres 11 open item, but rather a bug >> that goes back to pg10. However, maybe it would be worth fixing soon >> anyway? In particular, if we want to perturb the explain output as >> suggested upthread, maybe *that* should be considered an open item? >> >> Anyway, this has stalled for a month now. > > Yeah. I'm not willing to do anything here unilaterally. There is > neither universal agreement that there is a problem here nor agreement > on a fix. > Right, I think we have following options: (a) Come up with a solution which allows percolating the buffer usage and or similar stats to upper nodes in all cases. (b) Allow it to work for some of the cases as it was earlier. I think (b) can cause confusion and could lead to further questions on which specific cases will it work and for which it won't work. If we think (a) is a reasonable approach, then we can close this item with a conclusion as a work item for future and OTOH if we think option (b) is the better way to deal with it, then we can come up with a patch to do so. My inclination is to go with option (a), but I don't mind if the decision is to choose option (b). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2018-Jun-08, Robert Haas wrote: > On Wed, Jun 6, 2018 at 12:08 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > So, apparently this is not a Postgres 11 open item, but rather a bug > > that goes back to pg10. However, maybe it would be worth fixing soon > > anyway? In particular, if we want to perturb the explain output as > > suggested upthread, maybe *that* should be considered an open item? > > > > Anyway, this has stalled for a month now. > > Yeah. I'm not willing to do anything here unilaterally. There is > neither universal agreement that there is a problem here nor agreement > on a fix. Unless that changes, I don't know what to do here. After going over this whole thing once more, my opinion is that this reporting is broken. IMO if we want to display per-process counts, then let's display them all, or none; if we're going to display aggregated counts, then we should mark them as such. Otherwise it's pretty much impossible to read the output unless you know exactly what the code does for each value. (The other option would be to document what the code does for each value, but that seems suboptimal -- that just requires people to glue one more cheat sheet to their walls, which is not Pg style.) I'm CCing Tom here, as author of the patch that caused (most of) the issue. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I'm CCing Tom here, as author of the patch that caused (most of) the > issue. Uh ... me? I thought this was a parallel-query issue, which I've pretty much not been involved in. regards, tom lane
On 2018-Jun-29, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > I'm CCing Tom here, as author of the patch that caused (most of) the > > issue. > > Uh ... me? I thought this was a parallel-query issue, which I've > pretty much not been involved in. Well, maybe it's a matter of opinion. Amit K said a few messages back that your 01edb5c7fc3b ("Improve division of labor between execParallel.c and nodeGather[Merge].c.") had changed the way these numbers are printed, but only now I realize that he then indicated that a different code path was already behaving in that way. I stand by my opinion that we should not give misleading/confusing info; either let's show it all in the default output, or only do it in VERBOSE, but if the latter then let's suppress the misleading numbers in the default output. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jun 30, 2018 at 4:08 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2018-Jun-29, Tom Lane wrote: > >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> > I'm CCing Tom here, as author of the patch that caused (most of) the >> > issue. >> >> Uh ... me? I thought this was a parallel-query issue, which I've >> pretty much not been involved in. > > Well, maybe it's a matter of opinion. Amit K said a few messages back > that your 01edb5c7fc3b ("Improve division of labor between > execParallel.c and nodeGather[Merge].c.") had changed the way these > numbers are printed, but only now I realize that he then indicated that > a different code path was already behaving in that way. > > I stand by my opinion that we should not give misleading/confusing info; > either let's show it all in the default output, or only do it in > VERBOSE, but if the latter then let's suppress the misleading numbers in > the default output. > It is not clear to me what exactly is your expectation. Can you be a bit more specific? AFAIU, the primary confusion to OP is that the aggregated stats like buffer_usage (Buffers: ..) should be displayed correctly at Gather/Gather Merge and nodes above it. Currently, those are being displayed correctly at Parallel (Seq Scan) nodes. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jun 29, 2018 at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> I'm CCing Tom here, as author of the patch that caused (most of) the >> issue. > > Uh ... me? I thought this was a parallel-query issue, which I've > pretty much not been involved in. Well, it was your commit that caused the behavior change, as the original poster mentioned in his email. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jun 10, 2018 at 1:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Right, I think we have following options: > (a) Come up with a solution which allows percolating the buffer usage > and or similar stats to upper nodes in all cases. > (b) Allow it to work for some of the cases as it was earlier. > > I think (b) can cause confusion and could lead to further questions on > which specific cases will it work and for which it won't work. If we > think (a) is a reasonable approach, then we can close this item with a > conclusion as a work item for future and OTOH if we think option (b) > is the better way to deal with it, then we can come up with a patch to > do so. My inclination is to go with option (a), but I don't mind if > the decision is to choose option (b). I think the core problem here is this hunk from gather_readnext: { Assert(!tup); - DestroyTupleQueueReader(reader); --gatherstate->nreaders; if (gatherstate->nreaders == 0) - { - ExecShutdownGatherWorkers(gatherstate); return NULL; - } memmove(&gatherstate->reader[gatherstate->nextreader], &gatherstate->reader[gatherstate->nextreader + 1], sizeof(TupleQueueReader *) Since ExecShutdownGatherWorkers() is no longer called there, the instrumentation data isn't accumulated into the Gather node when the workers are shut down. I think that's a bug and we should fix it. To fix the problem with Limit that you mention, we could just modify nodeLimit.c so that when the state is changed from LIMIT_INWINDOW to LIMIT_WINDOWEND, we also call ExecShutdownNode on the child plan. We can fix other cases as we find them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 2, 2018 at 6:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > I think the core problem here is this hunk from gather_readnext: > > { > Assert(!tup); > - DestroyTupleQueueReader(reader); > --gatherstate->nreaders; > if (gatherstate->nreaders == 0) > - { > - ExecShutdownGatherWorkers(gatherstate); > return NULL; > - } > memmove(&gatherstate->reader[gatherstate->nextreader], > &gatherstate->reader[gatherstate->nextreader + 1], > sizeof(TupleQueueReader *) > > Since ExecShutdownGatherWorkers() is no longer called there, the > instrumentation data isn't accumulated into the Gather node when the > workers are shut down. I think that's a bug and we should fix it. > Yeah, previously, I have also pointed out the same code [1]. However, I have not done any testing to prove it. > To fix the problem with Limit that you mention, we could just modify > nodeLimit.c so that when the state is changed from LIMIT_INWINDOW to > LIMIT_WINDOWEND, we also call ExecShutdownNode on the child plan. > It should work. > We can fix other cases as we find them. > I think we have a similar problem with GatherMerge, but that also appears to be fixable. Are you planning to work on it? If not, then I can look into it. [1] - https://www.postgresql.org/message-id/CAA4eK1KZEbYKj9HHP-6WqqjAXuoB%2BWJu-w1s9uovj%3DeeBxC48Q%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 3, 2018 at 6:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Yeah, previously, I have also pointed out the same code [1]. However, > I have not done any testing to prove it. Ah, OK. Sorry, forgot about that email. > Are you planning to work on it? If not, then I can look into it. I won't have time for a few weeks, so if you have time sooner please have at it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 3, 2018 at 4:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Jul 2, 2018 at 6:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> > >> To fix the problem with Limit that you mention, we could just modify >> nodeLimit.c so that when the state is changed from LIMIT_INWINDOW to >> LIMIT_WINDOWEND, we also call ExecShutdownNode on the child plan. >> > > It should work. > I have tried this idea, but it doesn't completely solve the problem. The problem is that nodes below LIMIT won't get a chance to accumulate the stats as they won't be able to call InstrStopNode. So the result will be something like below: postgres=# explain (analyze,buffers,timing off,costs off) select * from t1 limit 50000; QUERY PLAN ----------------------------------------------------------------- Limit (actual rows=50000 loops=1) Buffers: shared hit=6 read=224 -> Gather (actual rows=50000 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=1 read=63 -> Parallel Seq Scan on t1 (actual rows=17213 loops=3) Buffers: shared hit=6 read=224 Planning Time: 0.105 ms Execution Time: 1363068.675 ms (10 rows) In this plan, you can notice that stats (Buffers:) at Parallel Seq Scan and Limit are same, but Gather node shows different stats. One idea could be that in ExecShutdownNode, somehow, we allow the nodes to count the stats, maybe by calling InstrStartNode/InstrStopNode, but not sure if that is the best way to fix it. Thoughts? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 6, 2018 at 9:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I have tried this idea, but it doesn't completely solve the problem. > The problem is that nodes below LIMIT won't get a chance to accumulate > the stats as they won't be able to call InstrStopNode. I'm not sure I understand. Why not? I see that we'd need to insert an extra call to InstrStopNode() if we were stopping the node while it was running, because then InstrStartNode() would have already been done, but the corresponding call to InstrStopNode() would not have been done. But I'm not sure how that would happen in this case. Can you explain further? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jul 7, 2018 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jul 6, 2018 at 9:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I have tried this idea, but it doesn't completely solve the problem. >> The problem is that nodes below LIMIT won't get a chance to accumulate >> the stats as they won't be able to call InstrStopNode. > > I'm not sure I understand. Why not? I see that we'd need to insert > an extra call to InstrStopNode() if we were stopping the node while it > was running, because then InstrStartNode() would have already been > done, but the corresponding call to InstrStopNode() would not have > been done. But I'm not sure how that would happen in this case. Can > you explain further? > Okay, let me try. The code flow is that for each node we will call InstrStartNode()->ExecProcNodeReal()->InstrStopNode(). Now let's say we have to execute a plan Limit->Gather-> Parallel SeqScan. In this, first for Limit node, we will call InstrStartNode() and ExecProcNodeReal() and then for Gather we will call InstrStartNode(), ExecProcNodeReal() and InstrStopNode(). Now, Limit node decides that it needs to shutdown all the nodes (ExecShutdownNode) and after that it will call InstrStopNode() for Limit node. So, in this flow after shutting down nodes, we never get chance for Gather node to use stats collected during ExecShutdownNode. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jul 7, 2018 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, Jul 7, 2018 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Jul 6, 2018 at 9:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> I have tried this idea, but it doesn't completely solve the problem. >>> The problem is that nodes below LIMIT won't get a chance to accumulate >>> the stats as they won't be able to call InstrStopNode. >> >> I'm not sure I understand. Why not? I see that we'd need to insert >> an extra call to InstrStopNode() if we were stopping the node while it >> was running, because then InstrStartNode() would have already been >> done, but the corresponding call to InstrStopNode() would not have >> been done. But I'm not sure how that would happen in this case. Can >> you explain further? >> > > Okay, let me try. The code flow is that for each node we will call > InstrStartNode()->ExecProcNodeReal()->InstrStopNode(). Now let's say > we have to execute a plan Limit->Gather-> Parallel SeqScan. In this, > first for Limit node, we will call InstrStartNode() and > ExecProcNodeReal() and then for Gather we will call InstrStartNode(), > ExecProcNodeReal() and InstrStopNode(). Now, Limit node decides that > it needs to shutdown all the nodes (ExecShutdownNode) and after that > it will call InstrStopNode() for Limit node. So, in this flow after > shutting down nodes, we never get chance for Gather node to use stats > collected during ExecShutdownNode. > I went ahead and tried the solution which I had mentioned yesterday, that is to allow ExecShutdownNode to count stats. Apart from fixing this problem, it will also fix the problem with Gather Merge as reported by Adrien [1], because now Gather Merge will also get a chance to count stats after shutting down workers. Note that, I have changed the location of InstrStartParallelQuery in ParallelQueryMain so that the buffer usage stats are accumulated only for the plan execution which is what we do for instrumentation information as well. If we don't do that, it will count some additional stats for ExecutorStart which won't match with what we have in Instrumentation structure of each node. [1] - https://www.postgresql.org/message-id/01952aab-33ca-36cd-e74b-ce32f3eefc84%40anayrat.info -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Jul 7, 2018, at 12:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:On Sat, Jul 7, 2018 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:On Sat, Jul 7, 2018 at 12:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:On Fri, Jul 6, 2018 at 9:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:I have tried this idea, but it doesn't completely solve the problem.
The problem is that nodes below LIMIT won't get a chance to accumulate
the stats as they won't be able to call InstrStopNode.
I'm not sure I understand. Why not? I see that we'd need to insert
an extra call to InstrStopNode() if we were stopping the node while it
was running, because then InstrStartNode() would have already been
done, but the corresponding call to InstrStopNode() would not have
been done. But I'm not sure how that would happen in this case. Can
you explain further?
Okay, let me try. The code flow is that for each node we will call
InstrStartNode()->ExecProcNodeReal()->InstrStopNode(). Now let's say
we have to execute a plan Limit->Gather-> Parallel SeqScan. In this,
first for Limit node, we will call InstrStartNode() and
ExecProcNodeReal() and then for Gather we will call InstrStartNode(),
ExecProcNodeReal() and InstrStopNode(). Now, Limit node decides that
it needs to shutdown all the nodes (ExecShutdownNode) and after that
it will call InstrStopNode() for Limit node. So, in this flow after
shutting down nodes, we never get chance for Gather node to use stats
collected during ExecShutdownNode.
I went ahead and tried the solution which I had mentioned yesterday,
that is to allow ExecShutdownNode to count stats. Apart from fixing
this problem, it will also fix the problem with Gather Merge as
reported by Adrien [1], because now Gather Merge will also get a
chance to count stats after shutting down workers.
Note that, I have changed the location of InstrStartParallelQuery in
ParallelQueryMain so that the buffer usage stats are accumulated only
for the plan execution which is what we do for instrumentation
information as well. If we don't do that, it will count some
additional stats for ExecutorStart which won't match with what we have
in Instrumentation structure of each node.
I tried running the below on both 11beta2 and HEAD with the patch
applied:
CREATE UNLOGGED TABLE t1 (c1 int);
INSERT INTO t1 SELECT generate_series(1,100000000);
/** restart PostgreSQL */
EXPLAIN (analyze,buffers,timing off,costs off)
SELECT count(*) FROM t1;
/** repeat above two queries */
I have identical postgresql.conf files on both instances as well.
11beta2
======
buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
from t1;
QUERY PLAN
--------------------------------------------------------------------------
Finalize Aggregate (actual rows=1 loops=1)
Buffers: shared read=63175
-> Gather (actual rows=7 loops=1)
Workers Planned: 6
Workers Launched: 6
Buffers: shared read=63175
-> Partial Aggregate (actual rows=1 loops=7)
Buffers: shared read=442478
-> Parallel Seq Scan on t1 (actual rows=14285714 loops=7)
Buffers: shared read=442478
Planning Time: 1.422 ms
Execution Time: 3214.407 ms
(12 rows)
buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
from t1;
QUERY PLAN
--------------------------------------------------------------------------
Finalize Aggregate (actual rows=1 loops=1)
Buffers: shared hit=27 read=64960
-> Gather (actual rows=7 loops=1)
Workers Planned: 6
Workers Launched: 6
Buffers: shared hit=27 read=64960
-> Partial Aggregate (actual rows=1 loops=7)
Buffers: shared hit=224 read=442254
-> Parallel Seq Scan on t1 (actual rows=14285714 loops=7)
Buffers: shared hit=224 read=442254
Planning Time: 0.080 ms
Execution Time: 3774.091 ms
(12 rows)
HEAD
=====
buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
from t1;
QUERY PLAN
--------------------------------------------------------------------------
Finalize Aggregate (actual rows=1 loops=1)
Buffers: shared read=442478
-> Gather (actual rows=7 loops=1)
Workers Planned: 6
Workers Launched: 6
Buffers: shared read=442478
-> Partial Aggregate (actual rows=1 loops=7)
Buffers: shared read=442478
-> Parallel Seq Scan on t1 (actual rows=14285714 loops=7)
Buffers: shared read=442478
Planning Time: 1.594 ms
Execution Time: 6207.799 ms
(12 rows)
buffers=# explain (analyze,buffers,timing off,costs off) select count(*)
from t1;
QUERY PLAN
--------------------------------------------------------------------------
Finalize Aggregate (actual rows=1 loops=1)
Buffers: shared hit=224 read=442254
-> Gather (actual rows=7 loops=1)
Workers Planned: 6
Workers Launched: 6
Buffers: shared hit=224 read=442254
-> Partial Aggregate (actual rows=1 loops=7)
Buffers: shared hit=224 read=442254
-> Parallel Seq Scan on t1 (actual rows=14285714 loops=7)
Buffers: shared hit=224 read=442254
Planning Time: 0.074 ms
Execution Time: 5006.395 ms
(12 rows)
Notice the “read” numbers just before the “Finalize Aggregate” is much
higher, and there appears to be a performance hit.
Jonathan
Attachment
On Wed, Jul 25, 2018 at 7:42 PM, Jonathan S. Katz <jonathan.katz@excoventures.com> wrote: > > On Jul 7, 2018, at 12:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > I tried running the below on both 11beta2 and HEAD with the patch > applied: > > CREATE UNLOGGED TABLE t1 (c1 int); > INSERT INTO t1 SELECT generate_series(1,100000000); > /** restart PostgreSQL */ > EXPLAIN (analyze,buffers,timing off,costs off) > SELECT count(*) FROM t1; > /** repeat above two queries */ > > I have identical postgresql.conf files on both instances as well. > > 11beta2 > ====== > buffers=# explain (analyze,buffers,timing off,costs off) select count(*) > from t1; > QUERY PLAN > > -------------------------------------------------------------------------- > Finalize Aggregate (actual rows=1 loops=1) > Buffers: shared read=63175 > -> Gather (actual rows=7 loops=1) > Workers Planned: 6 > Workers Launched: 6 > Buffers: shared read=63175 > -> Partial Aggregate (actual rows=1 loops=7) > Buffers: shared read=442478 > -> Parallel Seq Scan on t1 (actual rows=14285714 > loops=7) > Buffers: shared read=442478 > Planning Time: 1.422 ms > Execution Time: 3214.407 ms > (12 rows) > > buffers=# explain (analyze,buffers,timing off,costs off) select count(*) > from t1; > QUERY PLAN > > -------------------------------------------------------------------------- > Finalize Aggregate (actual rows=1 loops=1) > Buffers: shared hit=27 read=64960 > -> Gather (actual rows=7 loops=1) > Workers Planned: 6 > Workers Launched: 6 > Buffers: shared hit=27 read=64960 > -> Partial Aggregate (actual rows=1 loops=7) > Buffers: shared hit=224 read=442254 > -> Parallel Seq Scan on t1 (actual rows=14285714 > loops=7) > Buffers: shared hit=224 read=442254 > Planning Time: 0.080 ms > Execution Time: 3774.091 ms > (12 rows) > > HEAD > ===== > buffers=# explain (analyze,buffers,timing off,costs off) select count(*) > from t1; > QUERY PLAN > > -------------------------------------------------------------------------- > Finalize Aggregate (actual rows=1 loops=1) > Buffers: shared read=442478 > -> Gather (actual rows=7 loops=1) > Workers Planned: 6 > Workers Launched: 6 > Buffers: shared read=442478 > -> Partial Aggregate (actual rows=1 loops=7) > Buffers: shared read=442478 > -> Parallel Seq Scan on t1 (actual rows=14285714 > loops=7) > Buffers: shared read=442478 > Planning Time: 1.594 ms > Execution Time: 6207.799 ms > (12 rows) > > buffers=# explain (analyze,buffers,timing off,costs off) select count(*) > from t1; > QUERY PLAN > > -------------------------------------------------------------------------- > Finalize Aggregate (actual rows=1 loops=1) > Buffers: shared hit=224 read=442254 > -> Gather (actual rows=7 loops=1) > Workers Planned: 6 > Workers Launched: 6 > Buffers: shared hit=224 read=442254 > -> Partial Aggregate (actual rows=1 loops=7) > Buffers: shared hit=224 read=442254 > -> Parallel Seq Scan on t1 (actual rows=14285714 > loops=7) > Buffers: shared hit=224 read=442254 > Planning Time: 0.074 ms > Execution Time: 5006.395 ms > (12 rows) > > Notice the “read” numbers just before the “Finalize Aggregate” is much > higher, > You mean to say the number (Buffers: shared read=442478) in HEAD, right? If so, yeah, I am also wondering why the results of the patch are different in HEAD and 11beta2. So, if I read correctly, the numbers in 11beta2 appears correct, but on HEAD it is not correct, it should have divided the buffers read by loops. I will figure that out, but this is just to clarify that both of us are seeing the results in the same way. >and there appears to be a performance hit. > Do you mean to say the performance of the same query in 11beta2 and HEAD or something else? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
> On Jul 25, 2018, at 10:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 25, 2018 at 7:42 PM, Jonathan S. Katz > <jonathan.katz@excoventures.com> wrote: >> >> On Jul 7, 2018, at 12:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> I tried running the below on both 11beta2 and HEAD with the patch >> applied: >> >> CREATE UNLOGGED TABLE t1 (c1 int); >> INSERT INTO t1 SELECT generate_series(1,100000000); >> /** restart PostgreSQL */ >> EXPLAIN (analyze,buffers,timing off,costs off) >> SELECT count(*) FROM t1; >> /** repeat above two queries */ >> >> I have identical postgresql.conf files on both instances as well. >> >> 11beta2 >> ====== >> buffers=# explain (analyze,buffers,timing off,costs off) select count(*) >> from t1; >> QUERY PLAN >> >> -------------------------------------------------------------------------- >> Finalize Aggregate (actual rows=1 loops=1) >> Buffers: shared read=63175 >> -> Gather (actual rows=7 loops=1) >> Workers Planned: 6 >> Workers Launched: 6 >> Buffers: shared read=63175 >> -> Partial Aggregate (actual rows=1 loops=7) >> Buffers: shared read=442478 >> -> Parallel Seq Scan on t1 (actual rows=14285714 >> loops=7) >> Buffers: shared read=442478 >> Planning Time: 1.422 ms >> Execution Time: 3214.407 ms >> (12 rows) >> >> buffers=# explain (analyze,buffers,timing off,costs off) select count(*) >> from t1; >> QUERY PLAN >> >> -------------------------------------------------------------------------- >> Finalize Aggregate (actual rows=1 loops=1) >> Buffers: shared hit=27 read=64960 >> -> Gather (actual rows=7 loops=1) >> Workers Planned: 6 >> Workers Launched: 6 >> Buffers: shared hit=27 read=64960 >> -> Partial Aggregate (actual rows=1 loops=7) >> Buffers: shared hit=224 read=442254 >> -> Parallel Seq Scan on t1 (actual rows=14285714 >> loops=7) >> Buffers: shared hit=224 read=442254 >> Planning Time: 0.080 ms >> Execution Time: 3774.091 ms >> (12 rows) >> >> HEAD >> ===== >> buffers=# explain (analyze,buffers,timing off,costs off) select count(*) >> from t1; >> QUERY PLAN >> >> -------------------------------------------------------------------------- >> Finalize Aggregate (actual rows=1 loops=1) >> Buffers: shared read=442478 >> -> Gather (actual rows=7 loops=1) >> Workers Planned: 6 >> Workers Launched: 6 >> Buffers: shared read=442478 >> -> Partial Aggregate (actual rows=1 loops=7) >> Buffers: shared read=442478 >> -> Parallel Seq Scan on t1 (actual rows=14285714 >> loops=7) >> Buffers: shared read=442478 >> Planning Time: 1.594 ms >> Execution Time: 6207.799 ms >> (12 rows) >> >> buffers=# explain (analyze,buffers,timing off,costs off) select count(*) >> from t1; >> QUERY PLAN >> >> -------------------------------------------------------------------------- >> Finalize Aggregate (actual rows=1 loops=1) >> Buffers: shared hit=224 read=442254 >> -> Gather (actual rows=7 loops=1) >> Workers Planned: 6 >> Workers Launched: 6 >> Buffers: shared hit=224 read=442254 >> -> Partial Aggregate (actual rows=1 loops=7) >> Buffers: shared hit=224 read=442254 >> -> Parallel Seq Scan on t1 (actual rows=14285714 >> loops=7) >> Buffers: shared hit=224 read=442254 >> Planning Time: 0.074 ms >> Execution Time: 5006.395 ms >> (12 rows) >> >> Notice the “read” numbers just before the “Finalize Aggregate” is much >> higher, >> > > You mean to say the number (Buffers: shared read=442478) in HEAD, > right? If so, yeah, I am also wondering why the results of the patch > are different in HEAD and 11beta2. So, if I read correctly, the > numbers in 11beta2 appears correct, but on HEAD it is not correct, it > should have divided the buffers read by loops. I will figure that > out, but this is just to clarify that both of us are seeing the > results in the same way. I’ll try it again but patch it against 11beta2 and see what results I get. > >> and there appears to be a performance hit. >> > > Do you mean to say the performance of the same query in 11beta2 and > HEAD or something else? Correct. But per your advice let me try running it against a patched version of 11beta2 and see what happens. Jonathan
Attachment
On Thu, Jul 26, 2018 at 7:31 PM, Jonathan S. Katz <jonathan.katz@excoventures.com> wrote: > >> On Jul 25, 2018, at 10:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> >> You mean to say the number (Buffers: shared read=442478) in HEAD, >> right? If so, yeah, I am also wondering why the results of the patch >> are different in HEAD and 11beta2. So, if I read correctly, the >> numbers in 11beta2 appears correct, but on HEAD it is not correct, it >> should have divided the buffers read by loops. I want to correct myself here, the numbers on HEAD are correct, but not on PG11beta2. Is there any chance that either you forgot to apply the patch or maybe it is not using correct binaries in case of 11beta2. >> I will figure that >> out, but this is just to clarify that both of us are seeing the >> results in the same way. > > I’ll try it again but patch it against 11beta2 and see what results I get. > >> >>> and there appears to be a performance hit. >>> >> >> Do you mean to say the performance of the same query in 11beta2 and >> HEAD or something else? > > Correct. But per your advice let me try running it against a patched > version of 11beta2 and see what happens. > Yeah, that would be better. Today, I have tried the patch on both Head and PG11 and I am getting same and correct results. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
> On Jul 27, 2018, at 8:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 26, 2018 at 7:31 PM, Jonathan S. Katz > <jonathan.katz@excoventures.com> wrote: >> >>> On Jul 25, 2018, at 10:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> >>> >>> You mean to say the number (Buffers: shared read=442478) in HEAD, >>> right? If so, yeah, I am also wondering why the results of the patch >>> are different in HEAD and 11beta2. So, if I read correctly, the >>> numbers in 11beta2 appears correct, but on HEAD it is not correct, it >>> should have divided the buffers read by loops. > > I want to correct myself here, the numbers on HEAD are correct, but > not on PG11beta2. Is there any chance that either you forgot to apply > the patch or maybe it is not using correct binaries in case of > 11beta2. I need to correct myself too as I was unclear - that was an *unpatched* 11beta2, sorry for the confusion. >>> I will figure that >>> out, but this is just to clarify that both of us are seeing the >>> results in the same way. >> >> I’ll try it again but patch it against 11beta2 and see what results I get. >> >>> >>>> and there appears to be a performance hit. >>>> >>> >>> Do you mean to say the performance of the same query in 11beta2 and >>> HEAD or something else? >> >> Correct. But per your advice let me try running it against a patched >> version of 11beta2 and see what happens. >> > > Yeah, that would be better. Today, I have tried the patch on both > Head and PG11 and I am getting same and correct results. I have applied the the patch to PG11beta2 and tested. I received similar numbers to to the patched HEAD tests, e.g: => EXPLAIN (analyze,buffers,timing off,costs off) SELECT count(*) FROM t1; QUERY PLAN -------------------------------------------------------------------------- Finalize Aggregate (actual rows=1 loops=1) Buffers: shared hit=224 read=442254 -> Gather (actual rows=7 loops=1) Workers Planned: 6 Workers Launched: 6 Buffers: shared hit=224 read=442254 -> Partial Aggregate (actual rows=1 loops=7) Buffers: shared hit=224 read=442254 -> Parallel Seq Scan on t1 (actual rows=14285714 loops=7) Buffers: shared hit=224 read=442254 Planning Time: 0.104 ms Execution Time: 5308.140 ms I ran the tests a few more times and I understand why the execution times vary (with an explanation from Andres) so I am satisfied. Thanks for working on this! Jonathan
Attachment
On Fri, Jul 27, 2018 at 11:12 PM, Jonathan S. Katz <jonathan.katz@excoventures.com> wrote: > >> On Jul 27, 2018, at 8:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> >> Yeah, that would be better. Today, I have tried the patch on both >> Head and PG11 and I am getting same and correct results. > > I have applied the the patch to PG11beta2 and tested. > I think we should backpatch this till 9.6 where the parallel query was introduced. Note, that for HEAD and 11, the patch is same. For PG10, the patch code is same, but because surrounding code is different, the same patch didn't apply. For 9.6, we don't need to collect stats in ExecShutdownNode. I have tested it in all the back branches and it works fine. Robert, We have done verification that the approach works and fixes the bug in all known cases. Do you see any problem with this approach? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
> On Jul 28, 2018, at 2:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jul 27, 2018 at 11:12 PM, Jonathan S. Katz > <jonathan.katz@excoventures.com> wrote: >> >>> On Jul 27, 2018, at 8:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> >>> >>> Yeah, that would be better. Today, I have tried the patch on both >>> Head and PG11 and I am getting same and correct results. >> >> I have applied the the patch to PG11beta2 and tested. >> > > I think we should backpatch this till 9.6 where the parallel query was > introduced. Note, that for HEAD and 11, the patch is same. For PG10, > the patch code is same, but because surrounding code is different, the > same patch didn't apply. For 9.6, we don't need to collect stats in > ExecShutdownNode. I have tested it in all the back branches and it > works fine. The logic on backpatching seems sounds. I confirmed my tests of the respective patches against 9.6.9 and 10.4. I'll defer to someone else for comments on the code, but from my read it appears to be a consistent approach for each version. Jonathan
Attachment
Hi, I'm not an expert in the area of the code, but here's a review anyway. I did not read through the entire thread. I think we should try to get this fixed soon, to make some progress towards release-ability. Or just declare it to be entirely unrelated to the release, and remove it from the open items list; not an unreasonable choice either. This has been an open item for quite a while... > diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c > index 60aaa822b7e..ac22bedf0e2 100644 > --- a/src/backend/executor/execParallel.c > +++ b/src/backend/executor/execParallel.c > @@ -979,9 +979,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc) > /* Report workers' query for monitoring purposes */ > pgstat_report_activity(STATE_RUNNING, debug_query_string); > > - /* Prepare to track buffer usage during query execution. */ > - InstrStartParallelQuery(); > - > /* Attach to the dynamic shared memory area. */ > area_space = shm_toc_lookup(toc, PARALLEL_KEY_DSA, false); > area = dsa_attach_in_place(area_space, seg); > @@ -993,6 +990,9 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc) > queryDesc->planstate->state->es_query_dsa = area; > ExecParallelInitializeWorker(queryDesc->planstate, toc); > > + /* Prepare to track buffer usage during query execution. */ > + InstrStartParallelQuery(); > + > /* Run the plan */ > ExecutorRun(queryDesc, ForwardScanDirection, 0L, true); Isn't this an independent change? And one with potentially negative side effects? I think there's some arguments for changing this (and some against), but I think it's a bad idea to do so in the same patch. > diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c > index 36d2914249c..a0d49ec0fba 100644 > --- a/src/backend/executor/execProcnode.c > +++ b/src/backend/executor/execProcnode.c > @@ -737,6 +737,13 @@ ExecShutdownNode(PlanState *node) > > planstate_tree_walker(node, ExecShutdownNode, NULL); > > + /* > + * Allow instrumentation to count stats collected during shutdown for > + * nodes that are executed atleast once. > + */ > + if (node->instrument && node->instrument->running) > + InstrStartNode(node->instrument); > + > switch (nodeTag(node)) > { > case T_GatherState: > @@ -755,5 +762,12 @@ ExecShutdownNode(PlanState *node) > break; > } > > + /* > + * Allow instrumentation to count stats collected during shutdown for > + * nodes that are executed atleast once. > + */ > + if (node->instrument && node->instrument->running) > + InstrStopNode(node->instrument, 0); > + > return false; > } Duplicating the comment doesn't seem like a good idea. Just say something like "/* see comment for InstrStartNode above */". > diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c > index ac5a2ff0e60..cf1851e235f 100644 > --- a/src/backend/executor/nodeLimit.c > +++ b/src/backend/executor/nodeLimit.c > @@ -134,6 +134,8 @@ ExecLimit(PlanState *pstate) > node->position - node->offset >= node->count) > { > node->lstate = LIMIT_WINDOWEND; > + /* Allow nodes to release or shut down resources. */ > + (void) ExecShutdownNode(outerPlan); > return NULL; > } > Um, is this actually correct? What if somebody rewinds afterwards? That won't happen for parallel query currently, but architecturally I don't think we can do so unconditionally? Greetings, Andres Freund
On Tue, Jul 31, 2018 at 12:58 AM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > I'm not an expert in the area of the code, but here's a review anyway. I > did not read through the entire thread. > > > I think we should try to get this fixed soon, to make some progress > towards release-ability. Or just declare it to be entirely unrelated to > the release, and remove it from the open items list; not an unreasonable > choice either. This has been an open item for quite a while... > I am okay either way, but I can work on getting it committed for this release. > >> diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c >> index 60aaa822b7e..ac22bedf0e2 100644 >> --- a/src/backend/executor/execParallel.c >> +++ b/src/backend/executor/execParallel.c >> @@ -979,9 +979,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc) >> /* Report workers' query for monitoring purposes */ >> pgstat_report_activity(STATE_RUNNING, debug_query_string); >> >> - /* Prepare to track buffer usage during query execution. */ >> - InstrStartParallelQuery(); >> - >> /* Attach to the dynamic shared memory area. */ >> area_space = shm_toc_lookup(toc, PARALLEL_KEY_DSA, false); >> area = dsa_attach_in_place(area_space, seg); >> @@ -993,6 +990,9 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc) >> queryDesc->planstate->state->es_query_dsa = area; >> ExecParallelInitializeWorker(queryDesc->planstate, toc); >> >> + /* Prepare to track buffer usage during query execution. */ >> + InstrStartParallelQuery(); >> + >> /* Run the plan */ >> ExecutorRun(queryDesc, ForwardScanDirection, 0L, true); > > Isn't this an independent change? And one with potentially negative > side effects? I think there's some arguments for changing this (and some > against), > This is to ensure that buffer usage stats are tracked consistently in master and worker backends. In the master backend, we don't track the similar stats for ExecutorStart phase. Also, it would help us give the same stats for buffer usage in parallel and non-parallel version of the query. > but I think it's a bad idea to do so in the same patch. > Agreed, we can do this part separately. > >> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c >> index 36d2914249c..a0d49ec0fba 100644 >> --- a/src/backend/executor/execProcnode.c >> +++ b/src/backend/executor/execProcnode.c >> @@ -737,6 +737,13 @@ ExecShutdownNode(PlanState *node) >> >> planstate_tree_walker(node, ExecShutdownNode, NULL); >> >> + /* >> + * Allow instrumentation to count stats collected during shutdown for >> + * nodes that are executed atleast once. >> + */ >> + if (node->instrument && node->instrument->running) >> + InstrStartNode(node->instrument); >> + >> switch (nodeTag(node)) >> { >> case T_GatherState: >> @@ -755,5 +762,12 @@ ExecShutdownNode(PlanState *node) >> break; >> } >> >> + /* >> + * Allow instrumentation to count stats collected during shutdown for >> + * nodes that are executed atleast once. >> + */ >> + if (node->instrument && node->instrument->running) >> + InstrStopNode(node->instrument, 0); >> + >> return false; >> } > > Duplicating the comment doesn't seem like a good idea. Just say > something like "/* see comment for InstrStartNode above */". > Okay, will change in the next version. > >> diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c >> index ac5a2ff0e60..cf1851e235f 100644 >> --- a/src/backend/executor/nodeLimit.c >> +++ b/src/backend/executor/nodeLimit.c >> @@ -134,6 +134,8 @@ ExecLimit(PlanState *pstate) >> node->position - node->offset >= node->count) >> { >> node->lstate = LIMIT_WINDOWEND; >> + /* Allow nodes to release or shut down resources. */ >> + (void) ExecShutdownNode(outerPlan); >> return NULL; >> } >> > > Um, is this actually correct? What if somebody rewinds afterwards? > That won't happen for parallel query currently, but architecturally I > don't think we can do so unconditionally? > Currently, this is done for the forward scans (basically under ScanDirectionIsForward), is that enough, if not, what kind of additional check do you think we need here? I am not sure at this stage how we will design it for backward scans, but I think we already do this at other places in the code, see below code in ExecutePlan. ExecutePlan() { .. if (numberTuples && numberTuples == current_tuple_count) { /* Allow nodes to release or shut down resources. */ (void) ExecShutdownNode(planstate); break; } .. } -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jul 28, 2018 at 2:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > We have done verification that the approach works and fixes the bug in > all known cases. Do you see any problem with this approach? Regarding the change to execParallel.c, I think if you're going to move that code you should add a comment explaining the justification for the new positioning, like: /* * Prepare to track buffer usage during query execution. * * We do this after starting up the executor to match what happens in the * leader, which also doesn't count buffer accesses that occur during * executor startup. */ If you just make a change and move the code without adding a comment explaining why you moved it, the next person will wonder why they shouldn't just move it back when they hit some other problem. I also agree with Andres that this could be committed separately. Regarding the change to execProcnode.c, "atleast" should be "at least". But more broadly, I don't think the comment is very clear about the justification for what we're doing, and Andres is right that duplication the comment isn't helpful. I've attempted to improve this in the attached version. Regarding Andres's comments about the changes to nodeLimit.c, I think he's right, but as you say, this isn't the first place to have this problem. The issue is that someone might be using cursor operations to fetch forward through the plan one tuple at a time, and then after triggering the shutdown, they might start fetching backward. That doesn't matter so long as shutdown callbacks are only used for shutting down stuff related to parallel query, because you can't fetch backwards from a Gather node or anything under it. But if we want to make broader use of shutdown callbacks -- and I think that would be a good idea -- then it's a problem. There's already a latent bug here, because custom scans and foreign scans are allowed to have shutdown callbacks (but postgres_fdw and file_fdw don't). Fixing that seems like a separate issue, and I think it would probably be OK to proceed with the change as you have it for now, but we ought to do something about it. I'm not sure there's a problem with rewinding, as Andres suggests: if the node is entirely rescanned, I believe it's allowed to regenerate the output, and Gather (Merge) can do that by firing up a new set of workers. But scanning backwards is a problem. I'm not exactly sure what the best way of handling that is, but one thing I think might work is to save ExecutePlan's execute_once flag in the EState and then make the call in nodeLimit.c and the one in ExecutePlan itself conditional on that flag. If we know that the plan is only going to be executed once, then there can never be any backward fetches and it's fine to shut down as soon as we finish going forward. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > ... But scanning backwards is > a problem. I'm not exactly sure what the best way of handling that > is, but one thing I think might work is to save ExecutePlan's > execute_once flag in the EState and then make the call in nodeLimit.c > and the one in ExecutePlan itself conditional on that flag. If we > know that the plan is only going to be executed once, then there can > never be any backward fetches and it's fine to shut down as soon as we > finish going forward. Shouldn't this be dealt with by a eflag bit passed down at executor init time? What you're describing sounds a lot like somebody invented a different way because they were unfamiliar with eflags. regards, tom lane
On Wed, Aug 1, 2018 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> ... But scanning backwards is >> a problem. I'm not exactly sure what the best way of handling that >> is, but one thing I think might work is to save ExecutePlan's >> execute_once flag in the EState and then make the call in nodeLimit.c >> and the one in ExecutePlan itself conditional on that flag. If we >> know that the plan is only going to be executed once, then there can >> never be any backward fetches and it's fine to shut down as soon as we >> finish going forward. > > Shouldn't this be dealt with by a eflag bit passed down at executor > init time? What you're describing sounds a lot like somebody > invented a different way because they were unfamiliar with eflags. Uh, yeah, maybe. I'm a bit unfamiliar with eflags. :-) Looking at it, this sounds suspiciously something where we could just test EXEC_FLAG_BACKWARD, based on the comments for that flag, but I wouldn't be willing to bet very much on me being right about that. Can somebody try to fetch backward even if this flag wasn't provided? If so, we need to make sure that the shutting down a node doesn't leave it in a state where a fetch attempt will crash. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Looking at it, this sounds suspiciously something where we could just > test EXEC_FLAG_BACKWARD, based on the comments for that flag, but I > wouldn't be willing to bet very much on me being right about that. > Can somebody try to fetch backward even if this flag wasn't provided? They're not supposed to, and it would be a bug of theirs not yours if they did. Other node types that rely on eflags being provided honestly generally just Assert that they're not asked for something else later. regards, tom lane
On Wed, Aug 1, 2018 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Jul 28, 2018 at 2:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> We have done verification that the approach works and fixes the bug in >> all known cases. Do you see any problem with this approach? > > Regarding the change to execParallel.c, I think if you're going to > move that code you should add a comment explaining the justification > for the new positioning, like: > > /* > * Prepare to track buffer usage during query execution. > * > * We do this after starting up the executor to match what happens in the > * leader, which also doesn't count buffer accesses that occur during > * executor startup. > */ > > If you just make a change and move the code without adding a comment > explaining why you moved it, the next person will wonder why they > shouldn't just move it back when they hit some other problem. I also > agree with Andres that this could be committed separately. > > Regarding the change to execProcnode.c, "atleast" should be "at > least". But more broadly, I don't think the comment is very clear > about the justification for what we're doing, and Andres is right that > duplication the comment isn't helpful. I've attempted to improve this > in the attached version. > > Regarding Andres's comments about the changes to nodeLimit.c, I think > he's right, but as you say, this isn't the first place to have this > problem. The issue is that someone might be using cursor operations > to fetch forward through the plan one tuple at a time, and then after > triggering the shutdown, they might start fetching backward. That > doesn't matter so long as shutdown callbacks are only used for > shutting down stuff related to parallel query, because you can't fetch > backwards from a Gather node or anything under it. > Yeah and at the first place, we won't allow parallel query for cursors. The CURSOR_OPT_PARALLEL_OK won't be set for them and standard_planner will prohibit generating parallel plans. > But if we want to > make broader use of shutdown callbacks -- and I think that would be a > good idea -- then it's a problem. There's already a latent bug here, > because custom scans and foreign scans are allowed to have shutdown > callbacks (but postgres_fdw and file_fdw don't). > > Fixing that seems like a separate issue, and I think it would probably > be OK to proceed with the change as you have it for now, but we ought > to do something about it. I'm not sure there's a problem with > rewinding, as Andres suggests: if the node is entirely rescanned, I > believe it's allowed to regenerate the output, and Gather (Merge) can > do that by firing up a new set of workers. But scanning backwards is > a problem. I'm not exactly sure what the best way of handling that > is, but one thing I think might work is to save ExecutePlan's > execute_once flag in the EState and then make the call in nodeLimit.c > and the one in ExecutePlan itself conditional on that flag. If we > know that the plan is only going to be executed once, then there can > never be any backward fetches and it's fine to shut down as soon as we > finish going forward. > I think something on the lines what Tom and you are suggesting can be done with the help of EXEC_FLAG_BACKWARD, but I don't see the need to do anything for this patch. The change in nodeLimit.c is any way for forward scans, so there shouldn't be any need for any other check. If we agree that the current patch is sufficient for this bug fix, then I can work on splitting the patch and preparing the patch for back-branches. As mentioned above, I think we should backpatch this till 9.6. Any objections on backpatching this fix? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi, On 2018-08-02 08:21:58 +0530, Amit Kapila wrote: > I think something on the lines what Tom and you are suggesting can be > done with the help of EXEC_FLAG_BACKWARD, but I don't see the need to > do anything for this patch. The change in nodeLimit.c is any way for > forward scans, so there shouldn't be any need for any other check. I think this is almost a guarantee to introduce bugs in the future. And besides that, as Robert points out, it's essentially an exiting bug for custom scans. Given that EXEC_FLAG_BACKWARD already exists, why not do the right thing here? Greetings, Andres Freund
On Thu, Aug 2, 2018 at 8:38 AM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2018-08-02 08:21:58 +0530, Amit Kapila wrote: >> I think something on the lines what Tom and you are suggesting can be >> done with the help of EXEC_FLAG_BACKWARD, but I don't see the need to >> do anything for this patch. The change in nodeLimit.c is any way for >> forward scans, so there shouldn't be any need for any other check. > > I think this is almost a guarantee to introduce bugs in the future. And > besides that, as Robert points out, it's essentially an exiting bug for > custom scans. Given that EXEC_FLAG_BACKWARD already exists, why not do > the right thing here? > Sure, if we want we can do it in this patch, but this is not the problem of this patch. It is also related to existing usage of shutdown callbacks. I think we can use es_top_eflags in Estate to detect it and then call shutdown only if EXEC_FLAG_BACKWARD is not set. I think we should do that as a separate patch either before or after this patch rather than conflating that change into this patch. IIUC, then Robert also said that we should fix that separately. I will do as per whatever consensus we reach here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 2, 2018 at 10:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Aug 2, 2018 at 8:38 AM, Andres Freund <andres@anarazel.de> wrote: >> Hi, >> >> On 2018-08-02 08:21:58 +0530, Amit Kapila wrote: >>> I think something on the lines what Tom and you are suggesting can be >>> done with the help of EXEC_FLAG_BACKWARD, but I don't see the need to >>> do anything for this patch. The change in nodeLimit.c is any way for >>> forward scans, so there shouldn't be any need for any other check. >> >> I think this is almost a guarantee to introduce bugs in the future. And >> besides that, as Robert points out, it's essentially an exiting bug for >> custom scans. Given that EXEC_FLAG_BACKWARD already exists, why not do >> the right thing here? >> > > Sure, if we want we can do it in this patch, but this is not the > problem of this patch. It is also related to existing usage of > shutdown callbacks. I think we can use es_top_eflags in Estate to > detect it and then call shutdown only if EXEC_FLAG_BACKWARD is not > set. I think we should do that as a separate patch either before or > after this patch rather than conflating that change into this patch. > IIUC, then Robert also said that we should fix that separately. I > will do as per whatever consensus we reach here. > I have created three patches (a) move InstrStartParallelQuery from its original location so that we perform it just before ExecutorRun (b) patch to fix the gather stats by calling shutdown at appropriate place and allow stats collection in ExecShutdownNode (c) Probit calling ExecShutdownNode if there is a possibility of backward scans (I have done some basic tests with this patch, if we decide to proceed with it, then some more verification and testing would be required). I think we should commit first two patches as that fixes the problem being discussed in this thread and then do some additional verification for the third patch (mentioned in option c). I can understand if people want to commit the third patch before the second patch, so let me know what you guys think. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Aug 2, 2018 at 5:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I have created three patches (a) move InstrStartParallelQuery from its > original location so that we perform it just before ExecutorRun (b) > patch to fix the gather stats by calling shutdown at appropriate place > and allow stats collection in ExecShutdownNode (c) Probit calling > ExecShutdownNode if there is a possibility of backward scans (I have > done some basic tests with this patch, if we decide to proceed with > it, then some more verification and testing would be required). > > I think we should commit first two patches as that fixes the problem > being discussed in this thread and then do some additional > verification for the third patch (mentioned in option c). I can > understand if people want to commit the third patch before the second > patch, so let me know what you guys think. I'm happy with the first two patches. In the third one, I don't think "See ExecLimit" is a good thing to put a comment like this, because it's too hard to find the comment to which it refers, and because future commits are likely to edit or remove that comment without noticing the references to it from elsewhere. Instead I would just write, in all three places, /* If we know we won't need to back up, we can release resources at this point. */ or something like that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 2, 2018 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Aug 2, 2018 at 5:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I have created three patches (a) move InstrStartParallelQuery from its >> original location so that we perform it just before ExecutorRun (b) >> patch to fix the gather stats by calling shutdown at appropriate place >> and allow stats collection in ExecShutdownNode (c) Probit calling >> ExecShutdownNode if there is a possibility of backward scans (I have >> done some basic tests with this patch, if we decide to proceed with >> it, then some more verification and testing would be required). >> >> I think we should commit first two patches as that fixes the problem >> being discussed in this thread and then do some additional >> verification for the third patch (mentioned in option c). I can >> understand if people want to commit the third patch before the second >> patch, so let me know what you guys think. > > I'm happy with the first two patches. > Thanks. I have pushed those two patches. > In the third one, I don't think > "See ExecLimit" is a good thing to put a comment like this, because > it's too hard to find the comment to which it refers, and because > future commits are likely to edit or remove that comment without > noticing the references to it from elsewhere. Instead I would just > write, in all three places, /* If we know we won't need to back up, we > can release resources at this point. */ or something like that. > Okay, I have changed the comment as per your suggestion in the attached patch. I will do some more testing/verification of this patch and will commit over the weekend or on Monday if everything is fine. I have noticed that part of the comment atop ExecShutdownNode is now redundant. See attached edit_comments_shutdown_node_v1, let me know if you think otherwise. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On 08/03/2018 10:39 AM, Amit Kapila wrote: > Thanks. I have pushed those two patches. Thanks!
On Fri, Aug 3, 2018 at 4:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I have noticed that part of the comment atop ExecShutdownNode is now > redundant. See attached edit_comments_shutdown_node_v1, let me know > if you think otherwise. Oh, yes, that should be adjusted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-Aug-03, Adrien NAYRAT wrote: > On 08/03/2018 10:39 AM, Amit Kapila wrote: > > Thanks. I have pushed those two patches. > > Thanks! I am marking this open item as closed. If that's wrong, let's hear why :-) Thanks, -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Aug 3, 2018 at 2:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Aug 2, 2018 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Aug 2, 2018 at 5:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> I have created three patches (a) move InstrStartParallelQuery from its >>> original location so that we perform it just before ExecutorRun (b) >>> patch to fix the gather stats by calling shutdown at appropriate place >>> and allow stats collection in ExecShutdownNode (c) Probit calling >>> ExecShutdownNode if there is a possibility of backward scans (I have >>> done some basic tests with this patch, if we decide to proceed with >>> it, then some more verification and testing would be required). >>> >>> I think we should commit first two patches as that fixes the problem >>> being discussed in this thread and then do some additional >>> verification for the third patch (mentioned in option c). I can >>> understand if people want to commit the third patch before the second >>> patch, so let me know what you guys think. >> >> I'm happy with the first two patches. >> > > Thanks. I have pushed those two patches. > >> In the third one, I don't think >> "See ExecLimit" is a good thing to put a comment like this, because >> it's too hard to find the comment to which it refers, and because >> future commits are likely to edit or remove that comment without >> noticing the references to it from elsewhere. Instead I would just >> write, in all three places, /* If we know we won't need to back up, we >> can release resources at this point. */ or something like that. >> > > Okay, I have changed the comment as per your suggestion in the > attached patch. I will do some more testing/verification of this > patch and will commit over the weekend or on Monday if everything is > fine. > I have verified that the patch works whenever we use scrollable cursors. Please find the attached patch with the modified commit message. I think now it is a bit late for this minor-release and this doesn't appear to be a blocker issue, it is better to push it after the release. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Aug 3, 2018 at 8:08 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2018-Aug-03, Adrien NAYRAT wrote: > >> On 08/03/2018 10:39 AM, Amit Kapila wrote: >> > Thanks. I have pushed those two patches. >> >> Thanks! > > I am marking this open item as closed. If that's wrong, let's hear why > Thanks, it is not wrong. I initially thought I will close it after committing the pending patch which I just posted. However, it is okay as that is to address a somewhat related, but a different issue. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Aug 5, 2018 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Aug 3, 2018 at 2:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Thu, Aug 2, 2018 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> In the third one, I don't think >>> "See ExecLimit" is a good thing to put a comment like this, because >>> it's too hard to find the comment to which it refers, and because >>> future commits are likely to edit or remove that comment without >>> noticing the references to it from elsewhere. Instead I would just >>> write, in all three places, /* If we know we won't need to back up, we >>> can release resources at this point. */ or something like that. >>> >> >> Okay, I have changed the comment as per your suggestion in the >> attached patch. I will do some more testing/verification of this >> patch and will commit over the weekend or on Monday if everything is >> fine. >> > > I have verified that the patch works whenever we use scrollable > cursors. Please find the attached patch with the modified commit > message. I think now it is a bit late for this minor-release and this > doesn't appear to be a blocker issue, it is better to push it after > the release. > Pushed. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 3, 2018 at 6:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Aug 3, 2018 at 4:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I have noticed that part of the comment atop ExecShutdownNode is now >> redundant. See attached edit_comments_shutdown_node_v1, let me know >> if you think otherwise. > > Oh, yes, that should be adjusted. > Pushed the patch for same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com