Thread: Planning performance problem (67626.278ms)
Hi We have some very strange query planning problem. Long story short it takes 67626.278ms just to plan. Query execution takes 12ms. Query has 7 joins and 2 subselects. It looks like the issue is not deterministic, sometimes is takes few ms to plan the query. One of the tables has 550,485,942 live tuples and 743,504,012 dead tuples. Running ANALYZE on that tables solves the problem only temporarily. Question is how can we debug what is going on? Best Regards, Krzysztof Płocharz
-----Original Message----- From: Krzysztof Plocharz [mailto:plocharz@9livesdata.com] Sent: Monday, April 08, 2019 10:11 AM To: pgsql-performance@lists.postgresql.org Subject: Planning performance problem (67626.278ms) Hi We have some very strange query planning problem. Long story short it takes 67626.278ms just to plan. Query execution takes12ms. Query has 7 joins and 2 subselects. It looks like the issue is not deterministic, sometimes is takes few ms to plan the query. One of the tables has 550,485,942 live tuples and 743,504,012 dead tuples. Running ANALYZE on that tables solves the problemonly temporarily. Question is how can we debug what is going on? Best Regards, Krzysztof Płocharz _______________________________________________________________________________________________ Why do you have to run Analyze? Did you turn off Autovacuum? Regards, Igor Neyman
No, Autovacuum is running. On 2019/04/08 16:18, Igor Neyman wrote: > > -----Original Message----- > From: Krzysztof Plocharz [mailto:plocharz@9livesdata.com] > Sent: Monday, April 08, 2019 10:11 AM > To: pgsql-performance@lists.postgresql.org > Subject: Planning performance problem (67626.278ms) > > Hi > > We have some very strange query planning problem. Long story short it takes 67626.278ms just to plan. Query execution takes12ms. > > Query has 7 joins and 2 subselects. > It looks like the issue is not deterministic, sometimes is takes few ms to plan the query. > > One of the tables has 550,485,942 live tuples and 743,504,012 dead tuples. Running ANALYZE on that tables solves the problemonly temporarily. > > Question is how can we debug what is going on? > > Best Regards, > Krzysztof Płocharz > > _______________________________________________________________________________________________ > > Why do you have to run Analyze? Did you turn off Autovacuum? > > Regards, > Igor Neyman >
po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz <plocharz@9livesdata.com> napsal:
Hi
We have some very strange query planning problem. Long story short it
takes 67626.278ms just to plan. Query execution takes 12ms.
Query has 7 joins and 2 subselects.
It looks like the issue is not deterministic, sometimes is takes few ms
to plan the query.
One of the tables has 550,485,942 live tuples and 743,504,012 dead
tuples. Running ANALYZE on that tables solves the problem only temporarily.
Question is how can we debug what is going on?
please check your indexes against bloating. Planner get min and max from indexes and this operation is slow on bloat indexes.
but 67 sec is really slow - it can be some other other problem - it is real computer or virtual?
Best Regards,
Krzysztof Płocharz
On Mon, Apr 08, 2019 at 04:33:34PM +0200, Pavel Stehule wrote: > po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz <plocharz@9livesdata.com> napsal: > > > We have some very strange query planning problem. Long story short it > > takes 67626.278ms just to plan. Query execution takes 12ms. > > > > Query has 7 joins and 2 subselects. > > It looks like the issue is not deterministic, sometimes is takes few ms > > to plan the query. > > > > One of the tables has 550,485,942 live tuples and 743,504,012 dead > > tuples. Running ANALYZE on that tables solves the problem only temporarily. > > > > Question is how can we debug what is going on? > > please check your indexes against bloating. Planner get min and max from > indexes and this operation is slow on bloat indexes. I think that's from get_actual_variable_range(), right ? If it's due to bloating, I think the first step would be to 1) vacuum right now; and, 2) set more aggressive auto-vacuum, like ALTER TABLE t SET (AUTOVACUUM_VACUUM_SCALE_FACTOR=0.005). What version postgres server ? Justin
On 2019/04/08 16:42, Justin Pryzby wrote: > On Mon, Apr 08, 2019 at 04:33:34PM +0200, Pavel Stehule wrote: >> po 8. 4. 2019 v 16:11 odes�latel Krzysztof Plocharz <plocharz@9livesdata.com> napsal: >> >>> We have some very strange query planning problem. Long story short it >>> takes 67626.278ms just to plan. Query execution takes 12ms. >>> >>> Query has 7 joins and 2 subselects. >>> It looks like the issue is not deterministic, sometimes is takes few ms >>> to plan the query. >>> >>> One of the tables has 550,485,942 live tuples and 743,504,012 dead >>> tuples. Running ANALYZE on that tables solves the problem only temporarily. >>> >>> Question is how can we debug what is going on? >> >> please check your indexes against bloating. Planner get min and max from >> indexes and this operation is slow on bloat indexes. Yes, we thought about this, there are over 700,000,000 dead tuples. But as you said, it should not result in 67 second planning... > > I think that's from get_actual_variable_range(), right ? > > If it's due to bloating, I think the first step would be to 1) vacuum right > now; and, 2) set more aggressive auto-vacuum, like ALTER TABLE t SET > (AUTOVACUUM_VACUUM_SCALE_FACTOR=0.005). > We did pgrepack and it did help, but is it possible for get_actual_variable_range to take over 60 seconds? Is there any other workaround for this except for pgrepack/vacuum? Anyway to actually debug this? > What version postgres server ? > > Justin > > On 2019/04/08 16:33, Pavel Stehule wrote:> > > po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz > <plocharz@9livesdata.com <mailto:plocharz@9livesdata.com>> napsal: > > Hi > > We have some very strange query planning problem. Long story short it > takes 67626.278ms just to plan. Query execution takes 12ms. > > Query has 7 joins and 2 subselects. > It looks like the issue is not deterministic, sometimes is takes few ms > to plan the query. > > One of the tables has 550,485,942 live tuples and 743,504,012 dead > tuples. Running ANALYZE on that tables solves the problem only > temporarily. > > Question is how can we debug what is going on? > > > please check your indexes against bloating. Planner get min and max from > indexes and this operation is slow on bloat indexes. > Yes, we thought about this, there are over 700,000,000 dead tuples. But as you said, it should not result in 67 second planning... > but 67 sec is really slow - it can be some other other problem - it is > real computer or virtual? > real, with pretty good specs: NVME drives, Six-Core AMD Opteron, 64GB of ram. During testing system was mostly idle. > > Best Regards, > Krzysztof Płocharz > >
On 2019/04/08 16:33, Pavel Stehule wrote: > > > po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz > <plocharz@9livesdata.com <mailto:plocharz@9livesdata.com>> napsal: > > Hi > > We have some very strange query planning problem. Long story short it > takes 67626.278ms just to plan. Query execution takes 12ms. > > Query has 7 joins and 2 subselects. > It looks like the issue is not deterministic, sometimes is takes few ms > to plan the query. > > One of the tables has 550,485,942 live tuples and 743,504,012 dead > tuples. Running ANALYZE on that tables solves the problem only > temporarily. > > Question is how can we debug what is going on? > > > please check your indexes against bloating. Planner get min and max from > indexes and this operation is slow on bloat indexes. > Yes, we thought about this, there are over 700,000,000 dead tuples. But as you said, it should not result in 67 second planning... > but 67 sec is really slow - it can be some other other problem - it is > real computer or virtual? > real, with pretty good specs: NVME drives, Six-Core AMD Opteron, 64GB of ram. During testing system was mostly idle. > > Best Regards, > Krzysztof Płocharz > >
po 8. 4. 2019 v 16:55 odesílatel Krzysztof Plocharz <plocharz@9livesdata.com> napsal:
On 2019/04/08 16:42, Justin Pryzby wrote:
> On Mon, Apr 08, 2019 at 04:33:34PM +0200, Pavel Stehule wrote:
>> po 8. 4. 2019 v 16:11 odes�latel Krzysztof Plocharz <plocharz@9livesdata.com> napsal:
>>
>>> We have some very strange query planning problem. Long story short it
>>> takes 67626.278ms just to plan. Query execution takes 12ms.
>>>
>>> Query has 7 joins and 2 subselects.
>>> It looks like the issue is not deterministic, sometimes is takes few ms
>>> to plan the query.
>>>
>>> One of the tables has 550,485,942 live tuples and 743,504,012 dead
>>> tuples. Running ANALYZE on that tables solves the problem only temporarily.
>>>
>>> Question is how can we debug what is going on?
>>
>> please check your indexes against bloating. Planner get min and max from
>> indexes and this operation is slow on bloat indexes.
Yes, we thought about this, there are over 700,000,000 dead tuples. But
as you said, it should not result in 67 second planning...
>
> I think that's from get_actual_variable_range(), right ?
>
> If it's due to bloating, I think the first step would be to 1) vacuum right
> now; and, 2) set more aggressive auto-vacuum, like ALTER TABLE t SET
> (AUTOVACUUM_VACUUM_SCALE_FACTOR=0.005).
>
We did pgrepack and it did help, but is it possible for
get_actual_variable_range to take over 60 seconds?
Is there any other workaround for this except for pgrepack/vacuum?
Anyway to actually debug this?
you can use perf and get a profile.
> What version postgres server ?
>
> Justin
>
>
On 2019/04/08 16:33, Pavel Stehule wrote:>
>
> po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz
> <plocharz@9livesdata.com <mailto:plocharz@9livesdata.com>> napsal:
>
> Hi
>
> We have some very strange query planning problem. Long story short it
> takes 67626.278ms just to plan. Query execution takes 12ms.
>
> Query has 7 joins and 2 subselects.
> It looks like the issue is not deterministic, sometimes is takes
few ms
> to plan the query.
>
> One of the tables has 550,485,942 live tuples and 743,504,012 dead
> tuples. Running ANALYZE on that tables solves the problem only
> temporarily.
>
> Question is how can we debug what is going on?
>
>
> please check your indexes against bloating. Planner get min and max from
> indexes and this operation is slow on bloat indexes.
>
Yes, we thought about this, there are over 700,000,000 dead tuples. But
as you said, it should not result in 67 second planning...
> but 67 sec is really slow - it can be some other other problem - it is
> real computer or virtual?
>
real, with pretty good specs: NVME drives, Six-Core AMD Opteron, 64GB of
ram. During testing system was mostly idle.
>
> Best Regards,
> Krzysztof Płocharz
>
>
On Mon, Apr 08, 2019 at 04:55:36PM +0200, Krzysztof Plocharz wrote: > We did pgrepack and it did help, but is it possible for > get_actual_variable_range to take over 60 seconds? You have many tables being joined, perhaps in exhaustive search, so maybe that's being called many times. What version postgres server ? Justin
On 4/8/19 07:42, Justin Pryzby wrote: > On Mon, Apr 08, 2019 at 04:33:34PM +0200, Pavel Stehule wrote: >> po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz <plocharz@9livesdata.com> napsal: >> >>> We have some very strange query planning problem. Long story short it >>> takes 67626.278ms just to plan. Query execution takes 12ms. >>> >>> Query has 7 joins and 2 subselects. >>> It looks like the issue is not deterministic, sometimes is takes few ms >>> to plan the query. >>> >>> One of the tables has 550,485,942 live tuples and 743,504,012 dead >>> tuples. Running ANALYZE on that tables solves the problem only temporarily. >>> >>> Question is how can we debug what is going on? >> >> please check your indexes against bloating. Planner get min and max from >> indexes and this operation is slow on bloat indexes. > > I think that's from get_actual_variable_range(), right ? For what it's worth, I have seen a similar issue on Aurora PG 9.6 where query planning took a very long time (multiple minutes). In this particular case, there wasn't anything Aurora-specific about the call to get_actual_variable_range. We weren't able to distinctly identify the root cause or build a reproducible test case -- but we suspect that an inefficiency might exist in community PostgreSQL code. For debugging, a few ideas: 1) capture a stack with pstack or perf record --call-graph 2) capture the execution plan of the SQL w slow planning 3) capture detailed stats for all relations and objects involved 4) capture the usual info for bug reporting (preface section in docs) A reproducible test case is the gold standard; I'm keeping my eyes open for another case too. For the slow planning case that I saw, the slow process was almost entirely in this call stack (captured with perf record --call-graph): ... index_fetch_heap index_getnext get_actual_variable_range ineq_histogram_selectivity scalarineqsel mergejoinscansel initial_cost_mergejoin try_mergejoin_path add_paths_to_joinrel make_join_rel join_search_one_level standard_join_search make_one_rel query_planner ... -Jeremy -- Jeremy Schneider Database Engineer Amazon Web Services
Hi, On 2019-04-08 16:10:17 -0700, Jeremy Schneider wrote: > On 4/8/19 07:42, Justin Pryzby wrote: > > On Mon, Apr 08, 2019 at 04:33:34PM +0200, Pavel Stehule wrote: > >> po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz <plocharz@9livesdata.com> napsal: > >> > >>> We have some very strange query planning problem. Long story short it > >>> takes 67626.278ms just to plan. Query execution takes 12ms. > >>> > >>> Query has 7 joins and 2 subselects. > >>> It looks like the issue is not deterministic, sometimes is takes few ms > >>> to plan the query. > >>> > >>> One of the tables has 550,485,942 live tuples and 743,504,012 dead > >>> tuples. Running ANALYZE on that tables solves the problem only temporarily. > >>> > >>> Question is how can we debug what is going on? > >> > >> please check your indexes against bloating. Planner get min and max from > >> indexes and this operation is slow on bloat indexes. > > > > I think that's from get_actual_variable_range(), right ? > > For what it's worth, I have seen a similar issue on Aurora PG 9.6 where > query planning took a very long time (multiple minutes). In this > particular case, there wasn't anything Aurora-specific about the call to > get_actual_variable_range. We weren't able to distinctly identify the > root cause or build a reproducible test case -- but we suspect that an > inefficiency might exist in community PostgreSQL code. > > For debugging, a few ideas: > > 1) capture a stack with pstack or perf record --call-graph > > 2) capture the execution plan of the SQL w slow planning > > 3) capture detailed stats for all relations and objects involved > > 4) capture the usual info for bug reporting (preface section in docs) > > A reproducible test case is the gold standard; I'm keeping my eyes open > for another case too. > > For the slow planning case that I saw, the slow process was almost > entirely in this call stack (captured with perf record --call-graph): > ... > index_fetch_heap > index_getnext > get_actual_variable_range > ineq_histogram_selectivity > scalarineqsel > mergejoinscansel > initial_cost_mergejoin > try_mergejoin_path > add_paths_to_joinrel > make_join_rel > join_search_one_level > standard_join_search > make_one_rel > query_planner > ... I suspect some of this might be related to < 11 not having the following commit: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd Greetings, Andres Freund
Two years later, I still remember this. And today I just confirmed someone hitting this on open source PG13. On the incident I saw today, the tuple reads during planning were leading to excessive mxid waits - I suspect because memory pressure probably caused the mxid offsets/member files to get flushed and start reading from disk. It was the mxid waits that actually made the system go sideways - I was just surprised that we saw most of the system CPU in planning time (rather than execution). I really think there's something here in the planner that's still causing headaches for people; I've this seen a few times now. Looks to me like the common theme is: try_mergejoin_path -> initial_cost_mergejoin -> mergejoinscansel -> scalarineqsel -> ineq_histogram_selectivity -> get_actual_variable_range And from here it starts calling index_getnext() which can go on for a very long time and the system seems to fall over if it begins to involve much physical I/O. I'll continue to keep an eye out for this, and keep this thread updated if I find anything else that might move the understanding forward. Thanks, Jeremy Schneider On 4/8/19 16:26, Andres Freund wrote: > Hi, > > On 2019-04-08 16:10:17 -0700, Jeremy Schneider wrote: >> >> For what it's worth, I have seen a similar issue on Aurora PG 9.6 where >> query planning took a very long time (multiple minutes). In this >> particular case, there wasn't anything Aurora-specific about the call to >> get_actual_variable_range. We weren't able to distinctly identify the >> root cause or build a reproducible test case -- but we suspect that an >> inefficiency might exist in community PostgreSQL code. >> >> For debugging, a few ideas: >> >> 1) capture a stack with pstack or perf record --call-graph >> 2) capture the execution plan of the SQL w slow planning >> 3) capture detailed stats for all relations and objects involved >> 4) capture the usual info for bug reporting (preface section in docs) >> >> A reproducible test case is the gold standard; I'm keeping my eyes open >> for another case too. >> >> For the slow planning case that I saw, the slow process was almost >> entirely in this call stack (captured with perf record --call-graph): >> ... >> index_fetch_heap >> index_getnext >> get_actual_variable_range >> ineq_histogram_selectivity >> scalarineqsel >> mergejoinscansel >> initial_cost_mergejoin >> try_mergejoin_path >> add_paths_to_joinrel >> make_join_rel >> join_search_one_level >> standard_join_search >> make_one_rel >> query_planner >> ... > > I suspect some of this might be related to < 11 not having the following > commit: > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd -- Jeremy Schneider Database Engineer Amazon Web Services
On Thu, 22 Apr 2021 at 00:03, Jeremy Schneider <schnjere@amazon.com> wrote: > > Two years later, I still remember this. And today I just confirmed > someone hitting this on open source PG13. The only thing that changed about get_actual_variable_range() is that it now uses a SnapshotNonVacuumable snapshot. Previously a long-running transaction could have caused vacuum to be unable to remove tuples which could have caused get_actual_variable_range() to be slow if it had to skip the unvacuumable tuples. That's now changed as the SnapshotNonVacuumable will see any tuples required by that long-running transaction and use that to determine the range instead of skipping over it. Anyone with a large number of tuples that vacuum can remove that are at either end of the range on a column that is indexed by a btree index could still have issues. Vacuuming more often might be a good thing to consider. With the original report on this thread there were more dead tuples in the table than live tuples. Disabling auto-vacuum or tuning it so it waits that long is likely a bad idea. FWIW, here's a simple test case that shows the problem in current master. create table a (a int primary key) with (autovacuum_enabled = off); insert into a select x from generate_series(1,10000000) x; analyze a; delete from a; \timing on explain select * from a where a < 10000000; QUERY PLAN ------------------------------------------------------------ Seq Scan on a (cost=0.00..169247.71 rows=9998977 width=4) Filter: (a < 10000000) (2 rows) Time: 9062.600 ms (00:09.063) vacuum a; explain select * from a where a < 10000000; QUERY PLAN ------------------------------------------------- Seq Scan on a (cost=0.00..0.00 rows=1 width=4) Filter: (a < 10000000) (2 rows) Time: 2.665 ms Notice that it became faster again after I did a vacuum. David
David Rowley <dgrowleyml@gmail.com> writes: > FWIW, here's a simple test case that shows the problem in current master. This isn't telling the whole story. That first EXPLAIN did set the killed bits in the index, so that subsequent ones are fairly fast, even without VACUUM: regression=# explain select * from a where a < 10000000; QUERY PLAN ------------------------------------------------------------ Seq Scan on a (cost=0.00..169247.71 rows=9998977 width=4) Filter: (a < 10000000) (2 rows) Time: 3711.089 ms (00:03.711) regression=# explain select * from a where a < 10000000; QUERY PLAN ------------------------------------------------------------ Seq Scan on a (cost=0.00..169247.71 rows=9998977 width=4) Filter: (a < 10000000) (2 rows) Time: 230.094 ms Admittedly this is still more than after VACUUM gets rid of the index entries altogether: regression=# vacuum a; VACUUM Time: 2559.571 ms (00:02.560) regression=# explain select * from a where a < 10000000; QUERY PLAN ------------------------------------------------- Seq Scan on a (cost=0.00..0.00 rows=1 width=4) Filter: (a < 10000000) (2 rows) Time: 0.698 ms However, I'm skeptical that any problem actually remains in real-world use cases. regards, tom lane
> However, I'm skeptical that any problem actually remains in > real-world use cases. Hello Tom, We also had some issues with planning and get_actual_variable_range(). We actually found some interesting behaviour that probably requires an eye with better expertise in how the planner works. For the example being discussed you can add some joins into the equation and planning times deteriorate quite a bit. I'll just skip posting the first executions as it is already established that a subsequent one will be faster. create table b (b int primary key, a int references a(a)) with (autovacuum_enabled=off); insert into a select x from generate_series(1,10000000) x; insert into b select x, x from generate_series(1,10000000) x; create index b_a_idx on b(a); analyze a, b; For our case a rollback of a bulk insert causes bloat on the index. begin; insert into a select x from generate_series(10000001,20000000) x; rollback; explain (analyze, buffers) select * from a join b on (b.a = a.a) where b.a in (1,100,10000,1000000,1000001); Planning: Buffers: shared hit=9 read=27329 Planning Time: 134.560 ms Execution Time: 0.100 ms I see a lot of buffers being read for some reason (wasn't this fixed?). And times are slow too. But it get's worse with each join added to the select. explain (analyze, buffers) select * from a join b b1 on (b1.a = a.a) join b b2 on (b2.a = a.a) where b1.a in (1,100,10000,1000000,1000001); Planning: Buffers: shared hit=38 read=81992 Planning Time: 312.826 ms Execution Time: 0.131 ms Just add a few more joins and it is a recipe for disaster. Apparently, the planner isn't reusing the data boundaries across alternative plans. It would be nicer if the planner remembered each column boundaries for later reuse (within the same planner execution). Another thing that worries me is that even the second run has faster planning it is still way slower than the case without lots of bloat in the index. And I don't think this is just an edge case. Rollbacks on bulk inserts can be quite common, and joins are expected in a SQL database. We had downtime due to how the planner works on this case. Unfortunately setting more aggressive vacuum settings won't fix our problems. Most of the read queries are being issued to a replica. When the issues with the planner start happening, CPU usage on that node goes to 100% which interferes with the replication process. This means the replica cannot get to a new checkpoint with a new live max value in the index nor can it delete the bloat that vacuum has already cleaned on the leader server. Oh, by the way, we're running version 13.2 Regards, Manuel
Hello everyone, > Apparently, the planner isn't reusing the data boundaries across alternative > plans. It would be nicer if the planner remembered each column boundaries > for later reuse (within the same planner execution). I've written a very naive (and crappy) patch to show how adding memorization to get_actual_variable_range() could help the planner on scenarios with a big number of joins. For the previous example, > explain (analyze, buffers) > select * from a > join b b1 on (b1.a = a.a) > join b b2 on (b2.a = a.a) > where b1.a in (1,100,10000,1000000,1000001); each time you add a join clause the planner has to read an extra ~5[K] buffers and gets about 200[ms] slower. 1 join Planning: Buffers: shared hit=9 read=27329 Planning Time: 101.745 ms Execution Time: 0.082 ms 2 joins Planning: Buffers: shared hit=42 read=81988 Planning Time: 303.237 ms Execution Time: 0.102 ms 3 joins Planning: Buffers: shared hit=94 read=136660 Planning Time: 508.947 ms Execution Time: 0.155 ms 4 joins Planning: Buffers: shared hit=188 read=191322 Planning Time: 710.981 ms Execution Time: 0.168 ms After adding memorization the cost in buffers remains constant and the latency deteriorates only marginally (as expected) with each join. 1 join Planning: Buffers: shared hit=10 read=27328 Planning Time: 97.889 ms Execution Time: 0.066 ms 2 joins Planning: Buffers: shared hit=7 read=27331 Planning Time: 100.589 ms Execution Time: 0.111 ms 3 joins Planning: Buffers: shared hit=9 read=27329 Planning Time: 105.669 ms Execution Time: 0.134 ms 4 joins Planning: Buffers: shared hit=132 read=27370 Planning Time: 155.716 ms Execution Time: 0.219 ms I'd be happy to improve this patch into something better. Though I'd like suggestions on how to do it: I have this idea of creating a local "memorization" struct instance within standard_planner(). That would require passing on a pointer down until it reaches get_actual_variable_range(), which I think would be quite ugly, if done just to improve the planner for this scenario. Is there any better mechanism I could reuse from other modules? (utils or cache, for example). Regards, Manuel
Attachment
Manuel Weitzman <manuelweitzman@gmail.com> writes: > I've written a very naive (and crappy) patch to show how adding > memorization to get_actual_variable_range() could help the planner on > scenarios with a big number of joins. So ... the reason why there's not caching of get_actual_variable_range results already is that I'd supposed it wouldn't be necessary given the caching of selectivity estimates that happens at the RestrictInfo level. I don't have any objection in principle to adding another caching layer if that one's not working well enough, but I think it'd be wise to first understand why it's needed. regards, tom lane
Em dom., 20 de jun. de 2021 às 14:50, Manuel Weitzman <manuelweitzman@gmail.com> escreveu:
Hello everyone,
> Apparently, the planner isn't reusing the data boundaries across alternative
> plans. It would be nicer if the planner remembered each column boundaries
> for later reuse (within the same planner execution).
I've written a very naive (and crappy) patch to show how adding
memorization to get_actual_variable_range() could help the planner on
scenarios with a big number of joins.
For the previous example,
> explain (analyze, buffers)
> select * from a
> join b b1 on (b1.a = a.a)
> join b b2 on (b2.a = a.a)
> where b1.a in (1,100,10000,1000000,1000001);
each time you add a join clause the planner has to read an extra ~5[K]
buffers and gets about 200[ms] slower.
1 join
Planning:
Buffers: shared hit=9 read=27329
Planning Time: 101.745 ms
Execution Time: 0.082 ms
2 joins
Planning:
Buffers: shared hit=42 read=81988
Planning Time: 303.237 ms
Execution Time: 0.102 ms
3 joins
Planning:
Buffers: shared hit=94 read=136660
Planning Time: 508.947 ms
Execution Time: 0.155 ms
4 joins
Planning:
Buffers: shared hit=188 read=191322
Planning Time: 710.981 ms
Execution Time: 0.168 ms
After adding memorization the cost in buffers remains constant and the
latency deteriorates only marginally (as expected) with each join.
1 join
Planning:
Buffers: shared hit=10 read=27328
Planning Time: 97.889 ms
Execution Time: 0.066 ms
2 joins
Planning:
Buffers: shared hit=7 read=27331
Planning Time: 100.589 ms
Execution Time: 0.111 ms
3 joins
Planning:
Buffers: shared hit=9 read=27329
Planning Time: 105.669 ms
Execution Time: 0.134 ms
4 joins
Planning:
Buffers: shared hit=132 read=27370
Planning Time: 155.716 ms
Execution Time: 0.219 ms
I'd be happy to improve this patch into something better. Though I'd
like suggestions on how to do it:
I have this idea of creating a local "memorization" struct instance within
standard_planner(). That would require passing on a pointer down until
it reaches get_actual_variable_range(), which I think would be quite
ugly, if done just to improve the planner for this scenario.
Is there any better mechanism I could reuse from other modules? (utils
or cache, for example).
Without going into the merits of whether this cache will be adopted or not,
I have some comments about the code.
1. Prefer to use .patch instead of .diff, it makes it easier for browsers such as firefox to read and show the content automatically.
2. New struct?
Oid is unsigned int, lower than int64.
Better struct is:
+struct ActualVariableRangeCache {
+ int64 min_value; /* 8 bytes */
+ int64 max_value; /* 8 bytes */
+ Oid indexoid; /* 4 bytes */
+ bool has_min; /* 1 byte */
+ bool has_max; /*1 byte */
+};
+ int64 min_value; /* 8 bytes */
+ int64 max_value; /* 8 bytes */
+ Oid indexoid; /* 4 bytes */
+ bool has_min; /* 1 byte */
+ bool has_max; /*1 byte */
+};
Takes up less space.
3. Avoid use of type *long*, it is very problematic with 64 bits.
Windows 64 bits, long is 4 (four) bytes.
Linux 64 bits, long is 8 (eight) bytes.
4. Avoid C99 style declarations
for(unsigned long i = 0;)
Prefer:
size_t i;
for(i = 0;)
Helps backpatching to C89 versions.
Ranier Vilela
Ranier Vilela <ranier.vf@gmail.com> writes: > 3. Avoid use of type *long*, it is very problematic with 64 bits. > Windows 64 bits, long is 4 (four) bytes. > Linux 64 bits, long is 8 (eight) bytes. Agreed. > 4. Avoid C99 style declarations > for(unsigned long i = 0;) > Prefer: > size_t i; > for(i = 0;) > Helps backpatching to C89 versions. It seems unlikely that we'd consider back-patching this into pre-C99 branches, so I see no reason not to use C99 loop style. (But do keep in mind that we avoid most other C99-isms, such as intermixed decls and code.) regards, tom lane
> On 20-06-2021, at 17:06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > So ... the reason why there's not caching of get_actual_variable_range > results already is that I'd supposed it wouldn't be necessary given > the caching of selectivity estimates that happens at the RestrictInfo > level. I don't have any objection in principle to adding another > caching layer if that one's not working well enough, but I think it'd > be wise to first understand why it's needed. For what I could make out from the code, the caching done at the RestrictInfo level is already saving a lot of work, but there's a different RestrictInfo instance for each alternative path created by make_one_rel(). I wouldn't know how to reuse instances of RestrictInfo or if that would even be the correct approach. My guess is that doing so would be incorrect. I'll improve the patch with the suggestions from Ranier and you in the meantime. Best regards, Manuel
Manuel Weitzman <manuelweitzman@gmail.com> writes: >> On 20-06-2021, at 17:06, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So ... the reason why there's not caching of get_actual_variable_range >> results already is that I'd supposed it wouldn't be necessary given >> the caching of selectivity estimates that happens at the RestrictInfo >> level. I don't have any objection in principle to adding another >> caching layer if that one's not working well enough, but I think it'd >> be wise to first understand why it's needed. > For what I could make out from the code, the caching done at the > RestrictInfo level is already saving a lot of work, but there's a > different RestrictInfo instance for each alternative path created by > make_one_rel(). That seems a bit broken; a given WHERE clause should produce only one RestrictInfo. Can you provide a more concrete example? regards, tom lane
> On 29-06-2021, at 15:43, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Manuel Weitzman <manuelweitzman@gmail.com> writes: >>> On 20-06-2021, at 17:06, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> So ... the reason why there's not caching of get_actual_variable_range >>> results already is that I'd supposed it wouldn't be necessary given >>> the caching of selectivity estimates that happens at the RestrictInfo >>> level. I don't have any objection in principle to adding another >>> caching layer if that one's not working well enough, but I think it'd >>> be wise to first understand why it's needed. > >> For what I could make out from the code, the caching done at the >> RestrictInfo level is already saving a lot of work, but there's a >> different RestrictInfo instance for each alternative path created by >> make_one_rel(). > > That seems a bit broken; a given WHERE clause should produce only one > RestrictInfo. Can you provide a more concrete example? > I added some logging to see hits and misses on cached_scansel() for this query > explain (analyze, buffers) > select * from a > join b b1 on (b1.a = a.a) > join b b2 on (b2.a = a.a) > where b1.a in (1,100,10000,1000000,1000001); Apparently there's a RestrictInfo for each possible way of doing merge join (are those created dynamically for planning?), for example: - a join (b1 join b2) - b1 join (a join b2) - b2 join (a join b1) When the cost of a possible mergejoin path hasn't been computed yet, then mergejoinscansel() would have to check the bloated index again. I attached a patch so you can see the hits and misses on cached_scansel(). Each time there's a miss logged, there's also a different RestrictInfo pointer involved. Best regards, Manuel
Attachment
Manuel Weitzman <manuelweitzman@gmail.com> writes: > On 29-06-2021, at 15:43, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That seems a bit broken; a given WHERE clause should produce only one >> RestrictInfo. Can you provide a more concrete example? >> explain (analyze, buffers) >> select * from a >> join b b1 on (b1.a = a.a) >> join b b2 on (b2.a = a.a) >> where b1.a in (1,100,10000,1000000,1000001); Hm. By my count, this example generates 3 RestrictInfos during deconstruct_jointree, representing the three original clauses from the query, and then 4 more in generate_join_implied_equalities, representing the EC-derived clauses a.a = b1.a a.a = b2.a b1.a = b2.a b1.a = a.a The third of these seems legit enough; it's a new fact that we want to apply while considering joining b1 directly to b2. The other ones get made despite create_join_clause's efforts to avoid making duplicate clauses, because of two things: 1. create_join_clause doesn't trouble to look for commuted equivalents, which perhaps is penny-wise and pound-foolish. The cost of re-deriving selectivity estimates could be way more than the cost of checking this. 2. Although these look like they ought to be equivalent to the original clauses (modulo commutation, for some of them), they don't look that way to create_join_clause, because it's also checking for parent_ec equality. Per the comment, * parent_ec is either equal to ec (if the clause is a potentially-redundant * join clause) or NULL (if not). We have to treat this as part of the * match requirements --- it's possible that a clause comparing the same two * EMs is a join clause in one join path and a restriction clause in another. It might be worth digging into the git history to see why that became a thing and then considering whether there's a way around it. (I'm pretty sure that comment is mine, but I don't recall the details anymore.) Anyway, it's certainly not the case that we're making new RestrictInfos for every pair of rels. It looks that way in this example because the join vars all belong to the same EC, but that typically wouldn't be the case in more complex queries. So we could look into whether this code can be improved to share RestrictInfos across more cases. Another thought is that even if we need to keep original and derived clauses separate, maybe it'd be all right to copy previously-determined cached selectivity values from an original clause to an otherwise-identical derived clause (cf. commute_restrictinfo()). I'm not sure though whether it's reliably the case that we'd have filled in selectivities for the original clauses before this code wants to clone them. regards, tom lane
> 1. create_join_clause doesn't trouble to look for commuted > equivalents, which perhaps is penny-wise and pound-foolish. > The cost of re-deriving selectivity estimates could be way > more than the cost of checking this. Agreed. > 2. Although these look like they ought to be equivalent to the > original clauses (modulo commutation, for some of them), they don't > look that way to create_join_clause, because it's also checking > for parent_ec equality. Per the comment, > > * parent_ec is either equal to ec (if the clause is a potentially-redundant > * join clause) or NULL (if not). We have to treat this as part of the > * match requirements --- it's possible that a clause comparing the same two > * EMs is a join clause in one join path and a restriction clause in another. > > It might be worth digging into the git history to see why that > became a thing and then considering whether there's a way around it. > (I'm pretty sure that comment is mine, but I don't recall the details > anymore.) To me that sounds OK, I cannot prove that they're equivalent to the original clauses so I think it is fine to assume they're not (not an expert here, quite the opposite). > Anyway, it's certainly not the case that we're making new > RestrictInfos for every pair of rels. It looks that way in this > example because the join vars all belong to the same EC, but > that typically wouldn't be the case in more complex queries. Good to know, this wasn't clear to me. > So we could look into whether this code can be improved to share > RestrictInfos across more cases. Another thought is that even > if we need to keep original and derived clauses separate, maybe it'd > be all right to copy previously-determined cached selectivity values > from an original clause to an otherwise-identical derived clause > (cf. commute_restrictinfo()). I'm not sure though whether it's > reliably the case that we'd have filled in selectivities for the > original clauses before this code wants to clone them. To be honest, even if that sounds like a good idea to dig on, I think it wouldn't completely solve the problem with repeated calls to get_actual_variable_range(). The example query I gave is doing a lot of simple auto-joins which makes the thought process simpler, but I worry more about the more "common" case in which there is more than 2 distinct tables involved in the query For example, instead of having "b1, b2, ..., bn" as aliases of "b" in this query >> explain (analyze, buffers) >> select * from a >> join b b1 on (b1.a = a.a) >> join b b2 on (b2.a = a.a) >> where b1.a in (1,100,10000,1000000,1000001); it is also possible to reproduce the increasing cost in planning buffers for each new join on a distinct table being added: explain (analyze, buffers) select * from a join b on (b.a = a.a) join c on (c.a = a.a) -- ... (etc) where c.a in (1,100,10000,1000000,1000001); I can imagine that deconstruct_jointree() and generate_join_implied_equalities() would generate multiple RestrictInfos, in which many of them a constraint on a.a would be involved (each involving a different table). b.a = a.a c.a = a.a c.a = b.a a.a = b.a a.a = c.a ... (etc) (if we wanted, we could also add a different WHERE clause on each of the tables involved to make really sure all RestrictInfos are different). For each of these RestrictInfos there *could* be one cache miss on cached_scansel() that *could* force the planner to compute get_actual_variable_range() for the same variable (a.a) over and over, as mergejoinscansel() always computes the selectivity for the intervals that require actual extremal values. In practice this re-computing of the variable range seems to happen a lot. One way in which I see possible to share this kind of information (of extremal values) across RestrictInfos is to store the known variable ranges in PlannerInfo (or within a member of such struct), which seems to be around everywhere it would be needed. Best regards, Manuel
> On 30-06-2021, at 16:56, Manuel Weitzman <manuelweitzman@gmail.com> wrote: > > One way in which I see possible to share this kind of information (of > extremal values) across RestrictInfos is to store the known variable > ranges in PlannerInfo (or within a member of such struct), which seems > to be around everywhere it would be needed. I have written a new patch that's (hopefully) better than the first one I sent, to store the extremal index values within PlannerInfo. > it is also possible to reproduce the increasing cost in planning > buffers for each new join on a distinct table being added: > > [...] > > I can imagine that deconstruct_jointree() and > generate_join_implied_equalities() would generate multiple > RestrictInfos, in which many of them a constraint on a.a would be > involved (each involving a different table). I also attached an example in which there are RestrictInfos generated for multiple tables instead of just a single aliased one. The buffers read for planning also increase with each join added to the query. Best regards, Manuel
Attachment
On Thu, 1 Jul 2021 at 08:56, Manuel Weitzman <manuelweitzman@gmail.com> wrote: > For each of these RestrictInfos there *could* be one cache miss on > cached_scansel() that *could* force the planner to compute > get_actual_variable_range() for the same variable (a.a) over and over, > as mergejoinscansel() always computes the selectivity for the > intervals that require actual extremal values. In practice this > re-computing of the variable range seems to happen a lot. Recently, for some other topic, I was thinking about if we were ever to have the executor give up on a plan because something did not play out the way the planner expected it to, that if the executor to ever be given that ability to throw the plan back at the planner with some hints about where it went wrong, then I wondered what exactly these hints would look like. I guessed these would be a list of some rough guidelines saying that x JOIN y ON x.a=y.b produces at least Z rows. Then the planner would find that when estimating the size of the join between x and y and take the maximum of two values. That would need to be designed in such a way that the planner could quickly consult the feedback, e.g hashtable lookup on Oid. Anyway, I don't really have any clearly thought-through plans for that idea as it would be a large project that would need a lot of changes before it could be even thought about seriously. However, it did cause me to think about that again when reading this thread as it seems similar. You've learned the actual variable range through actual execution, so it does not seem too unreasonable that information might get stored in a similar place, if that place were to exist. I'm not saying that's what needs to happen here. It's more just food for thought. The caching you need does seem more specific to Oid than relid. David