Thread: Planning performance problem (67626.278ms)

Planning performance problem (67626.278ms)

From

Krzysztof Plocharz

Date:

08 April 2019, 14:10:52

Hi

We have some very strange query planning problem. Long story short it 
takes 67626.278ms just to plan. Query execution takes 12ms.

Query has 7 joins and 2 subselects.
It looks like the issue is not deterministic, sometimes is takes few ms 
to plan the query.

One of the tables has 550,485,942 live tuples and 743,504,012 dead 
tuples. Running ANALYZE on that tables solves the problem only temporarily.

Question is how can we debug what is going on?

Best Regards,
Krzysztof Płocharz

RE: Planning performance problem (67626.278ms)

From

Igor Neyman

Date:

08 April 2019, 14:18:11

-----Original Message-----
From: Krzysztof Plocharz [mailto:plocharz@9livesdata.com] 
Sent: Monday, April 08, 2019 10:11 AM
To: pgsql-performance@lists.postgresql.org
Subject: Planning performance problem (67626.278ms)

Hi

We have some very strange query planning problem. Long story short it takes 67626.278ms just to plan. Query execution
takes12ms.

Query has 7 joins and 2 subselects.
It looks like the issue is not deterministic, sometimes is takes few ms to plan the query.

One of the tables has 550,485,942 live tuples and 743,504,012 dead tuples. Running ANALYZE on that tables solves the
problemonly temporarily.

Question is how can we debug what is going on?

Best Regards,
Krzysztof Płocharz

_______________________________________________________________________________________________

Why do you have to run Analyze? Did you turn off Autovacuum?

Regards,
Igor Neyman

Re: Planning performance problem (67626.278ms)

From

Krzysztof Plocharz

Date:

08 April 2019, 14:28:02

No, Autovacuum is running.

On 2019/04/08 16:18, Igor Neyman wrote:
> 
> -----Original Message-----
> From: Krzysztof Plocharz [mailto:plocharz@9livesdata.com]
> Sent: Monday, April 08, 2019 10:11 AM
> To: pgsql-performance@lists.postgresql.org
> Subject: Planning performance problem (67626.278ms)
> 
> Hi
> 
> We have some very strange query planning problem. Long story short it takes 67626.278ms just to plan. Query execution
takes12ms.
 
> 
> Query has 7 joins and 2 subselects.
> It looks like the issue is not deterministic, sometimes is takes few ms to plan the query.
> 
> One of the tables has 550,485,942 live tuples and 743,504,012 dead tuples. Running ANALYZE on that tables solves the
problemonly temporarily.
 
> 
> Question is how can we debug what is going on?
> 
> Best Regards,
> Krzysztof Płocharz
> 
> _______________________________________________________________________________________________
> 
> Why do you have to run Analyze? Did you turn off Autovacuum?
> 
> Regards,
> Igor Neyman
>

Re: Planning performance problem (67626.278ms)

From

Pavel Stehule

Date:

08 April 2019, 14:33:34

po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz <plocharz@9livesdata.com> napsal:

Hi

We have some very strange query planning problem. Long story short it
takes 67626.278ms just to plan. Query execution takes 12ms.

Query has 7 joins and 2 subselects.
It looks like the issue is not deterministic, sometimes is takes few ms
to plan the query.

One of the tables has 550,485,942 live tuples and 743,504,012 dead
tuples. Running ANALYZE on that tables solves the problem only temporarily.

Question is how can we debug what is going on?

please check your indexes against bloating. Planner get min and max from indexes and this operation is slow on bloat indexes.

but 67 sec is really slow - it can be some other other problem - it is real computer or virtual?

Best Regards,
Krzysztof Płocharz

Re: Planning performance problem (67626.278ms)

From

Justin Pryzby

Date:

08 April 2019, 14:42:10

On Mon, Apr 08, 2019 at 04:33:34PM +0200, Pavel Stehule wrote:
> po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz <plocharz@9livesdata.com> napsal:
> 
> > We have some very strange query planning problem. Long story short it
> > takes 67626.278ms just to plan. Query execution takes 12ms.
> >
> > Query has 7 joins and 2 subselects.
> > It looks like the issue is not deterministic, sometimes is takes few ms
> > to plan the query.
> >
> > One of the tables has 550,485,942 live tuples and 743,504,012 dead
> > tuples. Running ANALYZE on that tables solves the problem only temporarily.
> >
> > Question is how can we debug what is going on?
> 
> please check your indexes against bloating. Planner get min and max from
> indexes and this operation is slow on bloat indexes.

I think that's from get_actual_variable_range(), right ?

If it's due to bloating, I think the first step would be to 1) vacuum right
now; and, 2) set more aggressive auto-vacuum, like ALTER TABLE t SET
(AUTOVACUUM_VACUUM_SCALE_FACTOR=0.005).

What version postgres server ?

Justin

Re: Planning performance problem (67626.278ms)

From

Krzysztof Plocharz

Date:

08 April 2019, 14:55:36


On 2019/04/08 16:42, Justin Pryzby wrote:
> On Mon, Apr 08, 2019 at 04:33:34PM +0200, Pavel Stehule wrote:
>> po 8. 4. 2019 v 16:11 odes�latel Krzysztof Plocharz <plocharz@9livesdata.com> napsal:
>>
>>> We have some very strange query planning problem. Long story short it
>>> takes 67626.278ms just to plan. Query execution takes 12ms.
>>>
>>> Query has 7 joins and 2 subselects.
>>> It looks like the issue is not deterministic, sometimes is takes few ms
>>> to plan the query.
>>>
>>> One of the tables has 550,485,942 live tuples and 743,504,012 dead
>>> tuples. Running ANALYZE on that tables solves the problem only temporarily.
>>>
>>> Question is how can we debug what is going on?
>>
>> please check your indexes against bloating. Planner get min and max from
>> indexes and this operation is slow on bloat indexes.

Yes, we thought about this, there are over 700,000,000 dead tuples. But 
as you said, it should not result in 67 second planning...

> 
> I think that's from get_actual_variable_range(), right ?
> 
> If it's due to bloating, I think the first step would be to 1) vacuum right
> now; and, 2) set more aggressive auto-vacuum, like ALTER TABLE t SET
> (AUTOVACUUM_VACUUM_SCALE_FACTOR=0.005).
> 

We did pgrepack and it did help, but is it possible for 
get_actual_variable_range to take over 60 seconds?
Is there any other workaround for this except for pgrepack/vacuum?

Anyway to actually debug this?

> What version postgres server ?
> 
> Justin
> 
> 




On 2019/04/08 16:33, Pavel Stehule wrote:>
 >
 > po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz
 > <plocharz@9livesdata.com <mailto:plocharz@9livesdata.com>> napsal:
 >
 >     Hi
 >
 >     We have some very strange query planning problem. Long story short it
 >     takes 67626.278ms just to plan. Query execution takes 12ms.
 >
 >     Query has 7 joins and 2 subselects.
 >     It looks like the issue is not deterministic, sometimes is takes 
few ms
 >     to plan the query.
 >
 >     One of the tables has 550,485,942 live tuples and 743,504,012 dead
 >     tuples. Running ANALYZE on that tables solves the problem only
 >     temporarily.
 >
 >     Question is how can we debug what is going on?
 >
 >
 > please check your indexes against bloating. Planner get min and max from
 > indexes and this operation is slow on bloat indexes.
 >
Yes, we thought about this, there are over 700,000,000 dead tuples. But 
as you said, it should not result in 67 second planning...

 > but 67 sec is really slow - it can be some other other problem - it is
 > real computer or virtual?
 >
real, with pretty good specs: NVME drives, Six-Core AMD Opteron, 64GB of 
ram. During testing system was mostly idle.


 >
 >     Best Regards,
 >     Krzysztof Płocharz
 >
 >

Re: Planning performance problem (67626.278ms)

From

Krzysztof Plocharz

Date:

08 April 2019, 14:58:55


On 2019/04/08 16:33, Pavel Stehule wrote:
> 
> 
> po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz 
> <plocharz@9livesdata.com <mailto:plocharz@9livesdata.com>> napsal:
> 
>     Hi
> 
>     We have some very strange query planning problem. Long story short it
>     takes 67626.278ms just to plan. Query execution takes 12ms.
> 
>     Query has 7 joins and 2 subselects.
>     It looks like the issue is not deterministic, sometimes is takes few ms
>     to plan the query.
> 
>     One of the tables has 550,485,942 live tuples and 743,504,012 dead
>     tuples. Running ANALYZE on that tables solves the problem only
>     temporarily.
> 
>     Question is how can we debug what is going on?
> 
> 
> please check your indexes against bloating. Planner get min and max from 
> indexes and this operation is slow on bloat indexes.
> 

Yes, we thought about this, there are over 700,000,000 dead tuples. But 
as you said, it should not result in 67 second planning...

> but 67 sec is really slow - it can be some other other problem - it is 
> real computer or virtual?
> 

real, with pretty good specs: NVME drives, Six-Core AMD Opteron, 64GB of 
ram. During testing system was mostly idle.

> 
>     Best Regards,
>     Krzysztof Płocharz
> 
>

Re: Planning performance problem (67626.278ms)

From

Pavel Stehule

Date:

08 April 2019, 15:07:04

po 8. 4. 2019 v 16:55 odesílatel Krzysztof Plocharz <plocharz@9livesdata.com> napsal:

On 2019/04/08 16:42, Justin Pryzby wrote:
> On Mon, Apr 08, 2019 at 04:33:34PM +0200, Pavel Stehule wrote:
>> po 8. 4. 2019 v 16:11 odes�latel Krzysztof Plocharz <plocharz@9livesdata.com> napsal:
>>
>>> We have some very strange query planning problem. Long story short it
>>> takes 67626.278ms just to plan. Query execution takes 12ms.
>>>
>>> Query has 7 joins and 2 subselects.
>>> It looks like the issue is not deterministic, sometimes is takes few ms
>>> to plan the query.
>>>
>>> One of the tables has 550,485,942 live tuples and 743,504,012 dead
>>> tuples. Running ANALYZE on that tables solves the problem only temporarily.
>>>
>>> Question is how can we debug what is going on?
>>
>> please check your indexes against bloating. Planner get min and max from
>> indexes and this operation is slow on bloat indexes.

Yes, we thought about this, there are over 700,000,000 dead tuples. But
as you said, it should not result in 67 second planning...

>
> I think that's from get_actual_variable_range(), right ?
>
> If it's due to bloating, I think the first step would be to 1) vacuum right
> now; and, 2) set more aggressive auto-vacuum, like ALTER TABLE t SET
> (AUTOVACUUM_VACUUM_SCALE_FACTOR=0.005).
>

We did pgrepack and it did help, but is it possible for
get_actual_variable_range to take over 60 seconds?
Is there any other workaround for this except for pgrepack/vacuum?

Anyway to actually debug this?

you can use perf and get a profile.

https://wiki.postgresql.org/wiki/Profiling_with_perf

> What version postgres server ?
>
> Justin
>
>

On 2019/04/08 16:33, Pavel Stehule wrote:>
>
> po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz
> <plocharz@9livesdata.com <mailto:plocharz@9livesdata.com>> napsal:
>
> Hi
>
> We have some very strange query planning problem. Long story short it
> takes 67626.278ms just to plan. Query execution takes 12ms.
>
> Query has 7 joins and 2 subselects.
> It looks like the issue is not deterministic, sometimes is takes
few ms
> to plan the query.
>
> One of the tables has 550,485,942 live tuples and 743,504,012 dead
> tuples. Running ANALYZE on that tables solves the problem only
> temporarily.
>
> Question is how can we debug what is going on?
>
>
> please check your indexes against bloating. Planner get min and max from
> indexes and this operation is slow on bloat indexes.
>
Yes, we thought about this, there are over 700,000,000 dead tuples. But
as you said, it should not result in 67 second planning...

> but 67 sec is really slow - it can be some other other problem - it is
> real computer or virtual?
>
real, with pretty good specs: NVME drives, Six-Core AMD Opteron, 64GB of
ram. During testing system was mostly idle.

>
> Best Regards,
> Krzysztof Płocharz
>
>

Re: Planning performance problem (67626.278ms)

From

Justin Pryzby

Date:

08 April 2019, 22:10:56

On Mon, Apr 08, 2019 at 04:55:36PM +0200, Krzysztof Plocharz wrote:
> We did pgrepack and it did help, but is it possible for
> get_actual_variable_range to take over 60 seconds?

You have many tables being joined, perhaps in exhaustive search, so maybe
that's being called many times.

What version postgres server ?

Justin

Re: Planning performance problem (67626.278ms)

From

Jeremy Schneider

Date:

08 April 2019, 23:10:17

On 4/8/19 07:42, Justin Pryzby wrote:
> On Mon, Apr 08, 2019 at 04:33:34PM +0200, Pavel Stehule wrote:
>> po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz <plocharz@9livesdata.com> napsal:
>>
>>> We have some very strange query planning problem. Long story short it
>>> takes 67626.278ms just to plan. Query execution takes 12ms.
>>>
>>> Query has 7 joins and 2 subselects.
>>> It looks like the issue is not deterministic, sometimes is takes few ms
>>> to plan the query.
>>>
>>> One of the tables has 550,485,942 live tuples and 743,504,012 dead
>>> tuples. Running ANALYZE on that tables solves the problem only temporarily.
>>>
>>> Question is how can we debug what is going on?
>>
>> please check your indexes against bloating. Planner get min and max from
>> indexes and this operation is slow on bloat indexes.
> 
> I think that's from get_actual_variable_range(), right ?

For what it's worth, I have seen a similar issue on Aurora PG 9.6 where
query planning took a very long time (multiple minutes). In this
particular case, there wasn't anything Aurora-specific about the call to
get_actual_variable_range. We weren't able to distinctly identify the
root cause or build a reproducible test case -- but we suspect that an
inefficiency might exist in community PostgreSQL code.

For debugging, a few ideas:

1) capture a stack with pstack or perf record --call-graph

2) capture the execution plan of the SQL w slow planning

3) capture detailed stats for all relations and objects involved

4) capture the usual info for bug reporting (preface section in docs)

A reproducible test case is the gold standard; I'm keeping my eyes open
for another case too.

For the slow planning case that I saw, the slow process was almost
entirely in this call stack (captured with perf record --call-graph):
...
index_fetch_heap
index_getnext
get_actual_variable_range
ineq_histogram_selectivity
scalarineqsel
mergejoinscansel
initial_cost_mergejoin
try_mergejoin_path
add_paths_to_joinrel
make_join_rel
join_search_one_level
standard_join_search
make_one_rel
query_planner
...

-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: Planning performance problem (67626.278ms)

From

Andres Freund

Date:

08 April 2019, 23:26:44

Hi,

On 2019-04-08 16:10:17 -0700, Jeremy Schneider wrote:
> On 4/8/19 07:42, Justin Pryzby wrote:
> > On Mon, Apr 08, 2019 at 04:33:34PM +0200, Pavel Stehule wrote:
> >> po 8. 4. 2019 v 16:11 odesílatel Krzysztof Plocharz <plocharz@9livesdata.com> napsal:
> >>
> >>> We have some very strange query planning problem. Long story short it
> >>> takes 67626.278ms just to plan. Query execution takes 12ms.
> >>>
> >>> Query has 7 joins and 2 subselects.
> >>> It looks like the issue is not deterministic, sometimes is takes few ms
> >>> to plan the query.
> >>>
> >>> One of the tables has 550,485,942 live tuples and 743,504,012 dead
> >>> tuples. Running ANALYZE on that tables solves the problem only temporarily.
> >>>
> >>> Question is how can we debug what is going on?
> >>
> >> please check your indexes against bloating. Planner get min and max from
> >> indexes and this operation is slow on bloat indexes.
> > 
> > I think that's from get_actual_variable_range(), right ?
> 
> For what it's worth, I have seen a similar issue on Aurora PG 9.6 where
> query planning took a very long time (multiple minutes). In this
> particular case, there wasn't anything Aurora-specific about the call to
> get_actual_variable_range. We weren't able to distinctly identify the
> root cause or build a reproducible test case -- but we suspect that an
> inefficiency might exist in community PostgreSQL code.
> 
> For debugging, a few ideas:
> 
> 1) capture a stack with pstack or perf record --call-graph
> 
> 2) capture the execution plan of the SQL w slow planning
> 
> 3) capture detailed stats for all relations and objects involved
> 
> 4) capture the usual info for bug reporting (preface section in docs)
> 
> A reproducible test case is the gold standard; I'm keeping my eyes open
> for another case too.
> 
> For the slow planning case that I saw, the slow process was almost
> entirely in this call stack (captured with perf record --call-graph):
> ...
> index_fetch_heap
> index_getnext
> get_actual_variable_range
> ineq_histogram_selectivity
> scalarineqsel
> mergejoinscansel
> initial_cost_mergejoin
> try_mergejoin_path
> add_paths_to_joinrel
> make_join_rel
> join_search_one_level
> standard_join_search
> make_one_rel
> query_planner
> ...

I suspect some of this might be related to < 11 not having the following
commit:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd

Greetings,

Andres Freund

Re: Planning performance problem (67626.278ms)

From

Jeremy Schneider

Date:

20 April 2021, 21:00:33

Two years later, I still remember this. And today I just confirmed
someone hitting this on open source PG13.

On the incident I saw today, the tuple reads during planning were
leading to excessive mxid waits - I suspect because memory pressure
probably caused the mxid offsets/member files to get flushed and start
reading from disk. It was the mxid waits that actually made the system
go sideways - I was just surprised that we saw most of the system CPU in
planning time (rather than execution).

I really think there's something here in the planner that's still
causing headaches for people; I've this seen a few times now. Looks to
me like the common theme is:

try_mergejoin_path ->
initial_cost_mergejoin ->
mergejoinscansel ->
scalarineqsel ->
ineq_histogram_selectivity ->
get_actual_variable_range

And from here it starts calling index_getnext() which can go on for a
very long time and the system seems to fall over if it begins to involve
much physical I/O.

I'll continue to keep an eye out for this, and keep this thread updated
if I find anything else that might move the understanding forward.

Thanks,
Jeremy Schneider

On 4/8/19 16:26, Andres Freund wrote:
> Hi,
> 
> On 2019-04-08 16:10:17 -0700, Jeremy Schneider wrote:
>>
>> For what it's worth, I have seen a similar issue on Aurora PG 9.6 where
>> query planning took a very long time (multiple minutes). In this
>> particular case, there wasn't anything Aurora-specific about the call to
>> get_actual_variable_range. We weren't able to distinctly identify the
>> root cause or build a reproducible test case -- but we suspect that an
>> inefficiency might exist in community PostgreSQL code.
>>
>> For debugging, a few ideas:
>>
>> 1) capture a stack with pstack or perf record --call-graph
>> 2) capture the execution plan of the SQL w slow planning
>> 3) capture detailed stats for all relations and objects involved
>> 4) capture the usual info for bug reporting (preface section in docs)
>>
>> A reproducible test case is the gold standard; I'm keeping my eyes open
>> for another case too.
>>
>> For the slow planning case that I saw, the slow process was almost
>> entirely in this call stack (captured with perf record --call-graph):
>> ...
>> index_fetch_heap
>> index_getnext
>> get_actual_variable_range
>> ineq_histogram_selectivity
>> scalarineqsel
>> mergejoinscansel
>> initial_cost_mergejoin
>> try_mergejoin_path
>> add_paths_to_joinrel
>> make_join_rel
>> join_search_one_level
>> standard_join_search
>> make_one_rel
>> query_planner
>> ...
> 
> I suspect some of this might be related to < 11 not having the following
> commit:
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: Planning performance problem (67626.278ms)

From

David Rowley

Date:

21 April 2021, 14:14:21

On Thu, 22 Apr 2021 at 00:03, Jeremy Schneider <schnjere@amazon.com> wrote:
>
> Two years later, I still remember this. And today I just confirmed
> someone hitting this on open source PG13.

The only thing that changed about get_actual_variable_range() is that
it now uses a SnapshotNonVacuumable snapshot. Previously a
long-running transaction could have caused vacuum to be unable to
remove tuples which could have caused get_actual_variable_range() to
be slow if it had to skip the unvacuumable tuples.

That's now changed as the SnapshotNonVacuumable will see any tuples
required by that long-running transaction and use that to determine
the range instead of skipping over it.

Anyone with a large number of tuples that vacuum can remove that are
at either end of the range on a column that is indexed by a btree
index could still have issues.  Vacuuming more often might be a good
thing to consider.  With the original report on this thread there were
more dead tuples in the table than live tuples.  Disabling auto-vacuum
or tuning it so it waits that long is likely a bad idea.

FWIW, here's a simple test case that shows the problem in current master.

create table a (a int primary key) with (autovacuum_enabled = off);
insert into a select x from generate_series(1,10000000) x;
analyze a;
delete from a;
\timing on
explain select * from a where a < 10000000;
                         QUERY PLAN
------------------------------------------------------------
 Seq Scan on a  (cost=0.00..169247.71 rows=9998977 width=4)
   Filter: (a < 10000000)
(2 rows)

Time: 9062.600 ms (00:09.063)

vacuum a;
explain select * from a where a < 10000000;
                   QUERY PLAN
-------------------------------------------------
 Seq Scan on a  (cost=0.00..0.00 rows=1 width=4)
   Filter: (a < 10000000)
(2 rows)

Time: 2.665 ms

Notice that it became faster again after I did a vacuum.

David

Re: Planning performance problem (67626.278ms)

From

Tom Lane

Date:

21 April 2021, 14:37:05

David Rowley <dgrowleyml@gmail.com> writes:
> FWIW, here's a simple test case that shows the problem in current master.

This isn't telling the whole story.  That first EXPLAIN did set the killed
bits in the index, so that subsequent ones are fairly fast, even without
VACUUM:

regression=# explain select * from a where a < 10000000;
                         QUERY PLAN                         
------------------------------------------------------------
 Seq Scan on a  (cost=0.00..169247.71 rows=9998977 width=4)
   Filter: (a < 10000000)
(2 rows)

Time: 3711.089 ms (00:03.711)
regression=# explain select * from a where a < 10000000;
                         QUERY PLAN                         
------------------------------------------------------------
 Seq Scan on a  (cost=0.00..169247.71 rows=9998977 width=4)
   Filter: (a < 10000000)
(2 rows)

Time: 230.094 ms

Admittedly this is still more than after VACUUM gets rid of the
index entries altogether:

regression=# vacuum a;
VACUUM
Time: 2559.571 ms (00:02.560)
regression=# explain select * from a where a < 10000000;
                   QUERY PLAN                    
-------------------------------------------------
 Seq Scan on a  (cost=0.00..0.00 rows=1 width=4)
   Filter: (a < 10000000)
(2 rows)

Time: 0.698 ms

However, I'm skeptical that any problem actually remains in
real-world use cases.

            regards, tom lane

Re: Planning performance problem (67626.278ms)

From

Manuel Weitzman

Date:

14 June 2021, 17:25:38

> However, I'm skeptical that any problem actually remains in
> real-world use cases.

Hello Tom,

We also had some issues with planning and get_actual_variable_range(). We
actually found some interesting behaviour that probably requires an eye with
better expertise in how the planner works.
For the example being discussed you can add some joins into the equation and
planning times deteriorate quite a bit.
I'll just skip posting the first executions as it is already established that
a subsequent one will be faster.


create table b (b int primary key, a int references a(a))
with (autovacuum_enabled=off);

insert into a select x from generate_series(1,10000000) x;
insert into b select x, x from generate_series(1,10000000) x;
create index b_a_idx on b(a);
analyze a, b;


For our case a rollback of a bulk insert causes bloat on the index.


begin;
insert into a select x from generate_series(10000001,20000000) x;
rollback;

explain (analyze, buffers)
select * from a
join b on (b.a = a.a)
where b.a in (1,100,10000,1000000,1000001);

 Planning:
   Buffers: shared hit=9 read=27329
 Planning Time: 134.560 ms
 Execution Time: 0.100 ms


I see a lot of buffers being read for some reason (wasn't this fixed?). And
times are slow too. But it get's worse with each join added to the select.


explain (analyze, buffers)
select * from a
join b b1 on (b1.a = a.a)
join b b2 on (b2.a = a.a)
where b1.a in (1,100,10000,1000000,1000001);

 Planning:
   Buffers: shared hit=38 read=81992
 Planning Time: 312.826 ms
 Execution Time: 0.131 ms

Just add a few more joins and it is a recipe for disaster.
Apparently, the planner isn't reusing the data boundaries across alternative
plans. It would be nicer if the planner remembered each column boundaries
for later reuse (within the same planner execution).

Another thing that worries me is that even the second run has faster planning
it is still way slower than the case without lots of bloat in the index. And
I don't think this is just an edge case. Rollbacks on bulk inserts can be
quite common, and joins are expected in a SQL database.

We had downtime due to how the planner works on this case. Unfortunately
setting more aggressive vacuum settings won't fix our problems. Most of the
read queries are being issued to a replica. When the issues with the planner
start happening, CPU usage on that node goes to 100% which interferes with the
replication process.
This means the replica cannot get to a new checkpoint with a new live
max value in the index nor can it delete the bloat that vacuum has already
cleaned on the leader server.

Oh, by the way, we're running version 13.2


Regards,

Manuel

Re: Planning performance problem (67626.278ms)

From

Manuel Weitzman

Date:

20 June 2021, 00:09:58

Hello everyone,

> Apparently, the planner isn't reusing the data boundaries across alternative
> plans. It would be nicer if the planner remembered each column boundaries
> for later reuse (within the same planner execution).

I've written a very naive (and crappy) patch to show how adding
memorization to get_actual_variable_range() could help the planner on
scenarios with a big number of joins.

For the previous example,

> explain (analyze, buffers)
> select * from a
> join b b1 on (b1.a = a.a)
> join b b2 on (b2.a = a.a)
> where b1.a in (1,100,10000,1000000,1000001);

each time you add a join clause the planner has to read an extra ~5[K]
buffers and gets about 200[ms] slower.

1 join
 Planning:
   Buffers: shared hit=9 read=27329
 Planning Time: 101.745 ms
 Execution Time: 0.082 ms

2 joins
 Planning:
   Buffers: shared hit=42 read=81988
 Planning Time: 303.237 ms
 Execution Time: 0.102 ms

3 joins
 Planning:
   Buffers: shared hit=94 read=136660
 Planning Time: 508.947 ms
 Execution Time: 0.155 ms

4 joins
 Planning:
   Buffers: shared hit=188 read=191322
 Planning Time: 710.981 ms
 Execution Time: 0.168 ms


After adding memorization the cost in buffers remains constant and the
latency deteriorates only marginally (as expected) with each join.

1 join
 Planning:
   Buffers: shared hit=10 read=27328
 Planning Time: 97.889 ms
 Execution Time: 0.066 ms

2 joins
 Planning:
   Buffers: shared hit=7 read=27331
 Planning Time: 100.589 ms
 Execution Time: 0.111 ms

3 joins
 Planning:
   Buffers: shared hit=9 read=27329
 Planning Time: 105.669 ms
 Execution Time: 0.134 ms

4 joins
 Planning:
   Buffers: shared hit=132 read=27370
 Planning Time: 155.716 ms
 Execution Time: 0.219 ms


I'd be happy to improve this patch into something better. Though I'd
like suggestions on how to do it:
I have this idea of creating a local "memorization" struct instance within
standard_planner(). That would require passing on a pointer down until
it reaches get_actual_variable_range(), which I think would be quite
ugly, if done just to improve the planner for this scenario.
Is there any better mechanism I could reuse from other modules? (utils
or cache, for example).


Regards,
Manuel

Attachment

actual_variable_range_memorization.diff

Re: Planning performance problem (67626.278ms)

From

Tom Lane

Date:

20 June 2021, 21:06:31

Manuel Weitzman <manuelweitzman@gmail.com> writes:
> I've written a very naive (and crappy) patch to show how adding
> memorization to get_actual_variable_range() could help the planner on
> scenarios with a big number of joins.

So ... the reason why there's not caching of get_actual_variable_range
results already is that I'd supposed it wouldn't be necessary given
the caching of selectivity estimates that happens at the RestrictInfo
level.  I don't have any objection in principle to adding another
caching layer if that one's not working well enough, but I think it'd
be wise to first understand why it's needed.

            regards, tom lane

Re: Planning performance problem (67626.278ms)

From

Ranier Vilela

Date:

20 June 2021, 23:23:40

Em dom., 20 de jun. de 2021 às 14:50, Manuel Weitzman <manuelweitzman@gmail.com> escreveu:

Hello everyone,

> Apparently, the planner isn't reusing the data boundaries across alternative
> plans. It would be nicer if the planner remembered each column boundaries
> for later reuse (within the same planner execution).

I've written a very naive (and crappy) patch to show how adding
memorization to get_actual_variable_range() could help the planner on
scenarios with a big number of joins.

For the previous example,

> explain (analyze, buffers)
> select * from a
> join b b1 on (b1.a = a.a)
> join b b2 on (b2.a = a.a)
> where b1.a in (1,100,10000,1000000,1000001);

each time you add a join clause the planner has to read an extra ~5[K]
buffers and gets about 200[ms] slower.

1 join
Planning:
Buffers: shared hit=9 read=27329
Planning Time: 101.745 ms
Execution Time: 0.082 ms

2 joins
Planning:
Buffers: shared hit=42 read=81988
Planning Time: 303.237 ms
Execution Time: 0.102 ms

3 joins
Planning:
Buffers: shared hit=94 read=136660
Planning Time: 508.947 ms
Execution Time: 0.155 ms

4 joins
Planning:
Buffers: shared hit=188 read=191322
Planning Time: 710.981 ms
Execution Time: 0.168 ms

After adding memorization the cost in buffers remains constant and the
latency deteriorates only marginally (as expected) with each join.

1 join
Planning:
Buffers: shared hit=10 read=27328
Planning Time: 97.889 ms
Execution Time: 0.066 ms

2 joins
Planning:
Buffers: shared hit=7 read=27331
Planning Time: 100.589 ms
Execution Time: 0.111 ms

3 joins
Planning:
Buffers: shared hit=9 read=27329
Planning Time: 105.669 ms
Execution Time: 0.134 ms

4 joins
Planning:
Buffers: shared hit=132 read=27370
Planning Time: 155.716 ms
Execution Time: 0.219 ms

I'd be happy to improve this patch into something better. Though I'd
like suggestions on how to do it:
I have this idea of creating a local "memorization" struct instance within
standard_planner(). That would require passing on a pointer down until
it reaches get_actual_variable_range(), which I think would be quite
ugly, if done just to improve the planner for this scenario.
Is there any better mechanism I could reuse from other modules? (utils
or cache, for example).

Without going into the merits of whether this cache will be adopted or not,

I have some comments about the code.

1. Prefer to use .patch instead of .diff, it makes it easier for browsers such as firefox to read and show the content automatically.

2. New struct?

Oid is unsigned int, lower than int64.

Better struct is:

+struct ActualVariableRangeCache {
+ int64 min_value; /* 8 bytes */
+ int64 max_value; /* 8 bytes */
+ Oid indexoid; /* 4 bytes */
+ bool has_min; /* 1 byte */
+ bool has_max; /*1 byte */
+};

Takes up less space.

3. Avoid use of type *long*, it is very problematic with 64 bits.

Windows 64 bits, long is 4 (four) bytes.

Linux 64 bits, long is 8 (eight) bytes.

4. Avoid C99 style declarations

for(unsigned long i = 0;)

Prefer:

size_t i;

for(i = 0;)

Helps backpatching to C89 versions.

regards,

Ranier Vilela

Re: Planning performance problem (67626.278ms)

From

Tom Lane

Date:

21 June 2021, 00:17:31

Ranier Vilela <ranier.vf@gmail.com> writes:
> 3. Avoid use of type *long*, it is very problematic with 64 bits.
> Windows 64 bits, long is 4 (four) bytes.
> Linux 64 bits, long is 8 (eight) bytes.

Agreed.

> 4. Avoid C99 style declarations
>     for(unsigned long i = 0;)
> Prefer:
>    size_t i;
>    for(i = 0;)
> Helps backpatching to C89 versions.

It seems unlikely that we'd consider back-patching this into pre-C99
branches, so I see no reason not to use C99 loop style.  (But do
keep in mind that we avoid most other C99-isms, such as intermixed
decls and code.)

            regards, tom lane

Re: Planning performance problem (67626.278ms)

From

Manuel Weitzman

Date:

29 June 2021, 19:26:20

> On 20-06-2021, at 17:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> So ... the reason why there's not caching of get_actual_variable_range
> results already is that I'd supposed it wouldn't be necessary given
> the caching of selectivity estimates that happens at the RestrictInfo
> level.  I don't have any objection in principle to adding another
> caching layer if that one's not working well enough, but I think it'd
> be wise to first understand why it's needed.

For what I could make out from the code, the caching done at the
RestrictInfo level is already saving a lot of work, but there's a
different RestrictInfo instance for each alternative path created by
make_one_rel().
I wouldn't know how to reuse instances of RestrictInfo or if that
would even be the correct approach. My guess is that doing so would
be incorrect.

I'll improve the patch with the suggestions from Ranier and you in
the meantime.

Best regards,
Manuel

Re: Planning performance problem (67626.278ms)

From

Tom Lane

Date:

29 June 2021, 19:43:28

Manuel Weitzman <manuelweitzman@gmail.com> writes:
>> On 20-06-2021, at 17:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So ... the reason why there's not caching of get_actual_variable_range
>> results already is that I'd supposed it wouldn't be necessary given
>> the caching of selectivity estimates that happens at the RestrictInfo
>> level.  I don't have any objection in principle to adding another
>> caching layer if that one's not working well enough, but I think it'd
>> be wise to first understand why it's needed.

> For what I could make out from the code, the caching done at the
> RestrictInfo level is already saving a lot of work, but there's a
> different RestrictInfo instance for each alternative path created by
> make_one_rel().

That seems a bit broken; a given WHERE clause should produce only one
RestrictInfo.  Can you provide a more concrete example?

            regards, tom lane

Re: Planning performance problem (67626.278ms)

From

Manuel Weitzman

Date:

29 June 2021, 21:35:38

> On 29-06-2021, at 15:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> Manuel Weitzman <manuelweitzman@gmail.com> writes:
>>> On 20-06-2021, at 17:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> So ... the reason why there's not caching of get_actual_variable_range
>>> results already is that I'd supposed it wouldn't be necessary given
>>> the caching of selectivity estimates that happens at the RestrictInfo
>>> level.  I don't have any objection in principle to adding another
>>> caching layer if that one's not working well enough, but I think it'd
>>> be wise to first understand why it's needed.
> 
>> For what I could make out from the code, the caching done at the
>> RestrictInfo level is already saving a lot of work, but there's a
>> different RestrictInfo instance for each alternative path created by
>> make_one_rel().
> 
> That seems a bit broken; a given WHERE clause should produce only one
> RestrictInfo.  Can you provide a more concrete example?
> 

I added some logging to see hits and misses on cached_scansel() for
this query
> explain (analyze, buffers)
> select * from a
> join b b1 on (b1.a = a.a)
> join b b2 on (b2.a = a.a)
> where b1.a in (1,100,10000,1000000,1000001);

Apparently  there's a RestrictInfo for each possible way of doing merge
join (are those created dynamically for planning?), for example:
- a join (b1 join b2)
- b1 join (a join b2)
- b2 join (a join b1)

When the cost of a possible mergejoin path hasn't been computed yet,
then mergejoinscansel() would have to check the bloated index again.

I attached a patch so you can see the hits and misses on cached_scansel().
Each time there's a miss logged, there's also a different RestrictInfo
pointer involved.

Best regards,
Manuel

Attachment

cached_scansel_hitmiss.patch

Re: Planning performance problem (67626.278ms)

From

Tom Lane

Date:

29 June 2021, 22:31:56

Manuel Weitzman <manuelweitzman@gmail.com> writes:
> On 29-06-2021, at 15:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That seems a bit broken; a given WHERE clause should produce only one
>> RestrictInfo.  Can you provide a more concrete example?

>> explain (analyze, buffers)
>> select * from a
>> join b b1 on (b1.a = a.a)
>> join b b2 on (b2.a = a.a)
>> where b1.a in (1,100,10000,1000000,1000001);

Hm.  By my count, this example generates 3 RestrictInfos during
deconstruct_jointree, representing the three original clauses
from the query, and then 4 more in generate_join_implied_equalities,
representing the EC-derived clauses

    a.a = b1.a
    a.a = b2.a
    b1.a = b2.a
    b1.a = a.a

The third of these seems legit enough; it's a new fact that we
want to apply while considering joining b1 directly to b2.
The other ones get made despite create_join_clause's efforts
to avoid making duplicate clauses, because of two things:

1. create_join_clause doesn't trouble to look for commuted
equivalents, which perhaps is penny-wise and pound-foolish.
The cost of re-deriving selectivity estimates could be way
more than the cost of checking this.

2. Although these look like they ought to be equivalent to the
original clauses (modulo commutation, for some of them), they don't
look that way to create_join_clause, because it's also checking
for parent_ec equality.  Per the comment,

 * parent_ec is either equal to ec (if the clause is a potentially-redundant
 * join clause) or NULL (if not).  We have to treat this as part of the
 * match requirements --- it's possible that a clause comparing the same two
 * EMs is a join clause in one join path and a restriction clause in another.

It might be worth digging into the git history to see why that
became a thing and then considering whether there's a way around it.
(I'm pretty sure that comment is mine, but I don't recall the details
anymore.)

Anyway, it's certainly not the case that we're making new
RestrictInfos for every pair of rels.  It looks that way in this
example because the join vars all belong to the same EC, but
that typically wouldn't be the case in more complex queries.

So we could look into whether this code can be improved to share
RestrictInfos across more cases.  Another thought is that even
if we need to keep original and derived clauses separate, maybe it'd
be all right to copy previously-determined cached selectivity values
from an original clause to an otherwise-identical derived clause
(cf. commute_restrictinfo()).  I'm not sure though whether it's
reliably the case that we'd have filled in selectivities for the
original clauses before this code wants to clone them.

            regards, tom lane

Re: Planning performance problem (67626.278ms)

From

Manuel Weitzman

Date:

30 June 2021, 20:56:14

> 1. create_join_clause doesn't trouble to look for commuted
> equivalents, which perhaps is penny-wise and pound-foolish.
> The cost of re-deriving selectivity estimates could be way
> more than the cost of checking this.

Agreed.

> 2. Although these look like they ought to be equivalent to the
> original clauses (modulo commutation, for some of them), they don't
> look that way to create_join_clause, because it's also checking
> for parent_ec equality.  Per the comment,
>
> * parent_ec is either equal to ec (if the clause is a potentially-redundant
> * join clause) or NULL (if not).  We have to treat this as part of the
> * match requirements --- it's possible that a clause comparing the same two
> * EMs is a join clause in one join path and a restriction clause in another.
>
> It might be worth digging into the git history to see why that
> became a thing and then considering whether there's a way around it.
> (I'm pretty sure that comment is mine, but I don't recall the details
> anymore.)

To me that sounds OK, I cannot prove that they're equivalent to the
original clauses so I think it is fine to assume they're not (not an
expert here, quite the opposite).

> Anyway, it's certainly not the case that we're making new
> RestrictInfos for every pair of rels.  It looks that way in this
> example because the join vars all belong to the same EC, but
> that typically wouldn't be the case in more complex queries.

Good to know, this wasn't clear to me.

> So we could look into whether this code can be improved to share
> RestrictInfos across more cases.  Another thought is that even
> if we need to keep original and derived clauses separate, maybe it'd
> be all right to copy previously-determined cached selectivity values
> from an original clause to an otherwise-identical derived clause
> (cf. commute_restrictinfo()).  I'm not sure though whether it's
> reliably the case that we'd have filled in selectivities for the
> original clauses before this code wants to clone them.

To be honest, even if that sounds like a good idea to dig on, I think
it wouldn't completely solve the problem with repeated calls to
get_actual_variable_range().

The example query I gave is doing a lot of simple auto-joins which
makes the thought process simpler, but I worry more about the more
"common" case in which there is more than 2 distinct tables involved
in the query

For example, instead of having "b1, b2, ..., bn" as aliases of "b" in
this query

>> explain (analyze, buffers)
>> select * from a
>> join b b1 on (b1.a = a.a)
>> join b b2 on (b2.a = a.a)
>> where b1.a in (1,100,10000,1000000,1000001);

it is also possible to reproduce the increasing cost in planning
buffers for each new join on a distinct table being added:

explain (analyze, buffers)
select * from a
join b on (b.a = a.a)
join c on (c.a = a.a)
-- ... (etc)
where c.a in (1,100,10000,1000000,1000001);

I can imagine that deconstruct_jointree() and
generate_join_implied_equalities() would generate multiple
RestrictInfos, in which many of them a constraint on a.a would be
involved (each involving a different table).

b.a = a.a
c.a = a.a
c.a = b.a
a.a = b.a
a.a = c.a
... (etc)

(if we wanted, we could also add a different WHERE clause on each of
the tables involved to make really sure all RestrictInfos are
different).

For each of these RestrictInfos there *could* be one cache miss on
cached_scansel() that *could* force the planner to compute
get_actual_variable_range() for the same variable (a.a) over and over,
as mergejoinscansel() always computes the selectivity for the
intervals that require actual extremal values. In practice this
re-computing of the variable range seems to happen a lot.

One way in which I see possible to share this kind of information (of
extremal values) across RestrictInfos is to store the known variable
ranges in PlannerInfo (or within a member of such struct), which seems
to be around everywhere it would be needed.


Best regards,
Manuel

Re: Planning performance problem (67626.278ms)

From

Manuel Weitzman

Date:

01 July 2021, 20:49:44

> On 30-06-2021, at 16:56, Manuel Weitzman <manuelweitzman@gmail.com> wrote:
> 
> One way in which I see possible to share this kind of information (of
> extremal values) across RestrictInfos is to store the known variable
> ranges in PlannerInfo (or within a member of such struct), which seems
> to be around everywhere it would be needed.

I have written a new patch that's (hopefully) better than the first
one I sent, to store the extremal index values within PlannerInfo.

> it is also possible to reproduce the increasing cost in planning
> buffers for each new join on a distinct table being added:
> 
> [...]
> 

> I can imagine that deconstruct_jointree() and
> generate_join_implied_equalities() would generate multiple
> RestrictInfos, in which many of them a constraint on a.a would be
> involved (each involving a different table).

I also attached an example in which there are RestrictInfos generated
for multiple tables instead of just a single aliased one. The buffers
read for planning also increase with each join added to the query.

Best regards,
Manuel

Attachment

Re: Planning performance problem (67626.278ms)

From

David Rowley

Date:

02 July 2021, 09:55:36

On Thu, 1 Jul 2021 at 08:56, Manuel Weitzman <manuelweitzman@gmail.com> wrote:
> For each of these RestrictInfos there *could* be one cache miss on
> cached_scansel() that *could* force the planner to compute
> get_actual_variable_range() for the same variable (a.a) over and over,
> as mergejoinscansel() always computes the selectivity for the
> intervals that require actual extremal values. In practice this
> re-computing of the variable range seems to happen a lot.

Recently, for some other topic, I was thinking about if we were ever
to have the executor give up on a plan because something did not play
out the way the planner expected it to, that if the executor to ever
be given that ability to throw the plan back at the planner with some
hints about where it went wrong, then I wondered what exactly these
hints would look like.

I guessed these would be a list of some rough guidelines saying that x
JOIN y ON x.a=y.b produces at least Z rows. Then the planner would
find that when estimating the size of the join between x and y and
take the maximum of two values.  That would need to be designed in
such a way that the planner could quickly consult the feedback, e.g
hashtable lookup on Oid.

Anyway, I don't really have any clearly thought-through plans for that
idea as it would be a large project that would need a lot of changes
before it could be even thought about seriously. However, it did cause
me to think about that again when reading this thread as it seems
similar. You've learned the actual variable range through actual
execution, so it does not seem too unreasonable that information might
get stored in a similar place, if that place were to exist.

I'm not saying that's what needs to happen here. It's more just food
for thought.  The caching you need does seem more specific to Oid than
relid.

David