Thread: Postgres undeterministically uses a bad plan, how to convince it otherwise?

Hi,

I am running the same application (identical codebase) as two separate 
instances to index (save) different sets of data. Both run PostgreSQL 13.

The queries are the same but the content in actual databases is 
different. One database is around 1TB and the other around 300GB.


There is a problem with a paginated select query with a join and an 
order. Depending on which column you order by (primary or FK) the query 
is either instant or takes minutes.

So on one database, ordering by primary is instant but on the other it 
is slow and vice-versa. Switching the columns around on the slow case 
fixes the issue.

All relavant colums are indexed.


Simplified:

Slow: SELECT * from table1 AS t1 LEFT JOIN table2 AS t2 ON 
t2.t1_id=t1.id ORDER BY t1.id ASC LIMIT 0, 10

Fast: SELECT * from table1 AS t1 LEFT JOIN table2 AS t2 ON 
t2.t1_id=t1.id ORDER BY t2.t1_id ASC LIMIT 0, 10

(and the opposite, on the other instance the first one is fast and 
second one is slow).


I have run all the statistic recalculations but that doesn't help. As 
far as I could read the docs, there is no real way to affect the plan

other than reworking the query (I've read about fencing?) which can't be 
done because it seems to be unpredictable and depends on actual data and 
data quantity.

I haven't tried reindexing.


Before providing and diving into the specific query plans:

- does the planner take previous runs of the same query and it's 
execution time into account? If not, why?

- assuming the query to be immutable, would it be possible for the 
planner to microbenchmark a few different plans instead of trying to 
estimate the cost?
As in, actually executing the query with different plans and caching the 
best one.



PS: good job to the FOSDEM Devroom organisers, one of the best rooms, 
was a blast.

Best regards, Cen




Re: Postgres undeterministically uses a bad plan, how to convince it otherwise?

From
"David G. Johnston"
Date:
On Thu, Feb 16, 2023 at 8:48 AM cen <cen.is.imba@gmail.com> wrote:

- does the planner take previous runs of the same query and it's
execution time into account? If not, why?

No, because that isn't how it works.  And while I'm no planner expert I'm not imagining any particularly compelling argument for why it would even make sense to try.  The book-keeping would be expensive and dealing with supposedly an ever-changing dataset would in many cases make any such comparisons be meaningless.


- assuming the query to be immutable, would it be possible for the
planner to microbenchmark a few different plans instead of trying to
estimate the cost?
As in, actually executing the query with different plans and caching the
best one.

No, the planner may not cause execution.  While I could imagine extending EXPLAIN to somehow retrieve and maybe even try alternative plans that have been fully constructed today I'm not holding my breath.

There is little reason for the project to give any real weight to "assuming the query to be immutable".  We do want to fix the planner to behave better if it is mis-behaving, otherwise you do have access to cost parameters, and potentially other planner toggles if you've truly run into an intractable problem.

David J.

On 2/16/23 09:47, cen wrote:
> Hi,
>
> I am running the same application (identical codebase) as two separate 
> instances to index (save) different sets of data. Both run PostgreSQL 13.
>
> The queries are the same but the content in actual databases is different. 
> One database is around 1TB and the other around 300GB.
>
>
> There is a problem with a paginated select query with a join and an order. 
> Depending on which column you order by (primary or FK) the query is either 
> instant or takes minutes.
>
> So on one database, ordering by primary is instant but on the other it is 
> slow and vice-versa. Switching the columns around on the slow case fixes 
> the issue.
>
> All relavant colums are indexed.
>
>
> Simplified:
>
> Slow: SELECT * from table1 AS t1 LEFT JOIN table2 AS t2 ON t2.t1_id=t1.id 
> ORDER BY t1.id ASC LIMIT 0, 10
>
> Fast: SELECT * from table1 AS t1 LEFT JOIN table2 AS t2 ON t2.t1_id=t1.id 
> ORDER BY t2.t1_id ASC LIMIT 0, 10
>
> (and the opposite, on the other instance the first one is fast and second 
> one is slow).

What does EXPLAIN ANALYZE say?

> I have run all the statistic recalculations but that doesn't help. As far 
> as I could read the docs, there is no real way to affect the plan
>
> other than reworking the query (I've read about fencing?) which can't be 
> done because it seems to be unpredictable and depends on actual data and 
> data quantity.
>
> I haven't tried reindexing.

Since you've run ANALYZE, when were the tables last vacuumed?

-- 
Born in Arizona, moved to Babylonia.



On 16/02/2023 17:07, David G. Johnston wrote:

No, the planner may not cause execution.  While I could imagine extending EXPLAIN to somehow retrieve and maybe even try alternative plans that have been fully constructed today I'm not holding my breath.

There is little reason for the project to give any real weight to "assuming the query to be immutable".  We do want to fix the planner to behave better if it is mis-behaving, otherwise you do have access to cost parameters, and potentially other planner toggles if you've truly run into an intractable problem.

David J.

Fair on both points. I didn't know planner toggles existed, I'll play with that.

Re: Postgres undeterministically uses a bad plan, how to convince it otherwise?

From
Marc Millas
Date:
If I may..
this answer looks more "philosophical" than "practical".
On Oracle (maybe 10, I don't remember well) was introduced the possibility to explicitly store an execution plan, so that a given query use THAT plan ie. dont go thru planner job.
OK if someone do stupid things, one may get stupid results...it was an "expert only" functionality  :-)
Still, in some cases, it was very useful to manage the rare cases where the planner cannot, for whatever reason do a good job.

OK its not the way postgres do behave. Still, in some cases...



Marc MILLAS
Senior Architect
+33607850334



On Thu, Feb 16, 2023 at 5:08 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Thu, Feb 16, 2023 at 8:48 AM cen <cen.is.imba@gmail.com> wrote:

- does the planner take previous runs of the same query and it's
execution time into account? If not, why?

No, because that isn't how it works.  And while I'm no planner expert I'm not imagining any particularly compelling argument for why it would even make sense to try.  The book-keeping would be expensive and dealing with supposedly an ever-changing dataset would in many cases make any such comparisons be meaningless.


- assuming the query to be immutable, would it be possible for the
planner to microbenchmark a few different plans instead of trying to
estimate the cost?
As in, actually executing the query with different plans and caching the
best one.

No, the planner may not cause execution.  While I could imagine extending EXPLAIN to somehow retrieve and maybe even try alternative plans that have been fully constructed today I'm not holding my breath.

There is little reason for the project to give any real weight to "assuming the query to be immutable".  We do want to fix the planner to behave better if it is mis-behaving, otherwise you do have access to cost parameters, and potentially other planner toggles if you've truly run into an intractable problem.

David J.

On 16/02/2023 17:15, Ron wrote:
> On 2/16/23 09:47, cen wrote:
>> Hi,
>>
>> I am running the same application (identical codebase) as two 
>> separate instances to index (save) different sets of data. Both run 
>> PostgreSQL 13.
>>
>> The queries are the same but the content in actual databases is 
>> different. One database is around 1TB and the other around 300GB.
>>
>>
>> There is a problem with a paginated select query with a join and an 
>> order. Depending on which column you order by (primary or FK) the 
>> query is either instant or takes minutes.
>>
>> So on one database, ordering by primary is instant but on the other 
>> it is slow and vice-versa. Switching the columns around on the slow 
>> case fixes the issue.
>>
>> All relavant colums are indexed.
>
> What does EXPLAIN ANALYZE say?

I finally managed to get back to this and have a similar sample query. 
In this case, changing the WHERE clause to use the joined table column 
slows the query down.

The initial case was on the ORDER BY column but the simptoms are the 
same I think.

I understand that even though both colums are indexed, the indexes are 
completely different but the point is, how would one know in advance 
which one will be faster when designing the query?

And as I mentioned in my initial text, the fast case of columns can 
switch around as the database grows.


Fast case:

SELECT
   t0."status",
   b1."timestamp"
FROM
   "transactions" AS t0
INNER JOIN
   "blocks" AS b1
ON
   b1."number" = t0."block_number"
WHERE
   (((t0."to_address_hash" = '\x3012c'))
     OR (t0."from_address_hash" = '\x3012c')
     OR (t0."created_contract_address_hash" = '\x3012c'))
   AND (t0."block_number" >= 30926000)
   AND (t0."block_number" <= 31957494)
ORDER BY
   t0."block_number" DESC
LIMIT
   150
OFFSET
   300;

Plan:

Limit  (cost=15911.73..23367.09 rows=150 width=16) (actual 
time=205.093..305.423 rows=150 loops=1)
    ->  Gather Merge  (cost=1001.03..812143.50 rows=16320 width=16) 
(actual time=36.140..305.333 rows=450 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          ->  Nested Loop  (cost=1.00..809259.75 rows=6800 width=16) 
(actual time=2.662..155.779 rows=153 loops=3)
                ->  Parallel Index Scan Backward using 
transactions_block_number_index on transactions t0 (cost=0.56..753566.08 
rows=6800 width=8) (actual time=0.224..145.998 rows=153 loops=3)
                      Index Cond: ((block_number >= 30926000) AND 
(block_number <= 31957494))
                      Filter: ((to_address_hash = '\x3012c'::bytea) OR 
(from_address_hash = '\x3012c'::bytea) OR (created_contract_address_hash 
= '\x3012c'::bytea))
                      Rows Removed by Filter: 22471
                ->  Index Scan using blocks_number_index on blocks b1  
(cost=0.44..8.18 rows=1 width=16) (actual time=0.059..0.060 rows=1 
loops=460)
                      Index Cond: (number = t0.block_number)
  Planning Time: 0.513 ms
  Execution Time: 305.541 ms

------------------------------

Slow case:

SELECT
   t0."status",
   b1."timestamp"
FROM
   "transactions" AS t0
INNER JOIN
   "blocks" AS b1
ON
   b1."number" = t0."block_number"
WHERE
   (((t0."to_address_hash" = '\x3012c'))
     OR (t0."from_address_hash" = '\x3012c')
     OR (t0."created_contract_address_hash" = '\x3012c'))
   AND (b1."number" >= 30926000) -- using col from joined table instead
   AND (b1."number" <= 31957494) -- using col from joined table instead
ORDER BY
   t0."block_number" DESC
LIMIT
   150
OFFSET
   300;

Plan:

Limit  (cost=1867319.63..1877754.02 rows=150 width=16) (actual 
time=95830.704..95962.116 rows=150 loops=1)
    ->  Gather Merge  (cost=1846450.83..2015348.94 rows=2428 width=16) 
(actual time=95805.872..95962.075 rows=450 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          ->  Merge Join  (cost=1845450.81..2014068.67 rows=1012 
width=16) (actual time=95791.362..95824.633 rows=159 loops=3)
                Merge Cond: (t0.block_number = b1.number)
                ->  Sort  (cost=1845402.63..1845823.81 rows=168474 
width=8) (actual time=95790.194..95790.270 rows=186 loops=3)
                      Sort Key: t0.block_number DESC
                      Sort Method: external merge  Disk: 2496kB
                      Worker 0:  Sort Method: external merge  Disk: 2408kB
                      Worker 1:  Sort Method: external merge  Disk: 2424kB
                      ->  Parallel Bitmap Heap Scan on transactions t0  
(cost=39601.64..1828470.76 rows=168474 width=8) (actual 
time=7274.149..95431.494 rows=137658 loops=3)
                            Recheck Cond: ((to_address_hash = 
'\x3012c'::bytea) OR (from_address_hash = '\x3012c'::bytea) OR 
(created_contract_address_hash = '\x3012c'::bytea))
                            Rows Removed by Index Recheck: 716205
                            Heap Blocks: exact=7043 lossy=134340
                            ->  BitmapOr (cost=39601.64..39601.64 
rows=404359 width=0) (actual time=7264.127..7264.130 rows=0 loops=1)
                                  ->  Bitmap Index Scan on 
transactions_to_address_hash_recent_collated_index (cost=0.00..326.32 
rows=3434 width=0) (actual time=2.314..2.314 rows=5 loops=1)
                                        Index Cond: (to_address_hash = 
'\x3012c'::bytea)
                                  ->  Bitmap Index Scan on 
transactions_from_address_hash_recent_collated_index 
(cost=0.00..38967.50 rows=400924 width=0) (actual 
time=7261.750..7261.750 rows=419509 loops=1)
                                        Index Cond: (from_address_hash = 
'\x3012c'::bytea)
                                  ->  Bitmap Index Scan on 
transactions_created_contract_address_hash_recent_collated_inde 
(cost=0.00..4.57 rows=1 width=0) (actual time=0.059..0.060 rows=0 loops=1)
                                        Index Cond: 
(created_contract_address_hash = '\x3012c'::bytea)
                ->  Index Scan Backward using blocks_number_index on 
blocks b1  (cost=0.44..167340.18 rows=189631 width=16) (actual 
time=0.139..26.964 rows=27612 loops=3)
                      Index Cond: ((number >= 30926000) AND (number <= 
31957494))
  Planning Time: 0.736 ms
  Execution Time: 95963.436 ms


> Since you've run ANALYZE, when were the tables last vacuumed?

Last autovacuum for transactions: 2023-02-23
Last autovacuum for blocks: 2023-01-23







On Thu, Mar 2, 2023 at 8:20 AM cen <cen.is.imba@gmail.com> wrote:
On 16/02/2023 17:15, Ron wrote:
> On 2/16/23 09:47, cen wrote:
>> Hi,
>>
>> I am running the same application (identical codebase) as two
>> separate instances to index (save) different sets of data. Both run
>> PostgreSQL 13.
>>
>> The queries are the same but the content in actual databases is
>> different. One database is around 1TB and the other around 300GB.
>>
>>
>> There is a problem with a paginated select query with a join and an
>> order. Depending on which column you order by (primary or FK) the
>> query is either instant or takes minutes.
>>
>> So on one database, ordering by primary is instant but on the other
>> it is slow and vice-versa. Switching the columns around on the slow
>> case fixes the issue.
>>
>> All relavant colums are indexed.
>
> What does EXPLAIN ANALYZE say?

I finally managed to get back to this and have a similar sample query.
In this case, changing the WHERE clause to use the joined table column
slows the query down.

The initial case was on the ORDER BY column but the simptoms are the
same I think.

I understand that even though both colums are indexed, the indexes are
completely different but the point is, how would one know in advance
which one will be faster when designing the query?

And as I mentioned in my initial text, the fast case of columns can
switch around as the database grows.


Fast case:

SELECT
   t0."status",
   b1."timestamp"
FROM
   "transactions" AS t0
INNER JOIN
   "blocks" AS b1
ON
   b1."number" = t0."block_number"
WHERE
   (((t0."to_address_hash" = '\x3012c'))
     OR (t0."from_address_hash" = '\x3012c')
     OR (t0."created_contract_address_hash" = '\x3012c'))
   AND (t0."block_number" >= 30926000)
   AND (t0."block_number" <= 31957494)
ORDER BY
   t0."block_number" DESC
LIMIT
   150
OFFSET
   300;

Plan:

Limit  (cost=15911.73..23367.09 rows=150 width=16) (actual
time=205.093..305.423 rows=150 loops=1)
    ->  Gather Merge  (cost=1001.03..812143.50 rows=16320 width=16)
(actual time=36.140..305.333 rows=450 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          ->  Nested Loop  (cost=1.00..809259.75 rows=6800 width=16)
(actual time=2.662..155.779 rows=153 loops=3)
                ->  Parallel Index Scan Backward using
transactions_block_number_index on transactions t0 (cost=0.56..753566.08
rows=6800 width=8) (actual time=0.224..145.998 rows=153 loops=3)
                      Index Cond: ((block_number >= 30926000) AND
(block_number <= 31957494))
                      Filter: ((to_address_hash = '\x3012c'::bytea) OR
(from_address_hash = '\x3012c'::bytea) OR (created_contract_address_hash
= '\x3012c'::bytea))
                      Rows Removed by Filter: 22471
                ->  Index Scan using blocks_number_index on blocks b1 
(cost=0.44..8.18 rows=1 width=16) (actual time=0.059..0.060 rows=1
loops=460)
                      Index Cond: (number = t0.block_number)
  Planning Time: 0.513 ms
  Execution Time: 305.541 ms

------------------------------

Slow case:

SELECT
   t0."status",
   b1."timestamp"
FROM
   "transactions" AS t0
INNER JOIN
   "blocks" AS b1
ON
   b1."number" = t0."block_number"
WHERE
   (((t0."to_address_hash" = '\x3012c'))
     OR (t0."from_address_hash" = '\x3012c')
     OR (t0."created_contract_address_hash" = '\x3012c'))
   AND (b1."number" >= 30926000) -- using col from joined table instead
   AND (b1."number" <= 31957494) -- using col from joined table instead
ORDER BY
   t0."block_number" DESC
LIMIT
   150
OFFSET
   300;

Plan:

Limit  (cost=1867319.63..1877754.02 rows=150 width=16) (actual
time=95830.704..95962.116 rows=150 loops=1)
    ->  Gather Merge  (cost=1846450.83..2015348.94 rows=2428 width=16)
(actual time=95805.872..95962.075 rows=450 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          ->  Merge Join  (cost=1845450.81..2014068.67 rows=1012
width=16) (actual time=95791.362..95824.633 rows=159 loops=3)
                Merge Cond: (t0.block_number = b1.number)
                ->  Sort  (cost=1845402.63..1845823.81 rows=168474
width=8) (actual time=95790.194..95790.270 rows=186 loops=3)
                      Sort Key: t0.block_number DESC
                      Sort Method: external merge  Disk: 2496kB
                      Worker 0:  Sort Method: external merge  Disk: 2408kB
                      Worker 1:  Sort Method: external merge  Disk: 2424kB
                      ->  Parallel Bitmap Heap Scan on transactions t0 
(cost=39601.64..1828470.76 rows=168474 width=8) (actual
time=7274.149..95431.494 rows=137658 loops=3)
                            Recheck Cond: ((to_address_hash =
'\x3012c'::bytea) OR (from_address_hash = '\x3012c'::bytea) OR
(created_contract_address_hash = '\x3012c'::bytea))
                            Rows Removed by Index Recheck: 716205
                            Heap Blocks: exact=7043 lossy=134340
                            ->  BitmapOr (cost=39601.64..39601.64
rows=404359 width=0) (actual time=7264.127..7264.130 rows=0 loops=1)
                                  ->  Bitmap Index Scan on
transactions_to_address_hash_recent_collated_index (cost=0.00..326.32
rows=3434 width=0) (actual time=2.314..2.314 rows=5 loops=1)
                                        Index Cond: (to_address_hash =
'\x3012c'::bytea)
                                  ->  Bitmap Index Scan on
transactions_from_address_hash_recent_collated_index
(cost=0.00..38967.50 rows=400924 width=0) (actual
time=7261.750..7261.750 rows=419509 loops=1)
                                        Index Cond: (from_address_hash =
'\x3012c'::bytea)
                                  ->  Bitmap Index Scan on
transactions_created_contract_address_hash_recent_collated_inde
(cost=0.00..4.57 rows=1 width=0) (actual time=0.059..0.060 rows=0 loops=1)
                                        Index Cond:
(created_contract_address_hash = '\x3012c'::bytea)
                ->  Index Scan Backward using blocks_number_index on
blocks b1  (cost=0.44..167340.18 rows=189631 width=16) (actual
time=0.139..26.964 rows=27612 loops=3)
                      Index Cond: ((number >= 30926000) AND (number <=
31957494))
  Planning Time: 0.736 ms
  Execution Time: 95963.436 ms


> Since you've run ANALYZE, when were the tables last vacuumed?

Last autovacuum for transactions: 2023-02-23
Last autovacuum for blocks: 2023-01-23

I shoved your plans into depesz... (https://explain.depesz.com/s/hRaZ)
I made the slow one the primary, and the fast one the optimized.

This allows me to compare them side by side easier than the email.

 And this makes it clear, in one line:

Sort (cost=1845402.63..1845823.81 rows=168474width=8) 
(cost=0..0 rows=0 width=0) (actual time=95,790.194..95,790.270 rows=186 loops=3)

The estimation was for 168,474 rows.  And the actual was 186.  you spun over a TON of data
that you did not use.  And notice the TIME.  That's your query time...

Had you included BUFFERS, I am sure it was a massive difference in data read.

You're wondering, if the 2 columns are being equated, what difference should it make?
But the system has to choose, and that extra information might have tricked the optimizer into thinking, Oh, I can just scan.
Or, oh, I will use the other index as a driver, since I have a filter to apply.

Furthermore, I RECENTLY LEARNED this: be very careful optimizing queries with LIMIT in them.
This can cause the optimizer to think brute force is fine, since it only needs X rows...

Kirk
On Fri, 3 Mar 2023 at 02:20, cen <cen.is.imba@gmail.com> wrote:
> I understand that even though both colums are indexed, the indexes are
> completely different but the point is, how would one know in advance
> which one will be faster when designing the query?

Likely to be safe, you'd just include both. The problem is that the
query planner makes use of equivalence classes to deduce equivalence
in quals.

If you have a query such as:

select * from t1 inner join t2 on t1.x = t2.y where t1.x = 3;

then the planner can deduce that t2.y must also be 3 and that qual can
be pushed down to the scan level. If t2.y = 3 is quite selective and
there's an index on that column, then this deduction is likely going
to be a very good win, as the alternative of not using it requires
looking at all rows in t2.

The problem is that the equivalence class code only can deduce
equality.  If we had written:

select * from t1 inner join t2 on t1.x = t2.y where t1.x > 2 and t1.x < 4;

then we'd not have gotten quite as optimal a plan.

Providing we're doing an inner join, then we could just write both
sets of quals to force the planner's hand:

select * from t1 inner join t2 on t1.x = t2.y where t1.x > 2 and t1.x
< 4 and t2.y > 2 and t2.y < 4;

you could likely do this.

I still hope to improve this in the planner one day.  A few other
things are getting closer which sets the bar a bit lower on getting
something like this committed. There's some relevant discussion in
[1].

David

[1] https://postgr.es/m/CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A%40mail.gmail.com



> Likely to be safe, you'd just include both. The problem is that the
> query planner makes use of equivalence classes to deduce equivalence
> in quals.
>
> If you have a query such as:
>
> select * from t1 inner join t2 on t1.x = t2.y where t1.x = 3;
>
> then the planner can deduce that t2.y must also be 3 and that qual can
> be pushed down to the scan level. If t2.y = 3 is quite selective and
> there's an index on that column, then this deduction is likely going
> to be a very good win, as the alternative of not using it requires
> looking at all rows in t2.

Does equivalency only work for constants as in the sample you provided 
or will it also be found in b1."number" and t0."block_number" in my 
sample query?

Meaning the columns could be used interchangeably in all the WHERE 
clauses and the ORDER clause, then it is a matter of figuring out what 
costs less.






On Fri, 3 Mar 2023 at 22:35, cen <cen.is.imba@gmail.com> wrote:
> Does equivalency only work for constants as in the sample you provided
> or will it also be found in b1."number" and t0."block_number" in my
> sample query?

It works for more than constants, but in this case, it's the presence
of the constant that would allow the qual to be pushed down into the
scan level of the other relation.

For cases such as t1 INNER JOIN t2 ON t1.x = t2.y INNER JOIN t3 ON
t2.y = t3.z the equivalence classes could allow t1 to be joined to t3
using t1.x = t3.z before t2 is joined in, so certainly it still does
things with classes not containing constants. No derived quals will
get pushed down to the scan level without constants, however.

> Meaning the columns could be used interchangeably in all the WHERE
> clauses and the ORDER clause, then it is a matter of figuring out what
> costs less.

The problem is you have:   AND (t0."block_number" >= 30926000) AND
(t0."block_number" <= 31957494).  The >= and <= operators are not used
to help build the equivalence class.  You'd see a very different plan
had you just been asking for block_number = 30926000.

I think your best solution will be to just also include the seemingly
surplus: AND (b1."number" >= 30926000)  AND (b1."number" <= 31957494)
quals.

David