Thread: Bad query plan when you add many OR conditions

Bad query plan when you add many OR conditions

From
Marco Colli
Date:
Hello!

I have a query on a large table that is very fast (0s):

Basically the query matches the rows that have a tag1 OR tag2 OR tag3 OR tag4 OR tag5... 

However if you increase the number of OR at some point PostgreSQL makes the bad decision to change its query plan! And the new plan makes the query terribly slow:

Instead of this (which is fast):
  Bitmap Index Scan on index_subscriptions_on_project_id_and_tags
It starts using this (which is slow):
  Parallel Index Scan using index_subscriptions_on_project_id_and_created_at
The choice seems quite stupid since it doesn't have the tags on the new index... and indeed the query takes about 1 minute instead of a few milliseconds. Here's a list of the available indexes:

How can I encourage PostgreSQL to use the Bitmap Index Scan even when there are many OR conditions? I have tried with VACUUM ANALYZE subscriptions but it doesn't help.

Note: the query is generated dynamically by customers of a SaaS, so I don't have full control on it


Thank you very much for any advice!
Marco Colli


Re: Bad query plan when you add many OR conditions

From
Justin Pryzby
Date:
On Fri, Jan 10, 2020 at 02:11:14AM +0100, Marco Colli wrote:
> I have a query on a large table that is very fast (0s):
> https://gist.github.com/collimarco/039412b4fe0dcf39955888f96eff29db#file-fast_query-txt

ORDER BY + LIMIT is a query which sometimes has issues, you can probably find
more by searching.  The planner thinks it'll hit the LIMIT pretty soon and only
run a fraction of the index scan - but then it turns out to be wrong.

You might have poor statistics on project_id and/or tags.  This *might* help:
ALTER TABLE subscriptions ALTER project_id SET STATISTICS 2000; ANALYZE subscriptions;

But I'm guessing there's correlation between the two, which the planner doesn't
know.  If you're running at least v10, I'm guessing it would help to CREATE
STATISTICS on those columns (and analyze).

See one similar problem here (not involving LIMIT).
https://www.postgresql.org/message-id/flat/CABFxtPedz4zL%2BaPWut4%2B%3Dum4av1aAXr6OVRfRB_6K7mJKMbEcw%40mail.gmail.com



Re: Bad query plan when you add many OR conditions

From
Marco Colli
Date:
I am trying different solutions and what I have found is even more surprising to me...

The query is always this:

I have added this index which would allow an index only scan:
"index_subscriptions_on_project_id_and_created_at_and_tags" btree (project_id, created_at DESC, tags) WHERE trashed_at IS NULL

But Postgresql continues to use this index (which has less information and then requires slow access to disk):
"index_subscriptions_on_project_id_and_created_at" btree (project_id, created_at DESC)



On Fri, Jan 10, 2020 at 4:06 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Fri, Jan 10, 2020 at 02:11:14AM +0100, Marco Colli wrote:
> I have a query on a large table that is very fast (0s):
> https://gist.github.com/collimarco/039412b4fe0dcf39955888f96eff29db#file-fast_query-txt

ORDER BY + LIMIT is a query which sometimes has issues, you can probably find
more by searching.  The planner thinks it'll hit the LIMIT pretty soon and only
run a fraction of the index scan - but then it turns out to be wrong.

You might have poor statistics on project_id and/or tags.  This *might* help:
ALTER TABLE subscriptions ALTER project_id SET STATISTICS 2000; ANALYZE subscriptions;

But I'm guessing there's correlation between the two, which the planner doesn't
know.  If you're running at least v10, I'm guessing it would help to CREATE
STATISTICS on those columns (and analyze).

See one similar problem here (not involving LIMIT).
https://www.postgresql.org/message-id/flat/CABFxtPedz4zL%2BaPWut4%2B%3Dum4av1aAXr6OVRfRB_6K7mJKMbEcw%40mail.gmail.com

Re: Bad query plan when you add many OR conditions

From
Marco Colli
Date:
@Justin Pryzby I have tried this as you suggested:

CREATE STATISTICS statistics_on_subscriptions_project_id_and_tags ON project_id, tags FROM subscriptions;
VACUUM ANALYZE subscriptions;

Unfortunately nothing changes and Postgresql continues to use the wrong plan (maybe stats don't work well on array fields like tags??).

On Fri, Jan 10, 2020 at 4:06 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Fri, Jan 10, 2020 at 02:11:14AM +0100, Marco Colli wrote:
> I have a query on a large table that is very fast (0s):
> https://gist.github.com/collimarco/039412b4fe0dcf39955888f96eff29db#file-fast_query-txt

ORDER BY + LIMIT is a query which sometimes has issues, you can probably find
more by searching.  The planner thinks it'll hit the LIMIT pretty soon and only
run a fraction of the index scan - but then it turns out to be wrong.

You might have poor statistics on project_id and/or tags.  This *might* help:
ALTER TABLE subscriptions ALTER project_id SET STATISTICS 2000; ANALYZE subscriptions;

But I'm guessing there's correlation between the two, which the planner doesn't
know.  If you're running at least v10, I'm guessing it would help to CREATE
STATISTICS on those columns (and analyze).

See one similar problem here (not involving LIMIT).
https://www.postgresql.org/message-id/flat/CABFxtPedz4zL%2BaPWut4%2B%3Dum4av1aAXr6OVRfRB_6K7mJKMbEcw%40mail.gmail.com

Re: Bad query plan when you add many OR conditions

From
Justin Pryzby
Date:
On Fri, Jan 10, 2020 at 12:03:39PM +0100, Marco Colli wrote:
> I have added this index which would allow an index only scan:
> "index_subscriptions_on_project_id_and_created_at_and_tags" btree
> (project_id, created_at DESC, tags) WHERE trashed_at IS NULL

Are those the only columns in subscriptions ?

> But Postgresql continues to use this index (which has less information and
> then requires slow access to disk):
> "index_subscriptions_on_project_id_and_created_at" btree (project_id,
> created_at DESC)

Did you vacuum the table ?
Did you try to "explain" the query after dropping the 1st index (like: begin;
DROP INDEX..; explain analyze..; rollback).

Also, is the first (other) index btree_gin (you can \dx to show extensions) ?

I think it needs to be a gin index to search tags ?

On Fri, Jan 10, 2020 at 01:42:24PM +0100, Marco Colli wrote:
> I would like to try your solution but I read that ALTER TABLE... SET
> STATISTICS  locks the table... Since it is just an experiment and we don't
> know if it actually works it would be greate to avoid locking a large table
> (50M) in production.

I suggest to CREATE TABLE test_subscriptions (LIKE subscriptions INCLUDING
ALL); INSERT INTO test_subscriptions SELECT * FROM subscriptions; ANALYZE test_subscriptions;

Anyway, ALTER..SET STATS requires a strong lock but for only a brief moment
(assuming it doesn't have to wait).  Possibly you'd be ok doing SET
statement_timeout='1s'; ALTER TABLE....  

> Does CREATE  STATISTICS lock the table too?

You can check by SET client_min_messages=debug; SET lock_timeout=333; SET log_lock_waits=on;
Looks like it needs ShareUpdateExclusiveLock.

> Does statistics work on an array field like tags? (I can't find any
> information)

It think it'd be data type agnostic.  And seems to work with arrays.

On Fri, Jan 10, 2020 at 02:30:27PM +0100, Marco Colli wrote:
> @Justin Pryzby I have tried this as you suggested:
> 
> CREATE STATISTICS statistics_on_subscriptions_project_id_and_tags ON
> project_id, tags FROM subscriptions;
> VACUUM ANALYZE subscriptions;
> 
> Unfortunately nothing changes and Postgresql continues to use the wrong
> plan (maybe stats don't work well on array fields like tags??).

It'd help to see SELECT stxddependencies FROM pg_statistic_ext WHERE
stxoid='subscriptions'::regclass



Re: Bad query plan when you add many OR conditions

From
Marco Colli
Date:
Before trying other solutions I would like to make PG use an index-only scan (it should be fast enough for our purpose).

I have tried to disable the other indexes and forced PG to use this index (which includes all the fields of the query):
index_subscriptions_on_project_id_and_created_at_and_tags

The problem is that the query plan is this:

As you can see it is a *index scan* and not an *index only* scan... I don't understand why. The index includes all the fields used by the query... so an index only scan should be possible.


On Fri, Jan 10, 2020 at 2:34 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Fri, Jan 10, 2020 at 12:03:39PM +0100, Marco Colli wrote:
> I have added this index which would allow an index only scan:
> "index_subscriptions_on_project_id_and_created_at_and_tags" btree
> (project_id, created_at DESC, tags) WHERE trashed_at IS NULL

Are those the only columns in subscriptions ?

> But Postgresql continues to use this index (which has less information and
> then requires slow access to disk):
> "index_subscriptions_on_project_id_and_created_at" btree (project_id,
> created_at DESC)

Did you vacuum the table ?
Did you try to "explain" the query after dropping the 1st index (like: begin;
DROP INDEX..; explain analyze..; rollback).

Also, is the first (other) index btree_gin (you can \dx to show extensions) ?

I think it needs to be a gin index to search tags ?

On Fri, Jan 10, 2020 at 01:42:24PM +0100, Marco Colli wrote:
> I would like to try your solution but I read that ALTER TABLE... SET
> STATISTICS  locks the table... Since it is just an experiment and we don't
> know if it actually works it would be greate to avoid locking a large table
> (50M) in production.

I suggest to CREATE TABLE test_subscriptions (LIKE subscriptions INCLUDING
ALL); INSERT INTO test_subscriptions SELECT * FROM subscriptions; ANALYZE test_subscriptions;

Anyway, ALTER..SET STATS requires a strong lock but for only a brief moment
(assuming it doesn't have to wait).  Possibly you'd be ok doing SET
statement_timeout='1s'; ALTER TABLE.... 

> Does CREATE  STATISTICS lock the table too?

You can check by SET client_min_messages=debug; SET lock_timeout=333; SET log_lock_waits=on;
Looks like it needs ShareUpdateExclusiveLock.

> Does statistics work on an array field like tags? (I can't find any
> information)

It think it'd be data type agnostic.  And seems to work with arrays.

On Fri, Jan 10, 2020 at 02:30:27PM +0100, Marco Colli wrote:
> @Justin Pryzby I have tried this as you suggested:
>
> CREATE STATISTICS statistics_on_subscriptions_project_id_and_tags ON
> project_id, tags FROM subscriptions;
> VACUUM ANALYZE subscriptions;
>
> Unfortunately nothing changes and Postgresql continues to use the wrong
> plan (maybe stats don't work well on array fields like tags??).

It'd help to see SELECT stxddependencies FROM pg_statistic_ext WHERE
stxoid='subscriptions'::regclass

Re: Bad query plan when you add many OR conditions

From
Tom Lane
Date:
Marco Colli <collimarco91@gmail.com> writes:
> As you can see it is a *index scan* and not an *index only* scan... I don't
> understand why. The index includes all the fields used by the query... so
> an index only scan should be possible.

Huh?  The query is "select * from ...", so it retrieves *all* columns
of the table.

            regards, tom lane



Re: Bad query plan when you add many OR conditions

From
Marco Colli
Date:
Sorry, I didn't notice the SELECT * and I said something stupid... 
However my reasoning should be still valid: I mean, PG could find the few relevant rows (there's a LIMIT 30) using ONLY the index. It has all the information required inside the index! Then it can simply access to that rows on disk... It cannot take ~1 minute to access a few rows on disk (max 30 rows, actual 0 rows).



On Fri, Jan 10, 2020 at 4:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Marco Colli <collimarco91@gmail.com> writes:
> As you can see it is a *index scan* and not an *index only* scan... I don't
> understand why. The index includes all the fields used by the query... so
> an index only scan should be possible.

Huh?  The query is "select * from ...", so it retrieves *all* columns
of the table.

                        regards, tom lane

Re: Bad query plan when you add many OR conditions

From
Jeff Janes
Date:
On Thu, Jan 9, 2020 at 8:11 PM Marco Colli <collimarco91@gmail.com> wrote:
Hello!

I have a query on a large table that is very fast (0s):

Basically the query matches the rows that have a tag1 OR tag2 OR tag3 OR tag4 OR tag5... 

Each branch of the OR is anticipated to return 400 rows, but it actually returns 0.  If it actually were to return 400 rows per branch, than eventually the plan switch actually would make sense.

Why is the estimate off by so much?  If you run a simple select, what the actual and expected number of rows WHERE project_id = 12345?  WHERE tags @> '{crt:2018_11}'?  Is one of those estimates way off reality, or is it only the conjunction which is deranged?
 
How can I encourage PostgreSQL to use the Bitmap Index Scan even when there are many OR conditions? I have tried with VACUUM ANALYZE subscriptions but it doesn't help.

Note: the query is generated dynamically by customers of a SaaS, so I don't have full control on it

Do you have enough control to change the ORDER BY to:

ORDER BY ("subscriptions"."created_at" + interval '0 days') DESC  

Cheers,

Jeff

Re: Bad query plan when you add many OR conditions

From
Tomas Vondra
Date:
On Fri, Jan 10, 2020 at 02:30:27PM +0100, Marco Colli wrote:
>@Justin Pryzby I have tried this as you suggested:
>
>CREATE STATISTICS statistics_on_subscriptions_project_id_and_tags ON
>project_id, tags FROM subscriptions;
>VACUUM ANALYZE subscriptions;
>
>Unfortunately nothing changes and Postgresql continues to use the wrong
>plan (maybe stats don't work well on array fields like tags??).
>

We support this type of clause for extended statistics (yet).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Bad query plan when you add many OR conditions

From
Thomas Kellerer
Date:
Marco Colli schrieb am 10.01.2020 um 02:11:
> I have a query on a large table that is very fast (0s):
> https://gist.github.com/collimarco/039412b4fe0dcf39955888f96eff29db#file-fast_query-txt
> 
> Basically the query matches the rows that have a tag1 OR tag2 OR tag3 OR tag4 OR tag5... 
> 
> However if you increase the number of OR at some point PostgreSQL makes the bad decision to change its query plan!
Andthe new plan makes the query terribly slow:
 
> https://gist.github.com/collimarco/039412b4fe0dcf39955888f96eff29db#file-slow_query-txt
> 
> Instead of this (which is fast):
>   Bitmap Index Scan on index_subscriptions_on_project_id_and_tags
> It starts using this (which is slow):
>   Parallel Index Scan using index_subscriptions_on_project_id_and_created_at
> The choice seems quite stupid since it doesn't have the tags on the new index... and indeed the query takes about 1
minuteinstead of a few milliseconds. Here's a list of the available indexes:
 
> https://gist.github.com/collimarco/039412b4fe0dcf39955888f96eff29db#file-_indexes-txt
> 
> How can I encourage PostgreSQL to use the Bitmap Index Scan even when there are many OR conditions? I have tried with
VACUUMANALYZE subscriptions but it doesn't help.
 
> 
> Note: the query is generated dynamically by customers of a SaaS, so I don't have full control on it

Can you replace the many ORs with a single "overlaps" comparison?

This 

    (tags @> ARRAY['crt:2018_04']::varchar[]) OR (tags @> ARRAY['crt:2018_05']::varchar[]) OR (tags @>
ARRAY['crt:2018_06']::varchar[])

is equivalent to 

    tags && array['crt:2018_04','crt:2018_05','crt:2018_06', ...]

The && operator can make use of a GIN index so maybe that uses the index_subscriptions_on_project_id_and_tags
regardlessof the number of elements.