Thread: Is the plan for IN(1,2,3) always the same as for =ANY('{1,2,3}') when using PQexec with no params?

Is the plan for IN(1,2,3) always the same as for =ANY('{1,2,3}') when using PQexec with no params?

From

Dmitry Koterov

Date:

17 November 2022, 03:40:13

Hi.

PG13+. Assume we have two identical queries with no arguments (as a plain text, e.g. passed to PQexec - NOT to PQexecParams!):

- one with "a=X AND b IN(...)"

- and one with "a=X and b=ANY('{...}')

The question: is it guaranteed that the planner will always choose identical plans for them (or, at least, the plan for ANY will not match an existing index worse than the plan with IN)? In my experiments it shows that the answer is "yes", but I don't know the PG internals to make sure that it's true in ALL situations.

Assumptions:

- The number of values in IN/ANY is of medium cardinality (say, 10-100 values)

- Again, all those values are static; no parameters are involved; plain simple SQL as a text

- There is also another column "a" which is compared against a constant ("a" is of 1000x lower cardinality than "b" to make it interesting), and an index on (a, b)

Example:

create table test(a bigint, b bigint);
create index test_idx on test(a, b);

truncate test;
insert into test(a, b) select round(s/10000), s from generate_series(1, 1000000) s;

# explain analyze select * from test where a=10 and b in(1,2,3);
----------------------------------------------------------------------------
Index Only Scan using test_idx on test (cost=0.42..13.31 rows=1 width=16)
Index Cond: ((a = 10) AND (b = ANY ('{1,2,3}'::bigint[])))

# explain analyze select * from test where a=10 and b=any('{1,2,3}');
----------------------------------------------------------------------------
Index Only Scan using test_idx on test (cost=0.42..13.31 rows=1 width=16)
Index Cond: ((a = 10) AND (b = ANY ('{1,2,3}'::bigint[])))

It shows exactly the same plan here. Would it always be the same, or it may be different? (E.g. my worry is that for IN variant, the planner can use the statistics against the actual values in that IN parentheses, whilst when using ANY('{...}'), I can imagine that in some circumstances, it would ignore the actual values within the literal and instead build a "generic" plan which may not utilize the index properly; is it the case?)

P.S.

A slightly correlated question was raised on StackOverflow, e.g. https://stackoverflow.com/questions/34627026/in-vs-any-operator-in-postgresql - but it wasn't about this particular usecase, they were discussing more complicated things, like when each element of an IN/ANY clause is actually a pair of elements (which makes the planner go crazy in some circumstances). My question is way simpler, I don't deal with clauses like "IN((1,2),(3,4),(5,6))" etc.; it's all about the plain 2-column selects. Unfortunately, it's super-hard to find more info about this question, because both "in" and "any" are stop-words in search engines, so they don't show good answers, including any of the PG mail list archives.

P.S.S.

Why does the answer matter? Because for "IN(1,2,3)" case, e.g. pg_stat_statements generalizes the query to "IN($1,$2,$3)" and doesn't coalesce multiple queries into one, whilst for "=ANY('{1,2,3}')", it coalesces them all to "=ANY($1)". Having those 1,2,3,... of a different cardinality all the time, the logs/stats are flooded with the useless variants of the same query basically. It also applies to e.g. logging to Datadog which normalizes the queries. We'd love to use "=ANY(...)" variant everywhere and never use IN() anymore, but are scared of getting some unexpected regressions.

Re: Is the plan for IN(1,2,3) always the same as for =ANY('{1,2,3}') when using PQexec with no params?

From

Tom Lane

Date:

17 November 2022, 03:51:20

Dmitry Koterov <dmitry.koterov@gmail.com> writes:
> PG13+. Assume we have two identical queries with no arguments (as a plain
> text, e.g. passed to PQexec - NOT to PQexecParams!):

> - one with "a=X AND b IN(...)"
> - and one with "a=X and b=ANY('{...}')

> The question: is it guaranteed that the planner will always choose
> identical plans for them (or, at least, the plan for ANY will not match an
> existing index worse than the plan with IN)?

This depends greatly on what "..." represents.  But if it's a list
of constants, they're probably equivalent.  transformAExprIn()
offers some caveats:

     * We try to generate a ScalarArrayOpExpr from IN/NOT IN, but this is only
     * possible if there is a suitable array type available.  If not, we fall
     * back to a boolean condition tree with multiple copies of the lefthand
     * expression.  Also, any IN-list items that contain Vars are handled as
     * separate boolean conditions, because that gives the planner more scope
     * for optimization on such clauses.

            regards, tom lane

Re: Is the plan for IN(1,2,3) always the same as for =ANY('{1,2,3}') when using PQexec with no params?

From

Dmitry Koterov

Date:

07 December 2022, 04:45:30

Thanks Tom!

It sounds like that for multi-value IN/ANY, they act the same way as you mentioned. But I found the difference in plans for the single-value variant.

Imagine we have a btree(a, b) index. Compare two queries for one-element use case:

1. a='aaa' AND b=ANY('{bbb}')
2. a='aaa' AND b IN('bbb')

They may produce different plans: IN() always coalesces to field='aaa' in the plan, whilst =ANY() always remains =ANY(). This causes PG to choose a "post-filtering" plan sometimes:

1. For =ANY: Index Cond: (a='aaa'); Filter: b=ANY('{bbb}')
2. For IN(): Index Cond: (a='aaa') AND (b='bbb')

Do you think that this difference is significant? Or maybe something is off in the planner, should it treat them differently by design, is it intended?

Below is an example screenshot from the production database with real data. We see that IN(20) is literally the same as =20 (green marker), whilst =any('{20}') causes PG to use post-filtering. (The cardinality of data in "type" field is low, just several unique values there in the entire table, so it probably doesn't make a big difference, whether a post-filtering is used or not, but anyways, the difference between IN() and =ANY looks a little scary. The index is "btree (cred_id, external_id, type)".)

Thanks!

On Wed, Nov 16, 2022 at 7:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dmitry Koterov <dmitry.koterov@gmail.com> writes:
> PG13+. Assume we have two identical queries with no arguments (as a plain
> text, e.g. passed to PQexec - NOT to PQexecParams!):

> - one with "a=X AND b IN(...)"
> - and one with "a=X and b=ANY('{...}')

> The question: is it guaranteed that the planner will always choose
> identical plans for them (or, at least, the plan for ANY will not match an
> existing index worse than the plan with IN)?

This depends greatly on what "..." represents. But if it's a list
of constants, they're probably equivalent. transformAExprIn()
offers some caveats:

* We try to generate a ScalarArrayOpExpr from IN/NOT IN, but this is only
* possible if there is a suitable array type available. If not, we fall
* back to a boolean condition tree with multiple copies of the lefthand
* expression. Also, any IN-list items that contain Vars are handled as
* separate boolean conditions, because that gives the planner more scope
* for optimization on such clauses.

regards, tom lane

Attachment

image.png