Is the plan for IN(1,2,3) always the same as for =ANY('{1,2,3}') when using PQexec with no params? - Mailing list pgsql-hackers

From Dmitry Koterov
Subject Is the plan for IN(1,2,3) always the same as for =ANY('{1,2,3}') when using PQexec with no params?
Date
Msg-id CA+CZih7CB5t-EO-XE5DHaStzbVfFN8CqfyKB1Nkc9JQocVntWQ@mail.gmail.com
Whole thread Raw
Responses Re: Is the plan for IN(1,2,3) always the same as for =ANY('{1,2,3}') when using PQexec with no params?
List pgsql-hackers
Hi.

PG13+. Assume we have two identical queries with no arguments (as a plain text, e.g. passed to PQexec - NOT to PQexecParams!):

- one with "a=X AND b IN(...)"
- and one with "a=X and b=ANY('{...}')

The question: is it guaranteed that the planner will always choose identical plans for them (or, at least, the plan for ANY will not match an existing index worse than the plan with IN)? In my experiments it shows that the answer is "yes", but I don't know the PG internals to make sure that it's true in ALL situations.

Assumptions:

- The number of values in IN/ANY is of medium cardinality (say, 10-100 values)
- Again, all those values are static; no parameters are involved; plain simple SQL as a text
- There is also another column "a" which is compared against a constant ("a" is of 1000x lower cardinality than "b" to make it interesting), and an index on (a, b)

Example:

create table test(a bigint, b bigint);
create index test_idx on test(a, b);

truncate test;
insert into test(a, b) select round(s/10000), s from generate_series(1, 1000000) s;

# explain analyze select * from test where a=10 and b in(1,2,3);
----------------------------------------------------------------------------
 Index Only Scan using test_idx on test  (cost=0.42..13.31 rows=1 width=16)
   Index Cond: ((a = 10) AND (b = ANY ('{1,2,3}'::bigint[])))

# explain analyze select * from test where a=10 and b=any('{1,2,3}');
----------------------------------------------------------------------------
 Index Only Scan using test_idx on test  (cost=0.42..13.31 rows=1 width=16)
   Index Cond: ((a = 10) AND (b = ANY ('{1,2,3}'::bigint[])))

It shows exactly the same plan here. Would it always be the same, or it may be different? (E.g. my worry is that for IN variant, the planner can use the statistics against the actual values in that IN parentheses, whilst when using ANY('{...}'), I can imagine that in some circumstances, it would ignore the actual values within the literal and instead build a "generic" plan which may not utilize the index properly; is it the case?)

P.S.
A slightly correlated question was raised on StackOverflow, e.g. https://stackoverflow.com/questions/34627026/in-vs-any-operator-in-postgresql - but it wasn't about this particular usecase, they were discussing more complicated things, like when each element of an IN/ANY clause is actually a pair of elements (which makes the planner go crazy in some circumstances). My question is way simpler, I don't deal with clauses like "IN((1,2),(3,4),(5,6))" etc.; it's all about the plain 2-column selects. Unfortunately, it's super-hard to find more info about this question, because both "in" and "any" are stop-words in search engines, so they don't show good answers, including any of the PG mail list archives.

P.S.S.
Why does the answer matter? Because for "IN(1,2,3)" case, e.g. pg_stat_statements generalizes the query to "IN($1,$2,$3)" and doesn't coalesce multiple queries into one, whilst for "=ANY('{1,2,3}')", it coalesces them all to "=ANY($1)". Having those 1,2,3,... of a different cardinality all the time, the logs/stats are flooded with the useless variants of the same query basically. It also applies to e.g. logging to Datadog which normalizes the queries. We'd love to use "=ANY(...)" variant everywhere and never use IN() anymore, but are scared of getting some unexpected regressions.

pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: pg_upgrade test failure
Next
From: Andres Freund
Date:
Subject: Re: CI and test improvements