Thread: Re: Replace IN VALUES with ANY in WHERE clauses during optimization

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

03 October 2024, 23:10:25

Hi!

On 03.10.2024 22:52, Ivan Kush wrote:
>
> Hello, hackers! I with my friends propose the patch to replace IN 
> VALUES to ANY in WHERE clauses.
>
> # Intro
>
> The `VALUES` in the `IN VALUES` construct is replaced with with an 
> array of values when `VALUES` contains 1 column. In the end it will be 
> replaced with ANY by the existing function makeA_Expr 
> (src/backend/nodes/makefuncs.c)
>
> This improves performance, especially if the values are small.
>
> # Patch
>
> v1-in_values_to_array.patch
>
> # How realized
>
> `VALUES` statement corresponds to `values_clause` nonterminal symbol 
> in gram.y, where it's parsed to `SelectStmt` node.
>
> `IN` is parsed in `a_expr` symbol. When it contains `VALUES` with 1 
> column, parser extracts data from `SelectStmt` and passes it
>
> to function call `makeSimpleA_Expr` where simple `A_Expr` is created.
>
> Later during optimizations of parser tree this `A_Expr` will be 
> transformed to `ArrayExpr` (already realized in Postgres)
>
>
> # Authors.
> Author: Ivan Kush <ivan.kush@tantorlabs.com>
> Author: Vadim Yacenko <vadim.yacenko@tantorlabs.com>
> Author: Alexander Simonov <alexander.simonov@tantorlabs.com>
>
> # Tests
> Implementation contains many regression tests of varying complexity, 
> which check supported features.
>
> # Platform
> This patch was checkouted from tag REL_17_STABLE. Code is developed in 
> Linux, doesn't contain platfrom-specific code, only Postgres internal 
> data structures and functions.
>
> # Documentation
> Regression tests contain many examples
>
> # Performance
> It increases performance
>
> # Example
> Let's compare result. With path the execution time is significantly 
> lower.
>
> We have a table table1 with 10000 rows.
>
> postgres=# \d table1;
>                          Table "public.table1"
>  Column |            Type             | Collation | Nullable | Default
> --------+-----------------------------+-----------+----------+---------
>  fld1   | timestamp without time zone |           | not null |
>  fld2   | bytea                       |           | not null |
> Indexes:
>     "table1index" btree (fld2)
>
> Let's execute several commands
> see commands.sql
>
> Plan no patch
> see plan_no_patch.txt
>
>
> Plan with patch
> see plan_with_patch.txt

I think you should think about putting these constants in ANY Array

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
select * from t
     where x in (VALUES(1200), (1));
                     QUERY PLAN
---------------------------------------------------
  Seq Scan on t (actual rows=1 loops=1)
    Filter: (x = ANY ('{1200,1}'::integer[]))
(3 rows)

Anlrey Lepikhov and I recently described this in an article [0] here and 
the implementation already exists, but for now it was posted a binary 
application for testing. The acceleration is significant I agree.

[0] https://danolivo.substack.com/p/7456653e-9716-4e91-ad09-83737784c665

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Laurenz Albe

Date:

03 October 2024, 23:19:56

On Thu, 2024-10-03 at 23:10 +0300, Alena Rybakina wrote:
> On 03.10.2024 22:52, Ivan Kush wrote:
> >
> > Hello, hackers! I with my friends propose the patch to replace IN
> > VALUES to ANY in WHERE clauses.
> >
> > # Intro
> >
> > The `VALUES` in the `IN VALUES` construct is replaced with with an
> > array of values when `VALUES` contains 1 column. In the end it will be
> > replaced with ANY by the existing function makeA_Expr
> > (src/backend/nodes/makefuncs.c)
> >
> > This improves performance, especially if the values are small.
>
> Anlrey Lepikhov and I recently described this in an article [0] here and
> the implementation already exists, but for now it was posted a binary
> application for testing. The acceleration is significant I agree.
>
> [0] https://danolivo.substack.com/p/7456653e-9716-4e91-ad09-83737784c665

I believe that the speed improvement is significant, but who writes a
query like

  ... WHERE col IN (VALUES (1), (2), (3))

when they could write the much shorter

  ... WHERE col IN (1, 2, 3)

which is already converted to "= ANY"?

I wonder if it is worth the extra planning time to detect and improve
such queries.

Yours,
Laurenz Albe

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Tom Lane

Date:

04 October 2024, 00:08:28

Laurenz Albe <laurenz.albe@cybertec.at> writes:
> I wonder if it is worth the extra planning time to detect and improve
> such queries.

I'm skeptical too.  I'm *very* skeptical of implementing it in the
grammar as shown here --- I'd go so far as to say that that approach
cannot be accepted.  That's far too early, and it risks all sorts
of problems.  An example is that the code as given seems to assume
that all the sublists are the same length ... but we haven't checked
that yet.  I also suspect that this does not behave the same as the
original construct for purposes like resolving dissimilar types in
the VALUES list.  (In an ideal world, perhaps it'd behave the same,
but that ship sailed a couple decades ago.)

            regards, tom lane

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Ivan Kush

Date:

04 October 2024, 11:43:28

Do you mean, that I should try to execute such command?

In this patch it gives ANY

postgres=# EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
select * from table1
    where fld2 in 
(VALUES('\\230\\211\\030f\\332\\261R\\333\\021\\356\\337z5\\336\\032\\372'::bytea), 
('\\235\\204 \\004\\017\\353\\301\\200\\021\\355a&d}\\245\\312'::byte
a));

                                                                                                                             QUERY
PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------


-------------------------------------------------------------------------------------------- 

Bitmap Heap Scan on table1 (actual rows=0 loops=1)
   Recheck Cond: (fld2 = ANY 

('{"\\x5c3233305c3231315c303330665c3333325c323631525c3333335c3032315c3335365c3333377a355c3333365c3033325c333732","\\x5c3233355c323034205c30303
45c3031375c3335335c3330315c3230305c3032315c3335356126647d5c3234355c333132"}'::bytea[])) 

   ->  Bitmap Index Scan on table1index (actual rows=0 loops=1)
         Index Cond: (fld2 = ANY 

('{"\\x5c3233305c3231315c303330665c3333325c323631525c3333335c3032315c3335365c3333377a355c3333365c3033325c333732","\\x5c3233355c323034205c3
030345c3031375c3335335c3330315c3230305c3032315c3335356126647d5c3234355c333132"}'::bytea[])) 

(4 rows)

Do you plan to send your implementation to the hackers?

On 10/3/24 23:10, Alena Rybakina wrote:
> I think you should think about putting these constants in ANY Array
>
> EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
> select * from t
>     where x in (VALUES(1200), (1));
>                     QUERY PLAN
> ---------------------------------------------------
>  Seq Scan on t (actual rows=1 loops=1)
>    Filter: (x = ANY ('{1200,1}'::integer[]))
> (3 rows)

-- 
Best wishes,
Ivan Kush
Tantor Labs LLC

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Ivan Kush

Date:

04 October 2024, 11:48:04

Some ORMs or proprietary software may write it mistakenly. In these 
cases this idea may be helpful.

This patch contains GUC to enable/disable this optimization

On 10/3/24 23:19, Laurenz Albe wrote:
> On Thu, 2024-10-03 at 23:10 +0300, Alena Rybakina wrote:
>> On 03.10.2024 22:52, Ivan Kush wrote:
>>> Hello, hackers! I with my friends propose the patch to replace IN
>>> VALUES to ANY in WHERE clauses.
>>>
>>> # Intro
>>>
>>> The `VALUES` in the `IN VALUES` construct is replaced with with an
>>> array of values when `VALUES` contains 1 column. In the end it will be
>>> replaced with ANY by the existing function makeA_Expr
>>> (src/backend/nodes/makefuncs.c)
>>>
>>> This improves performance, especially if the values are small.
>> Anlrey Lepikhov and I recently described this in an article [0] here and
>> the implementation already exists, but for now it was posted a binary
>> application for testing. The acceleration is significant I agree.
>>
>> [0] https://danolivo.substack.com/p/7456653e-9716-4e91-ad09-83737784c665
> I believe that the speed improvement is significant, but who writes a
> query like
>
>    ... WHERE col IN (VALUES (1), (2), (3))
>
> when they could write the much shorter
>
>    ... WHERE col IN (1, 2, 3)
>
> which is already converted to "= ANY"?
>
> I wonder if it is worth the extra planning time to detect and improve
> such queries.
>
> Yours,
> Laurenz Albe

-- 
Best wishes,
Ivan Kush
Tantor Labs LLC

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

04 October 2024, 12:15:56

On 04.10.2024 11:43, Ivan Kush wrote:
> Do you mean, that I should try to execute such command?
>
> In this patch it gives ANY
>
> postgres=# EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
> select * from table1
>    where fld2 in 
> (VALUES('\\230\\211\\030f\\332\\261R\\333\\021\\356\\337z5\\336\\032\\372'::bytea), 
> ('\\235\\204 \\004\\017\\353\\301\\200\\021\\355a&d}\\245\\312'::byte
> a));
>
                                                                                                                             QUERY

> PLAN
>
>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

>
> -------------------------------------------------------------------------------------------- 
>
> Bitmap Heap Scan on table1 (actual rows=0 loops=1)
>   Recheck Cond: (fld2 = ANY 
>
('{"\\x5c3233305c3231315c303330665c3333325c323631525c3333335c3032315c3335365c3333377a355c3333365c3033325c333732","\\x5c3233355c323034205c30303
> 45c3031375c3335335c3330315c3230305c3032315c3335356126647d5c3234355c333132"}'::bytea[])) 
>
>   ->  Bitmap Index Scan on table1index (actual rows=0 loops=1)
>         Index Cond: (fld2 = ANY 
>
('{"\\x5c3233305c3231315c303330665c3333325c323631525c3333335c3032315c3335365c3333377a355c3333365c3033325c333732","\\x5c3233355c323034205c3
> 030345c3031375c3335335c3330315c3230305c3032315c3335356126647d5c3234355c333132"}'::bytea[])) 
>
> (4 rows)
Yes I meant it.
>
> Do you plan to send your implementation to the hackers?
>
It was sent here [0].

[0] 
https://www.postgresql.org/message-id/21d5fca5-0c02-4afd-8c98-d0930b298a8d%40gmail.com

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Ivan Kush

Date:

11 November 2024, 20:29:14

I agree, your realization is better: reliability is better and debugging 
is simplier.
I've looked at the code, looks good to me. Only style notes like 
VTA/VtA, SELECT/select, etc. may be corrected

On 10/4/24 12:15, Alena Rybakina wrote:
>
> It was sent here [0].
>
> [0] 
> https://www.postgresql.org/message-id/21d5fca5-0c02-4afd-8c98-d0930b298a8d%40gmail.com
>
-- 
Best wishes,
Ivan Kush
Tantor Labs LLC

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alexander Korotkov

Date:

09 February, 14:58:56

Hi, Alena!

On Thu, Jan 9, 2025 at 3:11 PM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
> On 04.10.2024 12:05, Andrei Lepikhov wrote:
> > We also have an implementation of VALUES -> ARRAY transformation.
> > Because enterprises must deal with users' problems, many of these
> > users employ automatically generated queries.
> > Being informed very well of the consensus about that stuff, we've
> > designed it as a library. But, looking into the code now, I see that
> > it only needs a few cycles if no one 'x IN VALUES' expression is
> > presented in the query. Who knows? It may be OK for the core.
> > So, I've rewritten the code into the patch - see it in the attachment.
> >
> > The idea is quite simple - at the same place as
> > convert_ANY_sublink_to_join, we can test the SubLink on proper VALUES
> > RTE and perform the transformation if it's convertible.
>
> I updated the patch due to the problem with the coercion types for both
> sides of the expression.
>
> We must find a common type for both leftop of the expression and rightop
> including constants for correct transformation, and at the same time
> check that the resulting types are compatible.
>
> To do this we find an operator for the two input types if it is
> possible, and also remember the target types for the left and right
> sides, and after that make a coercion.
>
> This processing is only needed in cases where we are not working with
> parameters since the final type is not specified for the parameters.

I took a look at this patch.

+ /* TODO: remember parameters */

What was intended to do here?

Also, aren't we too restrictive while requiring is_simple_values_sequence()?

For instance, I believe cases like this (containing Var) could be transformed too.

select * from t t1, lateral (select * from t t2 where t2.i in (values (t1.i), (1)));

------
Regards,
Alexander Korotkov
Supabase

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

21 February, 00:09:27

Hi!

On 09.02.2025 18:38, Alexander Korotkov wrote:

Also, aren't we too restrictive while requiring is_simple_values_sequence()?
For instance, I believe cases like this (containing Var) could be transformed too.

select * from t t1, lateral (select * from t t2 where t2.i in (values (t1.i), (1)));

I'm still working on it.

Also, I think there is quite a code duplication about construction of
SAOP between match_orclause_to_indexcol() and convert_VALUES_to_ANY()
functions.  I would like to see a refactoring as a separate first
patch, which extracts the common part into a function.

Done.

I have attached a patch. In addition to the transfer, I added the process of searching for a suitable operator and type for the left expression for input expressions: const and left expression, since they may differ from the declared types. Additionally, we convert the left expr to a type suitable for the found operator.

-- 
Regards,
Alena Rybakina
Postgres Professional

Attachment

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

28 February, 14:48:47

Hi!

On 21.02.2025 00:09, Alena Rybakina wrote:

Hi!

On 09.02.2025 18:38, Alexander Korotkov wrote:

Also, aren't we too restrictive while requiring is_simple_values_sequence()?
For instance, I believe cases like this (containing Var) could be transformed too.

select * from t t1, lateral (select * from t t2 where t2.i in (values (t1.i), (1)));

I added it and attached a patch with diff file. To be honest, I didn't find queries except for var with volatile functions where the transform can't be applied.

I'm not sure about only cases where var can refer to something outside available_rels list but I couldn't come up with an example where that's possible, what do you think?

-- 
Regards,
Alena Rybakina
Postgres Professional

Attachment

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

01 March, 14:39:11

On 28.02.2025 14:48, Alena Rybakina wrote:

Hi!
On 21.02.2025 00:09, Alena Rybakina wrote:
Hi!
On 09.02.2025 18:38, Alexander Korotkov wrote:
Also, aren't we too restrictive while requiring is_simple_values_sequence()?
For instance, I believe cases like this (containing Var) could be transformed too.

select * from t t1, lateral (select * from t t2 where t2.i in (values (t1.i), (1)));
I added it and attached a patch with diff file. To be honest, I didn't find queries except for var with volatile functions where the transform can't be applied.

I removed the function volatility check that I added in the previous version, since we already check it in is_simple_values_sequence.

I'm not sure about only cases where var can refer to something outside available_rels list but I couldn't come up with an example where that's possible, what do you think?

Considering it again, I think we can't face problems like that because we don't work with join.

I attached a diff file as a difference with the 3rd version of the patch, when we did not consider the values with var for transformation.

-- 
Regards,
Alena Rybakina
Postgres Professional

Attachment

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alexander Korotkov

Date:

06 March, 11:23:29

Hi, Alena!

On Sat, Mar 1, 2025 at 1:39 PM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
> On 09.02.2025 18:38, Alexander Korotkov wrote:
>>
>> Also, aren't we too restrictive while requiring is_simple_values_sequence()?
>> For instance, I believe cases like this (containing Var) could be transformed too.
>>
>> select * from t t1, lateral (select * from t t2 where t2.i in (values (t1.i), (1)));
>
>
> I added it and attached a patch with diff file. To be honest, I didn't find queries except for var with volatile
functionswhere the transform can't be applied. 
>
> I removed the function volatility check that I added in the previous version, since we already check it in
is_simple_values_sequence.
>
> I'm not sure about only cases where var can refer to something outside available_rels list but I couldn't come up
withan example where that's possible, what do you think? 
>
> Considering it again, I think we can't face problems like that because we don't work with join.
>
> I attached a diff file as a difference with the 3rd version of the patch, when we did not consider the values with
varfor transformation. 

I take detailed look at makeSAOPArrayExpr() function, which is much
more complex than corresponding fragment from
match_orclause_to_indexcol().  And I found it to be mostly wrong.  We
are working in post parse-analyze stage.  That means it's too late to
do type coercion or lookup operator by name.  We have already all the
catalog objects nailed down.  In connection with that, second argument
of OpExpr shouldn't be ignored as it might contain amrelevant type
cast.  I think I've fixed the most of them problems in the attached
patchset.

------
Regards,
Alexander Korotkov
Supabase

Attachment

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

12 March, 21:11:24

Hi, Alexander!

On 06.03.2025 11:23, Alexander Korotkov wrote:

Hi, Alena!

On Sat, Mar 1, 2025 at 1:39 PM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:

On 09.02.2025 18:38, Alexander Korotkov wrote:

Also, aren't we too restrictive while requiring is_simple_values_sequence()?
For instance, I believe cases like this (containing Var) could be transformed too.

select * from t t1, lateral (select * from t t2 where t2.i in (values (t1.i), (1)));

I added it and attached a patch with diff file. To be honest, I didn't find queries except for var with volatile functions where the transform can't be applied.

I removed the function volatility check that I added in the previous version, since we already check it in is_simple_values_sequence.

I'm not sure about only cases where var can refer to something outside available_rels list but I couldn't come up with an example where that's possible, what do you think?

Considering it again, I think we can't face problems like that because we don't work with join.

I attached a diff file as a difference with the 3rd version of the patch, when we did not consider the values with var for transformation.

I take detailed look at makeSAOPArrayExpr() function, which is much
more complex than corresponding fragment from
match_orclause_to_indexcol().  And I found it to be mostly wrong.  We
are working in post parse-analyze stage.  That means it's too late to
do type coercion or lookup operator by name.  We have already all the
catalog objects nailed down.  In connection with that, second argument
of OpExpr shouldn't be ignored as it might contain amrelevant type
cast.  I think I've fixed the most of them problems in the attached
patchset.

I agree with your conclusion and changes.

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alexander Korotkov

Date:

29 March, 14:03:07

On Wed, Mar 12, 2025 at 8:11 PM Alena Rybakina
<a.rybakina@postgrespro.ru> wrote:
> On 06.03.2025 11:23, Alexander Korotkov wrote:
>
> Hi, Alena!
>
> On Sat, Mar 1, 2025 at 1:39 PM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
>
> On 09.02.2025 18:38, Alexander Korotkov wrote:
>
> Also, aren't we too restrictive while requiring is_simple_values_sequence()?
> For instance, I believe cases like this (containing Var) could be transformed too.
>
> select * from t t1, lateral (select * from t t2 where t2.i in (values (t1.i), (1)));
>
> I added it and attached a patch with diff file. To be honest, I didn't find queries except for var with volatile
functionswhere the transform can't be applied. 
>
> I removed the function volatility check that I added in the previous version, since we already check it in
is_simple_values_sequence.
>
> I'm not sure about only cases where var can refer to something outside available_rels list but I couldn't come up
withan example where that's possible, what do you think? 
>
> Considering it again, I think we can't face problems like that because we don't work with join.
>
> I attached a diff file as a difference with the 3rd version of the patch, when we did not consider the values with
varfor transformation. 
>
> I take detailed look at makeSAOPArrayExpr() function, which is much
> more complex than corresponding fragment from
> match_orclause_to_indexcol().  And I found it to be mostly wrong.  We
> are working in post parse-analyze stage.  That means it's too late to
> do type coercion or lookup operator by name.  We have already all the
> catalog objects nailed down.  In connection with that, second argument
> of OpExpr shouldn't be ignored as it might contain amrelevant type
> cast.  I think I've fixed the most of them problems in the attached
> patchset.
>
>
> I agree with your conclusion and changes.

I've revised the patchset.  Mostly comments/commit messages and minor
refactoring.  One thing I have to fix: we must do
IncrementVarSublevelsUp() unconditionally for all expressions as Vars
could be deeper inside.  Also, I've removed our subquery check
completely.  Not sure if we need it at all.  I'll further analyze
that.

------
Regards,
Alexander Korotkov
Supabase

Attachment

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

29 March, 22:03:00

Hi!

On 29.03.2025 14:03, Alexander Korotkov wrote:

On Wed, Mar 12, 2025 at 8:11 PM Alena Rybakina
<a.rybakina@postgrespro.ru> wrote:

On 06.03.2025 11:23, Alexander Korotkov wrote:

Hi, Alena!

On Sat, Mar 1, 2025 at 1:39 PM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:

On 09.02.2025 18:38, Alexander Korotkov wrote:

Also, aren't we too restrictive while requiring is_simple_values_sequence()?
For instance, I believe cases like this (containing Var) could be transformed too.

select * from t t1, lateral (select * from t t2 where t2.i in (values (t1.i), (1)));

I added it and attached a patch with diff file. To be honest, I didn't find queries except for var with volatile functions where the transform can't be applied.

I removed the function volatility check that I added in the previous version, since we already check it in is_simple_values_sequence.

I'm not sure about only cases where var can refer to something outside available_rels list but I couldn't come up with an example where that's possible, what do you think?

Considering it again, I think we can't face problems like that because we don't work with join.

I attached a diff file as a difference with the 3rd version of the patch, when we did not consider the values with var for transformation.

I take detailed look at makeSAOPArrayExpr() function, which is much
more complex than corresponding fragment from
match_orclause_to_indexcol().  And I found it to be mostly wrong.  We
are working in post parse-analyze stage.  That means it's too late to
do type coercion or lookup operator by name.  We have already all the
catalog objects nailed down.  In connection with that, second argument
of OpExpr shouldn't be ignored as it might contain amrelevant type
cast.  I think I've fixed the most of them problems in the attached
patchset.


I agree with your conclusion and changes.

I've revised the patchset.  Mostly comments/commit messages and minor
refactoring.  Also, I've removed our subquery check
completely.  Not sure if we need it at all.  I'll further analyze
that.

I agree with your changes, the code really started to look better and more understandable. Thank you!

As for function - it checked that values didn't contain any subquery elements (if they consists RangeTblEntry type variables) and when you removed it you caused the problem with sublevel parameter.

One thing I have to fix: we must do
IncrementVarSublevelsUp() unconditionally for all expressions as Vars
could be deeper inside.

Yes, I'm looking at it too, I've just understood that it was needed for subqueries - they can contain var elements which needs decrease the sublevel parameter.

for example for the query:

EXPLAIN (COSTS OFF)
SELECT ten FROM onek t
WHERE unique1 IN (VALUES (0), ((2 IN (SELECT unique2 FROM onek c
WHERE c.unique2 = t.unique1))::integer));

We are interested in this element: ((2 IN (SELECT unique2 FROM onek c WHERE c.unique2 = t.unique1))

It is funcexpr object with RabgeTblEntry variable. I highlighted

WARNING: 1{FUNCEXPR :funcid 2558 :funcresulttype 23 :funcretset false :funcvariadic false :funcformat 1 :funccollid 0 :inputcollid 0 :args ({SUBLINK :subLinkType 2 :subLinkId 0 :testexpr {OPEXPR :opno 96 :opfuncid 65 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]} {PARAM :paramkind 2 :paramid 1 :paramtype 23 :paramtypmod -1 :paramcollid 0 :location -1}) :location -1} :operName ("=") :subselect {QUERY :commandType 1 :querySource 0 :canSetTag true :utilityStmt <> :resultRelation 0 :hasAggs false :hasWindowFuncs false :hasTargetSRFs false :hasSubLinks false :hasDistinctOn false :hasRecursive false :hasModifyingCTE false :hasForUpdate false :hasRowSecurity false :hasGroupRTE false :isReturn false :cteList <> :rtable ({RANGETBLENTRY :alias {ALIAS :aliasname c :colnames <>} :eref {ALIAS :aliasname c :colnames ("unique1" "unique2" "two" "four" "ten" "twenty" "hundred" "thousand" "twothousand" "fivethous" "tenthous" "odd" "even" "stringu1" "stringu2" "string4")} :rtekind 0 :relid 32795 :inh true :relkind r :rellockmode 1 :perminfoindex 1 :tablesample <> :lateral false :inFromCl true :securityQuals <>}) :rteperminfos ({RTEPERMISSIONINFO :relid 32795 :inh true :requiredPerms 2 :checkAsUser 0 :selectedCols (b 9) :insertedCols (b) :updatedCols (b)}) :jointree {FROMEXPR :fromlist ({RANGETBLREF :rtindex 1}) :quals {OPEXPR :opno 96 :opfuncid 65 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location -1} {VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 2 :varreturningtype 0 :varnosyn 1 :varattnosyn 1 :location -1}) :location -1}} :mergeActionList <> :mergeTargetRelation 0 :mergeJoinCondition <> :targetList ({TARGETENTRY :expr {VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location -1} :resno 1 :resname unique2 :ressortgroupref 0 :resorigtbl 32795 :resorigcol 2 :resjunk false}) :override 0 :onConflict <> :returningOldAlias <> :returningNewAlias <> :returningList <> :groupClause <> :groupDistinct false :groupingSets <> :havingQual <> :windowClause <> :distinctClause <> :sortClause <> :limitOffset <> :limitCount <> :limitOption 0 :rowMarks <> :setOperations <> :constraintDeps <> :withCheckOptions <> :stmt_location -1 :stmt_len -1} :location -1}) :location -1}

I highlighted in bold the var we need - since it is in a subquery in the in expression will be flattened, all elements contained in it should decrease the level number by one, since they will belong to the subtree located above it. Because of that condition, this did not happen.

I generally agree with you that it is better to remove that condition. The function IncrementVarSublevelsUp essentially goes through the structures below and will decrease the level of only the vars for which this needs to be done, and the condition with 1 will protect us from touching those vars that should not. So the varlevelsup for this var should be 1.

I am currently investigating whether this transformation will be fair for all cases; I have not found any problems yet.

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alexander Korotkov

Date:

30 March, 00:59:01

Hi, Alena!

On Sat, Mar 29, 2025 at 9:03 PM Alena Rybakina
<a.rybakina@postgrespro.ru> wrote:
> On 29.03.2025 14:03, Alexander Korotkov wrote:
>> One thing I have to fix: we must do
>> IncrementVarSublevelsUp() unconditionally for all expressions as Vars
>> could be deeper inside.
>
> Yes, I'm looking at it too, I've just understood that it was needed for subqueries - they can contain var elements
whichneeds decrease the sublevel parameter. 
>
> for example for the query:
>
> EXPLAIN (COSTS OFF)
> SELECT ten FROM onek t
> WHERE unique1 IN (VALUES (0), ((2 IN (SELECT unique2 FROM onek c
>   WHERE c.unique2 = t.unique1))::integer));
>
> We are interested in this element: ((2 IN (SELECT unique2 FROM onek c  WHERE c.unique2 = t.unique1))
>
> It is funcexpr object with RabgeTblEntry variable. I highlighted
>
> WARNING:  1{FUNCEXPR :funcid 2558 :funcresulttype 23 :funcretset false :funcvariadic false :funcformat 1 :funccollid
0:inputcollid 0 :args ({SUBLINK :subLinkType 2 :subLinkId 0 :testexpr {OPEXPR :opno 96 :opfuncid 65 :opresulttype 16
:opretsetfalse :opcollid 0 :inputcollid 0 :args ({CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4
:constbyvaltrue :constisnull false :location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]} {PARAM :paramkind 2 :paramid 1
:paramtype23 :paramtypmod -1 :paramcollid 0 :location -1}) :location -1} :operName ("=") :subselect {QUERY :commandType
1:querySource 0 :canSetTag true :utilityStmt <> :resultRelation 0 :hasAggs false :hasWindowFuncs false :hasTargetSRFs
false:hasSubLinks false :hasDistinctOn false :hasRecursive false :hasModifyingCTE false :hasForUpdate false
:hasRowSecurityfalse :hasGroupRTE false :isReturn false :cteList <> :rtable ({RANGETBLENTRY :alias {ALIAS :aliasname c
:colnames<>} :eref {ALIAS :aliasname c :colnames ("unique1" "unique2" "two" "four" "ten" "twenty" "hundred" "thousand"
"twothousand""fivethous" "tenthous" "odd" "even" "stringu1" "stringu2" "string4")} :rtekind 0 :relid 32795 :inh true
:relkindr :rellockmode 1 :perminfoindex 1 :tablesample <> :lateral false :inFromCl true :securityQuals <>})
:rteperminfos({RTEPERMISSIONINFO :relid 32795 :inh true :requiredPerms 2 :checkAsUser 0 :selectedCols (b 9)
:insertedCols(b) :updatedCols (b)}) :jointree {FROMEXPR :fromlist ({RANGETBLREF :rtindex 1}) :quals {OPEXPR :opno 96
:opfuncid65 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 2 :vartype 23
:vartypmod-1 :varcollid 0 :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location
-1}{VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 2
:varreturningtype0 :varnosyn 1 :varattnosyn 1 :location -1}) :location -1}} :mergeActionList <> :mergeTargetRelation 0
:mergeJoinCondition<> :targetList ({TARGETENTRY :expr {VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0
:varnullingrels(b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location -1} :resno 1 :resname
unique2:ressortgroupref 0 :resorigtbl 32795 :resorigcol 2 :resjunk false}) :override 0 :onConflict <>
:returningOldAlias<> :returningNewAlias <> :returningList <> :groupClause <> :groupDistinct false :groupingSets <>
:havingQual<> :windowClause <> :distinctClause <> :sortClause <> :limitOffset <> :limitCount <> :limitOption 0
:rowMarks<> :setOperations <> :constraintDeps <> :withCheckOptions <> :stmt_location -1 :stmt_len -1} :location -1})
:location-1} 
>
>
> I highlighted in bold the var we need - since it is in a subquery in the in expression will be flattened, all
elementscontained in it should decrease the level number by one, since they will belong to the subtree located above
it.Because of that condition, this did not happen. 
>
> I generally agree with you that it is better to remove that condition. The function IncrementVarSublevelsUp
essentiallygoes through the structures below and will decrease the level of only the vars for which this needs to be
done,and the condition with 1 will protect us from touching those vars that should not. So the varlevelsup for this var
shouldbe 1. 
>
> I am currently investigating whether this transformation will be fair for all cases; I have not found any problems
yet.

Thank you for your feedback.  I appreciate you're also looking for the
potential problems.  On thing to highlight: doing
IncrementVarSublevelsUp() unconditionally is required not just for
subqueries.  Consider the following example.

SELECT * FROM t WHERE val1 IN (VALUES (val2), (val2 +1));

The second value contain Var, which needs IncrementVarSublevelsUp(),
but the top node is OpExpr.

------
Regards,
Alexander Korotkov
Supabase

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

01 April, 02:10:58

Hi, Alexander!

On 30.03.2025 00:59, Alexander Korotkov wrote:

Hi, Alena!

On Sat, Mar 29, 2025 at 9:03 PM Alena Rybakina
<a.rybakina@postgrespro.ru> wrote:

On 29.03.2025 14:03, Alexander Korotkov wrote:

One thing I have to fix: we must do
IncrementVarSublevelsUp() unconditionally for all expressions as Vars
could be deeper inside.

Yes, I'm looking at it too, I've just understood that it was needed for subqueries - they can contain var elements which needs decrease the sublevel parameter.

for example for the query:

EXPLAIN (COSTS OFF)
SELECT ten FROM onek t
WHERE unique1 IN (VALUES (0), ((2 IN (SELECT unique2 FROM onek c  WHERE c.unique2 = t.unique1))::integer));

We are interested in this element: ((2 IN (SELECT unique2 FROM onek c  WHERE c.unique2 = t.unique1))

It is funcexpr object with RabgeTblEntry variable. I highlighted

WARNING:  1{FUNCEXPR :funcid 2558 :funcresulttype 23 :funcretset false :funcvariadic false :funcformat 1 :funccollid 0 :inputcollid 0 :args ({SUBLINK :subLinkType 2 :subLinkId 0 :testexpr {OPEXPR :opno 96 :opfuncid 65 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]} {PARAM :paramkind 2 :paramid 1 :paramtype 23 :paramtypmod -1 :paramcollid 0 :location -1}) :location -1} :operName ("=") :subselect {QUERY :commandType 1 :querySource 0 :canSetTag true :utilityStmt <> :resultRelation 0 :hasAggs false :hasWindowFuncs false :hasTargetSRFs false :hasSubLinks false :hasDistinctOn false :hasRecursive false :hasModifyingCTE false :hasForUpdate false :hasRowSecurity false :hasGroupRTE false :isReturn false :cteList <> :rtable ({RANGETBLENTRY :alias {ALIAS :aliasname c :colnames <>} :eref {ALIAS :aliasname c :colnames ("unique1" "unique2" "two" "four" "ten" "twenty" "hundred" "thousand" "twothousand" "fivethous" "tenthous" "odd" "even" "stringu1" "stringu2" "string4")} :rtekind 0 :relid 32795 :inh true :relkind r :rellockmode 1 :perminfoindex 1 :tablesample <> :lateral false :inFromCl true :securityQuals <>}) :rteperminfos ({RTEPERMISSIONINFO :relid 32795 :inh true :requiredPerms 2 :checkAsUser 0 :selectedCols (b 9) :insertedCols (b) :updatedCols (b)}) :jointree {FROMEXPR :fromlist ({RANGETBLREF :rtindex 1}) :quals {OPEXPR :opno 96 :opfuncid 65 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location -1} {VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 2 :varreturningtype 0 :varnosyn 1 :varattnosyn 1 :location -1}) :location -1}} :mergeActionList <> :mergeTargetRelation 0 :mergeJoinCondition <> :targetList ({TARGETENTRY :expr {VAR :varno 1 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0 :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1 :varattnosyn 2 :location -1} :resno 1 :resname unique2 :ressortgroupref 0 :resorigtbl 32795 :resorigcol 2 :resjunk false}) :override 0 :onConflict <> :returningOldAlias <> :returningNewAlias <> :returningList <> :groupClause <> :groupDistinct false :groupingSets <> :havingQual <> :windowClause <> :distinctClause <> :sortClause <> :limitOffset <> :limitCount <> :limitOption 0 :rowMarks <> :setOperations <> :constraintDeps <> :withCheckOptions <> :stmt_location -1 :stmt_len -1} :location -1}) :location -1}


I highlighted in bold the var we need - since it is in a subquery in the in expression will be flattened, all elements contained in it should decrease the level number by one, since they will belong to the subtree located above it. Because of that condition, this did not happen.

I generally agree with you that it is better to remove that condition. The function IncrementVarSublevelsUp essentially goes through the structures below and will decrease the level of only the vars for which this needs to be done, and the condition with 1 will protect us from touching those vars that should not. So the varlevelsup for this var should be 1.

I am currently investigating whether this transformation will be fair for all cases; I have not found any problems yet.

Thank you for your feedback.  I appreciate you're also looking for the
potential problems.  On thing to highlight: doing
IncrementVarSublevelsUp() unconditionally is required not just for
subqueries.  Consider the following example.

SELECT * FROM t WHERE val1 IN (VALUES (val2), (val2 +1));

The second value contain Var, which needs IncrementVarSublevelsUp(),
but the top node is OpExpr.

Yes, I agree with that - this is precisely why we need to call IncrementVarSublevelsUp() unconditionally for all types.

As you mentioned earlier, Var nodes can be nested more deeply, and skipping this step could lead to incorrect behavior in those cases. So, now it works fine)

Thank you for an example.

I analyzed this transformation with various types of values that might be used in conditions.

First, I verified whether the change would affect semantics, especially in the presence of NULL elements. The only notable behavior I observed was
the coercion of NULL to an integer type. However, this behavior remains the same even without our transformation, so everything is fine.

To test this, I created a onek table containing NULL values:

CREATE TABLE onek ( unique1 INT, unique2 INT, ten TEXT );

INSERT INTO onek VALUES (0, 10, 'zero'), (1, NULL, 'one'), (2, 2, 'two'), (3, NULL, 'three'), (4, NULL, 'only null match'), (5, NULL, 'two + null match'), (6, NULL, 'no match');

1.1) explain analyze SELECT ten FROM onek t WHERE unique1 IN (VALUES (0), (NULL));

QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on onek t (cost=0.00..25.00 rows=6 width=32) (actual time=0.274..0.282 rows=1.00 loops=1) Filter: (unique1 = ANY ('{0,NULL}'::integer[])) Rows Removed by Filter: 6 Buffers: shared read=1 Planning: Buffers: shared hit=52 read=23 Planning Time: 2.124 ms Execution Time: 0.374 ms (8 rows)

The query plan without our patch:

QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Hash Semi Join (cost=0.05..25.33 rows=12 width=32) (actual time=0.064..0.081 rows=1.00 loops=1) Hash Cond: (t.unique1 = "*VALUES*".column1) Buffers: shared hit=1 -> Seq Scan on onek t (cost=0.00..22.00 rows=1200 width=36) (actual time=0.028..0.034 rows=7.00 loops=1) Buffers: shared hit=1 -> Hash (cost=0.03..0.03 rows=2 width=4) (actual time=0.018..0.020 rows=1.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4) (actual time=0.004..0.008 rows=2.00 loops=1) Planning: Buffers: shared hit=8 Planning Time: 0.513 ms Execution Time: 0.182 ms (12 rows)

I added another tuple with a NULL value in the unique1 column to verify that the semantics remain correct when comparing NULL with NULL.

I didn't observe any issues, as the behavior was identical to how it worked before applying the patch.

insert into onek values (NULL, 1, 'match');

1.2) explain analyze SELECT ten FROM onek t WHERE unique1 IN (VALUES (0), (NULL));

QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on onek t (cost=0.00..25.00 rows=6 width=32) (actual time=0.053..0.063 rows=1.00 loops=1) Filter: (unique1 = ANY ('{0,NULL}'::integer[])) Rows Removed by Filter: 7 Buffers: shared hit=1 Planning Time: 0.178 ms Execution Time: 0.109 ms (6 rows)

The query plan without our patch:

QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Hash Semi Join (cost=0.05..25.33 rows=12 width=32) (actual time=0.076..0.090 rows=1.00 loops=1) Hash Cond: (t.unique1 = "*VALUES*".column1) Buffers: shared hit=1 -> Seq Scan on onek t (cost=0.00..22.00 rows=1200 width=36) (actual time=0.043..0.048 rows=8.00 loops=1) Buffers: shared hit=1 -> Hash (cost=0.03..0.03 rows=2 width=4) (actual time=0.017..0.018 rows=1.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4) (actual time=0.005..0.008 rows=2.00 loops=1) Planning Time: 0.312 ms Execution Time: 0.174 ms (10 rows)

Since the subquery became correlated with our transformation when it included var objects, I also checked whether our transformation had any negative impact on query performance.

To do this, I added unique values to the table, forcing the subquery to be re-executed for each outer tuple. I observed a little performance degradation (see the number of shared hit in a query 2.1) and
the worst scenario involving nested VALUES clauses, where the performance impact is substantial (a query 2.2).

As for the 2.2 query, without our transformation, the query requires scanning significantly fewer blocks - roughly half as many (see the number of shared hit).

insert into onek select id, id, 'match' from generate_series(1,10000) id;

2.1) explain analyze SELECT ten FROM onek t WHERE unique1 IN ( VALUES (0), ((2 IN ( SELECT unique2 FROM onek c WHERE c.unique2 = t.unique1 ))::integer) );

QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Seq Scan on onek t (cost=0.00..901463.05 rows=51 width=6) (actual time=5.703..15131.084 rows=1.00 loops=1) Filter: (unique1 = ANY (ARRAY[0, ((ANY (2 = (SubPlan 1).col1)))::integer])) Rows Removed by Filter: 10007 Buffers: shared hit=550389 SubPlan 1 -> Seq Scan on onek c (cost=0.00..180.10 rows=1 width=4) (actual time=0.818..1.511 rows=1.00 loops=10008) Filter: (unique2 = t.unique1) Rows Removed by Filter: 10005 Buffers: shared hit=550334 Planning Time: 0.279 ms Execution Time: 15131.148 ms (11 rows)

The query plan without our patch:

QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Nested Loop Semi Join (cost=0.00..385.00 rows=66 width=32) (actual time=0.099..19935.638 rows=1.00 loops=1) Buffers: shared hit=550334 -> Seq Scan on onek t (cost=0.00..121.00 rows=6600 width=36) (actual time=0.082..1.723 rows=10008.00 loops=1) Buffers: shared hit=55 -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=1 width=4) (actual time=1.991..1.991 rows=0.00 loops=10008) Filter: (t.unique1 = column1) Rows Removed by Filter: 2 Buffers: shared hit=550279 SubPlan 1 -> Seq Scan on onek c (cost=0.00..137.50 rows=33 width=4) (actual time=1.106..1.989 rows=1.00 loops=10007) Filter: (unique2 = t.unique1) Rows Removed by Filter: 10005 Buffers: shared hit=550279 Planning Time: 0.393 ms Execution Time: 19935.710 ms (15 rows)

2.2) explain analyze SELECT ten FROM onek t WHERE unique1 IN ( VALUES (0), ((2 IN ( SELECT unique2 FROM onek c WHERE c.unique2 in ((values(0),((2 in (select unique2 from onek c1 where c1.unique2 = t.unique1))::integer))) ))::integer) );

-------------------------------------------------------------------------------------------------------------------------- Seq Scan on onek t (cost=0.00..2954341.54 rows=51 width=6) (actual time=8.095..93017.686 rows=1.00 loops=1) Filter: (unique1 = ANY (ARRAY[0, ((ANY (2 = (SubPlan 2).col1)))::integer])) Rows Removed by Filter: 10007 Buffers: shared hit=1100935 SubPlan 2 -> Seq Scan on onek c (cost=180.10..410.24 rows=2 width=4) (actual time=9.290..9.292 rows=0.00 loops=10008) Filter: (unique2 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer])) Rows Removed by Filter: 10008 Buffers: shared hit=1100880 SubPlan 1 -> Seq Scan on onek c1 (cost=0.00..180.10 rows=1 width=4) (actual time=1.183..2.291 rows=1.00 loops=10008) Filter: (unique2 = t.unique1) Rows Removed by Filter: 10007 Buffers: shared hit=550440 Planning: Buffers: shared hit=19 Planning Time: 0.733 ms Execution Time: 93017.795 ms (18 rows)

The query plan without our patch:

-------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop Semi Join (cost=0.00..555.40 rows=2 width=6) (actual time=0.046..50110.402 rows=1.00 loops=1) Buffers: shared hit=1100825 -> Seq Scan on onek t (cost=0.00..155.08 rows=10008 width=10) (actual time=0.028..1.374 rows=10008.00 loops=1) Buffers: shared hit=55 -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=1 width=4) (actual time=5.006..5.006 rows=0.00 loops=10008) Filter: (t.unique1 = column1) Rows Removed by Filter: 2 Buffers: shared hit=1100770 SubPlan 2 -> Hash Semi Join (cost=0.05..181.42 rows=2 width=4) (actual time=5.003..5.003 rows=0.00 loops=10007) Hash Cond: (c.unique2 = "*VALUES*_1".column1) Buffers: shared hit=1100770 -> Seq Scan on onek c (cost=0.00..155.08 rows=10008 width=4) (actual time=0.004..1.165 rows=10008.00 loops=10007) Buffers: shared hit=550385 -> Hash (cost=0.03..0.03 rows=2 width=4) (actual time=1.921..1.921 rows=2.00 loops=10007) Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared hit=550385 -> Values Scan on "*VALUES*_1" (cost=0.00..0.03 rows=2 width=4) (actual time=0.000..1.920 rows=2.00 loops=10007) Buffers: shared hit=550385 SubPlan 1 -> Seq Scan on onek c1 (cost=0.00..180.10 rows=1 width=4) (actual time=1.010..1.917 rows=1.00 loops=10007) Filter: (unique2 = t.unique1) Rows Removed by Filter: 10007 Buffers: shared hit=550385 Planning: Buffers: shared hit=6 Planning Time: 0.874 ms Execution Time: 50110.531 ms (28 rows)

If we build an index, the number of scanned blocks remains the same or even decreases, so I don't observe any performance degradation in that case.

Does this mean that we should consider applying this transformation later, perhaps where the OR->ANY transformation is performed, at least for cases where VALUES clauses contain subqueries or Var nodes?

create index on onek (unique2);

3.1)explain analyze SELECT ten FROM onek t WHERE unique1 IN ( VALUES (0), ((2 IN ( SELECT unique2 FROM onek c WHERE c.unique2 = t.unique1 ))::integer) );

--------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on onek t (cost=0.00..23198.50 rows=51 width=6) (actual time=0.142..60.369 rows=1.00 loops=1) Filter: (unique1 = ANY (ARRAY[0, ((ANY (2 = (SubPlan 1).col1)))::integer])) Rows Removed by Filter: 10007 Buffers: shared hit=20070 SubPlan 1 -> Index Only Scan using onek_unique2_idx on onek c (cost=0.29..4.30 rows=1 width=4) (actual time=0.004..0.005 rows=1.00 loops=10008) Index Cond: (unique2 = t.unique1) Heap Fetches: 0 Index Searches: 10007 Buffers: shared hit=20015 Planning: Buffers: shared hit=121 Planning Time: 2.426 ms Execution Time: 60.512 ms (14 rows)

The query plan without our patch:

---------------------- Nested Loop Semi Join (cost=0.00..555.40 rows=2 width=6) (actual time=0.118..59.554 rows=1.00 loops=1) Buffers: shared hit=19983 read=85 -> Seq Scan on onek t (cost=0.00..155.08 rows=10008 width=10) (actual time=0.090..1.834 rows=10008.00 loops=1) Buffers: shared read=55 -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=1 width=4) (actual time=0.005..0.005 rows=0.00 loops=10008) Filter: (t.unique1 = column1) Rows Removed by Filter: 2 Buffers: shared hit=19983 read=30 SubPlan 1 -> Index Only Scan using onek_unique2_idx on onek c (cost=0.29..4.30 rows=1 width=4) (actual time=0.004..0.004 rows=1.00 loops=10007) Index Cond: (unique2 = t.unique1) Heap Fetches: 0 Index Searches: 10006 Buffers: shared hit=19983 read=30 Planning: Buffers: shared hit=120 read=24 Planning Time: 3.731 ms Execution Time: 59.644 ms (18 rows)

3.2) explain analyze SELECT ten FROM onek t WHERE unique1 IN ( VALUES (0), ((2 IN ( SELECT unique2 FROM onek c WHERE c.unique2 in ((values(0),((2 in (select unique2 from onek c1 where c1.unique2 = t.unique1))::integer))) ))::integer) );

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Nested Loop Semi Join (cost=0.00..555.40 rows=2 width=6) (actual time=0.055..131.421 rows=1.00 loops=1) Buffers: shared hit=40090 -> Seq Scan on onek t (cost=0.00..155.08 rows=10008 width=10) (actual time=0.038..1.775 rows=10008.00 loops=1) Buffers: shared hit=55 -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=1 width=4) (actual time=0.013..0.013 rows=0.00 loops=10008) Filter: (t.unique1 = column1) Rows Removed by Filter: 2 Buffers: shared hit=40035 SubPlan 2 -> Nested Loop (cost=0.32..8.67 rows=2 width=4) (actual time=0.012..0.012 rows=0.00 loops=10007) Buffers: shared hit=40035 -> Unique (cost=0.04..0.04 rows=2 width=4) (actual time=0.008..0.008 rows=1.00 loops=10007) Buffers: shared hit=20016 -> Sort (cost=0.04..0.04 rows=2 width=4) (actual time=0.007..0.007 rows=2.00 loops=10007) Sort Key: "*VALUES*_1".column1 Sort Method: quicksort Memory: 25kB Buffers: shared hit=20016 -> Values Scan on "*VALUES*_1" (cost=0.00..0.03 rows=2 width=4) (actual time=0.000..0.005 rows=2.00 loops=10007) Buffers: shared hit=20013 SubPlan 1 -> Index Only Scan using onek_unique2_idx on onek c1 (cost=0.29..4.30 rows=1 width=4) (actual time=0.004..0.004 rows=1.00 loops=10007) Index Cond: (unique2 = t.unique1) Heap Fetches: 0 Index Searches: 10006 Buffers: shared hit=20013 -> Index Only Scan using onek_unique2_idx on onek c (cost=0.29..4.30 rows=1 width=4) (actual time=0.003..0.003 rows=0.00 loops=10009) Index Cond: (unique2 = "*VALUES*_1".column1) Heap Fetches: 0 Index Searches: 10009 Buffers: shared hit=20019 Planning: Buffers: shared hit=10 Planning Time: 1.183 ms Execution Time: 131.616 ms (34 rows)

The query plan without our patch:

------------------------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on onek t (cost=0.00..87875.20 rows=51 width=6) (actual time=0.217..174.053 rows=1.00 loops=1) Filter: (unique1 = ANY (ARRAY[0, ((ANY (2 = (SubPlan 2).col1)))::integer])) Rows Removed by Filter: 10007 Buffers: shared hit=40002 read=85 SubPlan 2 -> Index Only Scan using onek_unique2_idx on onek c (cost=4.60..12.92 rows=2 width=4) (actual time=0.007..0.007 rows=0.00 loops=10008) Index Cond: (unique2 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer])) Heap Fetches: 0 Index Searches: 10008 Buffers: shared hit=20017 SubPlan 1 -> Index Only Scan using onek_unique2_idx on onek c1 (cost=0.29..4.30 rows=1 width=4) (actual time=0.006..0.006 rows=1.00 loops=10008) Index Cond: (unique2 = t.unique1) Heap Fetches: 0 Index Searches: 10007 Buffers: shared hit=19985 read=30 Planning: Buffers: shared hit=91 read=25 Planning Time: 2.858 ms Execution Time: 174.233 ms (20 rows)

I don't observe any performance degradation if VALUES contains constants.

4.1) explain analyze SELECT ten FROM onek t WHERE unique1 IN ( VALUES (0), ((2 IN ( SELECT unique2 FROM onek c WHERE c.unique2 in ((values(0),(2))))::integer)) );

QUERY PLAN ------------------------------------------------------------------------------------------------------------- Seq Scan on onek t (cost=180.11..410.25 rows=2 width=6) (actual time=5.014..13.256 rows=3.00 loops=1) Filter: (unique1 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer])) Rows Removed by Filter: 10005 Buffers: shared hit=110 SubPlan 1 -> Seq Scan on onek c (cost=0.00..180.10 rows=3 width=4) (actual time=0.022..4.951 rows=2.00 loops=1) Filter: (unique2 = ANY ('{0,2}'::integer[])) Rows Removed by Filter: 10006 Buffers: shared hit=55 Planning: Buffers: shared hit=6 dirtied=1 Planning Time: 0.502 ms Execution Time: 13.348 ms (13 rows)

The query plan without our patch:

-------------------------------------------------------------------------------------------------------------------------------------------- Hash Semi Join (cost=0.05..181.42 rows=2 width=6) (actual time=5.072..9.076 rows=3.00 loops=1) Hash Cond: (t.unique1 = "*VALUES*".column1) Buffers: shared hit=55 read=55 -> Seq Scan on onek t (cost=0.00..155.08 rows=10008 width=10) (actual time=0.145..1.802 rows=10008.00 loops=1) Buffers: shared hit=52 read=3 -> Hash (cost=0.03..0.03 rows=2 width=4) (actual time=4.908..4.912 rows=2.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared hit=3 read=52 -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4) (actual time=0.003..4.901 rows=2.00 loops=1) Buffers: shared hit=3 read=52 SubPlan 1 -> Hash Semi Join (cost=0.05..181.42 rows=2 width=4) (actual time=0.036..4.861 rows=2.00 loops=1) Hash Cond: (c.unique2 = "*VALUES*_1".column1) Buffers: shared hit=3 read=52 -> Seq Scan on onek c (cost=0.00..155.08 rows=10008 width=4) (actual time=0.009..2.120 rows=10008.00 loops=1) Buffers: shared hit=3 read=52 -> Hash (cost=0.03..0.03 rows=2 width=4) (actual time=0.006..0.008 rows=2.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Values Scan on "*VALUES*_1" (cost=0.00..0.03 rows=2 width=4) (actual time=0.001..0.002 rows=2.00 loops=1) Planning: Buffers: shared hit=102 read=22 Planning Time: 1.853 ms Execution Time: 9.281 ms (23 rows)

4.2) explain analyze SELECT ten FROM onek t WHERE unique1 IN ( VALUES (0),(2) );

QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on onek t (cost=0.00..180.10 rows=3 width=6) (actual time=0.200..3.777 rows=3.00 loops=1) Filter: (unique1 = ANY ('{0,2}'::integer[])) Rows Removed by Filter: 10005 Buffers: shared read=55 Planning: Buffers: shared hit=65 read=26 Planning Time: 1.345 ms Execution Time: 3.826 ms (8 rows)

QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Hash Semi Join (cost=0.05..181.42 rows=2 width=6) (actual time=0.094..4.935 rows=3.00 loops=1) Hash Cond: (t.unique1 = "*VALUES*".column1) Buffers: shared hit=55 -> Seq Scan on onek t (cost=0.00..155.08 rows=10008 width=10) (actual time=0.056..2.052 rows=10008.00 loops=1) Buffers: shared hit=55 -> Hash (cost=0.03..0.03 rows=2 width=4) (actual time=0.019..0.019 rows=2.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4) (actual time=0.005..0.009 rows=2.00 loops=1) Planning Time: 0.332 ms Execution Time: 4.998 ms (10 rows)

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alexander Korotkov

Date:

01 April, 07:14:02

Hi, Alena!

On Tue, Apr 1, 2025 at 2:11 AM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
> Yes, I agree with that - this is precisely why we need to call IncrementVarSublevelsUp() unconditionally for all types.
>
> As you mentioned earlier, Var nodes can be nested more deeply, and skipping this step could lead to incorrect behavior in those cases. So, now it works fine)
>
> Thank you for an example.
>
> I analyzed this transformation with various types of values that might be used in conditions.
>
> First, I verified whether the change would affect semantics, especially in the presence of NULL elements. The only notable behavior I observed was
> the coercion of NULL to an integer type. However, this behavior remains the same even without our transformation, so everything is fine.

Thank you for your experiments! I've also rechecked we don't sacrifice lazy evaluation. But it appears we don't have one anyway.

CREATE FUNCTION my_func() RETURNS text AS $$
BEGIN
RAISE NOTICE 'notice';
RETURN 'b';
END;
$$ LANGUAGE 'plpgsql';

# create table test (val text);
# insert into test values ('a');
# explain analyze select * from test where val in (VALUES ('a'), (my_func()));
NOTICE: notice
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=0.05..21.26 rows=9 width=64) (actual time=0.178..0.183 rows=1.00 loops=1)
Hash Cond: (test.val = ("*VALUES*".column1)::text)
Buffers: shared hit=1
-> Seq Scan on test (cost=0.00..18.80 rows=880 width=64) (actual time=0.045..0.048 rows=1.00 loops=1)
Buffers: shared hit=1
-> Hash (cost=0.03..0.03 rows=2 width=32) (actual time=0.111..0.112 rows=2.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=32) (actual time=0.004..0.065 rows=2.00 loops=1)
Planning Time: 0.250 ms
Execution Time: 0.267 ms
(10 rows)

------
Regards,
Alexander Korotkov
Supabase

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alexander Korotkov

Date:

01 April, 15:07:59

Hi, Alena!

On Tue, Apr 1, 2025 at 2:11 AM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:

4.1) explain analyze SELECT ten
FROM onek t WHERE unique1 IN ( VALUES (0), ((2 IN ( SELECT unique2 FROM onek c WHERE c.unique2 in ((values(0),(2))))::integer)) );
QUERY PLAN ------------------------------------------------------------------------------------------------------------- Seq Scan on onek t (cost=180.11..410.25 rows=2 width=6) (actual time=5.014..13.256 rows=3.00 loops=1) Filter: (unique1 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer])) Rows Removed by Filter: 10005 Buffers: shared hit=110 SubPlan 1 -> Seq Scan on onek c (cost=0.00..180.10 rows=3 width=4) (actual time=0.022..4.951 rows=2.00 loops=1) Filter: (unique2 = ANY ('{0,2}'::integer[])) Rows Removed by Filter: 10006 Buffers: shared hit=55 Planning: Buffers: shared hit=6 dirtied=1 Planning Time: 0.502 ms Execution Time: 13.348 ms (13 rows)
The query plan without our patch:
-------------------------------------------------------------------------------------------------------------------------------------------- Hash Semi Join (cost=0.05..181.42 rows=2 width=6) (actual time=5.072..9.076 rows=3.00 loops=1) Hash Cond: (t.unique1 = "*VALUES*".column1) Buffers: shared hit=55 read=55 -> Seq Scan on onek t (cost=0.00..155.08 rows=10008 width=10) (actual time=0.145..1.802 rows=10008.00 loops=1) Buffers: shared hit=52 read=3 -> Hash (cost=0.03..0.03 rows=2 width=4) (actual time=4.908..4.912 rows=2.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared hit=3 read=52 -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4) (actual time=0.003..4.901 rows=2.00 loops=1) Buffers: shared hit=3 read=52 SubPlan 1 -> Hash Semi Join (cost=0.05..181.42 rows=2 width=4) (actual time=0.036..4.861 rows=2.00 loops=1) Hash Cond: (c.unique2 = "*VALUES*_1".column1) Buffers: shared hit=3 read=52 -> Seq Scan on onek c (cost=0.00..155.08 rows=10008 width=4) (actual time=0.009..2.120 rows=10008.00 loops=1) Buffers: shared hit=3 read=52 -> Hash (cost=0.03..0.03 rows=2 width=4) (actual time=0.006..0.008 rows=2.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Values Scan on "*VALUES*_1" (cost=0.00..0.03 rows=2 width=4) (actual time=0.001..0.002 rows=2.00 loops=1) Planning: Buffers: shared hit=102 read=22 Planning Time: 1.853 ms Execution Time: 9.281 ms (23 rows)

I think I managed to understand what is going on.

When we run a query with SOAP over a constant array then convert_saop_to_hashed_saop_walker() provides acceleration with hashing.

# explain analyze select * from test where val IN (5000, 4000, 9000, 2000, 1000, 140050);
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..21925.00 rows=6 width=4) (actual time=2.015..223.984 rows=6.00 loops=1)
Filter: (val = ANY ('{5000,4000,9000,2000,1000,140050}'::integer[]))
Rows Removed by Filter: 999994
Buffers: shared hit=2228 read=2197
Planning Time: 0.246 ms
Execution Time: 224.036 ms
(6 rows)

But when there is expression or subselect, then hashing doesn't work and query becomes slower.

# explain analyze select * from test where val IN (5000, 4000, 9000, 2000, 1000, (select 140050));
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.01..21925.01 rows=6 width=4) (actual time=0.904..396.495 rows=6.00 loops=1)
Filter: (val = ANY (ARRAY[5000, 4000, 9000, 2000, 1000, (InitPlan 1).col1]))
Rows Removed by Filter: 999994
Buffers: shared hit=2292 read=2133
InitPlan 1
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1.00 loops=1)
Planning Time: 0.160 ms
Execution Time: 396.538 ms
(8 rows)

In contrast, hashing is always available with VALUES.

# explain analyze select * from test where val in (VALUES (5000), (4000), (9000), (2000), (1000), ((select 140050)));
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=0.16..17050.23 rows=6 width=4) (actual time=1.589..225.061 rows=6.00 loops=1)
Hash Cond: (test.val = "*VALUES*".column1)
Buffers: shared hit=2356 read=2069
InitPlan 1
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.003..0.003 rows=1.00 loops=1)
-> Seq Scan on test (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.460..91.912 rows=1000000.00 loops=1)
Buffers: shared hit=2356 read=2069
-> Hash (cost=0.08..0.08 rows=6 width=4) (actual time=0.049..0.050 rows=6.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Values Scan on "*VALUES*" (cost=0.00..0.08 rows=6 width=4) (actual time=0.009..0.032 rows=6.00 loops=1)
Planning Time: 0.627 ms
Execution Time: 225.155 ms
(12 rows)

I think we should allow our transformation only when the array is constant (attached patchset). In future we may implement dynamic SAOP hashing, and then allow our transformation in more cases.

------
Regards,
Alexander Korotkov

Supabase

Attachment

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

01 April, 17:23:02

Hi, Alexander!

On 01.04.2025 15:07, Alexander Korotkov wrote:

Hi, Alena!

On Tue, Apr 1, 2025 at 2:11 AM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
4.1) explain analyze SELECT ten
FROM onek t WHERE unique1 IN ( VALUES (0), ((2 IN ( SELECT unique2 FROM onek c WHERE c.unique2 in ((values(0),(2))))::integer)) );
QUERY PLAN ------------------------------------------------------------------------------------------------------------- Seq Scan on onek t (cost=180.11..410.25 rows=2 width=6) (actual time=5.014..13.256 rows=3.00 loops=1) Filter: (unique1 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer])) Rows Removed by Filter: 10005 Buffers: shared hit=110 SubPlan 1 -> Seq Scan on onek c (cost=0.00..180.10 rows=3 width=4) (actual time=0.022..4.951 rows=2.00 loops=1) Filter: (unique2 = ANY ('{0,2}'::integer[])) Rows Removed by Filter: 10006 Buffers: shared hit=55 Planning: Buffers: shared hit=6 dirtied=1 Planning Time: 0.502 ms Execution Time: 13.348 ms (13 rows)
The query plan without our patch:
-------------------------------------------------------------------------------------------------------------------------------------------- Hash Semi Join (cost=0.05..181.42 rows=2 width=6) (actual time=5.072..9.076 rows=3.00 loops=1) Hash Cond: (t.unique1 = "*VALUES*".column1) Buffers: shared hit=55 read=55 -> Seq Scan on onek t (cost=0.00..155.08 rows=10008 width=10) (actual time=0.145..1.802 rows=10008.00 loops=1) Buffers: shared hit=52 read=3 -> Hash (cost=0.03..0.03 rows=2 width=4) (actual time=4.908..4.912 rows=2.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared hit=3 read=52 -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4) (actual time=0.003..4.901 rows=2.00 loops=1) Buffers: shared hit=3 read=52 SubPlan 1 -> Hash Semi Join (cost=0.05..181.42 rows=2 width=4) (actual time=0.036..4.861 rows=2.00 loops=1) Hash Cond: (c.unique2 = "*VALUES*_1".column1) Buffers: shared hit=3 read=52 -> Seq Scan on onek c (cost=0.00..155.08 rows=10008 width=4) (actual time=0.009..2.120 rows=10008.00 loops=1) Buffers: shared hit=3 read=52 -> Hash (cost=0.03..0.03 rows=2 width=4) (actual time=0.006..0.008 rows=2.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Values Scan on "*VALUES*_1" (cost=0.00..0.03 rows=2 width=4) (actual time=0.001..0.002 rows=2.00 loops=1) Planning: Buffers: shared hit=102 read=22 Planning Time: 1.853 ms Execution Time: 9.281 ms (23 rows)

I think I managed to understand what is going on.

When we run a query with SOAP over a constant array then convert_saop_to_hashed_saop_walker() provides acceleration with hashing.

# explain analyze select * from test where val IN (5000, 4000, 9000, 2000, 1000, 140050);
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..21925.00 rows=6 width=4) (actual time=2.015..223.984 rows=6.00 loops=1)
Filter: (val = ANY ('{5000,4000,9000,2000,1000,140050}'::integer[]))
Rows Removed by Filter: 999994
Buffers: shared hit=2228 read=2197
Planning Time: 0.246 ms
Execution Time: 224.036 ms
(6 rows)

But when there is expression or subselect, then hashing doesn't work and query becomes slower.

# explain analyze select * from test where val IN (5000, 4000, 9000, 2000, 1000, (select 140050));
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.01..21925.01 rows=6 width=4) (actual time=0.904..396.495 rows=6.00 loops=1)
Filter: (val = ANY (ARRAY[5000, 4000, 9000, 2000, 1000, (InitPlan 1).col1]))
Rows Removed by Filter: 999994
Buffers: shared hit=2292 read=2133
InitPlan 1
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1.00 loops=1)
Planning Time: 0.160 ms
Execution Time: 396.538 ms
(8 rows)

In contrast, hashing is always available with VALUES.

# explain analyze select * from test where val in (VALUES (5000), (4000), (9000), (2000), (1000), ((select 140050)));
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=0.16..17050.23 rows=6 width=4) (actual time=1.589..225.061 rows=6.00 loops=1)
Hash Cond: (test.val = "*VALUES*".column1)
Buffers: shared hit=2356 read=2069
InitPlan 1
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.003..0.003 rows=1.00 loops=1)
-> Seq Scan on test (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.460..91.912 rows=1000000.00 loops=1)
Buffers: shared hit=2356 read=2069
-> Hash (cost=0.08..0.08 rows=6 width=4) (actual time=0.049..0.050 rows=6.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Values Scan on "*VALUES*" (cost=0.00..0.08 rows=6 width=4) (actual time=0.009..0.032 rows=6.00 loops=1)
Planning Time: 0.627 ms
Execution Time: 225.155 ms
(12 rows)

I think we should allow our transformation only when the array is constant (attached patchset).

Yes, I agree with your conclusions; however, I noticed that you didn’t take Param-type variables into account.
These still get executed during the VALUES -> ANY transformation (see regression tests).

+PREPARE test2 (int,numeric, text) AS
+ SELECT ten FROM onek
+ WHERE sin(two)*four/($3::real) IN (VALUES (2), ($2), ($2), ($1));
+-- VTA forbidden because of unresolved casting of numeric parameter to common type
+EXPLAIN (COSTS OFF) EXECUTE test2(2, 2, '2');
+                                                         QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------------------
+ Seq Scan on onek
+   Filter: (((sin((two)::double precision) * (four)::double precision) / '2'::real) = ANY ('{2,2,2,2}'::double precision[]))
+(2 rows)
+
+PREPARE test3 (int,int, text) AS
+ SELECT ten FROM onek
+ WHERE sin(two)*four/($3::real) IN (VALUES (2), ($2), ($2), ($1));
+EXPLAIN (COSTS OFF) EXECUTE test3(2, 2, '2');
+                                                         QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------------------
+ Seq Scan on onek
+   Filter: (((sin((two)::double precision) * (four)::double precision) / '2'::real) = ANY ('{2,2,2,2}'::double precision[]))
+(2 rows)

In my opinion, we can apply the VALUES ->ANY transformation to them as well. What do you think? I ran some queries and didn’t notice any significant performance degradation.

create table test (x int);
insert into test select id from generate_series(1,1000) id;
PREPARE test4 (int,int, int) AS select * from test where x IN ($1, $2, $3);
PREPARE test3 (int,int, int) AS select * from test where x IN ($1, $2,
(select $3));
EXPLAIN ANALYZE EXECUTE test4(2, 2, 2);
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..18.75 rows=3 width=4) (actual time=0.016..0.353 rows=1.00 loops=1)
   Filter: (x = ANY (ARRAY[$1, $2, $3]))
   Rows Removed by Filter: 999
   Buffers: shared hit=5
Planning:
   Buffers: shared hit=20
Planning Time: 0.266 ms
Execution Time: 0.367 ms
(8 rows)

alena@postgres=# EXPLAIN ANALYZE EXECUTE test3(2, 2, 2);
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.01..18.76 rows=3 width=4) (actual time=0.072..1.379 rows=1.00 loops=1)
   Filter: (x = ANY (ARRAY[2, 2, (InitPlan 1).col1]))
   Rows Removed by Filter: 999
   Buffers: shared hit=5
   InitPlan 1
     -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.003..0.003 rows=1.00 loops=1)
Planning Time: 0.350 ms
Execution Time: 1.431 ms
(8 rows)

alena@postgres=# PREPARE test6 (int,int, int) AS select * from test where x IN (values($1), ($2), ($3));
PREPARE
alena@postgres=# EXPLAIN ANALYZE EXECUTE test6(2, 2, 2);
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..18.75 rows=3 width=4) (actual time=0.055..0.683 rows=1.00 loops=1)
   Filter: (x = ANY ('{2,2,2}'::integer[]))
   Rows Removed by Filter: 999
   Buffers: shared hit=5
Planning Time: 0.230 ms
Execution Time: 0.724 ms
(6 rows)

We can’t use hashing for them, but without this transformation, we still have to perform a join.

----------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=0.08..17.73 rows=3 width=4) (actual time=0.124..0.943 rows=1.00 loops=1)
   Hash Cond: (test.x = "*VALUES*".column1)
   Buffers: shared hit=5
   -> Seq Scan on test (cost=0.00..15.00 rows=1000 width=4) (actual time=0.051..0.389 rows=1000.00 loops=1)
         Buffers: shared hit=5
   -> Hash (cost=0.04..0.04 rows=3 width=4) (actual time=0.028..0.030 rows=3.00 loops=1)
         Buckets: 1024 Batches: 1 Memory Usage: 9kB
         -> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4) (actual time=0.004..0.010 rows=3.00 loops=1)
Planning:
   Buffers: shared hit=105 read=1
Planning Time: 2.176 ms
Execution Time: 1.077 ms
(12 rows)

So, I think we can bring it back and construct the Array node based on the have_param flag.

foreach (lc, rte->values_lists)
+    {
+        List *elem = lfirst(lc);
+        Node *value = linitial(elem);
+
+        value = eval_const_expressions(NULL, value);
+
+        if (!IsA(value, Const))
+            have_param = true;
+
+        consts = lappend(consts, value);
+
+    }

Regarding the check for the presence of Var elements before the transformation, I think we should, for now, restore the walker function (values_simplicity_check_walker) that
traverses the query to identify Var nodes. This function was included in the initial version of the patch:

+/*
+ * The function traverses the tree looking for elements of type var.
+ * If it finds it, it returns true.
+ */
+static bool
+values_simplicity_check_walker(Node *node, void *ctx)
+{
+    if (node == NULL)
+    {
+        return false;
+    }
+    else if(IsA(node, Var))
+        return true;
+    else if(IsA(node, Query))
+        return query_tree_walker((Query *) node,
+                                values_simplicity_check_walker,
+                                (void*) ctx,
+                                QTW_EXAMINE_RTES_BEFORE);
+
+    return expression_tree_walker(node, values_simplicity_check_walker,
+                                (void *) ctx);
+}

In future we may implement dynamic SAOP hashing, and then allow our transformation in more cases.

I agree with your suggestion) Thank you for your interest to this subject and contribution!

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

02 April, 17:33:43

On 01.04.2025 17:23, Alena Rybakina wrote:

Hi, Alexander!
On 01.04.2025 15:07, Alexander Korotkov wrote:
Hi, Alena!

On Tue, Apr 1, 2025 at 2:11 AM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
4.1) explain analyze SELECT ten
FROM onek t WHERE unique1 IN ( VALUES (0), ((2 IN ( SELECT unique2 FROM onek c WHERE c.unique2 in ((values(0),(2))))::integer)) );
QUERY PLAN ------------------------------------------------------------------------------------------------------------- Seq Scan on onek t (cost=180.11..410.25 rows=2 width=6) (actual time=5.014..13.256 rows=3.00 loops=1) Filter: (unique1 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer])) Rows Removed by Filter: 10005 Buffers: shared hit=110 SubPlan 1 -> Seq Scan on onek c (cost=0.00..180.10 rows=3 width=4) (actual time=0.022..4.951 rows=2.00 loops=1) Filter: (unique2 = ANY ('{0,2}'::integer[])) Rows Removed by Filter: 10006 Buffers: shared hit=55 Planning: Buffers: shared hit=6 dirtied=1 Planning Time: 0.502 ms Execution Time: 13.348 ms (13 rows)
The query plan without our patch:
-------------------------------------------------------------------------------------------------------------------------------------------- Hash Semi Join (cost=0.05..181.42 rows=2 width=6) (actual time=5.072..9.076 rows=3.00 loops=1) Hash Cond: (t.unique1 = "*VALUES*".column1) Buffers: shared hit=55 read=55 -> Seq Scan on onek t (cost=0.00..155.08 rows=10008 width=10) (actual time=0.145..1.802 rows=10008.00 loops=1) Buffers: shared hit=52 read=3 -> Hash (cost=0.03..0.03 rows=2 width=4) (actual time=4.908..4.912 rows=2.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared hit=3 read=52 -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4) (actual time=0.003..4.901 rows=2.00 loops=1) Buffers: shared hit=3 read=52 SubPlan 1 -> Hash Semi Join (cost=0.05..181.42 rows=2 width=4) (actual time=0.036..4.861 rows=2.00 loops=1) Hash Cond: (c.unique2 = "*VALUES*_1".column1) Buffers: shared hit=3 read=52 -> Seq Scan on onek c (cost=0.00..155.08 rows=10008 width=4) (actual time=0.009..2.120 rows=10008.00 loops=1) Buffers: shared hit=3 read=52 -> Hash (cost=0.03..0.03 rows=2 width=4) (actual time=0.006..0.008 rows=2.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Values Scan on "*VALUES*_1" (cost=0.00..0.03 rows=2 width=4) (actual time=0.001..0.002 rows=2.00 loops=1) Planning: Buffers: shared hit=102 read=22 Planning Time: 1.853 ms Execution Time: 9.281 ms (23 rows)

I think I managed to understand what is going on.

When we run a query with SOAP over a constant array then convert_saop_to_hashed_saop_walker() provides acceleration with hashing.

# explain analyze select * from test where val IN (5000, 4000, 9000, 2000, 1000, 140050);
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..21925.00 rows=6 width=4) (actual time=2.015..223.984 rows=6.00 loops=1)
Filter: (val = ANY ('{5000,4000,9000,2000,1000,140050}'::integer[]))
Rows Removed by Filter: 999994
Buffers: shared hit=2228 read=2197
Planning Time: 0.246 ms
Execution Time: 224.036 ms
(6 rows)

But when there is expression or subselect, then hashing doesn't work and query becomes slower.

# explain analyze select * from test where val IN (5000, 4000, 9000, 2000, 1000, (select 140050));
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.01..21925.01 rows=6 width=4) (actual time=0.904..396.495 rows=6.00 loops=1)
Filter: (val = ANY (ARRAY[5000, 4000, 9000, 2000, 1000, (InitPlan 1).col1]))
Rows Removed by Filter: 999994
Buffers: shared hit=2292 read=2133
InitPlan 1
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1.00 loops=1)
Planning Time: 0.160 ms
Execution Time: 396.538 ms
(8 rows)

In contrast, hashing is always available with VALUES.

# explain analyze select * from test where val in (VALUES (5000), (4000), (9000), (2000), (1000), ((select 140050)));
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=0.16..17050.23 rows=6 width=4) (actual time=1.589..225.061 rows=6.00 loops=1)
Hash Cond: (test.val = "*VALUES*".column1)
Buffers: shared hit=2356 read=2069
InitPlan 1
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.003..0.003 rows=1.00 loops=1)
-> Seq Scan on test (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.460..91.912 rows=1000000.00 loops=1)
Buffers: shared hit=2356 read=2069
-> Hash (cost=0.08..0.08 rows=6 width=4) (actual time=0.049..0.050 rows=6.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Values Scan on "*VALUES*" (cost=0.00..0.08 rows=6 width=4) (actual time=0.009..0.032 rows=6.00 loops=1)
Planning Time: 0.627 ms
Execution Time: 225.155 ms
(12 rows)

I think we should allow our transformation only when the array is constant (attached patchset).
Yes, I agree with your conclusions; however, I noticed that you didn’t take Param-type variables into account.
These still get executed during the VALUES -> ANY transformation (see regression tests).
+PREPARE test2 (int,numeric, text) AS
+ SELECT ten FROM onek
+ WHERE sin(two)*four/($3::real) IN (VALUES (2), ($2), ($2), ($1));
+-- VTA forbidden because of unresolved casting of numeric parameter to common type
+EXPLAIN (COSTS OFF) EXECUTE test2(2, 2, '2');
+                                                         QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------------------
+ Seq Scan on onek
+   Filter: (((sin((two)::double precision) * (four)::double precision) / '2'::real) = ANY ('{2,2,2,2}'::double precision[]))
+(2 rows)
+
+PREPARE test3 (int,int, text) AS
+ SELECT ten FROM onek
+ WHERE sin(two)*four/($3::real) IN (VALUES (2), ($2), ($2), ($1));
+EXPLAIN (COSTS OFF) EXECUTE test3(2, 2, '2');
+                                                         QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------------------
+ Seq Scan on onek
+   Filter: (((sin((two)::double precision) * (four)::double precision) / '2'::real) = ANY ('{2,2,2,2}'::double precision[]))
+(2 rows)
In my opinion, we can apply the VALUES ->ANY transformation to them as well. What do you think? I ran some queries and didn’t notice any significant performance degradation.
create table test (x int);
insert into test select id from generate_series(1,1000) id;
PREPARE test4 (int,int, int) AS select * from test where x IN ($1, $2, $3);
PREPARE test3 (int,int, int) AS select * from test where x IN ($1, $2,
(select $3));
EXPLAIN ANALYZE EXECUTE test4(2, 2, 2);
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..18.75 rows=3 width=4) (actual time=0.016..0.353 rows=1.00 loops=1)
   Filter: (x = ANY (ARRAY[$1, $2, $3]))
   Rows Removed by Filter: 999
   Buffers: shared hit=5
Planning:
   Buffers: shared hit=20
Planning Time: 0.266 ms
Execution Time: 0.367 ms
(8 rows)

alena@postgres=# EXPLAIN ANALYZE EXECUTE test3(2, 2, 2);
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.01..18.76 rows=3 width=4) (actual time=0.072..1.379 rows=1.00 loops=1)
   Filter: (x = ANY (ARRAY[2, 2, (InitPlan 1).col1]))
   Rows Removed by Filter: 999
   Buffers: shared hit=5
   InitPlan 1
     -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.003..0.003 rows=1.00 loops=1)
Planning Time: 0.350 ms
Execution Time: 1.431 ms
(8 rows)

alena@postgres=# PREPARE test6 (int,int, int) AS select * from test where x IN (values($1), ($2), ($3));
PREPARE
alena@postgres=# EXPLAIN ANALYZE EXECUTE test6(2, 2, 2);
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..18.75 rows=3 width=4) (actual time=0.055..0.683 rows=1.00 loops=1)
   Filter: (x = ANY ('{2,2,2}'::integer[]))
   Rows Removed by Filter: 999
   Buffers: shared hit=5
Planning Time: 0.230 ms
Execution Time: 0.724 ms
(6 rows)
We can’t use hashing for them, but without this transformation, we still have to perform a join.
----------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=0.08..17.73 rows=3 width=4) (actual time=0.124..0.943 rows=1.00 loops=1)
   Hash Cond: (test.x = "*VALUES*".column1)
   Buffers: shared hit=5
   -> Seq Scan on test (cost=0.00..15.00 rows=1000 width=4) (actual time=0.051..0.389 rows=1000.00 loops=1)
         Buffers: shared hit=5
   -> Hash (cost=0.04..0.04 rows=3 width=4) (actual time=0.028..0.030 rows=3.00 loops=1)
         Buckets: 1024 Batches: 1 Memory Usage: 9kB
         -> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4) (actual time=0.004..0.010 rows=3.00 loops=1)
Planning:
   Buffers: shared hit=105 read=1
Planning Time: 2.176 ms
Execution Time: 1.077 ms
(12 rows)
So, I think we can bring it back and construct the Array node based on the have_param flag.
foreach (lc, rte->values_lists)
+    {
+        List *elem = lfirst(lc);
+        Node *value = linitial(elem);
+
+        value = eval_const_expressions(NULL, value);
+
+        if (!IsA(value, Const))
+            have_param = true;
+
+        consts = lappend(consts, value);
+
+    }
Regarding the check for the presence of Var elements before the transformation, I think we should, for now, restore the walker function (values_simplicity_check_walker) that
traverses the query to identify Var nodes. This function was included in the initial version of the patch:
+/*
+ * The function traverses the tree looking for elements of type var.
+ * If it finds it, it returns true.
+ */
+static bool
+values_simplicity_check_walker(Node *node, void *ctx)
+{
+    if (node == NULL)
+    {
+        return false;
+    }
+    else if(IsA(node, Var))
+        return true;
+    else if(IsA(node, Query))
+        return query_tree_walker((Query *) node,
+                                values_simplicity_check_walker,
+                                (void*) ctx,
+                                QTW_EXAMINE_RTES_BEFORE);
+
+    return expression_tree_walker(node, values_simplicity_check_walker,
+                                (void *) ctx);
+}
In future we may implement dynamic SAOP hashing, and then allow our transformation in more cases.
I agree with your suggestion) Thank you for your interest to this subject and contribution!

I prepared a patch according to my suggestions, it just checks that the transformation is not carried out if there is a var element, there are changes only in one test, but I think it is correct.

diff -U3 /home/alena/postgrespro_or3/src/test/regress/expected/subselect.out /home/alena/postgrespro_or3/src/test/regress/results/subselect.out
--- /home/alena/postgrespro_or3/src/test/regress/expected/subselect.out    2025-04-02 02:50:07.018329864 +0300
+++ /home/alena/postgrespro_or3/src/test/regress/results/subselect.out    2025-04-02 17:27:09.845104001 +0300
@@ -3027,18 +3027,15 @@
SELECT ten FROM onek t WHERE unique1 IN (VALUES (0), ((2 IN
   (SELECT (3)))::integer)
);
-                     QUERY PLAN
-----------------------------------------------------
- Nested Loop
-   -> Unique
-         -> Sort
-               Sort Key: "*VALUES*".column1
-               -> Values Scan on "*VALUES*"
-                     SubPlan 1
-                       -> Result
-   -> Index Scan using onek_unique1 on onek t
-         Index Cond: (unique1 = "*VALUES*".column1)
-(9 rows)
+                                           QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Bitmap Heap Scan on onek t
+   Recheck Cond: (unique1 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer]))
+   -> Bitmap Index Scan on onek_unique1
+         Index Cond: (unique1 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer]))
+   SubPlan 1
+     -> Result
+(6 rows)

-- Alow to transformation and hold conversion between types of colemns and
-- declared type of column pointed in RTE

-- 
Regards,
Alena Rybakina
Postgres Professional

Attachment

values_to_any.diff.no-cfbot

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alexander Korotkov

Date:

02 April, 22:00:49

Hi, Alena!

On Wed, Apr 2, 2025 at 5:33 PM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:

I prepared a patch according to my suggestions, it just checks that the transformation is not carried out if there is a var element, there are changes only in one test, but I think it is correct.
diff -U3 /home/alena/postgrespro_or3/src/test/regress/expected/subselect.out /home/alena/postgrespro_or3/src/test/regress/results/subselect.out
--- /home/alena/postgrespro_or3/src/test/regress/expected/subselect.out    2025-04-02 02:50:07.018329864 +0300
+++ /home/alena/postgrespro_or3/src/test/regress/results/subselect.out    2025-04-02 17:27:09.845104001 +0300
@@ -3027,18 +3027,15 @@
SELECT ten FROM onek t WHERE unique1 IN (VALUES (0), ((2 IN
   (SELECT (3)))::integer)
);
-                     QUERY PLAN
-----------------------------------------------------
- Nested Loop
-   -> Unique
-         -> Sort
-               Sort Key: "*VALUES*".column1
-               -> Values Scan on "*VALUES*"
-                     SubPlan 1
-                       -> Result
-   -> Index Scan using onek_unique1 on onek t
-         Index Cond: (unique1 = "*VALUES*".column1)
-(9 rows)
+                                           QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Bitmap Heap Scan on onek t
+   Recheck Cond: (unique1 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer]))
+   -> Bitmap Index Scan on onek_unique1
+         Index Cond: (unique1 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer]))
+   SubPlan 1
+     -> Result
+(6 rows)

-- Alow to transformation and hold conversion between types of colemns and
-- declared type of column pointed in RTE

Your patch turns back performance degradation issue that I mention in [1].

With your patch.

# explain analyze select * from test where val in (VALUES (5000), (4000), (9000), (2000), (1000), (140050), ((select 1)));

QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.01..23175.01 rows=7 width=4) (actual time=0.455..417.929 rows=7.00 loops=1)
Filter: (val = ANY (ARRAY[5000, 4000, 9000, 2000, 1000, 140050, (InitPlan 1).col1]))
Rows Removed by Filter: 999993
Buffers: shared hit=2362 read=2063
InitPlan 1
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1.00 loops=1)
Planning Time: 0.211 ms
Execution Time: 417.984 ms
(8 rows)

Without your patch.

# explain analyze select * from test where val in (VALUES (5000), (4000), (9000), (2000), (1000), (140050), ((select 1)));
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=0.19..17050.26 rows=7 width=4) (actual time=0.289..210.335 rows=7.00 loops=1)
Hash Cond: (test.val = "*VALUES*".column1)
Buffers: shared hit=2211 read=2214
InitPlan 1
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1.00 loops=1)
-> Seq Scan on test (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.203..85.925 rows=1000000.00 loops=1)
Buffers: shared hit=2211 read=2214
-> Hash (cost=0.09..0.09 rows=7 width=4) (actual time=0.067..0.068 rows=7.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Values Scan on "*VALUES*" (cost=0.00..0.09 rows=7 width=4) (actual time=0.004..0.023 rows=7.00 loops=1)
Planning Time: 0.321 ms
Execution Time: 210.409 ms
(12 rows)

Only when we have all consts in the array, we can have SAOP hashing which is competitive with hashing of VALUES node. So, I'm continuing with my version of patch.

Links.

1. https://www.postgresql.org/message-id/CAPpHfds-7eJ3ZMgyeVrMfC5E1nTHD4Bp0ch5MZhrYSoiCfERXw%40mail.gmail.com

------
Regards,
Alexander Korotkov

Supabase

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

03 April, 17:18:46

On 02.04.2025 22:00, Alexander Korotkov wrote:

Hi, Alena!

On Wed, Apr 2, 2025 at 5:33 PM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
I prepared a patch according to my suggestions, it just checks that the transformation is not carried out if there is a var element, there are changes only in one test, but I think it is correct.
diff -U3 /home/alena/postgrespro_or3/src/test/regress/expected/subselect.out /home/alena/postgrespro_or3/src/test/regress/results/subselect.out
--- /home/alena/postgrespro_or3/src/test/regress/expected/subselect.out    2025-04-02 02:50:07.018329864 +0300
+++ /home/alena/postgrespro_or3/src/test/regress/results/subselect.out    2025-04-02 17:27:09.845104001 +0300
@@ -3027,18 +3027,15 @@
SELECT ten FROM onek t WHERE unique1 IN (VALUES (0), ((2 IN
   (SELECT (3)))::integer)
);
-                     QUERY PLAN
-----------------------------------------------------
- Nested Loop
-   -> Unique
-         -> Sort
-               Sort Key: "*VALUES*".column1
-               -> Values Scan on "*VALUES*"
-                     SubPlan 1
-                       -> Result
-   -> Index Scan using onek_unique1 on onek t
-         Index Cond: (unique1 = "*VALUES*".column1)
-(9 rows)
+                                           QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Bitmap Heap Scan on onek t
+   Recheck Cond: (unique1 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer]))
+   -> Bitmap Index Scan on onek_unique1
+         Index Cond: (unique1 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer]))
+   SubPlan 1
+     -> Result
+(6 rows)

-- Alow to transformation and hold conversion between types of colemns and
-- declared type of column pointed in RTE
Your patch turns back performance degradation issue that I mention in [1].

With your patch.
# explain analyze select * from test where val in (VALUES (5000), (4000), (9000), (2000), (1000), (140050), ((select 1)));
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.01..23175.01 rows=7 width=4) (actual time=0.455..417.929 rows=7.00 loops=1)
Filter: (val = ANY (ARRAY[5000, 4000, 9000, 2000, 1000, 140050, (InitPlan 1).col1]))
Rows Removed by Filter: 999993
Buffers: shared hit=2362 read=2063
InitPlan 1
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1.00 loops=1)
Planning Time: 0.211 ms
Execution Time: 417.984 ms
(8 rows)

Without your patch.
# explain analyze select * from test where val in (VALUES (5000), (4000), (9000), (2000), (1000), (140050), ((select 1)));
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=0.19..17050.26 rows=7 width=4) (actual time=0.289..210.335 rows=7.00 loops=1)
Hash Cond: (test.val = "*VALUES*".column1)
Buffers: shared hit=2211 read=2214
InitPlan 1
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1.00 loops=1)
-> Seq Scan on test (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.203..85.925 rows=1000000.00 loops=1)
Buffers: shared hit=2211 read=2214
-> Hash (cost=0.09..0.09 rows=7 width=4) (actual time=0.067..0.068 rows=7.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Values Scan on "*VALUES*" (cost=0.00..0.09 rows=7 width=4) (actual time=0.004..0.023 rows=7.00 loops=1)
Planning Time: 0.321 ms
Execution Time: 210.409 ms
(12 rows)

Only when we have all consts in the array, we can have SAOP hashing which is competitive with hashing of VALUES node. So, I'm continuing with my version of patch.

Links.
1. https://www.postgresql.org/message-id/CAPpHfds-7eJ3ZMgyeVrMfC5E1nTHD4Bp0ch5MZhrYSoiCfERXw%40mail.gmail.com

Okay, I agree with you.

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alexander Korotkov

Date:

03 April, 18:26:48

On Thu, Apr 3, 2025 at 5:18 PM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
> Okay, I agree with you.

Good.  I've reflected this limitation in comments and the commit
message.  Also, I've adjust regression tests by removing excessive
ones and adding more important cases.  I'm going to push this if no
objections.

------
Regards,
Alexander Korotkov
Supabase

Attachment

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

03 April, 18:51:10

On 03.04.2025 18:26, Alexander Korotkov wrote:
> On Thu, Apr 3, 2025 at 5:18 PM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
>> Okay, I agree with you.
> Good.  I've reflected this limitation in comments and the commit
> message.
Thank you, it looks fine)
> Also, I've adjust regression tests by removing excessive
> ones and adding more important cases.  I'm going to push this if no
> objections.
I agree with your changes.

-- 
Regards,
Alena Rybakina
Postgres Professional

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Andres Freund

Date:

04 April, 17:39:53

On 2025-04-03 18:26:48 +0300, Alexander Korotkov wrote:
> On Thu, Apr 3, 2025 at 5:18 PM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
> > Okay, I agree with you.
>
> Good.  I've reflected this limitation in comments and the commit
> message.  Also, I've adjust regression tests by removing excessive
> ones and adding more important cases.  I'm going to push this if no
> objections.

This just failed on both buildfarm and CI:

BF:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2025-04-04%2013%3A52%3A13
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=iguana&dt=2025-04-04%2014%3A08%3A50
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2025-04-04%2013%3A52%3A13
...


CI:

https://cirrus-ci.com/task/5466958990147584

https://api.cirrus-ci.com/v1/artifact/task/5466958990147584/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

--- C:/cirrus/src/test/regress/expected/subselect.out    2025-04-04 13:20:49.429815400 +0000
+++ C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/results/subselect.out    2025-04-04 13:23:33.602847700
+0000
@@ -2769,15 +2769,16 @@
 EXPLAIN (COSTS OFF)
 SELECT c.oid,c.relname FROM pg_class c JOIN pg_am a USING (oid)
 WHERE c.oid IN (VALUES (1), (2));
-                          QUERY PLAN
----------------------------------------------------------------
- Hash Join
-   Hash Cond: (a.oid = c.oid)
-   ->  Seq Scan on pg_am a
-   ->  Hash
-         ->  Index Scan using pg_class_oid_index on pg_class c
-               Index Cond: (oid = ANY ('{1,2}'::oid[]))
-(6 rows)
+                       QUERY PLAN
+---------------------------------------------------------
+ Merge Join
+   Merge Cond: (c.oid = a.oid)
+   ->  Index Scan using pg_class_oid_index on pg_class c
+         Index Cond: (oid = ANY ('{1,2}'::oid[]))
+   ->  Sort
+         Sort Key: a.oid
+         ->  Seq Scan on pg_am a
+(7 rows)

 -- Constant expressions are simplified
 EXPLAIN (COSTS OFF)


This failure was entirely predictable, as this patch has been failing on cfbot
for *months*.  Seriously. The tests passed exactly once since 2024-11-26, see:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5283

Come-on.

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Tom Lane

Date:

04 April, 18:17:53

Andres Freund <andres@anarazel.de> writes:
> This just failed on both buildfarm and CI:

Yeah, the plan for that join seems fairly unstable.  Is there
actually any need for the join to pg_am?

            regards, tom lane

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

04 April, 18:19:37

On 04.04.2025 17:39, Andres Freund wrote:
> On 2025-04-03 18:26:48 +0300, Alexander Korotkov wrote:
>> On Thu, Apr 3, 2025 at 5:18 PM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:
>>> Okay, I agree with you.
>> Good.  I've reflected this limitation in comments and the commit
>> message.  Also, I've adjust regression tests by removing excessive
>> ones and adding more important cases.  I'm going to push this if no
>> objections.
> This just failed on both buildfarm and CI:
>
> BF:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2025-04-04%2013%3A52%3A13
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=iguana&dt=2025-04-04%2014%3A08%3A50
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2025-04-04%2013%3A52%3A13
> ...
>
>
> CI:
>
> https://cirrus-ci.com/task/5466958990147584
>
https://api.cirrus-ci.com/v1/artifact/task/5466958990147584/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
>
> --- C:/cirrus/src/test/regress/expected/subselect.out    2025-04-04 13:20:49.429815400 +0000
> +++ C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/results/subselect.out    2025-04-04 13:23:33.602847700
+0000
> @@ -2769,15 +2769,16 @@
>   EXPLAIN (COSTS OFF)
>   SELECT c.oid,c.relname FROM pg_class c JOIN pg_am a USING (oid)
>   WHERE c.oid IN (VALUES (1), (2));
> -                          QUERY PLAN
> ----------------------------------------------------------------
> - Hash Join
> -   Hash Cond: (a.oid = c.oid)
> -   ->  Seq Scan on pg_am a
> -   ->  Hash
> -         ->  Index Scan using pg_class_oid_index on pg_class c
> -               Index Cond: (oid = ANY ('{1,2}'::oid[]))
> -(6 rows)
> +                       QUERY PLAN
> +---------------------------------------------------------
> + Merge Join
> +   Merge Cond: (c.oid = a.oid)
> +   ->  Index Scan using pg_class_oid_index on pg_class c
> +         Index Cond: (oid = ANY ('{1,2}'::oid[]))
> +   ->  Sort
> +         Sort Key: a.oid
> +         ->  Seq Scan on pg_am a
> +(7 rows)
>
>   -- Constant expressions are simplified
>   EXPLAIN (COSTS OFF)
>
>
> This failure was entirely predictable, as this patch has been failing on cfbot
> for *months*.  Seriously. The tests passed exactly once since 2024-11-26, see:
>
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5283
>
> Come-on.
>
>
Sorry, I didn't understand at first how to find a specific diff file to 
analyze what was wrong with my test, to be honest.
I only started to understand how to do this recently, then later I 
thought everything was fine after the letter several days before.

I fixed it - changed the tables and didn't use system tables.

Sorry again, I'll be more careful with this.

-- 
Regards,
Alena Rybakina
Postgres Professional

Attachment

values_test.diff

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Peter Geoghegan

Date:

04 April, 18:33:56

On Fri, Apr 4, 2025 at 11:19 AM Alena Rybakina
<a.rybakina@postgrespro.ru> wrote:
> I fixed it - changed the tables and didn't use system tables.

Is somebody going to commit this soon? Alexander?


--
Peter Geoghegan

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Tom Lane

Date:

04 April, 19:00:39

Peter Geoghegan <pg@bowt.ie> writes:
> Is somebody going to commit this soon? Alexander?

Done.

I thought the point of the test case was to check VALUES folding
on both sides of the join.  I'm not sure why that's actually
a useful expenditure of test cycles, but Alena's patch undid that
without adjusting the comment claiming that's what it's doing.
So I kept the shape of the test the same while substituting
hopefully-stabler tables as she did.

If it's still not stable I think the next step is to nuke both
test queries, since I remain of the opinion that they're likely
a waste of cycles.

            regards, tom lane

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Peter Geoghegan

Date:

04 April, 19:02:51

On Fri, Apr 4, 2025 at 12:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
> > Is somebody going to commit this soon? Alexander?
>
> Done.

Thanks!

> If it's still not stable I think the next step is to nuke both
> test queries, since I remain of the opinion that they're likely
> a waste of cycles.

+1

--
Peter Geoghegan

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alexander Korotkov

Date:

04 April, 19:04:36

On Fri, Apr 4, 2025 at 7:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
> > Is somebody going to commit this soon? Alexander?
>
> Done.
>
> I thought the point of the test case was to check VALUES folding
> on both sides of the join.  I'm not sure why that's actually
> a useful expenditure of test cycles, but Alena's patch undid that
> without adjusting the comment claiming that's what it's doing.
> So I kept the shape of the test the same while substituting
> hopefully-stabler tables as she did.
>
> If it's still not stable I think the next step is to nuke both
> test queries, since I remain of the opinion that they're likely
> a waste of cycles.

Thank you for pushing this!  I just start looking at the problem and
have just seen your commit.

------
Regards,
Alexander Korotkov
Supabase

Re: Replace IN VALUES with ANY in WHERE clauses during optimization

From

Alena Rybakina

Date:

04 April, 19:30:47

On 04.04.2025 19:00, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

Is somebody going to commit this soon? Alexander?

Done.

I thought the point of the test case was to check VALUES folding
on both sides of the join.  I'm not sure why that's actually
a useful expenditure of test cycles, but Alena's patch undid that
without adjusting the comment claiming that's what it's doing.
So I kept the shape of the test the same while substituting
hopefully-stabler tables as she did.

Yes, that's right. Thank you.

If it's still not stable I think the next step is to nuke both
test queries, since I remain of the opinion that they're likely
a waste of cycles.

Agree.

-- 
Regards,
Alena Rybakina
Postgres Professional