Re: POC, WIP: OR-clause support for indexes - Mailing list pgsql-hackers

From a.rybakina
Subject Re: POC, WIP: OR-clause support for indexes
Date
Msg-id c27333ea-3515-487e-8025-468e7d86ddff@postgrespro.ru
Whole thread Raw
In response to Re: POC, WIP: OR-clause support for indexes  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
Hi! Thank you for your review!

On 15.10.2023 01:34, Alexander Korotkov wrote:
Hi, Alena!

Thank you for your work on the subject.

On Wed, Oct 4, 2023 at 10:21 PM a.rybakina <a.rybakina@postgrespro.ru> wrote:
I fixed the kernel dump issue and all the regression tests were successful, but I discovered another problem when I added my own regression tests.
Some queries that contain "or" expressions do not convert to "ANY". I have described this in more detail using diff as expected and real results:

diff -U3 /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out
--- /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out 2023-10-04 21:54:12.496282667 +0300
+++ /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out  2023-10-04 21:55:41.665422459 +0300
@@ -1925,17 +1925,20 @@ EXPLAIN (COSTS OFF) SELECT count(*) FROM tenk1   WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3) OR thousand = 41;
-                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
+                                                        QUERY PLAN
+---------------------------------------------------------------------------------------------------------------------------  Aggregate    ->  Bitmap Heap Scan on tenk1
-         Recheck Cond: (((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[]))) OR (thousand = 41))
+         Recheck Cond: ((((thousand = 42) AND (tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3))) OR (thousand = 41))          ->  BitmapOr
-               ->  Bitmap Index Scan on tenk1_thous_tenthous
-                     Index Cond: ((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[])))
+               ->  BitmapOr
+                     ->  Bitmap Index Scan on tenk1_thous_tenthous
+                           Index Cond: ((thousand = 42) AND (tenthous = 1))
+                     ->  Bitmap Index Scan on tenk1_thous_tenthous
+                           Index Cond: ((thousand = 42) AND (tenthous = 3))                ->  Bitmap Index Scan on tenk1_thous_tenthous                      Index Cond: (thousand = 41)
-(8 rows)
+(11 rows)
I think this query is not converted, because you only convert
top-level ORs in the transform_ors() function.  But in the example
given, the target OR lays under AND, which in turn lays under another
OR.  I think you need to make transform_ors() recursive to handle
cases like this.
Yes, you are right, it seems that a recursive method is needed here.
I wonder about the default value of the parameter or_transform_limit
of 500. In [1] and [2] you show the execution time degradation from 0
to ~500 OR clauses.  I made a simple SQL script with the query "SELECT
* FROM pgbench_accounts a WHERE  aid = 1 OR aid = 2 OR ... OR aid =
100;". The pgbench results for a single connection in prepared mode
are the following.
master: 936 tps
patched (or_transform_limit == 0) :1414 tps
So, transformation to ANY obviously accelerates the execution.

I think it's important to identify the cases where this patch causes
the degradation.  Generally, I don't see why ANY could be executed
slower than the equivalent OR clause.  So, the possible degradation
cases are slower plan generation and worse plans.  I managed to find
both.

As you stated before, currently the OR transformation has a quadratic
complexity depending on the number of or-clause-groups.  I made a
simple test to evaluate this. containing 10000 or-clause-groups.
SELECT * FROM pgbench_accounts a WHERE aid + 1 * bid = 1 OR aid + 2 *
bid = 1 OR ... OR aid + 10000 * bid = 1;
master: 316ms
patched: 7142ms
Note, that the current or_transform_limit GUC parameter is not capable
of cutting such cases, because it cuts cases lower than the limit not
higher than the limit.  In the comment, you mention that we could
invent something like hash to handle this.  Hash should be nice, but
the problem is that we currently don't have a generic facility to hash
nodes (or even order them).  It would be nice to add this facility,
that would be quite a piece of work.  I would propose to limit this
patch for now to handle just a single Var node as a non-const side of
the clause and implement a simple hash for Vars.
I ran the query and saw that you were right, this place in the patch turns out to be very expensive. In addition to the hash, I saw a second solution to this problem - parameterize constants and store them in the list, but this will not be such a universal solution as hashing. If the variable, not the constant, changes, parameterization will not help.

I agree with your suggestion to try adding hashing. I'll take a closer look at this.

Another problem is the possible generation of worse plans.  I made an
example table with two partial indexes.
create table test as (select (random()*10)::int x, (random()*1000) y
from generate_series(1,1000000) i);
create index test_x_1_y on test (y) where x = 1;
create index test_x_2_y on test (y) where x = 2;
vacuum analyze test;

Without the transformation of ORs to ANY, our planner manages to use
both indexes with a Bitmap scan.
# explain select * from test where (x = 1 or x = 2) and y = 100;                                                  QUERY PLAN
-------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on test  (cost=8.60..12.62 rows=1 width=12)   Recheck Cond: (((y = '100'::double precision) AND (x = 1)) OR ((y =
'100'::double precision) AND (x = 2)))   ->  BitmapOr  (cost=8.60..8.60 rows=1 width=0)         ->  Bitmap Index Scan on test_x_1_y  (cost=0.00..4.30 rows=1 width=0)               Index Cond: (y = '100'::double precision)         ->  Bitmap Index Scan on test_x_2_y  (cost=0.00..4.30 rows=1 width=0)               Index Cond: (y = '100'::double precision)
(7 rows)

With transformation, the planner can't use indexes.
# explain select * from test where (x = 1 or x = 2) and y = 100;                                 QUERY PLAN
----------------------------------------------------------------------------- Gather  (cost=1000.00..12690.10 rows=1 width=12)   Workers Planned: 2   ->  Parallel Seq Scan on test  (cost=0.00..11690.00 rows=1 width=12)         Filter: ((x = ANY (ARRAY[1, 2])) AND (y = '100'::double precision))
(4 rows)

The solution I see would be to tech Bitmap scan to handle ANY clause
in the same way as the OR clause.  I think the entry point for the
relevant logic is the choose_bitmap_and() function.
It's a good idea, I'll try.
But to be honest, I'm afraid that problems with selectivity may come up again and in order to solve them, additional processing of RestrictInfo may be required, which will be unnecessarily expensive. As far as I understand, at this stage we are creating indexes for AND expressions and there is a risk that its transformation may cause the need to change references in all possible places where it was referenced.
Regarding the GUC parameter, I don't see we need a limit.  It's not
yet clear whether a small number or a large number of OR clauses are
more favorable for transformation.  I propose to have just a boolean
enable_or_transformation GUC.

Links
1. https://www.postgresql.org/message-id/6b97b517-f36a-f0c6-3b3a-0cf8cfba220c%40yandex.ru
2. https://www.postgresql.org/message-id/938d82e1-98df-6553-334c-9db7c4e288ae%40yandex.ru

I tend to agree with you and I see that in some cases it really doesn't help.

pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Re: Patch: Improve Boolean Predicate JSON Path Docs
Next
From: Alexander Korotkov
Date:
Subject: Re: On login trigger: take three