Re: POC, WIP: OR-clause support for indexes - Mailing list pgsql-hackers
From | a.rybakina |
---|---|
Subject | Re: POC, WIP: OR-clause support for indexes |
Date | |
Msg-id | c27333ea-3515-487e-8025-468e7d86ddff@postgrespro.ru Whole thread Raw |
In response to | Re: POC, WIP: OR-clause support for indexes (Alexander Korotkov <aekorotkov@gmail.com>) |
List | pgsql-hackers |
Hi! Thank you for your review!
On 15.10.2023 01:34, Alexander Korotkov wrote:
Yes, you are right, it seems that a recursive method is needed here.Hi, Alena! Thank you for your work on the subject. On Wed, Oct 4, 2023 at 10:21 PM a.rybakina <a.rybakina@postgrespro.ru> wrote:I fixed the kernel dump issue and all the regression tests were successful, but I discovered another problem when I added my own regression tests. Some queries that contain "or" expressions do not convert to "ANY". I have described this in more detail using diff as expected and real results: diff -U3 /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out --- /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out 2023-10-04 21:54:12.496282667 +0300 +++ /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out 2023-10-04 21:55:41.665422459 +0300 @@ -1925,17 +1925,20 @@ EXPLAIN (COSTS OFF) SELECT count(*) FROM tenk1 WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3) OR thousand = 41; - QUERY PLAN --------------------------------------------------------------------------------------------------------- + QUERY PLAN +--------------------------------------------------------------------------------------------------------------------------- Aggregate -> Bitmap Heap Scan on tenk1 - Recheck Cond: (((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[]))) OR (thousand = 41)) + Recheck Cond: ((((thousand = 42) AND (tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3))) OR (thousand = 41)) -> BitmapOr - -> Bitmap Index Scan on tenk1_thous_tenthous - Index Cond: ((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[]))) + -> BitmapOr + -> Bitmap Index Scan on tenk1_thous_tenthous + Index Cond: ((thousand = 42) AND (tenthous = 1)) + -> Bitmap Index Scan on tenk1_thous_tenthous + Index Cond: ((thousand = 42) AND (tenthous = 3)) -> Bitmap Index Scan on tenk1_thous_tenthous Index Cond: (thousand = 41) -(8 rows) +(11 rows)I think this query is not converted, because you only convert top-level ORs in the transform_ors() function. But in the example given, the target OR lays under AND, which in turn lays under another OR. I think you need to make transform_ors() recursive to handle cases like this.
I ran the query and saw that you were right, this place in the patch turns out to be very expensive. In addition to the hash, I saw a second solution to this problem - parameterize constants and store them in the list, but this will not be such a universal solution as hashing. If the variable, not the constant, changes, parameterization will not help.I wonder about the default value of the parameter or_transform_limit of 500. In [1] and [2] you show the execution time degradation from 0 to ~500 OR clauses. I made a simple SQL script with the query "SELECT * FROM pgbench_accounts a WHERE aid = 1 OR aid = 2 OR ... OR aid = 100;". The pgbench results for a single connection in prepared mode are the following. master: 936 tps patched (or_transform_limit == 0) :1414 tps So, transformation to ANY obviously accelerates the execution. I think it's important to identify the cases where this patch causes the degradation. Generally, I don't see why ANY could be executed slower than the equivalent OR clause. So, the possible degradation cases are slower plan generation and worse plans. I managed to find both. As you stated before, currently the OR transformation has a quadratic complexity depending on the number of or-clause-groups. I made a simple test to evaluate this. containing 10000 or-clause-groups. SELECT * FROM pgbench_accounts a WHERE aid + 1 * bid = 1 OR aid + 2 * bid = 1 OR ... OR aid + 10000 * bid = 1; master: 316ms patched: 7142ms Note, that the current or_transform_limit GUC parameter is not capable of cutting such cases, because it cuts cases lower than the limit not higher than the limit. In the comment, you mention that we could invent something like hash to handle this. Hash should be nice, but the problem is that we currently don't have a generic facility to hash nodes (or even order them). It would be nice to add this facility, that would be quite a piece of work. I would propose to limit this patch for now to handle just a single Var node as a non-const side of the clause and implement a simple hash for Vars.
I agree with your suggestion to try adding hashing. I'll take a closer look at this.
It's a good idea, I'll try.Another problem is the possible generation of worse plans. I made an example table with two partial indexes. create table test as (select (random()*10)::int x, (random()*1000) y from generate_series(1,1000000) i); create index test_x_1_y on test (y) where x = 1; create index test_x_2_y on test (y) where x = 2; vacuum analyze test; Without the transformation of ORs to ANY, our planner manages to use both indexes with a Bitmap scan. # explain select * from test where (x = 1 or x = 2) and y = 100; QUERY PLAN -------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on test (cost=8.60..12.62 rows=1 width=12) Recheck Cond: (((y = '100'::double precision) AND (x = 1)) OR ((y = '100'::double precision) AND (x = 2))) -> BitmapOr (cost=8.60..8.60 rows=1 width=0) -> Bitmap Index Scan on test_x_1_y (cost=0.00..4.30 rows=1 width=0) Index Cond: (y = '100'::double precision) -> Bitmap Index Scan on test_x_2_y (cost=0.00..4.30 rows=1 width=0) Index Cond: (y = '100'::double precision) (7 rows) With transformation, the planner can't use indexes. # explain select * from test where (x = 1 or x = 2) and y = 100; QUERY PLAN ----------------------------------------------------------------------------- Gather (cost=1000.00..12690.10 rows=1 width=12) Workers Planned: 2 -> Parallel Seq Scan on test (cost=0.00..11690.00 rows=1 width=12) Filter: ((x = ANY (ARRAY[1, 2])) AND (y = '100'::double precision)) (4 rows) The solution I see would be to tech Bitmap scan to handle ANY clause in the same way as the OR clause. I think the entry point for the relevant logic is the choose_bitmap_and() function.
But to be honest, I'm afraid that problems with selectivity may come up again and in order to solve them, additional processing of RestrictInfo may be required, which will be unnecessarily expensive. As far as I understand, at this stage we are creating indexes for AND expressions and there is a risk that its transformation may cause the need to change references in all possible places where it was referenced.
Regarding the GUC parameter, I don't see we need a limit. It's not yet clear whether a small number or a large number of OR clauses are more favorable for transformation. I propose to have just a boolean enable_or_transformation GUC. Links 1. https://www.postgresql.org/message-id/6b97b517-f36a-f0c6-3b3a-0cf8cfba220c%40yandex.ru 2. https://www.postgresql.org/message-id/938d82e1-98df-6553-334c-9db7c4e288ae%40yandex.ru
I tend to agree with you and I see that in some cases it really doesn't help.
pgsql-hackers by date: