Re: POC, WIP: OR-clause support for indexes - Mailing list pgsql-hackers

From Alena Rybakina
Subject Re: POC, WIP: OR-clause support for indexes
Date
Msg-id c43ff0d4-a431-4232-8484-dcf8baac1c4e@postgrespro.ru
Whole thread Raw
In response to Re: POC, WIP: OR-clause support for indexes  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: POC, WIP: OR-clause support for indexes
List pgsql-hackers
On 26.06.2024 23:19, Robert Haas wrote:
I think maybe replying to multiple emails with a single email is
something you'd be better off doing less often.
Ok, I won't do this in the future. After thinking it over, I realized that it turned out to be some kind of mess in the end.
On Tue, Jun 25, 2024 at 7:14 PM Alena Rybakina
<a.rybakina@postgrespro.ru> wrote:
Sorry, you are right and I'll try to explain more precisely. The first approach is the first part of the patch, where we made "Or" expressions into an SAOP at an early stage of plan generation [0], the second one was the one proposed by A.Korotkov [1].
[0] isn't doing anything "at an early stage of plan generation".  It's
changing something in *the parser*. The parser and planner are VERY
different stages of query parsing, and it's really important to keep
them separate mentally and in discussions. 

Thanks for the detailed explanation, I'm always glad to learn new things for myself) 

To be honest, I had an intuitive feeling that the transformation was called in the analyzer stage, but I wasn't sure about it, so I tried to summarize it. 

As for the fact that in general all this can be divided into two large stages, parsing and planner is a little new to me. I have reread the documentation [0] and I found information about it there.

Before that, I was guided by information from the Carnegie Mellon University lecture and the Bruce Mamjian report[1], which was wrong on my part.

By the way, it turns out that the query rewriting stage refers to an independent stage, which is located between the parser stage and the planner/optimizer. I found it from the documentation [2].

[0] https://www.postgresql.org/docs/current/planner-optimizer.html

[1] https://momjian.us/main/writings/pgsql/optimizer.pdf

[2] https://www.postgresql.org/docs/16/rule-system.html

We should not be changing
anything about the query in the parser, because that will, as
Alexander also pointed out, change what gets stored if the user does
something like CREATE VIEW whatever AS SELECT ...; and we should, for
the most part, be storing the query as the user entered it, not some
transformed version of it. Further, at the parser stage, we do not
know the cost of anything, so we can only transform things when the
transformed version is always - or practically always - going to be
cheaper than the untransformed version.

Thank you, now it has become clear to me why it is so important to leave the transformation at the planner stage.

On 24.06.2024 18:28, Robert Haas wrote:
Andrei mentioned the problem, which might be caused by the transformation on the later stage of optimization is updating references to expressions in RestrictInfo [3] lists, because they can be used in different parts during the formation of the query plan. As the practice of Self-join removal [4] has shown, this can be expensive, but feasible. By applying the transformation at the analysis stage [0], because no links were created, so we did not encounter such problems, so this approach was more suitable than the others.
The link you provided for [3] doesn't show me exactly what code you're
talking about, but I can see why mutating a RestrictInfo after
creating it could be problematic. However, I'm not proposing that, and
I don't think it's a good idea. Instead of mutating an existing data
structure after it's been created, we want to get each data structure
correct at the moment that it is created. What that means is that at
each stage of processing, whenever we create a new in-memory data
structure, we have an opportunity to make changes along the way.

For instance, let's say we have a RestrictInfo and we are creating a
Path, perhaps via create_index_path(). One argument to that function
is a list of indexclauses. The indexclauses are derived from the
RestrictInfo list associated with the RelOptInfo. We take some subset
of those quals that are deemed to be indexable and we reorder them and
maybe change a few things and we build this new list of indexclauses
that is then passed to create_index_path(). The RelOptInfo's list of
RestrictInfos is not changed -- only the new list of clauses derived
from it is being built up here, without any mutation of the original
structure.

This is the kind of thing that this patch can and probably should do.
Join removal is quite awkward, as you rightly point out, because we
end up modifying existing data structures after they've been created,
and that requires us to run around and fix up a bunch of stuff, and
that can have bugs. Whenever possible, we don't want to do it that
way. Instead, we want to pick points in the processing when we're
anyway constructing some new structure and use that as an opportunity
to do transformations when building the new structure that incorporate
optimizations that make sense.

Thanks for the idea! I hadn't thought in this direction before, but it really might just work and solve all our original problems.
By the way, I saw that the optimizer is smart enough to eliminate duplicates. Below I have conducted a couple of examples where he decides for himself which expression is more profitable for him to leave.
We just need to add this transformation, and the optimizer will choose the appropriate one)

alena@postgres=# explain select * from x where (a = 1 or a = 2) and a in (1,2);

                             QUERY PLAN                             
--------------------------------------------------------------------
 Index Only Scan using a_idx on x  (cost=0.28..8.61 rows=1 width=4)
   Index Cond: (a = ANY ('{1,2}'::integer[]))
(2 rows)

alena@postgres=# explain select * from x where a < 3 and (a = 1 or a = 2) and a = ANY(ARRAY[1,2]);
                             QUERY PLAN                             
--------------------------------------------------------------------
 Index Only Scan using a_idx on x  (cost=0.28..8.60 rows=1 width=4)
   Index Cond: ((a < 3) AND (a = ANY ('{1,2}'::integer[])))
(2 rows)

It works for Korotkov's case too, as I see it:

alena@postgres=# create table test as (select (random()*10)::int x, (random()*1000) y from generate_series(1,1000000) i); create index test_x_1_y on test (y) where x = 1; create index test_x_2_y on test (y) where x = 2; vacuum analyze test; SELECT 1000000 CREATE INDEX CREATE INDEX VACUUM alena@postgres=# explain select * from test where (x = 1 or x = 2) and y = 100 and x in (1,2); QUERY PLAN -------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on test (cost=8.60..12.62 rows=1 width=12) Recheck Cond: (((y = '100'::double precision) AND (x = 1)) OR ((y = '100'::double precision) AND (x = 2))) -> BitmapOr (cost=8.60..8.60 rows=1 width=0) -> Bitmap Index Scan on test_x_1_y (cost=0.00..4.30 rows=1 width=0) Index Cond: (y = '100'::double precision) -> Bitmap Index Scan on test_x_2_y (cost=0.00..4.30 rows=1 width=0) Index Cond: (y = '100'::double precision) (7 rows)

I noticed that the distribute_quals_to_rels function launches at the stage when it is necessary to generate RestrictInfo lists for relation - it might be a suitable place for applying transformation.
So, instead of completely replacing the list, we should form a complex BoolExpr structure with the "AND" operator, which should contain two expressions, where one of them is BoolExpr with the "OR" operator and the second is ScalarArrayOpExpr.

To be honest, I've already started writing code to do this, but I'm faced with a misunderstanding of how to correctly create a condition for "OR" expressions that are not subject to transformation. For example, the expressions b=1 in the query below:

alena@postgres=# explain select * from x where ( (a =5 or a=4) and a = ANY(ARRAY[5,4])) or (b=1); QUERY PLAN ---------------------------------------------------------------------------------- Seq Scan on x (cost=0.00..123.00 rows=1 width=8) Filter: ((((a = 5) OR (a = 4)) AND (a = ANY ('{5,4}'::integer[]))) OR (b = 1)) (2 rows)

I see that two expressions have remained unchanged and it only works for "AND" binary operations.

But I think it might be worth applying this together, where does the optimizer generate indexes (build_paths_for_OR function)?

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Is missing LOGIN Event on Trigger Firing Matrix ?
Next
From: "David G. Johnston"
Date:
Subject: Re: Is missing LOGIN Event on Trigger Firing Matrix ?