Re: POC, WIP: OR-clause support for indexes - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: POC, WIP: OR-clause support for indexes
Date
Msg-id CAPpHfdu5iQOjF93vGbjidsQkhHvY2NSm29duENYH_cbhC6x+Mg@mail.gmail.com
Whole thread Raw
In response to Re: POC, WIP: OR-clause support for indexes  (Alena Rybakina <a.rybakina@postgrespro.ru>)
Responses Re: POC, WIP: OR-clause support for indexes
List pgsql-hackers
On Sun, Jul 28, 2024 at 12:59 PM Alena Rybakina
<a.rybakina@postgrespro.ru> wrote:
> On 27.07.2024 13:56, Alexander Korotkov wrote:
> > On Thu, Jul 25, 2024 at 5:04 PM Alena Rybakina
> > <a.rybakina@postgrespro.ru>  wrote:
> >> To be honest, I have found a big problem in this patch - we try to perform the transformation every time we
examimea column: 
> >>
> >> for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++) { ...
> >>
> >> }
> >>
> >> I have fixed it and moved the transformation before going through the loop.
> > What makes you think there is a problem?
>
> To be honest, I was bothered by the fact that we need to go through
> expressions several times that obviously will not fit under other
> conditions.
> Just yesterday I thought that it would be worthwhile to create a list of
> candidates - expressions that did not fit because the column did not
> match the index (!match_index_to_operand(nconst_expr, indexcol, index)).

I admit that this area probably could use some optimization,
especially for case of many clauses and many indexes.  But in the
scope of this patch, I think this is enough to not make things worse
in this area.

> Another problem that is related to the first one that the boolexpr could
> contain expressions referring to different operands, for example, both x
> and y. that is, we may have the problem that the optimal "ANY"
> expression may not be used, because the expression with x may come
> earlier and the loop may end earlier.
>
> alena@postgres=# create table b (x int, y int);
> CREATE TABLE
> alena@postgres=# insert into b select id, id from
> generate_series(1,1000) as id;
> INSERT 0 1000
> alena@postgres=# create index x_idx on b(x);
> CREATE INDEX
> alena@postgres=# analyze;
> ANALYZE
> alena@postgres=# explain select * from b where y =3 or x =4 or x=5 or
> x=6 or x = 7 or x=8 or x=9;
>                                        QUERY PLAN
> ---------------------------------------------------------------------------------------
>   Seq Scan on b  (cost=0.00..32.50 rows=7 width=8)
>     Filter: ((y = 3) OR (x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x =
> 8) OR (x = 9))
> (2 rows)
> alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x =
> 7 or x=8 or x=9 or y=1;
>                                        QUERY PLAN
> ---------------------------------------------------------------------------------------
>   Seq Scan on b  (cost=0.00..32.50 rows=7 width=8)
>     Filter: ((x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 8) OR (x =
> 9) OR (y = 1))
> (2 rows)
> alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x =
> 7 or x=8 or x=9;
>                             QUERY PLAN
> ----------------------------------------------------------------
>   Index Scan using x_idx on b  (cost=0.28..12.75 rows=6 width=8)
>     Index Cond: (x = ANY ('{4,5,6,7,8,9}'::integer[]))
> (2 rows)
>
> Furthermore expressions can be stored in a different order.
> For example, first comes "AND" expr, and then group of "OR" expr, which
> we can convert to "ANY" expr, but we won't do this due to the fact that
> we will exit the loop early, according to this condition:
>
> if (!IsA(sub_rinfo->clause, OpExpr))
>             return NULL;
>
> or it may occur due to other conditions.
>
> alena@postgres=# create index x_y_idx on b(x,y);
> CREATE INDEX
> alena@postgres=# analyze;
> ANALYZE
>
> alena@postgres=# explain select * from b where (x = 2 and y =3) or x =4
> or x=5 or x=6 or x = 7 or x=8 or x=9;
>                                               QUERY PLAN
> -----------------------------------------------------------------------------------------------------
>   Seq Scan on b  (cost=0.00..35.00 rows=6 width=8)
>     Filter: (((x = 2) AND (y = 3)) OR (x = 4) OR (x = 5) OR (x = 6) OR
> (x = 7) OR (x = 8) OR (x = 9))
> (2 rows)
>
> Because of these reasons, I tried to save this and that transformation
> together for each column and try to analyze for each expr separately
> which method would be optimal.

Yes, with v27 of the patch, optimization wouldn't work in these cases.
However, you are using quite small table.  If you will use larger
table or disable sequential scans, there would be bitmap plans to
handle these queries.  So, v27 doesn't make the situation worse. It
just doesn't optimize all that it could potentially optimize and
that's OK.

I've written a separate 0002 patch to address this.  Now, before
generation of paths for bitmap OR, similar OR entries are grouped
together.  When considering a group of similar entries, they are
considered both together and one-by-one.  Ideally we could consider
more sophisticated grouping, but that seems fine for now.  You can
check how this patch handles the cases of above.

Also, 0002 address issue of duplicated bitmap scan conditions in
different forms. During generate_bitmap_or_paths() we need to exclude
considered condition for other clauses.  It couldn't be as normal
filtered out in the latter stage, because could reach the index in
another form.

> > Do you have a test case
> > illustrating a slow planning time?
> No, I didn't have time to measure it and sorry for that. I'll do it.
> > When v27 performs transformation for a particular column, it just
> > stops facing the first unmatched OR entry.  So,
> > match_orclause_to_indexcol() examines just the first OR entry for all
> > the columns excepts at most one.  So, the check
> > match_orclause_to_indexcol() does is not much slower than other
> > match_*_to_indexcol() do.
> >
> > I actually think this could help performance in many cases, not hurt
> > it.  At least we get rid of O(n^2) complexity over the number of OR
> > entries, which could be very many.
>
> I agree with you that there is an overhead and your patch fixes this
> problem, but optimizer needs to have a good ordering of expressions for
> application.
>
> I think we can try to move the transformation to another place where
> there is already a loop pass, and also save two options "OR" expr and
> "ANY" expr in one place (through BoolExpr) (like find_duplicate_ors
> function) and teach the optimizer to determine which option is better,
> for example, like now in match_orclause_to_indexcol() function.
>
> What do you thing about it?

find_duplicate_ors() and similar places were already tried before.
Please, check upthread.  This approach receives severe critics. AFAIU,
the problem is that find_duplicate_ors() during preprocessing, a
cost-blind stage.

This is why I'd like to continue developing ideas of v27, because it
fits the existing framework.

------
Regards,
Alexander Korotkov
Supabase

Attachment

pgsql-hackers by date:

Previous
From: "Joel Jacobson"
Date:
Subject: Re: Optimize mul_var() for var1ndigits >= 8
Next
From: "Anton A. Melnikov"
Date:
Subject: Re: Maybe don't process multi xmax in FreezeMultiXactId() if it is already marked as invalid?