Re: POC, WIP: OR-clause support for indexes - Mailing list pgsql-hackers

From Alena Rybakina
Subject Re: POC, WIP: OR-clause support for indexes
Date
Msg-id 0f50882c-e639-4856-aab6-6ccfec848164@postgrespro.ru
Whole thread Raw
In response to Re: POC, WIP: OR-clause support for indexes  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: POC, WIP: OR-clause support for indexes
List pgsql-hackers
On 21.06.2024 23:35, Robert Haas wrote:
On Fri, Jun 21, 2024 at 1:05 PM Alena Rybakina
<a.rybakina@postgrespro.ru> wrote:
I'm confused, I have seen that we have two threads [1] and [2] about this thread and I haven't found any explanation for how they differ.

And I don't understand, why am I not listed as the author of the patch? I was developing the first part of the patch before Andrey came to review it [3] and his first part hasn't changed much since then.
v25 still lists you as an author (in fact, the first author) but I
can't say why we have two CommitFest entries. Surely that's a mistake.

Sorry, maybe I was overreacting. 

Thank you very much for taking the time to do a detailed review!

On 22.06.2024 00:07, Alexander Korotkov wrote:
Sorry, I didn't notice that the [1] commitfest entry exists and
created the [2] commitfest entry.  I'm removed [2].
Thank you!

On the patch itself, I'm really glad we got to a design where this is
part of planning, not parsing. I'm not sure yet whether we're doing it
at the right time within the planner, but I think this *might* be
right, whereas the old way was definitely wrong.

It's hard to tell, but I think it might be one of the good places to apply transformation. Let me describe a brief conclusion on the two approaches.

In the first approach, we definitely did not process the extra "OR" expressions in the first approach, since they were packaged as an Array. It could lead to the fact that less planning time would be spent on the optimizer.

Also the selectivity for Array expressions is estimated better, which could lead to the generation of a more optimal plan, but, to be honest, this is just an observation from changes in regression tests and, in general, how the process of calculating the selectivity of a complex expression works.

 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 51) AND b = ''1''');

  estimated | actual

 -----------+--------

-        99 |    100

+       100 |    100

 (1 row)

 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 51) AND (b = ''1'' OR b = ''2'')');

  estimated | actual

 -----------+--------

-        99 |    100

+       100 |    100

 (1 row)

 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 2 OR a = 51 OR a = 52) AND (b = ''1'' OR b = ''2'')');

  estimated | actual

 -----------+--------

-       197 |    200

+       200 |    200

In addition, we do not have new equivalence classes, since some “Or” expressions are not available for their formation.  This can result in reduced memory and time spent generating the query plan, especially in partitions.

Speaking of the main disadvantages, we do not give the optimizer the opportunity to generate a plan using BitmapScan, which can lead to the generation of a suboptimal plan, but in the current approach the same thing happens [0].

And the second one might be related the lack of generation Equivalence Classes and generation of useful pathkeys as a result, so we could miss an optimal plan again. But I haven't caught something like this on practice. I see we won't have such problems if we apply the transformation later.

Overall, I have not yet noticed any very different parts from what was in the first approach: I didn’t see any significant degradation or improvement, which is good, but so far the main problem with the degradation of the plan has not yet been solved, that is, we have not escaped from the main problems.

Andrei mentioned the problem in the second approach about updating references to expressions in RestrictInfo [1] lists, because the can be used in different variables during the formation of the query plan. As the practice of Self-join removal [2] has shown, this can be expensive, but feasible. 

By applying the transformation at the analysis stage in the first approach, because no links were created, so we did not encounter such problems, so this approach was more suitable than the others.

[0] https://www.postgresql.org/message-id/7d5aed92-d4cc-4b76-8ae0-051d182c9eec%40postgrespro.ru

[1] https://www.postgresql.org/message-id/6850c306-4e9d-40b7-8096-1f3c7d29cd9e%40postgrespro.ru

[2] https://commitfest.postgresql.org/48/5043/

What exactly is the strategy around OR-clauses with type differences?
If I'm reading the code correctly, the first loop requires an exact
opno match, which presumably implies that the constant-type elements
are of the same type. But then why does the second loop need to use
coerce_to_common_type?

It needs to transform all similar constants to one type, because some constants of "OR" expressions can belong others, like the numeric and int types. Due to the fact that array structure demands that all types must be belonged to one type, so for this reason we applied this procedure.

You can find the similar strategy in transformAExprIn function, when we transform "In" list to SaopArray expression. Frankly, initially, I took it as the main example to make my patch.

+ if (!OperatorIsVisible(entry->opno))
+ namelist = lappend(namelist,
makeString(get_namespace_name(operform->oprnamespace)));
+
+ namelist = lappend(namelist, makeString(pstrdup(NameStr(operform->oprname))));
+ ReleaseSysCache(opertup);
+
+ saopexpr =
+ (ScalarArrayOpExpr *)
+ make_scalar_array_op(NULL,
+ namelist,
+ true,
+ (Node *) entry->expr,
+ (Node *) newa,
+ -1);

I do not think this is acceptable. We should find a way to get the
right operator into the ScalarArrayOpExpr without translating the OID
back into a name and then back into an OID.
I don’t really understand the reason why it’s better not to do this. Can you explain please?

Also, why is the array built with eval_const_expressions instead of
something like makeArrayResult? There should be no need for general
expression evaluation here if we are just dealing with constants.
I'm not ready to answer this question right now, I need time to study the use of the makeArrayResult function in the optimizer.
+ foreach(lc2, entry->exprs)
+ {
+ RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc2);
+
+ is_pushed_down = is_pushed_down || rinfo->is_pushed_down;
+ has_clone = has_clone || rinfo->is_pushed_down;
+ security_level = Max(security_level, rinfo->security_level);
+ required_relids = bms_union(required_relids, rinfo->required_relids);
+ incompatible_relids = bms_union(incompatible_relids,
rinfo->incompatible_relids);
+ outer_relids = bms_union(outer_relids, rinfo->outer_relids);
+ }
This seems like an extremely bad idea. Smushing together things with
different security levels (or a bunch of these other properties) seems
like it will break things. Presumably we wouldn't track these
properties on a per-RelOptInfo basis unless we needed an accurate idea
of the property value for each RelOptInfo. If the values are
guaranteed to match, then it's fine, but then we don't need this code
to merge possibly-different values. If they're not guaranteed to
match, then presumably we shouldn't merge into a single OR clause
unless they do.

We hadn't thought about it before, to be honest. But I agree with you that this may be one of the reasons not to make the transformation.
On a related note, it looks to me like the tests focus too much on
simple cases. It seems like it's mostly testing cases where there are
no security quals, no weird operator classes, no type mismatches, and
few joins. In the cases where there are joins, it's an inner join and
there's no distinction between an ON-qual and a WHERE-qual. I strongly
suggest adding some test cases for weirder scenarios.
+ /* One more trick: assemble correct clause */

This comment doesn't seem to make much sense. Some other comments
contain spelling mistakes. The patch should have comments in more
places explaining key design decisions.
+extern JumbleState *JumbleExpr(Expr *expr, uint64 *exprId);

This is no longer needed.
Agree.
-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

pgsql-hackers by date:

Previous
From: Andreas 'ads' Scherbaum
Date:
Subject: Re: call for applications: mentoring program for code contributors
Next
From: "Joel Jacobson"
Date:
Subject: Re: Add pg_get_acl() function get the ACL for a database object