Re: POC, WIP: OR-clause support for indexes - Mailing list pgsql-hackers

From Alena Rybakina
Subject Re: POC, WIP: OR-clause support for indexes
Date
Msg-id fa5f57e1-4051-93d8-5f54-f1e2f4587cc2@yandex.ru
Whole thread Raw
In response to Re: POC, WIP: OR-clause support for indexes  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: POC, WIP: OR-clause support for indexes
List pgsql-hackers
Hi! Thank you for your research, I'm sure it will help me to fix the 
problem of calculating selectivity faster)
I'm sorry I didn't answer right away, to be honest, I had a full diary 
of urgent matters at work. For this reason, I didn't have enough time to 
work on this patch properly.

> The optimizer will itself do a limited form of "normalizing to CNF".
> Are you familiar with extract_restriction_or_clauses(), from
> orclauses.c? Comments above the function have an example of how this
> can work:
>
>   * Although a join clause must reference multiple relations overall,
>   * an OR of ANDs clause might contain sub-clauses that reference just one
>   * relation and can be used to build a restriction clause for that rel.
>   * For example consider
>   *      WHERE ((a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45));
>   * We can transform this into
>   *      WHERE ((a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45))
>   *          AND (a.x = 42 OR a.x = 44)
>   *          AND (b.y = 43 OR b.z = 45);
>   * which allows the latter clauses to be applied during the scans of a and b,
>   * perhaps as index qualifications, and in any case reducing the number of
>   * rows arriving at the join.  In essence this is a partial transformation to
>   * CNF (AND of ORs format).  It is not complete, however, because we do not
>   * unravel the original OR --- doing so would usually bloat the qualification
>   * expression to little gain.
This is an interesting feature. I didn't notice this function before, I 
studied many times consider_new_or_cause, which were called there. As 
far as I know, there is a selectivity calculation going on there, but as 
far as I remember, I called it earlier after my conversion, and 
unfortunately it didn't solve my problem with calculating selectivity. 
I'll reconsider it again, maybe I can find something I missed.
> Of course this immediately makes me wonder: shouldn't your patch be
> able to perform an additional transformation here? You know, by
> transforming "a.x = 42 OR a.x = 44" into "a IN (42, 44)"? Although I
> haven't checked for myself, I assume that this doesn't happen right
> now, since your patch currently performs all of its transformations
> during parsing.
>
> I also noticed that the same comment block goes on to say something
> about "clauselist_selectivity's inability to recognize redundant
> conditions". Perhaps that is relevant to the problems you were having
> with selectivity estimation, back when the code was in
> preprocess_qual_conditions() instead? I have no reason to believe that
> there should be any redundancy left behind by your transformation, so
> this is just one possibility to consider.
> Separately, the commit message of commit 25a9e54d2d says something
> about how the planner builds RestrictInfos, which seems
> possibly-relevant. That commit enhanced extended statistics for OR
> clauses, so the relevant paragraph describes a limitation of extended
> statistics with OR clauses specifically. I'm just guessing, but it
> still seems like it might be relevant to the problem you ran into with
> selectivity estimation. Another possibility to consider.

I understood what is said about AND clauses in this comment. It seems to 
me that AND clauses saved like (BoolExpr *) expr->args->(RestrictInfo *) 
clauseA->(RestrictInfo *)clauseB lists and OR clauses saved like 
(BoolExpr *) expr -> orclause->(RestrictInfo *)clause A->(RestrictInfo 
*)clause B.

As I understand it, selectivity is calculated for each expression. But 
I'll exploring it deeper, because I think this place may contain the 
answer to the question, what's wrong with selectivity calculation in my 
patch.

> BTW, I sometimes use RR to help improve my understanding of the planner:
>
>
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Recording_Postgres_using_rr_Record_and_Replay_Framework
> The planner has particularly complicated control flow, which has
> unique challenges -- just knowing where to begin can be difficult
> (unlike most other areas). I find that setting watchpoints to see when
> and where the planner modifies state using RR is far more useful than
> it would be with regular GDB. Once I record a query, I find that I can
> "map out" what happens in the planner relatively easily.
Thank you for sharing this source! I didn't know about this before, and 
it will definitely make my life easier to understand the optimizer.

I understand what you mean, and I researched the optimizer in a similar 
way through gdb and looked at the comments and code in postgresql. This 
is a complicated way and I didn't always understand correctly what this 
variable was doing in this place, and this created some difficulties for me.

So, thank you for the link!

> Many interesting cases won't get SAOP transformation from the patch,
> simply because of the or_transform_limit GUC's default of 500. I don't
> think that that design makes too much sense. It made more sense back
> when the focus was on expression evaluation overhead. But that's only
> one of the benefits that we now expect from the patch, right? So it
> seems like something that should be revisited soon.
>
> I'm not suggesting that there is no need for some kind of limit. But
> it seems like a set of heuristics might be a better approach. Although
> I would like to get a better sense of the costs of the transformation
> to be able to say too much more.

Yes, this may be revised in the future after some transformations. 
Initially, I was solving the problem described here [0]. So, after 
testing [1], I come to the conclusion that 500 is the ideal value for 
or_transform_limit.

[0] 
https://www.postgresql.org/message-id/919bfbcb-f812-758d-d687-71f89f0d9a68%40postgrespro.ru

[1] 
https://www.postgresql.org/message-id/6b97b517-f36a-f0c6-3b3a-0cf8cfba220c%40yandex.ru

-- 
Regards,
Alena Rybakina
Postgres Professional




pgsql-hackers by date:

Previous
From: Masahiro Ikeda
Date:
Subject: Re: Support to define custom wait events for extensions
Next
From: Christoph Berg
Date:
Subject: Re: A failure in 031_recovery_conflict.pl on Debian/s390x