Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not? - Mailing list pgsql-hackers

From Andy Fan
Subject Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Date
Msg-id CAKU4AWpOqbhOwnxZS80DZWrc7daqAmMV7aKr7EDg3jkZcCUPOw@mail.gmail.com
Whole thread Raw
In response to Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?  (Andy Fan <zhihui.fan1213@gmail.com>)
Responses Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?  ("Finnerty, Jim" <jfinnert@amazon.com>)
List pgsql-hackers

 
So I think knowing what bad it is to have this feature is the key point to discussion now. 


I re-read the discussion at 2015 [1] and the below topic is added for the above
question.   Here is the summary for easy discussion. 

====
From planner aspect: 

> While I've only read your description of the patch not the patch itself, 
> the search methodology you propose seems pretty brute-force and 
> unlikely to solve that issue.  It's particularly important to avoid O(N^2) 
> behaviors when there are N expressions ...

The patch has 3 steps in general.  1). Gather the filter_qual_list during
the deconstruct_jointree. only unmergeable qual is gathered here. 
2).  After the root->eq_classes is built, scan each of the above quals to 
find out if there is a EC match,  if yes, add it to the EC.  There are 
some fast paths here. like ec->relids,  em->em_relids.  3).  compose 
the qual in ec_filter and members in ec_members, then distribute it to
the relations.  This step take the most cycles of this feature,   and it is 
the most important part for this feature as well.

Fortunately,  thousands of partitions of a table would not make it worse
since they are not generated at that stage.  So I'd believe the number of
ECs or EMs in an EC would be pretty small in common cases. 
 
> time would be spent on searches for matching subexpressions whether 
> or not anything was learned (and often nothing would be learned).  

This is about some cases like "SELECT * FROM t1, t2 WHERE t1.a = t2.a
and t1.b > 3".   In this case,  we still need to go through steps 1 & 2,  all the fast 
paths don't work and the equal() is unavoidable.  However step 3 can be ignored.  
If we want to improve this,  could we maintain an attr_eq_indexes in RelOptInfos 
which indicates if the given attribute appears in any one of EC members?  

=====
From executor aspects:

> The reason why the answer isn't automatically "all of them"
> is because, first of all, it's possible that enforcing the condition
> at a particular table costs more to filter out the rows that we save
> in execution time at higher levels of the plan tree.  For example,
> consider A JOIN B ON A.X = B.X WHERE A.X > 1000000.  It might be that
> the range of A.X is [0,1000001] but the range of B.X is
> [1000000,2000000]; so enforcing the inequality against A is very
> selective but enforcing it against B filters out basically nothing.

I think we can classify this as we push down / execute an qual, the
qual takes lots of cycles, but it doesn't filter many rows. 
 
> A first cut might be to enforce the inequality against the relation
> where it's believed to be most selective, equivalence-class column 
> mentioned in the inequality provided that the
> selectivity is thought to be above some threshold ... but I'm not sure
> this is very principled,

I can only input +1 after some deep thoughts. 

>> Furthermore, there are some cases involving parameterized paths where
>> enforcing the inequality multiple times is definitely bad: for
>> example, if we've got a nested loop where the outer side is a seq scan
>> that enforces the condition and the inner side is an index probe, it
>> is just a waste to retest it on the inner side.  We already know that
>> the outer row passes the inequality, so the inner row will necessarily
>> pass also.  This doesn't apply to merge or hash joins, and it also
>> doesn't apply to all nested loops: scans that aren't paramaterized by
>> the equivalence-class column can still benefit from separate
>> enforcement of the inequality.
>>
> I guess that could be fixed by somehow marking these pushed quals as
> optional and having parameterised scans ignore optional quals.

This has been done by committing 4. 

> Now, all that having been said, I think this is a pretty important
> optimization.  Lots of people have asked for it, and I think it would
> be worth expending some brainpower to try to figure out a way to be
> smarter than we are now, which is, in a nutshell, as dumb as possible.

+1.  I asked custom to add the derivable quals manually for 10+ of table
each query last year and gained great results.   

Anyone still have interest in this?  Or is a better solution really possible?  
Or is the current method  too bad to rescue? 

--
Best Regards
Andy Fan

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Assertion failure in WaitForWALToBecomeAvailable state machine
Next
From: Amit Kapila
Date:
Subject: Re: [BUG]Update Toast data failure in logical replication