Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not? - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Date
Msg-id CA+Tgmoax5_Yo0ggznZBLnFDhq9eh3L+JWyipzZe2905zkFm3GA@mail.gmail.com
Whole thread Raw
In response to Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
List pgsql-hackers
On Tue, Mar 1, 2022 at 9:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I agree. My question is: why shouldn't every case where we can deduce
> > an implied inequality be reasonably likely to show a benefit?
>
> Maybe it will be, if we can deal with the issue you already mentioned
> about not misestimating the resulting partially-redundant conditions.

OK.

> > Second, it looks to me like the patch takes the rather naive strategy
> > of enforcing the derived clauses everywhere that they can legally be
> > put, which seems certain not to be optimal.
>
> I'm not sure about that ... it's basically what we do with derived
> equalities.  However, there's enough structure in the equivalence-class
> case that we don't end up enforcing redundant quals.  It's not clear
> to me whether the same can be said here.

I mean, to go back to the example of a.x < 42 and a.x = b.x, there are
three possible choices as to where to enforce the qual (a, b, both).
That's a meaningful choice, independent of any estimation issue. I
think it is reasonably common to have cases where a.x < 42 is very
selective and b.x < 42 hardly filters out anything at all, or the
other way around. Certainly, that kind of situation came up a lot in
PostgreSQL-based applications that I wrote myself back in the day. If
we're just talking about btree operators, *maybe* we can say it's
cheap enough that we don't care, but color me a tad skeptical.

> > I don't know whether attaching something to the equivalence class data
> > structure is the right idea or not. Presumably, we don't want to make
> > an extra pass over the query tree to gather the information needed for
> > this kind of optimization, and it feels like we need to know which
> > vars are EMs before we try to derive alternate/additional quals.
>
> Yeah, we don't want to make an additional pass over the tree, and
> we also would rather not add an additional set of per-operator
> catalog lookups.  We might be able to generalize the code that looks
> for equality operators so that it looks for "any btree operator"
> with the same number of lookups, and then have it feed the results
> down either the EquivalenceClass path or the inequality path
> as appropriate.  At the end, after we've formed all the ECs, we
> could have a go at matching up the inequality structures with the
> ECs.

Interesting idea.

> But I don't agree that ECs are a necessary prerequisite.
> Here are a couple of other patterns that might be worth looking for:
>
> * "a > b AND b > c" allows deducing "a > c", whether or not any
> of those values appears in an EC.
>
> * "a > const1 AND a > const2" can be simplified to either "a > const1"
> or "a > const2" depending on which constant is larger.  (The predicate
> proof mechanism already has a form of this, but we don't typically
> apply it in a way that would result in dropping the redundant qual.)
>
> It's entirely possible that one or both of these patterns is not
> worth looking for.  But I would say that it's equally unproven
> that deriving "a > c" from "a = b AND b > c" is worth the cycles.
> I'll grant that it's most likely going to be a win if we can use
> any of these patterns to generate a restriction clause from what
> had been join clauses.  Beyond that it's much less clear.

Pretty much all of the cases that I've run across involve an equijoin
plus an inequality, so if somebody asked me which problem we ought to
put most effort into solving, I'd say that one. Cases like "a>1 and
a>2" or a same-table case like "a=b and b>3" haven't been as common in
my experience, and haven't caused as much trouble when they do happen.
Part of that is because if you have something like "a>1 and a>2" in
your query, it may be easy for you to just tweak the query generation
to avoid it, and if "a=b and b>3" is coming up a lot, you may choose
to adjust your data model (e.g. choose to store NULL in b to indicate
same-as-a), whereas if you have something like
"orders.orderno=order_lines.orderno and order_lines.orderno<10000,"
what are you going to do to avoid that exactly? If you normalize your
order data and then want to find the old orders, this problem arises
ineluctably.

But having said that, I'm not *against* doing something about those
cases if it's cheap or falls out naturally. If we could detect for
free that the user had written a>1 and a>2, it'd certainly be
beneficial to drop the former qual and keep only the latter. If the
user writes a>b and b>c and all those columns are in one table I don't
see how it helps to derive a>c, because we're still going to need to
check the other two quals anyway so we've just created more work. But
if those columns are not all in the same table then I'd say chances
are really pretty good. Like, suppose it's x.a>y.b and y.b>x.c. Well,
like I say, I don't really see people writing queries like that
myself, but if they do, it seems pretty obvious that deriving x.a>x.c
has the potential to save a LOT of trouble. If it's x.a>y.b and
y.b>z.c I don't feel quite so optimistic, but it may be that we would
like to do the x-z join first, and if we do, enforcing x.a>z.c at that
level to shrink the join product seems like a very strong idea. It is
a slight loss if we run that qual on lots of rows and it never fails,
but it is a gigantic win if it filters out a bunch of stuff. I bet a
lot of users would be VERY happy to pay the cost of testing x.a>z.c at
the x-z join level even on queries where the statistics suggest that
it will be entirely useless, because it won't cost that much to check
it, and if by some chance the statistics are misleading, it might
prevent a really bad outcome where the query runs for a super-long
time and they get paged. So the questions in my mind here are all
about whether we can detect this stuff cheaply and whether anybody
wants to do the work to make it happen, not whether we'd get a benefit
in the cases where it kicks in.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Proposal: Support custom authentication methods using hooks
Next
From: Stephen Frost
Date:
Subject: Re: Proposal: Support custom authentication methods using hooks