Re: enable_incremental_sort changes query behavior - Mailing list pgsql-hackers

From James Coleman
Subject Re: enable_incremental_sort changes query behavior
Date
Msg-id CAAaqYe9sBisTpRKXqOzgftPswVTp4-mrjmvOBZczcmOEoODCgw@mail.gmail.com
Whole thread Raw
In response to Re: enable_incremental_sort changes query behavior  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: enable_incremental_sort changes query behavior
List pgsql-hackers
On Sat, Oct 3, 2020 at 5:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sat, Oct 03, 2020 at 10:50:06AM -0400, James Coleman wrote:
> >On Fri, Oct 2, 2020 at 11:16 PM James Coleman <jtc331@gmail.com> wrote:
> >>
> >> On Fri, Oct 2, 2020 at 7:07 PM James Coleman <jtc331@gmail.com> wrote:
> >> >
> >> > On Fri, Oct 2, 2020 at 6:28 PM Tomas Vondra
> >> > <tomas.vondra@2ndquadrant.com> wrote:
> >> > >
> >> > > On Fri, Oct 02, 2020 at 05:45:52PM -0400, James Coleman wrote:
> >> > > >On Fri, Oct 2, 2020 at 4:56 PM Tomas Vondra
> >> > > ><tomas.vondra@2ndquadrant.com> wrote:
> >> > > >>
> >> > > >> ...
> >> > > >>
> >> > > >> More importanly, it does not actually fix the issue - it does fix that
> >> > > >> particular query, but just replacing the DISTINCT with either ORDER BY
> >> > > >> or GROUP BY make it fail again :-(
> >> > > >>
> >> > > >> Attached is a simple script I used, demonstrating these issues for the
> >> > > >> three cases that expect to have ressortgroupref != 0 (per the comment
> >> > > >> before TargetEntry in plannodes.h).
> >> > > >
> >> > > >So does checking for volatile expressions (if you happened to test
> >> > > >that) solve all the cases? If you haven't tested that yet, I can try
> >> > > >to do that this evening.
> >> > > >
> >> > >
> >> > > Yes, it does fix all the three queries in the SQL script.
> >> > >
> >> > > The question however is whether this is the root issue, and whether it's
> >> > > the right way to fix it. For example - volatility is not the only reason
> >> > > that may block qual pushdown. If you look at qual_is_pushdown_safe, it
> >> > > also blocks pushdown of leaky functions in security_barrier views. So I
> >> > > wonder if that could cause failures too, somehow. But I haven't managed
> >> > > to create such example.
> >> >
> >> > I was about to say that the issue here is slightly different from qual
> >> > etc. pushdown, since we're not concerned about quals here, and so I
> >> > wonder where we determine what target list entries to put in a given
> >> > scan path, but then I realized that implies (maybe!) a simpler
> >> > solution. Instead of duplicating checks on target list entries would
> >> > be safe, why not check directly in get_useful_pathkeys_for_relation()
> >> > whether or not the pathkey has a target list entry?
> >> >
> >> > I haven't been able to try that out yet, and so maybe I'm missing
> >> > something, but I need to step out for a bit, so I'll have to look at
> >> > it later.
> >>
> >> I've started poking at this, but haven't yet found a workable
> >> solution. See the attached patch which "solves" the problem by
> >> breaking putting the sort under the gather merge, but it at least
> >> demonstrates conceptually what I think we're interested in doing.
> >>
> >> The issue currently is that the comparison of expressions fails -- in
> >> our query, the first column selected shows up as a Var (with the same
> >> contents) but is a different pointer in the em_expr and the reltarget
> >> exprs list. I don't yet see a good way to get the equivalence class
> >> for a PathTarget entry.
> >
> >Hmm, I think I was looking to do is the attached patch. I didn't
> >realize until I did a lot of reading through source that we have an
> >equal() function that can compare exprs. That (plus the realization in
> >[1] the originally reported backtrace wasn't where the error actually
> >came from) convinced me that what we need is to confirm not only that
> >the all of the vars in the ec member come from the relids in the rel
> >but also that the expr is actually represented in the target of the
> >rel.
> >
> >With the gucs changed as I mentioned earlier both of the plans (with
> >and without a volatile call in the 2nd select entry) now look like:
> >
> > Unique
> >   ->  Gather Merge
> >         Workers Planned: 2
> >         ->  Sort
> >               Sort Key: ref_0.i, (md5(ref_0.t))
> >               ->  Nested Loop
> >                     ->  Parallel Index Scan using ref_0_i_idx on ref_0
> >                     ->  Function Scan on generate_series ref_1
> >
> >Without the gucs changed the minimal repro case now doesn't error, but
> >results in this plan:
> >
> > HashAggregate
> >   Group Key: ref_0.i, CASE WHEN pg_rotate_logfile_old() THEN ref_0.t
> >ELSE ref_0.t END
> >   ->  Nested Loop
> >         ->  Seq Scan on ref_0
> >         ->  Function Scan on generate_series ref_1
> >
> >Similarly in your six queries I now only see parallel query showing up
> >in the last one.
> >
>
> OK, that seems reasonable I think.

If we proceed with this patch I'd like to tweak it to check for the
relids subset first on the theory that it will be cheaper than the
expr equality check loop; that change is attached.

> >I created an entirely new function because adding the target expr
> >lookup to the existing find_em_expr_for_rel() function broke a bunch
> >of postgres_fdw tests. That maybe raises questions about whether that
> >code also could have problems in theory/in the future, but I didn't
> >investigate further. In any case we already know it excludes
> >volatile...so maybe it's fine because in practice that's actually a
> >broader exclusion than what we're doing here.
> >
>
> I don't think postgres_fdw needs these new checks, because FDW scan's
> are not allowed in parallel part of the plan - if that changes in the
> future, I'm sure that'll require fixes in plenty other places.
>
> OTOH it's interesting that it triggers those failures - I wonder if we
> could learn something from them? AFAICS those paths can't be built by
> generate_useful_gather_paths (because of the postgres_fdw vs. parallel
> query restrictions), so how do these plans look?

Well the fdw code doesn't trigger the errors, but both
generate_useful_gather_paths() and the fdw code originally relied on
find_em_expr_for_rel(), but now they're diverging. IIRC the fdw code
also uses it for determining what path keys are valid -- in that case
which ones are safe to be sent to the foreign server (so all of the
vars have to be from rels on the foreign server). The fdw code has
additional checks too though, for example no volatile expressions may
be pushed to the remote.

> >This seems to fix the issue, but I'd like feedback on whether it's too
> >strict. We could of course just check em_has_volatile, but I'm
> >wondering if that's simultaneously too strict (by not allowing the
> >volatile expression to be computed in the gather merge supposing
> >there's no join) and too loose (maybe there are other cases we should
> >care about?). It also just strikes me as re-encoding rules that should
> >have already been applied (and thus we should be able to look up in
> >the data we have if it's safe to use the expr/pathkey). Those are
> >mostly intuitions though.
> >
>
> I don't know :-( As I mentioned before, I suspect checking just the
> volatility may not be enough in some cases (leakyness, etc.) and I think
> you're right it may be too strict in other cases.
>
> Not sure I understand which rules you think we're re-encoding, but I
> have a nagging feeling there's a piece of code somewhere earlier in
> the query planner meant to prevent such cases (sort on path without all
> the pathkeys), and that fixing it here is just a band aid :-(

I'm getting at exactly what you're saying below: there's got to be
some existing rules (and presumably code already enforcing it in some
places) at what level in the path tree a given pathkey is
safe/correct, and we don't want to reinvent those rules (or ideally
that code). Verifying the expr is in the rel seems to me to be
intuitively correct, but I wish there was another place in the code
that could validate that.

> I'm sure we're building plans with Sort on top of Index Scan paths, so
> how come those don't fail? How come the non-parallel version of these
> queries don't fail?

Yeah, exactly. Something has to be doing something similar -- I don't
think everywhere else using sort_pathkeys instead of query_pathkeys
really changes the problem. I've read through all of the places we use
generate_useful_gather_paths() as well as root->sort_pathkeys to try
to get a better understanding of this. Most of the places I convinced
myself it's already safe -- for example places like
create_orderd_paths are working with the upper rel, and the full
target list is available. A few other places I added comments (see the
"questions" patch in the attached series) where I'm not sure why we
apply projections *after* we create a sort path; given the issue we're
discussing here I would have thought you'd want to do exactly the
opposite.

I haven't yet tried to read through the code in other places where we
build an explicit sort to see if I can find any hints as to what
existing code does to ensure this situation doesn't occur, but so far
I haven't found anything else obvious.

Is there someone who might be able to point us in the right
direction/validate what we're thinking?

James

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: terminate called after throwing an instance of 'std::bad_alloc'
Next
From: Amit Langote
Date:
Subject: Re: making update/delete of inheritance trees scale better