Re: Parallelize correlated subqueries that execute within each worker - Mailing list pgsql-hackers

From James Coleman
Subject Re: Parallelize correlated subqueries that execute within each worker
Date
Msg-id CAAaqYe8m0DHUWk7gLKb_C4abTD4nMkU26ErE=ahow4zNMZbzPQ@mail.gmail.com
Whole thread Raw
In response to Re: Parallelize correlated subqueries that execute within each worker  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Parallelize correlated subqueries that execute within each worker
List pgsql-hackers
On Fri, Jan 21, 2022 at 3:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jan 14, 2022 at 2:25 PM James Coleman <jtc331@gmail.com> wrote:
> > I've been chewing on this a bit, and I was about to go re-read the
> > code and see how easy it'd be to do exactly what you're suggesting in
> > generate_gather_paths() (and verifying it doesn't need to happen in
> > other places). However there's one (I think large) gotcha with that
> > approach (assuming it otherwise makes sense): it means we do
> > unnecessary work. In the current patch series we only need to recheck
> > parallel safety if we're in a situation where we might actually
> > benefit from doing that work (namely when we have a correlated
> > subquery we might otherwise be able to execute in a parallel plan). If
> > we don't track that status we'd have to recheck the full parallel
> > safety of the path for all paths -- even without correlated
> > subqueries.
>
> I don't think there's an intrinsic problem with the idea of making a
> tentative determination about parallel safety and then refining it
> later, but I'm not sure why you think it would be a lot of work to
> figure this out at the point where we generate gather paths. I think
> it's just a matter of testing whether the set of parameters that the
> path needs as input is the empty set. It may be that neither extParam
> nor allParam are precisely that thing, but I think both are very
> close, and it seems to me that there's no theoretical reason why we
> can't know for every path the set of inputs that it requires "from the
> outside."

As I understand it now (not sure I realized this before) you're
suggesting that *even when there are required params* marking it as
parallel safe, and then checking the params for parallel safety later.
From a purely theoretical perspective that seemed reasonable, so I
took a pass at that approach.

The first, and likely most interesting, thing I discovered was that
the vast majority of what the patch accomplishes it does so not via
the delayed params safety checking but rather via the required outer
relids checks I'd added to generate_useful_gather_paths.

For that to happen I did have to mark PARAM_EXEC params as presumed
parallel safe. That means that parallel_safe now doesn't strictly mean
"parallel safe in the current context" but "parallel safe as long as
any params are provided". That's a real change, but probably
acceptable as long as a project policy decision is made in that
direction.

There are a few concerns I have (and I'm not sure what level they rise to):

1. From what I can tell we don't have access on a path to the set of
params required by that path (I believe this is what Tom was
referencing in his sister reply at this point in the thread). That
means we have to rely on checking that the required outer relids are
provided by the current query level. I'm not quite sure yet whether or
not that guarantees (or if the rest of the path construction logic
guarantees for us) that the params provided by the outer rel are used
in a correlated way that isn't shared across workers. And because we
don't have the param information available we can't add additional
checks (that I can tell) to verify that.
2. Are we excluding any paths (by having one that will always be
invalid win the cost comparisons in add_partial_path)? I suppose this
danger actually exists in the previous version of the patch as well,
and I don't actually have any examples of this being a problem. Also
maybe this can only be a problem if (1) reveals a bug.
3. The new patch series actually ends up allowing parallelization of
correlated params in a few more places than the original patch series.
From what I can tell all of the cases are in fact safe to execute in
parallel, which, if true, means this is a feature not a concern. The
changed query plans fall into two categories: a.) putting a gather
inside a subplan and b.) correlated param usages in a subquery scan
path on the inner side of a join. I've separated out those specific
changes in a separate patch to make it easier to tell which ones I'm
referencing.

On the other hand this is a dramatically simpler patch series.
Assuming the approach is sound, it should much easier to maintain than
the previous version.

The final patch in the series is a set of additional checks I could
imagine to try to be more explicit, but at least in the current test
suite there isn't anything at all they affect.

Does this look at least somewhat more like what you'd envisionsed
(granting the need to squint hard given the relids checks instead of
directly checking params)?

Thanks,
James Coleman

Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Warning in geqo_main.c from clang 13
Next
From: Fujii Masao
Date:
Subject: Re: [Proposal] Add foreign-server health checks infrastructure