Re: Parallelize correlated subqueries that execute within each worker - Mailing list pgsql-hackers
From | James Coleman |
---|---|
Subject | Re: Parallelize correlated subqueries that execute within each worker |
Date | |
Msg-id | CAAaqYe8m0DHUWk7gLKb_C4abTD4nMkU26ErE=ahow4zNMZbzPQ@mail.gmail.com Whole thread Raw |
In response to | Re: Parallelize correlated subqueries that execute within each worker (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Parallelize correlated subqueries that execute within each worker
|
List | pgsql-hackers |
On Fri, Jan 21, 2022 at 3:20 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jan 14, 2022 at 2:25 PM James Coleman <jtc331@gmail.com> wrote: > > I've been chewing on this a bit, and I was about to go re-read the > > code and see how easy it'd be to do exactly what you're suggesting in > > generate_gather_paths() (and verifying it doesn't need to happen in > > other places). However there's one (I think large) gotcha with that > > approach (assuming it otherwise makes sense): it means we do > > unnecessary work. In the current patch series we only need to recheck > > parallel safety if we're in a situation where we might actually > > benefit from doing that work (namely when we have a correlated > > subquery we might otherwise be able to execute in a parallel plan). If > > we don't track that status we'd have to recheck the full parallel > > safety of the path for all paths -- even without correlated > > subqueries. > > I don't think there's an intrinsic problem with the idea of making a > tentative determination about parallel safety and then refining it > later, but I'm not sure why you think it would be a lot of work to > figure this out at the point where we generate gather paths. I think > it's just a matter of testing whether the set of parameters that the > path needs as input is the empty set. It may be that neither extParam > nor allParam are precisely that thing, but I think both are very > close, and it seems to me that there's no theoretical reason why we > can't know for every path the set of inputs that it requires "from the > outside." As I understand it now (not sure I realized this before) you're suggesting that *even when there are required params* marking it as parallel safe, and then checking the params for parallel safety later. From a purely theoretical perspective that seemed reasonable, so I took a pass at that approach. The first, and likely most interesting, thing I discovered was that the vast majority of what the patch accomplishes it does so not via the delayed params safety checking but rather via the required outer relids checks I'd added to generate_useful_gather_paths. For that to happen I did have to mark PARAM_EXEC params as presumed parallel safe. That means that parallel_safe now doesn't strictly mean "parallel safe in the current context" but "parallel safe as long as any params are provided". That's a real change, but probably acceptable as long as a project policy decision is made in that direction. There are a few concerns I have (and I'm not sure what level they rise to): 1. From what I can tell we don't have access on a path to the set of params required by that path (I believe this is what Tom was referencing in his sister reply at this point in the thread). That means we have to rely on checking that the required outer relids are provided by the current query level. I'm not quite sure yet whether or not that guarantees (or if the rest of the path construction logic guarantees for us) that the params provided by the outer rel are used in a correlated way that isn't shared across workers. And because we don't have the param information available we can't add additional checks (that I can tell) to verify that. 2. Are we excluding any paths (by having one that will always be invalid win the cost comparisons in add_partial_path)? I suppose this danger actually exists in the previous version of the patch as well, and I don't actually have any examples of this being a problem. Also maybe this can only be a problem if (1) reveals a bug. 3. The new patch series actually ends up allowing parallelization of correlated params in a few more places than the original patch series. From what I can tell all of the cases are in fact safe to execute in parallel, which, if true, means this is a feature not a concern. The changed query plans fall into two categories: a.) putting a gather inside a subplan and b.) correlated param usages in a subquery scan path on the inner side of a join. I've separated out those specific changes in a separate patch to make it easier to tell which ones I'm referencing. On the other hand this is a dramatically simpler patch series. Assuming the approach is sound, it should much easier to maintain than the previous version. The final patch in the series is a set of additional checks I could imagine to try to be more explicit, but at least in the current test suite there isn't anything at all they affect. Does this look at least somewhat more like what you'd envisionsed (granting the need to squint hard given the relids checks instead of directly checking params)? Thanks, James Coleman
Attachment
pgsql-hackers by date: