Re: Parallelize correlated subqueries that execute within each worker - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Parallelize correlated subqueries that execute within each worker
Date
Msg-id 8459b9bd-6248-4a43-a281-e40adb537242@enterprisedb.com
Whole thread Raw
In response to Re: Parallelize correlated subqueries that execute within each worker  (James Coleman <jtc331@gmail.com>)
List pgsql-hackers
Hi,

I was going through the "needs review" patches with no recent messages,
trying to figure out what is needed to move them forward, and this one
caught my eye because I commented on it before. And it's also a bit sad
example, because it started in 2021 and is moving at glacier speed :-(

I read through the thread, to understand how the design changed over
time, and I like the current approach (proposed by Robert) much more
than the initial idea of adding new flag next to parallel_safe etc.

And in general the patch looks reasonably simple and clean, but my
knowledge of PARAM intricacies is pretty much nil, so I'm hesitant to
claim the patch is correct. And I'm not sure what exactly needs to
happen to validate the approach :-(


The regression tests currently fail, due to a different plan for one of
the new queries in select_parallel. I guess it's due to some committed
patch, and it looks like a sensible change, but I haven't looked closely.

Also, I do get this warning when building with GCC 12.2.0 on Debian:

clauses.c: In function ‘max_parallel_hazard_walker’:
clauses.c:961:49: warning: ‘save_safe_param_ids’ may be used
uninitialized [-Wmaybe-uninitialized]
  961 |                         context->safe_param_ids =
save_safe_param_ids;
      |
~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
clauses.c:943:29: note: ‘save_safe_param_ids’ was declared here
  943 |                 List       *save_safe_param_ids;
      |                             ^~~~~~~~~~~~~~~~~~~

It's harmless, the compiler simply does not realize the two blocks
(where save_safe_param_ids is set and used) have exactly the same if
conditions, but it's a bit confusing for people too.


I was wondering if this could affect some queries in TPC-H, but the only
query affected seems to be Q2 - where it helps, cutting the time in
half, but Q2 is generally pretty fast / the expensive part was already
parallelized quite well (i.e. the correlated subquery is fairly cheap).

However, it's not difficult to construct a query where this helps a lot.
If the correlated subquery does something expensive (e.g. aggregation of
non-trivial amounts of data), this would help. So I wonder if e.g.
TPC-DS would benefit from this more ...


A couple review comments about the code:

1) new fields in max_parallel_hazard_context should have comments:

+    bool        check_params;
+    Bitmapset **required_params;


2) Do we need both is_parallel_safe and is_parallel_safe_with_params?
ISTM the main difference is the for() loop, so why not add an extra
parameter to is_parallel_safe() and skip that loop if it's null? Or, if
we worry about external code, keep is_parallel_safe_with_params() and
define is_parallel_safe() as is_parallel_safe_with_params(...,NULL)?

3) Isn't it a bit weird that is_parallel_safe_with_params() actually
sets check_params=false, which seems like it doesn't actually check
parameters? I'm a bit confused / unsure if this is a bug or how it
actually checks parameters. If it's correct, it may need a comment.

4) The only place setting check_params is max_parallel_hazard, which is
called only for the whole Query from standard_planner(). But it does not
set required_params - can't this be an issue if the pointer happens to
be random garbage?

5) It probably needs a pgindent run, there's a bunch of rather long
lines and those are likely to get wrapped and look weird.

6) Is the comment in max_parallel_hazard_walker() still accurate? It
talks about PARAM_EXTERN and PARAM_EXEC, but the patch removes the
PARAM_EXTERN check entirely. So maybe the comment needs updating?

7) I don't like the very complex if condition very much, it's hard to
understand. I'd split that into two separate conditions, and add a short
comment for each of them. I.e. instead of:

  if (param->paramkind != PARAM_EXEC || !(context->check_params ||
context->required_params != NULL))
    return false;

I'd do

  /* ... comment ... */
  if (param->paramkind != PARAM_EXEC)
    return false;

  /* ... comment ... */
  if (!(context->check_params || context->required_params != NULL))
    return false;

or something like that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Why do we define HAVE_GSSAPI_EXT_H?
Next
From: "Anton A. Melnikov"
Date:
Subject: Re: psql: fix variable existence tab completion