Re: XPRS - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: XPRS
Date
Msg-id CA+hUKGJEMT7SSZRqt-knu_3iLkdscBCe9M2nrhC259FdE5bX7g@mail.gmail.com
Whole thread Raw
In response to Re: XPRS  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: XPRS
List pgsql-hackers
On Sat, Aug 24, 2019 at 3:19 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> > Although “The Wei Hong Optimizer” was designed in the context of
> > Postgres, it became the standard approach for many of the parallel
> > query optimizers in industry."
>
> I assume this quote is from 30 years ago. I wonder if the claim is still
> true, on current hardware (including e.g. distributed databases).

The quote is from 2018, and appears in the article I linked (it's a
chapter from the book Making Databases Work: The Wisdom of Michael
Stonebraker), but I'm not sure which systems it's referring to.

Speculation:  Many other systems have what we call parallel-oblivious
operators only, and then insert various exchange operators to make a
parallel plan.  That is, they don't necessarily have a direct
equivalent of our "Parallel Sequential Scan", "Parallel Index Scan",
"Parallel Foreign Scan": they just use their regular scans, possibly
with the addition of some kind of "Parallel Scatter" node (that's my
made up name, it's the opposite of Gather, called various things like
"page supplier" or "block iterator") or "Parallel Repartition"
inserted in the right places.  Perhaps they create a serial plan
first, and then try to parallelise it by inserting various parallel
operators and then recomputing the costs?  Rather than considering the
separate paths in the first phase of the optimiser, as we do.  The
cases where Hong's best-parallel-plan hypothesis isn't true for us now
might go away if we had Parallel Repartition, so that each 'stream'
would be the complete set of tuples for some known partition.

To be clear, I'm not suggesting we do that necessarily, just pointing
out some interesting claims about ancient POSTGRES wisdom, in a highly
speculative water cooler thread.   Actually, this isn't the first time
it's occurred to me that elements of our design were falling out of
the order that we chose to implement things in.  Another example is
the "Shared Hash" that I had in an early version of my work on
Parallel Hash Join, where just one process would run a parallel-safe
but non-parallel-oblivious plan to build a shared hash table while
other workers twiddled their thumbs; I dropped it because our cost
model has no penalty for running N copies of the same plan rather than
just one so there was no way to pick that plan, and that's because we
don't have a cost model like Hong's that considers resource usage too.
Another more speculative observation: maybe no-partition/shared
Parallel Hash Join is only obvious if you already have the general
concept of parallel-aware executor nodes.  AFAIK Robert and Amit
invented those to be able to coordinate parallel scans between
processes, where thread-based systems might be able to share a single
scan state somehow under a Scatter-like operator.  If you had threads,
you might not need that concept that allows arbitrary executor nodes
to communicate with each other between workers, and then it might be
more obvious and natural to use repartitioning for parallelising hash
joins.

> FWIW I think we'll have to do something about resource acquisition, sooner
> or later. It was always quite annoying that we don't really consider
> memory consumption of the query as a whole during planning, and parallel
> query made it a bit more painful.

Agreed.

Here's an approach I have been wondering about to cap total executor
memory usage, which is a much more down-to-Earth idea than any of the
above space cadet planner problems.  Let's start at the other end of
the problem, by introducing admission control and memory quotas.  That
is, keep using work_mem with its current per-node-copy meaning at
planning time, for now, and then:

1.  Compute the peak amount of memory each plan thinks it will need.
Initially that could be done by by summing estimates from all nodes
and considering workers.  A later refinement could deal with nodes
that give back memory early, if we get around to doing that.  The
estimate could be shown by EXPLAIN.  (Some details to work out: worst
case vs expected etc.)

2.  Introduce a new GUC global_work_mem, which limits the total plan
that are allowed to run concurrently, according to their memory
estimates.  Introduce a shared memory counter of currently allocated
quota.

3.  Introduce a new GUC session_work_mem, which is the amount of quota
that every session tries to acquire when it connects or perhaps first
runs a query, and that it won't give back until the end of the
session.  Or perhaps they acquire less than that if they need less,
but that's the amount they never give back once they've got that much.
The idea is to allow queries with estimates under that limit, for
example high frequency OLTP queries, to avoid any extra communication
overhead from this scheme.

4.  To run queries that have estimates higher than the session's
current allocated quota, the session must acquire more quota for the
duration of the query.  If it can't be acquired right now without
exceeding global_work_mem, it has to join a queue and wait.  A
refinement could be that you are allowed to run with fewer workers
than planned to reduce the requirement.

5.  While executing, executor nodes could opportunistically ask for
more quota than was planned for, up to some limit, to avoid having to
spill to disk.  If the request is unsuccessful, that's OK, they can
deal with that.

6.  So long as we have nodes that have no escape mechanism in certain
edge cases (hash aggregates and joins with bad stats and extreme
skew), you could perhaps have the option of raising an error or
forcing the total to exceed global_work_mem temporarily with a warning
(which would at least prevent other large queries from running and
making it worse).

7.  Regular heap memory and DSM memory should be counted together,
since it makes no difference to the operating system, it's all memory
and we should count it against the same quota.  You'd probably want to
consider hidden allocator fragmentation too, as well as other hidden
overheads, to get decent accuracy.

This is sort of fudging together of ideas from conversations with
Kevin Grittner (who talked about admission control a few years back),
Peter Geoghegan (who mentioned opportunistically asking for more), and
things I've heard of on SQL Server ("memory grants").  I think it
would provide some relief from the problems we see today: it's hard to
set work_mem so that you never get OOM but you can still use a decent
amount of your precious memory, especially with mixed parallel and
non-parallel query workloads thanks to our current
work_mem-multiplying design.

--
Thomas Munro
https://enterprisedb.com



pgsql-hackers by date:

Previous
From: Euler Taveira
Date:
Subject: Re: [Patch] Add a reset_computed_values function in pg_stat_statements
Next
From: Chapman Flack
Date:
Subject: safe to overload objectSubId for a type?