Re: Treating work_mem as a shared resource (Was: Parallel Hash take II) - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)
Date
Msg-id 19181.1510951685@sss.pgh.pa.us
Whole thread Raw
In response to Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Peter Geoghegan <pg@bowt.ie> writes:
> On Fri, Nov 17, 2017 at 7:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think this is basically a planning problem.  For example, say we wanted to have work_mem_per_query instead of
work_mem_per_node. There is an obvious design: consider memory use as an independent dimension of merit during path
generationand comparison (less is better).  Discard candidate paths whose memory use exceeds the work_mem_per_query
budgetunless there are no other alternatives.  At the end of planning, pick the cheapest path that survived the
memory-budgetfilter.  Now, this has the problem that it would make planning more expensive (because we'd hang on to
morepaths for longer) but it solves a lot of other problems.  If there's no memory pressure, we can use memory like mad
evenwhen it doesn't save much, but when we have to pick between using more memory for one part of the plan and using
morememory for another part of the plan, the choice that does the best job reducing overall execution time will win.
Awesome.

> I'd like to hear some opinions on the feasibility of this approach.

There is indeed a big planning problem here, but Robert's sketch is
missing an important component of it: work_mem is not an output of cost
estimates, it is an *input*.  For example, we can sort or hash-join in
however much memory you want, but it's going to cost different amounts.

I think what we're actually looking for is to find the breakpoints in
the cost curve where it thinks it can switch to a different sorting
or hashing model, and then to emit paths that assume work_mem just
above each of those breakpoints.  But the code isn't set up like that
now, not as to either planning or execution.

Another problem with formulating it that way is that it suddenly puts
a much higher premium on the planner's space estimates being right,
which is something I don't have much faith in.  For instance, if the
planner thinks that 1000kB is just enough to hold a hash table, and
then when we run it we find out that we need a bit more space than that,
do we really want the executor to switch to a batched join?  Probably not,
especially not if having set the node's work_mem to 1010kB instead
would've let it run to completion without batching.  Addressing that
discrepancy might be where we need the dynamic "emergency memory request"
mechanism that Peter was postulating.  But I'm not sure exactly how that
works, because at the point where the executor realizes it's about to
exceed the original space budget, it generally has little idea how much
more it would need in order to avoid spilling the sort to disk or
adding another round of batching.

So it's really unclear to me what either the planner or executor API
contracts for memory consumption ought to be if we're going to try to do
this differently.  I agree there's a lot of potential for improvement if
we can find a better solution, but we're going to need to put some serious
thought into it.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: pspg - psql pager
Next
From: Justin Pryzby
Date:
Subject: Re: PG10.1 autovac killed building extended stats