Thread: Treating work_mem as a shared resource (Was: Parallel Hash take II)

Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Peter Geoghegan
Date:
On Wed, Nov 15, 2017 at 1:06 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> In the old days, Oracle had only simple per-operation memory limits
> too, and that applied to every operation in every thread just like our
> work_mem.  It's interesting that they had separate knobs for sort and
> hash though, and defaulted to giving hash twice as much.

It makes plenty of sense to give hash twice as much IMV.

I'm interested in the *economics* of how we use memory -- I think we
could do a lot better there. Memory capacities have certainly
increased dramatically over the years since sort_mem became work_mem,
but I suspect that users are (quite reasonably) still giving most of
that over to shared_buffers/FS cache. And, storage/network bandwidth
also increased dramatically during that time, so single-pass external
sorts will continue to be a sensible thing to do frequently. Hashing
is a different story, though -- you really do want to make sure that
hash-based operations have access to more memory, where it can really
go to use.

Though I am primarily concerned about the fact that we don't give any
weight to how sensitive hash-based operations are to having less
memory, I don't really want to talk about band-aid measures like a
hash_mem GUC (though that does have a certain appeal). I want to talk
about moving past work_mem, and towards a model where work_mem-like
memory is treated like a shared resource, under a regime that
intelligently weighs the effects of making more memory available to
one plan based on system-wide accounting, and the sensitivity of each
memory-consuming operation to the availability of memory. This thread
is intended to get the ball rolling on that.

It seems like we need something like this to get the full benefit of
our numerous enhancements to sorting and hashing.

> With a whole-plan memory target, our planner would probably begin to
> plan join order differently to minimise the number of hash tables in
> memory at once, like other RDBMSs.  Not sure how the plan-wide target
> should work though -- try many plans, giving different portions of
> budget to different subplans?  That should work fine if you like
> O(please-melt-my-computer), especially if combined with a similar
> approach to choosing worker numbers.  Some kind of feedback system?
> Seems like a different kind of planner, but I have no clue.  If you
> have ideas/papers/references, it'd be great to see a new thread on
> that subject.

My first thought is that we might implement a model where little
changes about work_mem itself -- it becomes a minimum for each
work_mem consuming operation. There could be an additional "emergency
memory fund" that individual plan nodes can avail of during execution,
if and when it looks like they'll fall underneath an allocation that
allows the work_mem-consuming operation to perform "optimally" or
"reasonably". This would happen at certain "penny-wise, pound foolish"
points. There'd be big differences in how we do this for sorts as
compared to hash joins, because the underlying cost function for each
look totally different. There'd probably be a maximum amount of memory
that each executor node could request from the emergency fund, such as
a smallish multiple of work_mem (work_mem being the minimum budget it
can *reliably* claim). A single hash join asking for the entire
emergency fund for itself all at once seems excessive, and likely to
create problems in other areas, so that should be impossible. And, it
should be "hard" for a node to ask for and/or receive the absolute
maximum a node can get, because we want to keep that for cases that
would otherwise truly be much slower.

All in all, this approach shouldn't be too radical a departure from
the work_mem model. I admit that there are significant risks with this
approach as a project. It seems like there is a chance that it won't
be ambitious enough in the end, because there are so many competing
considerations. At the same time, I cannot help but be concerned about
how naive we are about how sorting and hashing respond to work_mem.

Anyway, here is how I see the extra/emergency requests working for
specific operations:

* For sorts, a non-optimal sort (a sort that asks for more memory)
would ask for memory when it first looked like multiple passes will be
required. As I pointed out in my Sort vs. Hash talk, that's actually
pretty rare these days, because as work_mem doubles, the capacity to
do everything in one pass quadruples. You should really never get
multiple passes for an external sort -- the economics of doing it that
way with any frequency are not likely to be sensible on modern
machines.

* For hash join, the "emergency request" logic could be much more
sophisticated, and would be much more likely to be used in practice. I
think that we'd probably want to worry about the difference between
performing a hash join entirely in-memory versus having to do some
amount of batching (unlike sorting). This would generally be more
"eager" than the requests that tuplesort makes, because smaller
differences matter much more, much sooner.

* For hash aggregates, we might have "overage requests" from the
emergency fund, or something along those lines. The emergency fund
might therefore be in deficit (negative) when hash aggregates
misbehave, since it cannot "say no" to these requests (hash aggregate
will not currently take no for an answer, since it has no emergency
spill mechanism). This could limit the overall impact of that
happening, and might also provide a useful choke point for new
alerting and monitoring systems that can hook into the "central memory
management" logic. Hash aggregates might go over work_mem without it
really mattering much of the time.

ISTM that there is a similar dilemma to this "sort versus hash"
dilemma for maintenance_work_mem tasks: the "CREATE INDEX versus
VACUUM" dilemma. We should try to address that as part of this effort.
(This dilemma is one reason why I wrote the autovacuum_work_mem patch
-- that's not a million miles from the idea of a hash_mem GUC.)

To come up with a real proposal for treating local memory as a shared
resource, I think that we need:

* To hear more ideas about keeping things in balance here. How clever
should we be?

* To experimentally verify that the cost functions (latency as a
function of memory) for things like sorting, merge join, hash join,
and hash aggregate are what we think they are.

* To understand how this relates to admission control. The only
obvious difference that I can think of is that admission control
probably involves queuing when very memory constrained, and isn't
limited to caring about memory. I'm not trying to make swapping/OOM
impossible here; I'm trying to make it easier to be a Postgres DBA
sizing work_mem, and make it so that DBAs don't have to be stingy with
work_mem. The work_mem sizing formulas we sometimes promote (based on
max_connections) are probably very limiting in the real world.

* To better understand the role of the optimizer here. Can we get more
hash aggregate plans in the first place without risking swapping/OOM?
Is it okay that what I propose kind of works against that goal?

* To better understand the role of parallel query here.

* To try to find a way to assess what "better" here looks like,
overall. What benchmark might we devise to assess what a good, robust
strategy looks like?

I freely admit that my proposal is pretty hand-wavy at this point,
but, as I said, I want to at least get the ball rolling.
-- 
Peter Geoghegan


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
David Rowley
Date:
On 16 November 2017 at 16:38, Peter Geoghegan <pg@bowt.ie> wrote:
> * To understand how this relates to admission control. The only
> obvious difference that I can think of is that admission control
> probably involves queuing when very memory constrained, and isn't
> limited to caring about memory. I'm not trying to make swapping/OOM
> impossible here; I'm trying to make it easier to be a Postgres DBA
> sizing work_mem, and make it so that DBAs don't have to be stingy with
> work_mem. The work_mem sizing formulas we sometimes promote (based on
> max_connections) are probably very limiting in the real world.

I had always imagined that this should be some sort of work_mem_pool.
Each plan would have some mention of how much they expect to consume,
which I'd thought was N * work_mem where N is the number of Nodes in
the plan that require a work_mem, then at the start of execution, we
atomically increment variable in shared mem that tracks the
work_mem_pool usage, then check if that variable is <= work_mem_pool
then start execution, if not we add ourselves to some waiters queue
and go to sleep only to be signaled when another plan execution
completes and releases memory back into the pool, we'd then re-check
and just go back to sleep if there's still not enough space.

Probably simple plans with no work_mem requirement can skip all these
checks which may well keep concurrency up.  I'm just not all that
clear on how to handle the case where the plan's memory estimate
exceeds work_mem_pool. It would never get to run. Perhaps everything
that requires any memory must wait in that case so this query can run
alone. i.e. special case this to require the work_mem_pool usage to be
0 before we run, or maybe it should just be an ERROR?

Probably the whole feature could be disabled if work_mem_pool is -1,
which might be a better option for users who find there's some kind of
contention around memory pool checks.

> I freely admit that my proposal is pretty hand-wavy at this point,
> but, as I said, I want to at least get the ball rolling.

Me too. I might have overlooked some giant roadblock.

I think it's important that the work_mem_pool tracker is consumed at
the start of the query rather than when the work_mem node first runs,
as there'd likely be some deadlocking type waiting issue if we have
plans part-way through execution start waiting for other plans to
complete. That might not be ideal, as we'd be assuming that a plan
will always consume all their work_mems at once, but it seems better
than what we have today. Maybe we can improve on it later.


-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


Re: Treating work_mem as a shared resource (Was: Parallel Hashtake II)

From
Serge Rielau
Date:
I have been pondering how to deal with work_mem for a couple of months myself and had very similar thoughts.
As far as I can tell the problem goes beyond work_mem though:
1. There are several hash operations such as set-ops, hashed subplans, and hash aggregates which today are not spilling at all.
We have solved them partially so far and, once complete, think the fixes can be pushed into community PG if there is desire for it
2. We also worry about large objects which can bloat a backend
3. Others random allocations I fear I just don’t know about.
4. OS are chronically poor in trading memory between processes even after the memory is freed unless it’s returned to the OS in big contiguous chunks. 

Just as you have, we have also considered holistic provisioning of work_mem across all consumers, but we find that to be too complex.
Having an “emergency fund” in shared memory is also an option, but I find it too limiting.
Also this approach what was done at DB2 when I was there and it proved cumbersome.

So I’m currently pressing forward with a much more fundamental approach:
Pushing Top Transaction Context and its children into shared memory.
To avoid fragmentation and serialization on latches I have defined the concept of “a context cluster”. The root of the cluster is the sole true allocator of memory. Child contexts allocate blocks as pallocs from the cluster root. Basically memory management goes recursive and children live within the root.
The root (TopTransactionContext) allocates big blocks. e.g. 8MB at a time.
Within a transaction PG operates as usual with freelists and all turning over these same 8MB or allocating more if needed.
But at the end of every transaction big chunks of memory become available to share with other transactions again. 
A few places where we reparent contexts need to detect that this can’t be done between or in/out of clusters and do deep copies if needed, but there are few of those. Come to think of it all the cases I encountered so far were SFDC specific…

I’m also moving the e-state from the Portal Heap to the Top Transaction Context.
At the end of the day the assumption is that most transactions only need one block from shared memory, and I can probably pin it to the backend, further reducing contention. 
If there is an Out Of Memory situation - should be very rare - there are multiple ways to deal with it. If there is no dead-lock we can simply wait. If there is one rolling back the transaction that encountered the OOM is the obvious - if not optimal solution.
Finding the biggest consumer and sending it a signal to back of would be another way to do it.

My goal is to run a backend with 50-100MB with all local caches controlled for size. 
Transaction Memory with e-states included should sized for 8MB/backend plus a fixed “spill” of some GB.

Yes, this is invasive and I’m sure to debug this for a while given my limited knowledge of the engine. I may yet fail spectacularly. 
On the other hand it’s conceptually pretty straight forward.

Cheers
Serge Rielau
SFDC

Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Robert Haas
Date:
On Thu, Nov 16, 2017 at 11:50 AM, Serge Rielau <serge@rielau.com> wrote:
Just as you have, we have also considered holistic provisioning of work_mem across all consumers, but we find that to be too complex.
Having an “emergency fund” in shared memory is also an option, but I find it too limiting.

I agree.

I think this is basically a planning problem.  For example, say we wanted to have work_mem_per_query instead of work_mem_per_node.  There is an obvious design: consider memory use as an independent dimension of merit during path generation and comparison (less is better).  Discard candidate paths whose memory use exceeds the work_mem_per_query budget unless there are no other alternatives.  At the end of planning, pick the cheapest path that survived the memory-budget filter.  Now, this has the problem that it would make planning more expensive (because we'd hang on to more paths for longer) but it solves a lot of other problems.  If there's no memory pressure, we can use memory like mad even when it doesn't save much, but when we have to pick between using more memory for one part of the plan and using more memory for another part of the plan, the choice that does the best job reducing overall execution time will win.  Awesome.

We could also do more localized variants of this that don't provide hard guarantees but do tend to avoid squandering resources.  I don't think that we can directly incorporate memory use into cost because that will distort the costs of higher-level nodes in the plan tree; cost needs to mean execution time.  However, what we could do is refuse to replace a more expensive path in a relation's path list with a cheaper one when the savings are small and the cheaper path uses a lot more memory.  That way, you wouldn't replace a nested loop that costs a million units with a hash join that costs 999,999 units but uses a GB of RAM; you'd save the hash join for cases where we think it will help significantly.

Yet another thing we could do is to try to get nodes to voluntarily use less than work_mem when possible.  This is particularly an issue for sorts.  A 2-batch hash join is so much more expensive than a single-batch hash join that it's almost never going to make sense unless we have no realistic alternative, although I suppose a 64-batch hash join might be not that different from a 32-batch hash join.  But for sorts, given all Peter's work in this area, I bet there are a lot of sorts that could budget a quarter or less of work_mem and really not be hurt very much.  It depends somewhat on how fast and how contended your I/O is, though, which we don't have an especially good way to model.  I'm starting to wonder if that sort_mem GUC might be a good idea... use that for sorts, and keep work_mem for everything else.

If we really want to be able to dynamically react to change memory conditions, what we need is a forest of plans for a given query rather than just one.  Pick plan A if memory is limited, otherwise pick B.  Or use admission control.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Vladimir Rusinov
Date:
On Fri, Nov 17, 2017 at 3:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Nov 16, 2017 at 11:50 AM, Serge Rielau <serge@rielau.com> wrote:
Just as you have, we have also considered holistic provisioning of work_mem across all consumers, but we find that to be too complex.
Having an “emergency fund” in shared memory is also an option, but I find it too limiting.

I agree.

I think this is basically a planning problem.  For example, say we wanted to have work_mem_per_query instead of work_mem_per_node.  There is an obvious design: consider memory use as an independent dimension of merit during path generation and comparison (less is better).  Discard candidate paths whose memory use exceeds the work_mem_per_query budget unless there are no other alternatives.  At the end of planning, pick the cheapest path that survived the memory-budget filter.  Now, this has the problem that it would make planning more expensive (because we'd hang on to more paths for longer) but it solves a lot of other problems.  If there's no memory pressure, we can use memory like mad even when it doesn't save much, but when we have to pick between using more memory for one part of the plan and using more memory for another part of the plan, the choice that does the best job reducing overall execution time will win.  Awesome.

We could also do more localized variants of this that don't provide hard guarantees but do tend to avoid squandering resources.  I don't think that we can directly incorporate memory use into cost because that will distort the costs of higher-level nodes in the plan tree; cost needs to mean execution time.  However, what we could do is refuse to replace a more expensive path in a relation's path list with a cheaper one when the savings are small and the cheaper path uses a lot more memory.  That way, you wouldn't replace a nested loop that costs a million units with a hash join that costs 999,999 units but uses a GB of RAM; you'd save the hash join for cases where we think it will help significantly.

Yet another thing we could do is to try to get nodes to voluntarily use less than work_mem when possible.  This is particularly an issue for sorts.  A 2-batch hash join is so much more expensive than a single-batch hash join that it's almost never going to make sense unless we have no realistic alternative, although I suppose a 64-batch hash join might be not that different from a 32-batch hash join.  But for sorts, given all Peter's work in this area, I bet there are a lot of sorts that could budget a quarter or less of work_mem and really not be hurt very much.  It depends somewhat on how fast and how contended your I/O is, though, which we don't have an especially good way to model.  I'm starting to wonder if that sort_mem GUC might be a good idea... use that for sorts, and keep work_mem for everything else.

If we really want to be able to dynamically react to change memory conditions, what we need is a forest of plans for a given query rather than just one.  Pick plan A if memory is limited, otherwise pick B.  Or use admission control.

FWIW, lack of per-connection and/or global memory limit for work_mem is major PITA when running shared and/or large-scale setup.

Currently we are doing a poor job with the work_mem parameter because we don't have a good way to let our customers increase it without also giving them ability to shoot themselves in a foot.
Even a simple param limiting global total number of work_mem buffers would help here.

-- 
Vladimir Rusinov
PostgreSQL SRE, Google Ireland

Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047

Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Peter Geoghegan
Date:
On Fri, Nov 17, 2017 at 7:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Nov 16, 2017 at 11:50 AM, Serge Rielau <serge@rielau.com> wrote:
>>
>> Just as you have, we have also considered holistic provisioning of work_mem across all consumers, but we find that
tobe too complex. 
>> Having an “emergency fund” in shared memory is also an option, but I find it too limiting.
>
>
> I agree.

Yeah. I suspect that that idea is not ambitious enough to do a lot of
what we want, and yet is too ambitious to justify working on given its
limited shelf life.

> I think this is basically a planning problem.  For example, say we wanted to have work_mem_per_query instead of
work_mem_per_node. There is an obvious design: consider memory use as an independent dimension of merit during path
generationand comparison (less is better).  Discard candidate paths whose memory use exceeds the work_mem_per_query
budgetunless there are no other alternatives.  At the end of planning, pick the cheapest path that survived the
memory-budgetfilter.  Now, this has the problem that it would make planning more expensive (because we'd hang on to
morepaths for longer) but it solves a lot of other problems.  If there's no memory pressure, we can use memory like mad
evenwhen it doesn't save much, but when we have to pick between using more memory for one part of the plan and using
morememory for another part of the plan, the choice that does the best job reducing overall execution time will win.
Awesome.

I'd like to hear some opinions on the feasibility of this approach.
Does David have anything to say about it, for example?

> We could also do more localized variants of this that don't provide hard guarantees but do tend to avoid squandering
resources.

That sounds like independent work, though it could be very useful.

> Yet another thing we could do is to try to get nodes to voluntarily use less than work_mem when possible.  This is
particularlyan issue for sorts.  A 2-batch hash join is so much more expensive than a single-batch hash join that it's
almostnever going to make sense unless we have no realistic alternative, although I suppose a 64-batch hash join might
benot that different from a 32-batch hash join.  But for sorts, given all Peter's work in this area, I bet there are a
lotof sorts that could budget a quarter or less of work_mem and really not be hurt very much.  It depends somewhat on
howfast and how contended your I/O is, though, which we don't have an especially good way to model.  I'm starting to
wonderif that sort_mem GUC might be a good idea... use that for sorts, and keep work_mem for everything else. 

Right. The ability for sorts to do well with less memory is really
striking these days. And though I didn't mean to seriously suggest it,
a hash_mem GUC does seem like it solves some significant problems
without much risk. I think it should be hash_mem, not sort_mem,
because hashing seems more like the special case among operations that
consume work_mem, and because sort_mem is already the old name for
work_mem that is still accepted as a work_mem alias, and because
hash_mem avoids any confusion about whether or not CREATE INDEX uses
the new GUC (it clearly does not).

Since I am primarily concerned about the difference in sensitivity to
the availability of memory that exists when comparing sorting and
hashing, and since a new GUC seems like it would noticeably improve
matters, I am beginning to take the idea of writing a hash_mem patch
for v11 seriously.

--
Peter Geoghegan


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Peter Geoghegan
Date:
On Fri, Nov 17, 2017 at 8:09 AM, Vladimir Rusinov <vrusinov@google.com> wrote:
> FWIW, lack of per-connection and/or global memory limit for work_mem is major PITA when running shared and/or
large-scalesetup.
 
>
> Currently we are doing a poor job with the work_mem parameter because we don't have a good way to let our customers
increaseit without also giving them ability to shoot themselves in a foot.
 
> Even a simple param limiting global total number of work_mem buffers would help here.

I suspect that we can do better here just by allocating memory more
sensibly in a very simple way (something like my hash_mem proposal).
The relationship between aggregate memory usage and aggregate
throughput is very non-linear. One can imagine giving more memory to
hash joins, making each hash join much faster, having the overall
effect of *reducing* aggregate memory usage. The DBA can be more
generous with memory while actually decreasing aggregate memory usage.
This is at least possible with work_mem consuming operations that
involve hashing, like hash join and hash aggregate.

Simple benchmarking tools like pgbench enforce the idea that meeting
throughput requirements is the most important thing, but in reality
workloads are usually very bursty. It is often more important to be
able to stay on a smaller instance size while maintaining less than
excellent (but still acceptable) performance. Again, it's about the
economics.

-- 
Peter Geoghegan


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Fri, Nov 17, 2017 at 7:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think this is basically a planning problem.  For example, say we wanted to have work_mem_per_query instead of
work_mem_per_node. There is an obvious design: consider memory use as an independent dimension of merit during path
generationand comparison (less is better).  Discard candidate paths whose memory use exceeds the work_mem_per_query
budgetunless there are no other alternatives.  At the end of planning, pick the cheapest path that survived the
memory-budgetfilter.  Now, this has the problem that it would make planning more expensive (because we'd hang on to
morepaths for longer) but it solves a lot of other problems.  If there's no memory pressure, we can use memory like mad
evenwhen it doesn't save much, but when we have to pick between using more memory for one part of the plan and using
morememory for another part of the plan, the choice that does the best job reducing overall execution time will win.
Awesome.

> I'd like to hear some opinions on the feasibility of this approach.

There is indeed a big planning problem here, but Robert's sketch is
missing an important component of it: work_mem is not an output of cost
estimates, it is an *input*.  For example, we can sort or hash-join in
however much memory you want, but it's going to cost different amounts.

I think what we're actually looking for is to find the breakpoints in
the cost curve where it thinks it can switch to a different sorting
or hashing model, and then to emit paths that assume work_mem just
above each of those breakpoints.  But the code isn't set up like that
now, not as to either planning or execution.

Another problem with formulating it that way is that it suddenly puts
a much higher premium on the planner's space estimates being right,
which is something I don't have much faith in.  For instance, if the
planner thinks that 1000kB is just enough to hold a hash table, and
then when we run it we find out that we need a bit more space than that,
do we really want the executor to switch to a batched join?  Probably not,
especially not if having set the node's work_mem to 1010kB instead
would've let it run to completion without batching.  Addressing that
discrepancy might be where we need the dynamic "emergency memory request"
mechanism that Peter was postulating.  But I'm not sure exactly how that
works, because at the point where the executor realizes it's about to
exceed the original space budget, it generally has little idea how much
more it would need in order to avoid spilling the sort to disk or
adding another round of batching.

So it's really unclear to me what either the planner or executor API
contracts for memory consumption ought to be if we're going to try to do
this differently.  I agree there's a lot of potential for improvement if
we can find a better solution, but we're going to need to put some serious
thought into it.
        regards, tom lane


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Robert Haas
Date:
On Fri, Nov 17, 2017 at 1:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> Right. The ability for sorts to do well with less memory is really
> striking these days. And though I didn't mean to seriously suggest it,
> a hash_mem GUC does seem like it solves some significant problems
> without much risk. I think it should be hash_mem, not sort_mem,
> because hashing seems more like the special case among operations that
> consume work_mem, and because sort_mem is already the old name for
> work_mem that is still accepted as a work_mem alias, and because
> hash_mem avoids any confusion about whether or not CREATE INDEX uses
> the new GUC (it clearly does not).

Hmm.  I wonder if you are correct that hashing is the special case.
Hashing and sorting are of course the two main operations -- but
there's materialize and anything else that uses a CTE, and maybe other
stuff I'm not thinking about right now.  You might be right that hash
is the one where it really matters, but it's probably worth a bit more
reflection on where it matters most and for what reasons.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Peter Geoghegan
Date:
On Fri, Nov 17, 2017 at 3:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm.  I wonder if you are correct that hashing is the special case.
> Hashing and sorting are of course the two main operations -- but
> there's materialize and anything else that uses a CTE, and maybe other
> stuff I'm not thinking about right now.  You might be right that hash
> is the one where it really matters, but it's probably worth a bit more
> reflection on where it matters most and for what reasons.

I'd rather be approximately correct than precisely wrong. I think that
the current work_mem model is precisely wrong. I'm conscious of the
fact that we are loathe to create new GUCs (I sometimes think that
we're a bit too averse to doing so), but maybe there is room for
adding a second work_mem-alike GUC.

For now, I admit that I am applying fuzzy criteria, and that I could
easily have missed an important subtlety. Creating hash_mem instead of
sort_mem is a direction that is entirely debatable, and should
actually be debated. OTOH, it seems like a real problem that we don't
allow hashing to take full advantage of available main memory, and
*some* interim solution seems preferable to what we have now.

-- 
Peter Geoghegan


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Peter Geoghegan
Date:
On Fri, Nov 17, 2017 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'd like to hear some opinions on the feasibility of this approach.
>
> There is indeed a big planning problem here, but Robert's sketch is
> missing an important component of it: work_mem is not an output of cost
> estimates, it is an *input*.  For example, we can sort or hash-join in
> however much memory you want, but it's going to cost different amounts.
>
> I think what we're actually looking for is to find the breakpoints in
> the cost curve where it thinks it can switch to a different sorting
> or hashing model, and then to emit paths that assume work_mem just
> above each of those breakpoints.  But the code isn't set up like that
> now, not as to either planning or execution.

It might not be that hard to come up with a model for determining
which points on the curve are of interest. It seems easy to do this
for sorting, because it's actually a very simple curve. Once you're in
single pass territory, and provided you're still using at least a few
tens of megabytes of work_mem, the availability of work_mem seems to
only make a very small difference (temp file I/O is still essentially
sequential, and the logarithmic growth in comparisons as more runs
must be merged doesn't really bite you). Perhaps this also means that
you can expect to get away with moderately bad estimates there.

> Another problem with formulating it that way is that it suddenly puts
> a much higher premium on the planner's space estimates being right,
> which is something I don't have much faith in.  For instance, if the
> planner thinks that 1000kB is just enough to hold a hash table, and
> then when we run it we find out that we need a bit more space than that,
> do we really want the executor to switch to a batched join?  Probably not,
> especially not if having set the node's work_mem to 1010kB instead
> would've let it run to completion without batching.  Addressing that
> discrepancy might be where we need the dynamic "emergency memory request"
> mechanism that Peter was postulating.  But I'm not sure exactly how that
> works, because at the point where the executor realizes it's about to
> exceed the original space budget, it generally has little idea how much
> more it would need in order to avoid spilling the sort to disk or
> adding another round of batching.

I don't know either.

I think that it's reasonable for us to make it a goal of the executor
to have operations that have a smooth cost function, in order to
manage the risk of misestimation well, and to make it a goal to have
operations that are otherwise adaptive to misestimation. To a large
degree, my abandoned "quicksort with spillover" design from a couple
of years ago was written with this in mind (it avoided a sudden
discontinuity in the cost function of sort nodes, at the point where
you must spill for the first time). Another example of an adaptive
operation is "role reversal" for hash joins, where the executor flips
the inner and outer side during execution, at a point where it becomes
clear that the optimizer had it backwards, estimation-wise. There are
probably numerous other things like this that are possible...and maybe
even worthwhile.

In summary, I agree that we're going to have big problems if the
planner needs to have very accurate estimates to see a real benefit.
It seems possible that most of the benefit of "fixing work_mem" comes
from avoiding using a woefully inadequate amount of memory where
batching was clearly always going to be necessary. There may be
limited benefit to preventing batching in the first place. So while
there could also be room for an "emergency memory request" mechanism,
it's more of a nice-to-have.

> So it's really unclear to me what either the planner or executor API
> contracts for memory consumption ought to be if we're going to try to do
> this differently.  I agree there's a lot of potential for improvement if
> we can find a better solution, but we're going to need to put some serious
> thought into it.

The devil is in the details, of course. Vladimir said something about
customer issues with sizing work_mem on Google's cloud database
service, and it reminded me of my experiences with this while working
at Heroku. I tended to hear few complaints about it, but then there'd
sometimes be serious customer issues quite suddenly.

My theory is that there can be a turning point where demands on
work_mem increase, and there are suddenly more group aggregates than
hash aggregates (to a lesser extent, there may be fewer hash joins).
Now the database is using group aggregates that are quite a bit slower
than hash aggregates, while still using approximately the same amount
of memory as before. This creates significantly more pressure quite
suddenly, because the group aggregates are quite a bit slower, and it
takes that much longer for the memory to be released.

I'm mostly concerned about avoiding instability like this. Users
greatly value predictable performance.

-- 
Peter Geoghegan


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Robert Haas
Date:
On Fri, Nov 17, 2017 at 9:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I think that it's reasonable for us to make it a goal of the executor
> to have operations that have a smooth cost function, in order to
> manage the risk of misestimation well, and to make it a goal to have
> operations that are otherwise adaptive to misestimation.

Hash joins are a place where we could have a smoother cost function
than we do.  When we run out of memory, instead of switching from
(say) a single batch to two batches, switch to 64 batches, but
initially keep 63 of them in memory and only write the very last one
to disk.  Every time we again run out of memory, dump another batch to
disk.  If we end up dumping more than half or so of the batches to
disk, switch to an even larger number of batches to make it even more
fine-grained.  The current system is bad because you jump from
spooling NO tuples to a tuplestore to spooling HALF of the inner AND
outer tuples to a tuplestore.  If the hash table is just a little too
big to fit, we could write 1/64 or 2/64 or 3/64 of the inner and outer
tuples to a tuplestore instead of HALF of them, which would be a huge
win.

That having been said, I think the place where our plans most commonly
go wrong is where we incorrectly estimate the number of tuples by
multiple orders of magnitude - 100x is common, 1000x is common, a
million x is not uncommon, even a billion x is not unheard-of.  And I
don't think there's any way to make a hash join happy if it thinks
it's going to need 1 batch and it ends up needing a million batches.
At that, even if the cost function is very smooth, you've moved so far
along the curve that you're probably not in a good place.  So, while I
think that smoothing out the cost functions is a good idea, I think we
also need to consider what more can be done to improve the estimates -
and especially to avoid estimates that are off by huge multiples.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Peter Geoghegan
Date:
On Tue, Nov 21, 2017 at 7:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Hash joins are a place where we could have a smoother cost function
> than we do.

Yes, it definitely is.

> When we run out of memory, instead of switching from
> (say) a single batch to two batches, switch to 64 batches, but
> initially keep 63 of them in memory and only write the very last one
> to disk.  Every time we again run out of memory, dump another batch to
> disk.  If we end up dumping more than half or so of the batches to
> disk, switch to an even larger number of batches to make it even more
> fine-grained.

That could work.

> That having been said, I think the place where our plans most commonly
> go wrong is where we incorrectly estimate the number of tuples by
> multiple orders of magnitude - 100x is common, 1000x is common, a
> million x is not uncommon, even a billion x is not unheard-of.  And I
> don't think there's any way to make a hash join happy if it thinks
> it's going to need 1 batch and it ends up needing a million batches.

What about dynamic role reversal? That could make a big difference.

> At that, even if the cost function is very smooth, you've moved so far
> along the curve that you're probably not in a good place.  So, while I
> think that smoothing out the cost functions is a good idea, I think we
> also need to consider what more can be done to improve the estimates -
> and especially to avoid estimates that are off by huge multiples.

I agree that it would be enormously valuable if we could make
estimates much better, so I think that I understand why you emphasize
it. But, I don't think that there are any good ideas for improving
join selectivity that don't involve expert DBA knowledge, or
novel/risky techniques for feedback to the system about column
redundancy/correlation, etc. These do not seem like scalable
approaches, and so they don't particularly appeal to me as projects.
I'd be happy to be shown to be wrong about this.

OTOH, techniques like dynamic role reversal, for when there are many
batches and it's faster to flip the outer and inner side do seem
promising. It's probably possible to come up with a more or less
unambiguous improvement, without layering complexity. I suspect that
this technique is widely implemented, and will cut down on cases
leading to terrible performance to a significant degree. I should try
to talk Thomas into working on it.

-- 
Peter Geoghegan


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Robert Haas
Date:
On Tue, Nov 21, 2017 at 5:38 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> That having been said, I think the place where our plans most commonly
>> go wrong is where we incorrectly estimate the number of tuples by
>> multiple orders of magnitude - 100x is common, 1000x is common, a
>> million x is not uncommon, even a billion x is not unheard-of.  And I
>> don't think there's any way to make a hash join happy if it thinks
>> it's going to need 1 batch and it ends up needing a million batches.
>
> What about dynamic role reversal? That could make a big difference.

In the best case it's great, but it looks to me like there are a lot
of thorny problems.  For example, imagine giant_table INNER JOIN
bigger_than_we_thought  The latter table will be chosen as the inner
table and that won't work out very well, but there's no way to know
whether switching the sides will be any better except to try reading a
bunch of rows from giant_table and seeing whether it turns out to be a
lot smaller than we thought.  To do that, we'll need to dump the hash
table we started to build on the original inner side out to disk so
that we can free up enough work_mem to try building a hash table on
the other side.  When the giant table turns out to actually be giant,
we'll need to go back to the original plan, which means dumping out
the tuples from the second hash table and reloading the tuples from
the first one.  So we end up just doing a bunch of extra work for
nothing.  I think that this scenario - wasting effort trying to switch
the sides only to give up - will happen frequently.

In the multi-batch case, there seems to be a little more hope of doing
something clever.  We're anyway writing out most of both inputs out to
tapes.  If we were willing to write ALL of both inputs out to tapes,
then we could decide - perhaps even separately for each batch - which
side to load into the hash table.  Of course, that adds a lot of
incremental I/O unless the number of batches is large (e.g. if we had
only 4 batches, writing 4/4 of the data instead of 3/4 is a 33%
increase, but if we had 64 batches, writing 64/64 of the data instead
of 63/64 doesn't matter a lot, probably).  And it leaves out a few
important details, like the fact that what fits in the hash table is
used to choose the number of batches in the first place, and that we
write the whole of one side to tapes before starting on the other
side.  I don't know how to handle those problems but it seems like it
might be possible to come up with something clever, at least for
certain cases.

> I agree that it would be enormously valuable if we could make
> estimates much better, so I think that I understand why you emphasize
> it. But, I don't think that there are any good ideas for improving
> join selectivity that don't involve expert DBA knowledge, or
> novel/risky techniques for feedback to the system about column
> redundancy/correlation, etc. These do not seem like scalable
> approaches, and so they don't particularly appeal to me as projects.
> I'd be happy to be shown to be wrong about this.

Yeah, I agree that it's a hard problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Thomas Munro
Date:
On Sun, Nov 26, 2017 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 21, 2017 at 5:38 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> That having been said, I think the place where our plans most commonly
>>> go wrong is where we incorrectly estimate the number of tuples by
>>> multiple orders of magnitude - 100x is common, 1000x is common, a
>>> million x is not uncommon, even a billion x is not unheard-of.  And I
>>> don't think there's any way to make a hash join happy if it thinks
>>> it's going to need 1 batch and it ends up needing a million batches.
>>
>> What about dynamic role reversal? That could make a big difference.
>
> In the best case it's great, but it looks to me like there are a lot
> of thorny problems.

There are loads of inter-related topics discussed in this thread,
including some operator-specific stuff like the above, and some more
general stuff, all requiring more research.  In the meantime, I wonder
if there are some simpler incremental improvements we could consider.

Since work_mem currently acts as a kind of per executor node instance
limit, the system-wide peak memory usage could be described as number
of concurrent queries * number of executor nodes * number of parallel
participants * work_mem.  In the past I think the number of executor
nodes was practically anchored to the ground by the number of
relations in the query (not necessarily linearly, but not far off it),
and the number of parallel participants was one.  With the advent of
parallel query we have this new multiplying term, and with the advent
of partitions and partition-wise join we have exciting new ways to
explode the number of executor nodes when the user only explicitly
named a few relations.

We could imagine various levels of memory budgeting:

1.  work_mem_per_system (global budget).
2.  work_mem_per_query (work_mem somehow shared out between executor nodes).
3.  Per planned executor node budget (workers get a fraction of
work_mem for each node).
4.  What we have today: per executor node instance budget (workers get
to use work_mem for each node).

1 and 2 seem like they might be boil-the-ocean problems.  But as far
as I know moving from 4 to 3 would merely require warming up a minor
lake.  That would take out one of the multipliers, and would remove a
perverse incentive from any potential cost-based parallel degree
choosing algorithms (you can print free memory by adding new workers.)

Parallel Hash either combines the memory budgets of all participants
to make one large no-partition hash table, or partitions the inner
relation into work_mem sized batches and loads several of them into
memory at the same time (at most one per participant).  Either way the
total memory usage is participants * work_mem, consistent with policy
4 and consistent with the total budget given to equivalent
parallel-oblivious hash join, sort-merge join or any other node.

If we switched to policy 3 and (say) work_mem were somehow
automagically adjusted to be divided by number of participants at
planning and execution time, then Parallel Hash wouldn't have to
change at all to conform to the policy.  It would use at most work_mem
per Parallel Hash node, no matter how many workers and no matter which
of its strategies it picked (either it receives a budget of work_mem /
participants, and then multiplies it by participants to create a
no-partition hash table combining the participants' budgets, or it
lets each participant chew on smaller hash tables adding up to at most
work_mem).  Just the same total per-node budget as any other executor
node gets.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From
Robert Haas
Date:
On Fri, Feb 23, 2018 at 6:06 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> If we switched to policy 3 and (say) work_mem were somehow
> automagically adjusted to be divided by number of participants at
> planning and execution time, then Parallel Hash wouldn't have to
> change at all to conform to the policy.  It would use at most work_mem
> per Parallel Hash node, no matter how many workers and no matter which
> of its strategies it picked (either it receives a budget of work_mem /
> participants, and then multiplies it by participants to create a
> no-partition hash table combining the participants' budgets, or it
> lets each participant chew on smaller hash tables adding up to at most
> work_mem).  Just the same total per-node budget as any other executor
> node gets.

That's true, but what you'd have instead is a whole lot of additional
planning overhead.  Right now, if we choose to do a merge-join or a
parallel-oblivious hash join or a nested loop with a materialize node
on the inner side, we can join the parallel-aware path on the outer
side to the same parallel-oblivious path on the inner side that we
would use if we decided against parallel query altogether.  If you
wanted to all of the copies of a node across all parallel participants
to stick to work_mem as a budget, then you'd need one set of paths for
each rel planned with the default work_mem setting and a second set
planned with less work_mem.  And if you imagine a future where we
create various paths for the same relation with various different
numbers of workers, then you'd need to have even more different sets
of paths for each relation.

If we're OK with making planning more expensive to solve this problem,
then I think we should forget about #3 and go straight to #2.  What we
would do is just teach add_path() that "amount of memory used" is
another independent dimension of merit, so that a more expensive plan
might be kept if it uses less memory. Then if at the end of planning
you want to pick the fastest plan that uses less than X amount of
memory, or if you want to pick the plan for which weight * cost +
weight * memory usage is minimal, or whatever it is you want, you can.

I think the only one from your list that's really boil-the-ocean is
#1.  For that one, you presumably need to create multiple plans and
switch between them based on how much memory is available right now
and maybe how much you think will be available in the near future and
I guess impose some kind of admission control when system memory gets
too low...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company