Home > mailing lists

Re: Adjusting hash join memory limit to handle batch explosion - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Adjusting hash join memory limit to handle batch explosion
Date	February 25 20:39:09
Msg-id	0767ca41-7cad-4f4e-a464-e0b9681e42d5@vondra.me Whole thread Raw
In response to	Re: Adjusting hash join memory limit to handle batch explosion (James Hunter <james.hunter.pg@gmail.com>)
Responses	Re: Adjusting hash join memory limit to handle batch explosion
List	pgsql-hackers

Tree view

On 2/25/25 17:30, James Hunter wrote:
> On Wed, Feb 19, 2025 at 12:22 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> I've pushed the first (and main) part of the patch series, after some
>> more cleanup and comment polishing.
> 
> Two comments on your merged patch --
> 
> First, it's easier to see what's going on if we overlook the logic to
> round to nearest power of two, and solve the optimization problem
> algebraically. Let T = the total memory needed to hash all input rows,
> and B = the size of per-batch metadata (= 2 * BLKSIZE, which is
> typically 16 KB). Then, solving the optimization problem, the minimum
> memory usage occurs at n = nbatches = SQRT(T / B) and w = workmem =
> SQRT(B * T).
> 
> (Here I am using "workmem" for the hash table's "space_allowed.")
> 
> The total working memory used, at the minimum, is always 2 * w: twice
> the optimal "workmem" ("space_allowed").
> 
> This says that the maximum input size that can be (optimally) hashed
> with the default 8 MB workmem (= work_mem * hash_mem_multiplier) is 4
> GB, and the total working memory used would actually be 16 MB.
> 
> Also, to hash 64 GB, or 16x as much, requires a 32 MB workmem, with 64
> MB of total working memory used. So "workmem" grows with the SQRT of
> T, the total hash memory needed; and total working memory is 2x
> "workmem."
> 

Yes, this is a nice way to explain the issue, and how we solve it. It's
probably better than the comment in my commit, I guess.

> Second -- the algebraic solution illustrates the difficulty in
> tracking and restricting working memory usage for Hash Joins! Your
> patch improves the "hash 64 GB" situation, because it eliminates 96 GB
> of per-batch metadata, by reducing n = nbatches from 8192 to 2048, at
> a cost of only 24 MB of workmem. Using the default 8 MB workmem,
> *actual* total working memory used would be 8 MB + 16 KB * (64 GB / 8
> MB) = 136 MB. By increasing workmem to 32 MB, total working memory is
> only 64 MB; so we save 72 MB overall. This is a good thing, but--
> 

Agreed.

> The "but" is that the customer really should have set their workmem to
> 64 MB, in the first place; and we should have taken half of that for
> the hash table, and left the other half for per-batch metadata.
> 
> -- OK,  but historically we have pretended that the per-batch metadata
> used no memory. So the customer should have set their workmem to 32
> MB, with the understanding that PostgreSQL would have actually used 64
> MB...
> 

Sure, we could have considered the per-batch metadata during planning.
And if we find we can't run the hash join, we'd "disable" (penalize) it
in some way. No argument there.

But that assumes we correctly estimate the number of batches during
planning, and it's easy to get that wrong. E.g. the nbatch explosion
cases are a good example.

And that's what my patch was aiming to improve. It does not matter how
the user sets the work_mem GUC, really.

> -- OK, but the customer *didn't* set their workmem to 32 MB. (If they
> had, we wouldn't need this patch -- but we *do* need this patch, which
> means the customer hasn't set their workmem high enough.) Why not?
> Well, because if they set it to 32 MB, they'd run OOM!
> 

Not sure I follow the reasoning here :-( If the query completed with a
lower work_mem value, it should complete with work_mem = 32MB, because
that reduces the amount of memory needed. But yes, it's possible they
hit OOM in both cases, it's an attempt to reduce the impact.

> -- So we are (secretly!) increasing the customer's workmem to 32 MB,
> but only for this particular Hash Join. The customer can't increase it
> to 32 MB for all Hash Joins, or they'd run OOM. So we increase it just
> for this Hash Join, in the hopes that by doing so we'll avoid running
> OOM... which is good; but we don't *tell* the customer we've done
> this, and we just hope that the customer actually has 64 MB (= 2x
> workmem) free (because, if they don't, they'll run OOM anyway).
> 

Right. This is meant to be a best-effort mitigation for rare cases.

Maybe we should track/report it somehow, though. I mean, if 1% of hash
joins need this, you're probably fine. If 99% hash joins hit it, you
probably really need a higher work_mem value because the hashed relation
is just too large.

But you have a point - maybe we should track/report this somewhere.
First step would be to make the total memory usage better visible in
explain (it's not obvious it does not include the per-batch metadata).

> All of this is to say that this patch illustrates the need for
> something like proposal [1], which allows PostgreSQL to set workmem
> limits on individual execution nodes, based on the optimizer's memory
> estimates. In the above patch, we're blindly making things better,
> without knowing whether we've made them good enough. (The customer is
> less likely to run OOM using 64 MB instead of 136 MB, but OOM is still
> possible since their workmem limit is 8 MB!)
> 
> In v.next of my patchset at [1] (should be done by end of day today) I
> will deal with the case discussed above by:
> 
> 1. Doubling Plan.workmem_limit whenever we halve nbatches (so we track
> the "workmem" needed by the hash table);
> 2. Displaying Plan.workmem_limit + Hash.nbatches * (2 * BLCKSIZE),
> inside EXPLAIN (work_mem on), (so we display to the customer our best
> estimate of the effective workmem limit).
> 
> Thanks,
> James
> 
> [1]
https://www.postgresql.org/message-id/flat/CAJVSvF6s1LgXF6KB2Cz68sHzk%2Bv%2BO_vmwEkaon%3DH8O9VcOr-tQ%40mail.gmail.com

I'm not opposed to doing something like this, but I'm not quite sure how
could it help the cases I meant to address with my patch, where we plan
with low nbatch value, and then it explodes as execution time.


regards

-- 
Tomas Vondra

pgsql-hackers by date:

From: Robert Treat
Date: 25 February, 20:36:40
Subject: Re: Trigger more frequent autovacuums of heavy insert tables

From: Laurenz Albe
Date: 25 February, 20:41:48
Subject: Re: Update docs for UUID data type

Re: Adjusting hash join memory limit to handle batch explosion - Mailing list pgsql-hackers

Previous

Next