Re: Adjusting hash join memory limit to handle batch explosion - Mailing list pgsql-hackers
From | James Hunter |
---|---|
Subject | Re: Adjusting hash join memory limit to handle batch explosion |
Date | |
Msg-id | CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_+zecoZYMU+dn3BVVaZg@mail.gmail.com Whole thread Raw |
In response to | Re: Adjusting hash join memory limit to handle batch explosion (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Adjusting hash join memory limit to handle batch explosion
|
List | pgsql-hackers |
On Wed, Feb 19, 2025 at 12:22 PM Tomas Vondra <tomas@vondra.me> wrote: > > I've pushed the first (and main) part of the patch series, after some > more cleanup and comment polishing. Two comments on your merged patch -- First, it's easier to see what's going on if we overlook the logic to round to nearest power of two, and solve the optimization problem algebraically. Let T = the total memory needed to hash all input rows, and B = the size of per-batch metadata (= 2 * BLKSIZE, which is typically 16 KB). Then, solving the optimization problem, the minimum memory usage occurs at n = nbatches = SQRT(T / B) and w = workmem = SQRT(B * T). (Here I am using "workmem" for the hash table's "space_allowed.") The total working memory used, at the minimum, is always 2 * w: twice the optimal "workmem" ("space_allowed"). This says that the maximum input size that can be (optimally) hashed with the default 8 MB workmem (= work_mem * hash_mem_multiplier) is 4 GB, and the total working memory used would actually be 16 MB. Also, to hash 64 GB, or 16x as much, requires a 32 MB workmem, with 64 MB of total working memory used. So "workmem" grows with the SQRT of T, the total hash memory needed; and total working memory is 2x "workmem." Second -- the algebraic solution illustrates the difficulty in tracking and restricting working memory usage for Hash Joins! Your patch improves the "hash 64 GB" situation, because it eliminates 96 GB of per-batch metadata, by reducing n = nbatches from 8192 to 2048, at a cost of only 24 MB of workmem. Using the default 8 MB workmem, *actual* total working memory used would be 8 MB + 16 KB * (64 GB / 8 MB) = 136 MB. By increasing workmem to 32 MB, total working memory is only 64 MB; so we save 72 MB overall. This is a good thing, but-- The "but" is that the customer really should have set their workmem to 64 MB, in the first place; and we should have taken half of that for the hash table, and left the other half for per-batch metadata. -- OK, but historically we have pretended that the per-batch metadata used no memory. So the customer should have set their workmem to 32 MB, with the understanding that PostgreSQL would have actually used 64 MB... -- OK, but the customer *didn't* set their workmem to 32 MB. (If they had, we wouldn't need this patch -- but we *do* need this patch, which means the customer hasn't set their workmem high enough.) Why not? Well, because if they set it to 32 MB, they'd run OOM! -- So we are (secretly!) increasing the customer's workmem to 32 MB, but only for this particular Hash Join. The customer can't increase it to 32 MB for all Hash Joins, or they'd run OOM. So we increase it just for this Hash Join, in the hopes that by doing so we'll avoid running OOM... which is good; but we don't *tell* the customer we've done this, and we just hope that the customer actually has 64 MB (= 2x workmem) free (because, if they don't, they'll run OOM anyway). All of this is to say that this patch illustrates the need for something like proposal [1], which allows PostgreSQL to set workmem limits on individual execution nodes, based on the optimizer's memory estimates. In the above patch, we're blindly making things better, without knowing whether we've made them good enough. (The customer is less likely to run OOM using 64 MB instead of 136 MB, but OOM is still possible since their workmem limit is 8 MB!) In v.next of my patchset at [1] (should be done by end of day today) I will deal with the case discussed above by: 1. Doubling Plan.workmem_limit whenever we halve nbatches (so we track the "workmem" needed by the hash table); 2. Displaying Plan.workmem_limit + Hash.nbatches * (2 * BLCKSIZE), inside EXPLAIN (work_mem on), (so we display to the customer our best estimate of the effective workmem limit). Thanks, James [1] https://www.postgresql.org/message-id/flat/CAJVSvF6s1LgXF6KB2Cz68sHzk%2Bv%2BO_vmwEkaon%3DH8O9VcOr-tQ%40mail.gmail.com
pgsql-hackers by date: