Re: Memory leak from ExecutorState context? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Memory leak from ExecutorState context?
Date
Msg-id dbae24d7-0dda-18aa-5e08-8138ac1caef9@enterprisedb.com
Whole thread Raw
In response to Re: Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Re: Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
List pgsql-hackers

On 3/2/23 19:15, Jehan-Guillaume de Rorthais wrote:
> Hi!
> 
> On Thu, 2 Mar 2023 13:44:52 +0100
> Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>> Well, yeah and no.
>>
>> In principle we could/should have allocated the BufFiles in a different
>> context (possibly hashCxt). But in practice it probably won't make any
>> difference, because the query will probably run all the hashjoins at the
>> same time. Imagine a sequence of joins - we build all the hashes, and
>> then tuples from the outer side bubble up through the plans. And then
>> you process the last tuple and release all the hashes.
>>
>> This would not fix the issue. It'd be helpful for accounting purposes
>> (we'd know it's the buffiles and perhaps for which hashjoin node). But
>> we'd still have to allocate the memory etc. (so still OOM).
> 
> Well, accounting things in the correct context would already worth a patch I
> suppose. At least, it help to investigate faster. Plus, you already wrote a
> patch about that[1]:
> 
> https://www.postgresql.org/message-id/20190421114618.z3mpgmimc3rmubi4%40development
> 
> Note that I did reference the "Out of Memory errors are frustrating as heck!"
> thread in my first email, pointing on a Tom Lane's email explaining that
> ExecutorState was not supposed to be so large[2].
> 
> [2] https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
> 

Ah, right, I didn't realize it's the same thread. There are far too many
threads about this sort of things, and I probably submitted half-baked
patches to most of them :-/

>> There's only one thing I think could help - increase the work_mem enough
>> not to trigger the explosive growth in number of batches. Imagine
>> there's one very common value, accounting for ~65MB of tuples. With
>> work_mem=64MB this leads to exactly the explosive growth you're
>> observing here. With 128MB it'd probably run just fine.
>>
>> The problem is we don't know how large the work_mem would need to be :-(
>> So you'll have to try and experiment a bit.
>>
>> I remembered there was a thread [1] about *exactly* this issue in 2019.
>>
>> [1]
>> https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net
>>
>> I even posted a couple patches that try to address this by accounting
>> for the BufFile memory, and increasing work_mem a bit instead of just
>> blindly increasing the number of batches (ignoring the fact that means
>> more memory will be used for the BufFile stuff).
>>
>> I don't recall why it went nowhere, TBH. But I recall there were
>> discussions about maybe doing something like "block nestloop" at the
>> time, or something. Or maybe the thread just went cold.
> 
> So I read the full thread now. I'm still not sure why we try to avoid hash
> collision so hard, and why a similar data subset barely larger than work_mem
> makes the number of batchs explode, but I think I have a better understanding of
> the discussion and the proposed solutions.
> 

I don't think this is about hash collisions (i.e. the same hash value
being computed for different values). You can construct cases like this,
of course, particularly if you only look at a subset of the bits (for 1M
batches we only look at the first 20 bits), but I'd say it's fairly
unlikely to happen unless you do that intentionally.

(I'm assuming regular data types with reasonable hash functions. If the
query joins on custom data types with some silly hash function, it may
be more likely to have conflicts.)

IMHO a much more likely explanation is there actually is a very common
value in the data. For example there might be a value repeated 1M times,
and that'd be enough to break this.

We do build a special "skew" buckets for values from an MCV, but maybe
the stats are not updated yet, or maybe there are too many such values
to fit into MCV?

I now realize there's probably another way to get into this - oversized
rows. Could there be a huge row (e.g. with a large text/bytea value)?
Imagine a row that's 65MB - that'd be game over with work_mem=64MB. Or
there might be smaller rows, but a couple hash collisions would suffice.

> There was some thoughts about how to make a better usage of the memory. As
> memory is exploding way beyond work_mem, at least, avoid to waste it with too
> many buffers of BufFile. So you expand either the work_mem or the number of
> batch, depending on what move is smarter. TJis is explained and tested here:
> 
> https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
> https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development
> 
> And then, another patch to overflow each batch to a dedicated temp file and
> stay inside work_mem (v4-per-slice-overflow-file.patch):
> 
> https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
> 
> Then, nothing more on the discussion about this last patch. So I guess it just
> went cold.
> 

I think a contributing factor was that the OP did not respond for a
couple months, so the thread went cold.

> For what it worth, these two patches seems really interesting to me. Do you need
> any help to revive it?
> 

I think another reason why that thread went nowhere were some that we've
been exploring a different (and likely better) approach to fix this by
falling back to a nested loop for the "problematic" batches.

As proposed in this thread:

 https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development

So I guess the best thing would be to go through these threads, see what
the status is, restart the discussion and propose what to do. If you do
that, I'm happy to rebase the patches, and maybe see if I could improve
them in some way.

I was hoping we'd solve this by the BNL, but if we didn't get that in 4
years, maybe we shouldn't stall and get at least an imperfect stop-gap
solution ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Jehan-Guillaume de Rorthais
Date:
Subject: Re: Memory leak from ExecutorState context?
Next
From: Tom Lane
Date:
Subject: Re: Evaluate arguments of correlated SubPlans in the referencing ExprState