accounting for memory used for BufFile during hash joins - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | accounting for memory used for BufFile during hash joins |
Date | |
Msg-id | 20190504003414.bulcbnge3rhwhcsh@development Whole thread Raw |
Responses |
Re: accounting for memory used for BufFile during hash joins
Re: accounting for memory used for BufFile during hash joins |
List | pgsql-hackers |
Hi, I'm starting this thread mostly to keep track of patches developed in response to issue [1] reported on pgsql-performance. The symptoms are very simple - query performing a hash join ends up using much more memory than expected (pretty much ignoring work_mem), and possibly ending up with OOM. The root cause is that hash join treats batches as pretty much free, but that's not really true - we do allocate two BufFile structs per batch, and each BufFile is ~8kB as it includes PGAlignedBuffer. This is not ideal even if we happen to estimate everything correctly, because for example with work_mem=4MB and nbatch=1024, it means we'll use about 16MB (2*8kB*1024) for the BufFile structures alone, plus the work_mem for hash table itself. But it can easily explode when we under-estimate the hash side. In the pgsql-performance message, the hash side (with the patches applied, allowing the query to complete) it looks like this: Hash (cost=2823846.37..2823846.37 rows=34619 width=930) (actual time=252946.367..252946.367 rows=113478127 loops=1) So it's 3277x under-estimated. It starts with 16 batches, and ends up adding more and more batches until it fails with 524288 of them (it gets to that many batches because some of the values are very common and we don't disable the growth earlier). The OOM is not very surprising, because with 524288 batches it'd need about 8GB of memory, and the system only has 8GB RAM installed. The two attached patches both account for the BufFile memory, but then use very different strategies when the work_mem limit is reached. The first patch realizes it's impossible to keep adding batches without breaking the work_mem limit, because at some point the BufFile will need more memory than that. But it does not make sense to stop adding batches entirely, because then the hash table could grow indefinitely. So the patch abandons the idea of enforcing work_mem in this situation, and instead attempts to minimize memory usage over time - it increases the spaceAllowed in a way that ensures doubling the number of batches actually reduces memory usage in the long run. The second patch tries to enforce work_mem more strictly. That would be impossible if we were to keep all the BufFile structs in memory, so instead it slices the batches into chunks that fit into work_mem, and then uses a single "overflow" file for slices currently not in memory. These extra slices can't be counted into work_mem, but we should need just very few of them. For example with work_mem=4MB the slice is 128 batches, so we need 128x less overflow files (compared to per-batch). Neither of those patches tweaks ExecChooseHashTableSize() to consider memory needed for BufFiles while deciding how many batches will be needed. That's something that probably needs to happen, but it would not help with the underestimate issue. I'm not entirely sure which of those approaches is the right one. The first one is clearly just a "damage control" for cases where the hash side turned out to be much larger than we expected. With good estimates we probably would not have picked a hash join for those (that is, we should have realized we can't keep work_mem and prohibit hash join). The second patch however makes hash join viable for some of those cases, and it seems to work pretty well (there are some numbers in the message posted to pgsql-performance thread). So I kinda like this second one. It's all just PoC quality, at this point, far from committable state. [1] https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: