Re: Out of Memory errors are frustrating as heck! - Mailing list pgsql-performance
From | Tomas Vondra |
---|---|
Subject | Re: Out of Memory errors are frustrating as heck! |
Date | |
Msg-id | 20190420200134.ugwtkb2whc7oyalg@development Whole thread Raw |
In response to | Re: Out of Memory errors are frustrating as heck! (Gunther <raj@gusw.net>) |
Responses |
Re: Out of Memory errors are frustrating as heck!
|
List | pgsql-performance |
On Fri, Apr 19, 2019 at 11:34:54PM -0400, Gunther wrote: > > ... > >It would be so nice if there was a way to force a specific plan for >purposes of the testing. I tried giving false data in pg_class >reltuples and relpages: > >foo=# analyze tmp_q; >ANALYZE >foo=# analyze tmp_r; >ANALYZE >foo=# select relname, relpages, reltuples from pg_class where relname in ('tmp_q', 'tmp_r'); > relname | relpages | reltuples >---------+----------+------------- > tmp_r | 5505039 | 1.13467e+08 > tmp_q | 7 | 236 >(2 rows) > >foo=# update pg_class set (relpages, reltuples) = (5505039, 1.13467e+08) where relname = 'tmp_q'; >UPDATE 1 >foo=# update pg_class set (relpages, reltuples) = (7, 236) where relname = 'tmp_r'; >UPDATE 1 > >but that didn't help. Somehow the planner outsmarts every such trick, >so I can't get it to follow my right outer join plan where the big >table is hashed. I am sure y'all know some way to force it. > That does not work, because the planner does not actually use these values directly - it only computes the density from them, and then multiplies that to the current number of pages in the file. That behaves much nicer when the table grows/shrinks between refreshes of the pg_class values. So what you need to do is tweak these values to skew the density in a way that then results in the desired esimate when multiplied with the actual number of pages. For me, this did the trick: update pg_class set (relpages, reltuples) = (1000000, 1) where relname = 'tmp_r'; update pg_class set (relpages, reltuples) = (1, 1000000) where relname = 'tmp_q'; after which I get a plan like this: Hash Right Join Hash Cond: (...) -> Seq Scan on tmp_q q -> Hash -> Seq Scan on tmp_r r As for the issue, I have a theory that I think would explain the issues. It is related to the number of batches, as others speculated over here. It's not a memory leak, though, it's just that each batch requires a lot of extra memory and we don't account for that. The trouble is - each batch is represented by BufFile, which is a whopping 8272 bytes, because it includes PGAlignedBlock. Now, with 131072 batches, that's a nice 1GB of memory right there. And we don't actually account for this memory in hashjoin code, so it's not counted against work_mem and we just increase the number of batches. Attached are two patches, that should help us to confirm that's actually what's happening when running the query on actual data. The first patch moves the BufFile stuff into a separate memory context, to make it more obvious where the memory went. It also adds a buch of logging into the ExecHashIncreaseNumBatches() function. The second patch makes sure all the BufFiles are allocated right when increasing the number of batches - otherwise we allocate them only when we actually find a row for that batch, and I suspect the sample data shared on this thread are somewhat correlated (I see long runs of the same UUID value). That might slow down the memory growth. Of course, the real data probably don't have such correlation, resulting in faster failures. With the patch, I see stuff like this with 256k batches: ExecutorState: 65536 total in 4 blocks; 28136 free (4 chunks); 37400 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used hash batch files: 4404002656 total in 524302 blocks; 8387928 free (20 chunks); 4395614728 used so it's conceivable it's the root cause. As for a fix, I'm not sure. I'm pretty sure we need to consider the amount of memory for BufFile(s) when increasing the number of batches. But we can't just stop incrementing the batches, because that would mean the current batch may easily get bigger than work_mem :-( I think we might cap the number of batches kept in memory, and at some point start spilling data into an "overflow batch." So for example we'd allow 32k batches, and then instead of increasing nbatch to 64k, we'd create a single "overflow batch" representing batches 32k - 64k. After processing the first 32k batches, we'd close the files and reuse the memory for the next 32k batches. We'd read the overflow batch, split it into the 32k batches, and just process them as usual. Of course, there might be multiple rounds of this, for example we might end up with 32k concurrent batches but 128k virtual ones, which means we'd do 4 rounds of this dance. It's a bit inefficient, but situations like this should be rather rare, and it's more graceful than just crashing with OOM. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-performance by date: