Re: Out of Memory errors are frustrating as heck! - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Out of Memory errors are frustrating as heck!
Date
Msg-id 20190420200134.ugwtkb2whc7oyalg@development
Whole thread Raw
In response to Re: Out of Memory errors are frustrating as heck!  (Gunther <raj@gusw.net>)
Responses Re: Out of Memory errors are frustrating as heck!
List pgsql-performance
On Fri, Apr 19, 2019 at 11:34:54PM -0400, Gunther wrote:
>
> ...
>
>It would be so nice if there was a way to force a specific plan for 
>purposes of the testing.  I tried giving false data in pg_class 
>reltuples and relpages:
>
>foo=# analyze tmp_q;
>ANALYZE
>foo=# analyze tmp_r;
>ANALYZE
>foo=# select relname, relpages, reltuples from pg_class where relname in ('tmp_q', 'tmp_r');
> relname | relpages |  reltuples
>---------+----------+-------------
> tmp_r   |  5505039 | 1.13467e+08
> tmp_q   |        7 |         236
>(2 rows)
>
>foo=# update pg_class set (relpages, reltuples) = (5505039, 1.13467e+08) where relname = 'tmp_q';
>UPDATE 1
>foo=# update pg_class set (relpages, reltuples) = (7, 236) where relname = 'tmp_r';
>UPDATE 1
>
>but that didn't help. Somehow the planner outsmarts every such trick, 
>so I can't get it to follow my right outer join plan where the big 
>table is hashed.  I am sure y'all know some way to force it.
>

That does not work, because the planner does not actually use these values
directly - it only computes the density from them, and then multiplies
that to the current number of pages in the file. That behaves much nicer
when the table grows/shrinks between refreshes of the pg_class values.

So what you need to do is tweak these values to skew the density in a way
that then results in the desired esimate when multiplied with the actual
number of pages. For me, this did the trick:

  update pg_class set (relpages, reltuples) = (1000000, 1)
  where relname = 'tmp_r';

  update pg_class set (relpages, reltuples) = (1, 1000000)
  where relname = 'tmp_q';

after which I get a plan like this:

   Hash Right Join
     Hash Cond: (...)
     ->  Seq Scan on tmp_q q
     ->  Hash
           ->  Seq Scan on tmp_r r

As for the issue, I have a theory that I think would explain the issues.
It is related to the number of batches, as others speculated over here.
It's not a memory leak, though, it's just that each batch requires a lot
of extra memory and we don't account for that.

The trouble is - each batch is represented by BufFile, which is a whopping
8272 bytes, because it includes PGAlignedBlock. Now, with 131072 batches,
that's a nice 1GB of memory right there. And we don't actually account for
this memory in hashjoin code, so it's not counted against work_mem and we
just increase the number of batches.

Attached are two patches, that should help us to confirm that's actually
what's happening when running the query on actual data. The first patch
moves the BufFile stuff into a separate memory context, to make it more
obvious where the memory went. It also adds a buch of logging into the
ExecHashIncreaseNumBatches() function.

The second patch makes sure all the BufFiles are allocated right when
increasing the number of batches - otherwise we allocate them only when we
actually find a row for that batch, and I suspect the sample data shared
on this thread are somewhat correlated (I see long runs of the same UUID
value). That might slow down the memory growth. Of course, the real data
probably don't have such correlation, resulting in faster failures.

With the patch, I see stuff like this with 256k batches:

  ExecutorState: 65536 total in 4 blocks; 28136 free (4 chunks); 37400 used
    HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
      hash batch files: 4404002656 total in 524302 blocks; 8387928 free (20 chunks); 4395614728 used

so it's conceivable it's the root cause.

As for a fix, I'm not sure. I'm pretty sure we need to consider the amount
of memory for BufFile(s) when increasing the number of batches. But we
can't just stop incrementing the batches, because that would mean the
current batch may easily get bigger than work_mem :-(

I think we might cap the number of batches kept in memory, and at some
point start spilling data into an "overflow batch." So for example we'd
allow 32k batches, and then instead of increasing nbatch to 64k, we'd
create a single "overflow batch" representing batches 32k - 64k. After
processing the first 32k batches, we'd close the files and reuse the
memory for the next 32k batches. We'd read the overflow batch, split it
into the 32k batches, and just process them as usual. Of course, there
might be multiple rounds of this, for example we might end up with 32k
concurrent batches but 128k virtual ones, which means we'd do 4 rounds of
this dance.

It's a bit inefficient, but situations like this should be rather rare,
and it's more graceful than just crashing with OOM.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Attachment

pgsql-performance by date:

Previous
From: Gunther
Date:
Subject: Re: Out of Memory errors are frustrating as heck!
Next
From: Tomas Vondra
Date:
Subject: Re: Out of Memory errors are frustrating as heck!