Re: [GENERAL] Bad planning data resulting in OOM killing of postgres - Mailing list pgsql-general

From Tom Lane
Subject Re: [GENERAL] Bad planning data resulting in OOM killing of postgres
Date
Msg-id 2270.1487115399@sss.pgh.pa.us
Whole thread Raw
In response to Re: [GENERAL] Bad planning data resulting in OOM killing of postgres  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
I wrote:
> David Hinkle <hinkle@cipafilter.com> writes:
>> Thanks guys, here's the information you requested:
>> psql:postgres@cipafilter = show work_mem;
>> work_mem
>> ──────────
>> 10MB
>> (1 row)

> [ squint... ]  It should absolutely not have tried to hash a 500M-row
> table if it thought work_mem was only 10MB.  I wonder if there's an
> integer-overflow problem or something like that.

Ah, nah, scratch that: I was thinking of the hash aggregation case,
where there's no run-time ability to spill to disk so the planner
will not risk using hash aggregation if it estimates the hash table
would exceed work_mem.  Hash joins do have the ability to restrict
memory consumption by increasing the number of batches, so the planner
doesn't worry about it in that case.

I think what must be happening is that there's some one value of
log_raw.titleid that occurs a ridiculous number of times, so that
the executor is unable to split up that particular hash bucket,
leading to OOM when it tries to load all those rows to process
the hash bucket.

The planner does attempt to estimate the worst-case bucket size,
but in what now seems like brain fade, it doesn't do more with that
information than charge an appropriate number of tuple comparisons.
That would somewhat discourage use of a hash join, but a merge join
on this many tuples would be pretty expensive too, so it's not overly
surprising that it went with hashing anyway.

I am thinking we ought to fix it so it rejects (or at least heavily
penalizes) a hash join if it estimates that the rows containing
the inner side's most common value wouldn't all fit in work_mem.

What's not completely certain however is whether this diagnosis is
accurate for your case, or whether the proposed remedy would fix it.
It would be useful to see the contents of pg_stats.most_common_freqs
for log_raw.titleid.

            regards, tom lane


pgsql-general by date:

Previous
From: James Sewell
Date:
Subject: Re: [GENERAL] PostgreSQL corruption
Next
From: Adrian Klaver
Date:
Subject: Re: [GENERAL] database folder name and tables filenames