Re: DBT-3 with SF=20 got failed - Mailing list pgsql-hackers

From Robert Haas
Subject Re: DBT-3 with SF=20 got failed
Date
Msg-id CA+TgmoZPVzBrbvqf5rxBWGmaGUdGuOO836ccFrjv_gp8yhgq6A@mail.gmail.com
Whole thread Raw
In response to Re: DBT-3 with SF=20 got failed  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: DBT-3 with SF=20 got failed  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Tue, Sep 8, 2015 at 5:02 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Also, I'm not sure what other places do you have in mind (could you list
> some examples?) but I'd bet we limit the allocation to 1GB because of the
> palloc() limit and not because of fear of over-estimates.

I don't really think those two things are different from each other.
The palloc() limit is a means of enforcing a general policy of
limiting all allocations to 1GB except in places where we've made a
very conscious decision to allow a specific exception.  This limit
happens to dovetail nicely with the varlena size limit, so in many
cases it is the exactly correct limit just for that reason.  But even
when, as here, that's not at issue, it's still a useful limit, because
there are many ways that some garbage value can get passed to palloc
-- bad planner estimates, corrupted tuples, bugs in other parts of our
code.  And at least on my old MacBook Pro (I haven't tested the
current one), passing a sufficiently-large value to malloc() causes a
kernel panic.  That's probably a particularly bad bug, but there are
lots of systems where "accidentally" allocating an unreasonable amount
of space will have all kinds of unpleasant consequences.  So, I
believe that  palloc()'s limit improves the overall stability of the
system considerably even if it causes some occasional annoyance.

Most of the time, you can just palloc() and not worry too much about
whether you're going to blow up the machine: you won't, because you
aren't going to allocate more than 1GB.  Any place that wants to
allocate more than that needs to be someplace where we can be pretty
sure that we're not going to accidentally allocate some completely
unreasonable amount of memory, like say 1TB.  Nothing in this
discussion convinces me that this is such a place.  Note that
tuplesort.c and tuplestore.c, the only existing callers of
repalloc_huge, only allocate such large amounts of memory when they
actually have enough tuples to justify it - it is always based on the
actual number of tuples, never an estimate.  I think that would be a
sound principle here, too.  Resizing the hash table to such a large
size based on the actual load factor is very reasonable; starting with
such a large size seems less so.  Admittedly, 512MB is an arbitrary
point: and if it so happened that the limit was 256MB or 1GB or 128MB
or even 2GB I wouldn't advocate for changing it just for fun.  But
you're saying we should just remove that limit altogether, and I think
that's clearly unreasonable.  Do you really want to start out with a
TB or even PB-sized hash table when the actual number of tuples is,
say, one?  That may sound crazy, but I've seen enough bad query plans
to know that, yes, we are sometimes off by nine orders of magnitude.
This is not a hypothetical problem.

>> More importantly, removing the cap on the allocation size makes the
>> problem a lot worse.  You might be sad if a bad planner estimate
>> causes the planner to allocate 1GB when 64MB would have been enough,
>> but on modern systems it is not likely to be an enormous problem.  If
>> a similar mis-estimation causes the planner to allocate 16GB rather
>> than 1GB, the opportunity for you to be sad is magnified pretty
>> considerably.  Therefore, I don't really see the over-estimation bug
>> fix as being separate from this one.
>
> Perhaps. But if you want to absolutely prevent such sadness then maybe you
> should not set work_mem that high?

I think that's a red herring for a number of reasons.  One, the
allocation for the hash buckets is only a small portion of the total
memory.  Two, the fact that you are OK with the hash table growing to
a certain size does not mean that you want it to start out that big on
the strength of a frequently-flawed planner estimate.

> Anyway, I do see this as a rather orthogonal problem - an independent
> improvement, mostly unrelated to the bugfix. Even if we decide to redesign
> it like this (and I'm not particularly opposed to that, assuming someone
> takes the time to measure how expensive the additional resize actually is),
> we'll still have to fix the repalloc().
>
> So I still fail to see why we shouldn't apply this fix.

In all seriousness, that is fine.  I respect your opinion; I'm just
telling you mine, which happens to be different.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Dependency between bgw_notify_pid and bgw_flags
Next
From: Robert Haas
Date:
Subject: Re: [patch] Proposal for \rotate in psql