Re: DBT-3 with SF=20 got failed - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: DBT-3 with SF=20 got failed
Date
Msg-id 55F0561B.3000503@2ndquadrant.com
Whole thread Raw
In response to Re: DBT-3 with SF=20 got failed  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: DBT-3 with SF=20 got failed
List pgsql-hackers
On 09/09/2015 03:55 PM, Robert Haas wrote:
> On Tue, Sep 8, 2015 at 5:02 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Also, I'm not sure what other places do you have in mind (could you list
>> some examples?) but I'd bet we limit the allocation to 1GB because of the
>> palloc() limit and not because of fear of over-estimates.
>
> I don't really think those two things are different from each other.
> The palloc() limit is a means of enforcing a general policy of
> limiting all allocations to 1GB except in places where we've made a
> very conscious decision to allow a specific exception.  This limit
> happens to dovetail nicely with the varlena size limit, so in many
> cases it is the exactly correct limit just for that reason.  But even
> when, as here, that's not at issue, it's still a useful limit, because
> there are many ways that some garbage value can get passed to palloc
> -- bad planner estimates, corrupted tuples, bugs in other parts of our
> code.  And at least on my old MacBook Pro (I haven't tested the
> current one), passing a sufficiently-large value to malloc() causes a
> kernel panic.  That's probably a particularly bad bug, but there are
> lots of systems where "accidentally" allocating an unreasonable amount
> of space will have all kinds of unpleasant consequences.  So, I
> believe that  palloc()'s limit improves the overall stability of the
> system considerably even if it causes some occasional annoyance.

I'm not really buying this. The 1GB has nothing to do with platform 
limits, it's there exactly to make it varlena-like (which has exactly 
the same limit), and because it allows using 32-bit int to track all the 
bytes. Neither of these is relevant here.

It has nothing to do with malloc() limits on various platforms, and if 
there really are such limits that we think we should worry about, we 
should probably address those properly. Not by and-aiding all the 
various places independently.

And most importantly, these platform limits would apply to both the 
initial allocation and to the subsequent resize. It's utterly useless to 
just "fix" the initial allocation and then allow failure when we try to 
resize the hash table.

> Most of the time, you can just palloc() and not worry too much about
> whether you're going to blow up the machine: you won't, because you
> aren't going to allocate more than 1GB.  Any place that wants to
> allocate more than that needs to be someplace where we can be pretty
> sure that we're not going to accidentally allocate some completely
> unreasonable amount of memory, like say 1TB.  Nothing in this
> discussion convinces me that this is such a place.  Note that

We're not going to allocate a completely unreasonable amount of memory, 
because there already are some checks in place.

Firstly, you can't really get buckets larger than ~25% of work_mem, 
because we the pointer has only 8B, while the HJ tuple has 16B plus the 
data (IIRC). For wider tuples the size of buckets further decreases.

Secondly, we limit the number of buckets to INT_MAX, so about 16GB 
(because buckets are just pointers). No matter how awful estimate you 
get (or how insanely high you set work_mem) you can't exceed this.

> tuplesort.c and tuplestore.c, the only existing callers of
> repalloc_huge, only allocate such large amounts of memory when they
> actually have enough tuples to justify it - it is always based on the
> actual number of tuples, never an estimate.  I think that would be a
> sound principle here, too.  Resizing the hash table to such a large
> size based on the actual load factor is very reasonable; starting with
> such a large size seems less so.  Admittedly, 512MB is an arbitrary
> point: and if it so happened that the limit was 256MB or 1GB or 128MB
> or even 2GB I wouldn't advocate for changing it just for fun. But
> you're saying we should just remove that limit altogether, and I think
> that's clearly unreasonable.  Do you really want to start out with a
> TB or even PB-sized hash table when the actual number of tuples is,
> say, one?  That may sound crazy, but I've seen enough bad query plans
> to know that, yes, we are sometimes off by nine orders of magnitude.
> This is not a hypothetical problem.

No, I'm not saying anything like that - I actually explicitly stated 
that I'm not against such change (further restricting the initial hash 
table size), if someone takes the time to do a bit of testing and 
provide some numbers.

Moreover as I explained there already are limits in place (25% of 
work_mem or 16GB, whichever is lower), so I don't really see the bugfix 
as unreasonable.

Maybe if we decide to lift this restriction (using int64 to address the 
buckets, which removes the 16GB limit) this issue will get much more 
pressing. But I guess hash tables handling 2B buckets will be enough for 
the near future.

>>> More importantly, removing the cap on the allocation size makes the
>>> problem a lot worse.  You might be sad if a bad planner estimate
>>> causes the planner to allocate 1GB when 64MB would have been enough,
>>> but on modern systems it is not likely to be an enormous problem.  If
>>> a similar mis-estimation causes the planner to allocate 16GB rather
>>> than 1GB, the opportunity for you to be sad is magnified pretty
>>> considerably.  Therefore, I don't really see the over-estimation bug
>>> fix as being separate from this one.
>>
>> Perhaps. But if you want to absolutely prevent such sadness then maybe you
>> should not set work_mem that high?
>
> I think that's a red herring for a number of reasons.  One, the
> allocation for the hash buckets is only a small portion of the total
> memory.  Two, the fact that you are OK with the hash table growing to
> a certain size does not mean that you want it to start out that big on
> the strength of a frequently-flawed planner estimate.

True, although I don't think the herring is entirely red. It might just 
as well be a mackerel ;-)

>> Anyway, I do see this as a rather orthogonal problem - an independent
>> improvement, mostly unrelated to the bugfix. Even if we decide to redesign
>> it like this (and I'm not particularly opposed to that, assuming someone
>> takes the time to measure how expensive the additional resize actually is),
>> we'll still have to fix the repalloc().
>>
>> So I still fail to see why we shouldn't apply this fix.
>
> In all seriousness, that is fine.  I respect your opinion; I'm just
> telling you mine, which happens to be different.

Likewise.

Let me repeat my previous proposal:
 1) Let's apply the proposed bugfix (and also backpatch it), because    the current code *is* broken.
 2) Do a bunch of experiments with limiting the initial hash size,    decide whether the impact on well-estimated cases
isacceptable.
 

I'm strongly opposed to just limiting the initial size without actually 
measuring how expensive the resize is, as that simply adds cost to the 
well-estimated cases (which may easily be the vast majority of all the 
plans, as we tend to notice just the poorly estimated ones).


regards

-- 
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: jsonb_concat: make sure we always return a non-scalar value
Next
From: Robbie Harwood
Date:
Subject: Re: [PATCH v2] GSSAPI encryption support