Re: DBT-3 with SF=20 got failed - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: DBT-3 with SF=20 got failed
Date
Msg-id 5603C751.1080803@2ndquadrant.com
Whole thread Raw
In response to Re: DBT-3 with SF=20 got failed  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: DBT-3 with SF=20 got failed  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 09/23/2015 11:25 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On 09/11/2015 07:16 PM, Robert Haas wrote:
>>> On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>> I'm arguing for fixing the existing bug, and then addressing the case of
>>>> over-estimation separately, with proper analysis.
>
>>> Well, this is part of how we're looking it differently.  I think the
>>> bug is "we're passing a value to palloc that is too large, so
>>> sometimes it fails" and the way to fix that is to properly limit the
>>> value.  You are clearly defining the bug a bit differently.
>
>> Yes, I see it differently.
>
>> I don't quite understand why limiting the value is more "proper" than
>> using a function that can handle the actual value.
>
>> The proposed bugfix addresses the issue in the most straightforward way,
>> without introducing additional considerations about possible
>> over-estimations (which the current code completely ignores, so this is
>> a new thing). I think bugfixes should not introduce such changes to
>> behavior (albeit internal), especially not without any numbers.
>
> This thread seems to have stalled out...
>
> After re-reading it, I'm going to agree with Robert that we should
> clamp the initial pointer-array size to work with palloc (ie, 512MB
> worth of pointers, which please recall is going to represent at
> least 10X that much in hashtable entries, probably more).

10x that much in entries? Do you mean NTUP_PER_BUCKET? Because that was 
reduced to 1 last year as part of the hashjoin improvements. So we do 
have more buckets than tuples (to keep load factor below 1.0). It's 
still true the entries will need more space than buckets (because of 
headers and such), but it may easily get well below 10x.

In the example reported by KaiGai-san, the entries are 8B wide (~24B 
with header), while buckets are 8B. That's 1:3 ratio. It is a bit 
extreme example because in other cases the tuples will be wider.

It also seems to me that the higher the ratio, the lower the need to 
actually impose such limit because it increases the natural pressure 
keeping buckets down (because both buckets and entries need to share 
work_mem of limited size).

> The argument that allowing it to be larger would be a performance win
> seems entirely unbased on any evidence, while the risks of allowing
> arbitrarily large allocations based on planner estimates seem pretty
> obvious to me.

Do we agree that resizing the hash table is not free? Because my
argument was that we're forcing the well-estimated cases to do
additional resize, so maybe we should measure the impact first.

Now, maybe it does not really matter in this case - we probably get 
slightly inaccurate estimates all the time. Not terribly wrong, but 
enough to make the initial number of buckets too low, so we may actually 
do the resize quite anyway. Also, if we're dealing with hash tables of 
this size, we're probably dealing with much larger outer relation and 
the additional resize is going to be just noise ...

I however quite dislike the dismissal of the possible impact. It should 
be the responsibility of the person introducing the change to show that 
no such impact actually exists, not just waving it off as "unbased on 
any evidence" when there's no evidence presented.

Had I been adding such limit, I'd do at least some testing and presented 
the results here. Perhaps I'm a bit over-protective of this piece of 
code as I've spent quite a bit of time getting it faster, but I believe 
the principle that the person proposing a change should demonstrate the 
(lack of) performance impact is sound.

> And the argument that such overestimates are a bug that we can easily
> fix is based on even less evidence; in fact, I would dismiss that out
> of hand as hubris.

I don't think anyone was suggesting the overestimates are easy to fix 
(or even possible to fix in general). If I ever claimed that, I was 
clearly wrong.

>
> Now there is a separate issue about whether we should allow hashtable
> resizes to exceed that limit.  There I would vote yes, as long as the
> resize is based on arrival of actual data and not estimates (following
> Robert's point that the existing uses of repalloc_huge are driven by
> actual need).
>
> So I don't like any of the proposed patches exactly as-is.
>
> BTW, just looking at the code in question, it seems to be desperately
> in need of editorial review.  A couple of examples:
>
> * Some of the code goes to the trouble of maintaining a
> log2_nbuckets_optimal value, but ExecHashIncreaseNumBuckets doesn't
> know about that and recomputes the value.

Yeah, that's a stupid bug.

>
> * ExecHashIncreaseNumBuckets starts out with a run-time test on
> something that its sole caller has just Assert()'d to not be the
> case, and which really ought to be impossible with or without that
> Assert.

Yeah, right. Excessively defensive coding on my side (I think I've added 
the Assert later, or something).

>
> * ExecHashTableInsert believes it can only increase nbuckets_optimal
> if we're still in the first batch, but MultiExecHash() will call
> ExecHashIncreaseNumBuckets at the end of input-acquisition whether
> we've created more than one batch or not.  Meanwhile,
> ExecHashIncreaseNumBatches thinks it can change the number of buckets
> in any case.  Maybe that's all okay, but at the very least those
> tests ought to look like they'd heard of each other.  And of those
> three places, having the safety check where it is seems like the
> least reasonable place.

I don't quite understand where you see the problem. I believe the logic 
is sound, although adding a comment explaining it, but let me explain.

The only place where we actually mess with the number of buckets is 
ExecHashTableInsert() around line 880. All the other places are resizes 
of the hash table, rebuilding it to the optimal number of buckets.

The rebuild can happen in two situations - either at the end of the 
input (which is what happens in MultiExecHash), or when we start 
batching (ExecHashIncreaseNumBatches).

Once we're batching, there's no point in further messing with buckets 
because we assume using more buckets would overflow work_mem.

The fact that we're doing the resize within ExecHashIncreaseNumBatches 
is merely because of efficiency - we are going to walk the hash table 
anyway (because we need to get rid of ~1/2 the tuples), so we simply 
rebuild the buckets too.

> Tracking an "optimal" number of buckets seems like an activity that
> should not be constrained by whether we have any hope of being able
> to apply the number.

Not sure I understand?

>
> So I'm not having a lot of faith that there aren't other bugs in the
> immediate area.

Well, I surely am not infallible, but so far I haven't seen any proof of 
an actual bug, except for needlessly recomputing a value (which can only 
happen once), and excessive check on an already asserted value.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: "Shulgin, Oleksandr"
Date:
Subject: Re: Calculage avg. width when operator = is missing
Next
From: Gavin Flower
Date:
Subject: Re: [COMMITTERS] pgsql: Use gender-neutral language in documentation