Thread: why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

From
b8flowerfire
Date:
When I read the source code about the hashjoin, I was very confused that the
postgresql define the NTUP_PER_BUCKET value as 10.
Since this value is used to estimate the tuple count in one bucket, is it
better if we have a smaller value?
I have not done some experiments, but it seems that we could archive less
hash collisions and better performance if we decrease the value.
So could anyone explain to me that why we define NTUP_PER_BUCKET as 10?
If there exists a specified situation that we would get worse performance or
some troubles if set NTUP_PER_BUCKET to 1 or 2?

Thanks very much. 



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/why-postgresql-define-NTUP-PER-BUCKET-as-10-not-other-numbers-smaller-tp5806472.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



On Mon, Jun 9, 2014 at 4:06 AM, b8flowerfire <b8flowerfire@gmail.com> wrote:
> When I read the source code about the hashjoin, I was very confused that the
> postgresql define the NTUP_PER_BUCKET value as 10.
> Since this value is used to estimate the tuple count in one bucket, is it
> better if we have a smaller value?
> I have not done some experiments, but it seems that we could archive less
> hash collisions and better performance if we decrease the value.
> So could anyone explain to me that why we define NTUP_PER_BUCKET as 10?
> If there exists a specified situation that we would get worse performance or
> some troubles if set NTUP_PER_BUCKET to 1 or 2?

This has come up before.  Basically, the problem is that if you reduce
NTUP_PER_BUCKET, the bucket array gets larger, which might reduce the
amount of space available for tuples to the point where the hash join
overflows to multiple batches.  That will be more expensive than
performing the hash join with a higher NTUP_PER_BUCKET.

But having said that, I think the current situation is pretty bad,
too.  NTUP_PER_BUCKET is basically the anticipated load factor for the
hash table, and everything I've ever read about hash table design says
you want that to be something like 1, or 0.25.  So 10 seems really
high.  But I'm not sure exactly what to do to fix the problem, because
there are indeed cases where we will be worse off if we just lower the
value categorically.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> This has come up before.  Basically, the problem is that if you reduce
> NTUP_PER_BUCKET, the bucket array gets larger, which might reduce the
> amount of space available for tuples to the point where the hash join
> overflows to multiple batches.  That will be more expensive than
> performing the hash join with a higher NTUP_PER_BUCKET.

> But having said that, I think the current situation is pretty bad,
> too.  NTUP_PER_BUCKET is basically the anticipated load factor for the
> hash table, and everything I've ever read about hash table design says
> you want that to be something like 1, or 0.25.  So 10 seems really
> high.  But I'm not sure exactly what to do to fix the problem, because
> there are indeed cases where we will be worse off if we just lower the
> value categorically.

Keep in mind that that standard advice is meant for all-in-memory cases,
not for cases where the alternative to running with longer hash chains
is dumping tuples out to disk and reading them back.

I'm quite prepared to believe that we should change NTUP_PER_BUCKET ...
but appealing to standard advice isn't a good basis for arguing that.
Actual performance measurements (in both batched and unbatched cases)
would be a suitable basis for proposing a change.
        regards, tom lane



Re: why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

From
b8flowerfire
Date:
Robert Haas wrote
> On Mon, Jun 9, 2014 at 4:06 AM, b8flowerfire <

> b8flowerfire@

> > wrote:
> 
> This has come up before.  Basically, the problem is that if you reduce
> NTUP_PER_BUCKET, the bucket array gets larger, which might reduce the
> amount of space available for tuples to the point where the hash join
> overflows to multiple batches.  That will be more expensive than
> performing the hash join with a higher NTUP_PER_BUCKET.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Thanks for the explanation. But i don't think it is very convincible.
Simply reduce the value of NTUP_PER_BUCKET will enlarge the pointer array
and reduce the tuples in one batch. But is that effect significant to the
performance?
The utilization of the work_mem, i think, is determined by the ratio of size
of the pointer and the size of the tuple.
Let's assume the size of tuple is 28 bytes, which is very reasonable because
it's the sum of the size of HJTUPLE_OVERHEAD(at least 8 bytes), the size of
MinimalTupleData(at least 10 bytes) and the content of a tuple(assume 10
bytes). And the size of pointer is 4 bytes.
So, if NTUP_PER_BUCKET is set to 10, about (28 * 10 / 28 * 10 + 4) of the
work_mem is used to store tuples. If NTUP_PER_BUCKET is set to 1, about (28
/ 28 + 4) of the work_mem is used to store tuples, reduced to 90% of the
original.
As a result, changing the value of NTUP_PER_BUCKET to 1 may increase the
batches number by only about 10%. So it that enough to effect the
performance? Or maybe i can not do the calculation simply in this way.

Besides, we have larger main-memory now. If we set the work_mem larger, the
more batches effect introduced by the smaller NTUP_PER_BUCKET value may be
reduced, couldn't it?

I have read about discussion about the NTUP_PER_BUCKET before. It seems that
if we change NTUP_PER_BUCKET to 50 or even larger, the performance wouldn't
be much worse. Because every tuple in the chain of a bucket has a hash
value. Having more tuples in a bucket simply increase some comparisons of
two integers. So is it the same if we change it smaller, that we could not
get much better? Is it one of the reasons that we define it as 10?

After all, it is my first time to discuss in a open source community. Thank
you all for the reply and the discussion. Thanks.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/why-postgresql-define-NTUP-PER-BUCKET-as-10-not-other-numbers-smaller-tp5806472p5806617.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



On Tue, Jun 10, 2014 at 1:13 AM, b8flowerfire <b8flowerfire@gmail.com> wrote:
> Thanks for the explanation. But i don't think it is very convincible.
> Simply reduce the value of NTUP_PER_BUCKET will enlarge the pointer array
> and reduce the tuples in one batch. But is that effect significant to the
> performance?
> The utilization of the work_mem, i think, is determined by the ratio of size
> of the pointer and the size of the tuple.
> Let's assume the size of tuple is 28 bytes, which is very reasonable because
> it's the sum of the size of HJTUPLE_OVERHEAD(at least 8 bytes), the size of
> MinimalTupleData(at least 10 bytes) and the content of a tuple(assume 10
> bytes). And the size of pointer is 4 bytes.

The size of a pointer is 8 bytes on most platforms these days.  On the
flip side, we shouldn't forget that each tuple has a 2-pointer, thus
16-byte, overhead due to the way AllocSetAlloc works, and that before
adding that we will round up to the nearest power of two when
allocating.  So in fact, in your example, each tuple will require 48
bytes on a 64-bit platform, and each pointer will require 8.  So if
I'm calculation correctly, the memory allocation for the pointers
would be about 1.6% of the the total with NTUP_PER_BUCKET = 10 and
about 14.3% of the total with NTUP_PER_BUCKET = 1.

> As a result, changing the value of NTUP_PER_BUCKET to 1 may increase the
> batches number by only about 10%. So it that enough to effect the
> performance? Or maybe i can not do the calculation simply in this way.

The problem case is when you have 1 batch and the increased memory
consumption causes you to switch to 2 batches.  That's expensive.  It
seems clear based on previous testing that *on the average*
NTUP_PER_BUCKET = 1 will be better, but in the case where it causes an
increase in the number of batches it will be much worse - particularly
because the only way we ever increase the number of batches is to
double it, which is almost always going to be a huge loss.

> Besides, we have larger main-memory now. If we set the work_mem larger, the
> more batches effect introduced by the smaller NTUP_PER_BUCKET value may be
> reduced, couldn't it?

If work_mem is large enough that we're going to do a single batch
either way, or the same number of batches either way, then we can
reduce NTUP_PER_BUCKET and it should be a clear win.

> I have read about discussion about the NTUP_PER_BUCKET before. It seems that
> if we change NTUP_PER_BUCKET to 50 or even larger, the performance wouldn't
> be much worse. Because every tuple in the chain of a bucket has a hash
> value. Having more tuples in a bucket simply increase some comparisons of
> two integers. So is it the same if we change it smaller, that we could not
> get much better? Is it one of the reasons that we define it as 10?

I'm not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, Jun 9, 2014 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Keep in mind that that standard advice is meant for all-in-memory cases,
> not for cases where the alternative to running with longer hash chains
> is dumping tuples out to disk and reading them back.

Sure, but that doesn't help someone who sets work_mem to some very
large value precisely to ensure that the hash join will be done in
memory.  They still don't get the benefit of a smaller NTUP_PER_BUCKET
setting.

> I'm quite prepared to believe that we should change NTUP_PER_BUCKET ...
> but appealing to standard advice isn't a good basis for arguing that.
> Actual performance measurements (in both batched and unbatched cases)
> would be a suitable basis for proposing a change.

Well, it's all in what scenario you test, right?  If you test the case
where something overflows work_mem as a result of the increased size
of the bucket array, it's always going to suck.  And if you test the
case where that doesn't happen, it's likely to win.  I think Stephen
Frost has already done quite a bit of testing in this area, on
previous threads.  But there's no one-size-fits-all solution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 9, 2014 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm quite prepared to believe that we should change NTUP_PER_BUCKET ...
>> but appealing to standard advice isn't a good basis for arguing that.
>> Actual performance measurements (in both batched and unbatched cases)
>> would be a suitable basis for proposing a change.

> Well, it's all in what scenario you test, right?  If you test the case
> where something overflows work_mem as a result of the increased size
> of the bucket array, it's always going to suck.  And if you test the
> case where that doesn't happen, it's likely to win.  I think Stephen
> Frost has already done quite a bit of testing in this area, on
> previous threads.  But there's no one-size-fits-all solution.

I don't really recall any hard numbers being provided.  I think if we
looked at some results that said "here's the average gain, and here's
the worst-case loss, and here's an estimate of how often you'd hit
the worst case", then we could make a decision.

However, I notice that it's already the case that we make a
to-batch-or-not-to-batch decision on the strength of some very crude
numbers during ExecChooseHashTableSize, and we explicitly don't consider
palloc overhead there.  It would certainly be easy enough to use two
different NTUP_PER_BUCKET target load factors depending on which path
is being taken in ExecChooseHashTableSize.  So maybe part of the answer is
to not require those numbers to be the same.
        regards, tom lane



On Tue, Jun 10, 2014 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, it's all in what scenario you test, right?  If you test the case
>> where something overflows work_mem as a result of the increased size
>> of the bucket array, it's always going to suck.  And if you test the
>> case where that doesn't happen, it's likely to win.  I think Stephen
>> Frost has already done quite a bit of testing in this area, on
>> previous threads.  But there's no one-size-fits-all solution.
>
> I don't really recall any hard numbers being provided.  I think if we
> looked at some results that said "here's the average gain, and here's
> the worst-case loss, and here's an estimate of how often you'd hit
> the worst case", then we could make a decision.

The worst case loss is that you have to rescan the entire inner
relation, so it's pretty darned bad.  I'm not sure how to construct an
optimal worst case fot that being monumentally expensive, but making
the inner relation gigantic is probably a good start.

> However, I notice that it's already the case that we make a
> to-batch-or-not-to-batch decision on the strength of some very crude
> numbers during ExecChooseHashTableSize, and we explicitly don't consider
> palloc overhead there.  It would certainly be easy enough to use two
> different NTUP_PER_BUCKET target load factors depending on which path
> is being taken in ExecChooseHashTableSize.  So maybe part of the answer is
> to not require those numbers to be the same.

If we could allow NTUP_PER_BUCKET to drop when the hashtable is
expected to fit in memory either way, perhaps with some safety margin
(e.g. we expect to use less than 75% of work_mem), I bet that would
make the people who have been complaining about this issue happy.  And
probably a few people who haven't been complaining, too: I don't
recall the precise performance numbers that were posted before, but
ISTR the difference between 10 and 1 was pretty dramatic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jun 10, 2014 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't really recall any hard numbers being provided.  I think if we
>> looked at some results that said "here's the average gain, and here's
>> the worst-case loss, and here's an estimate of how often you'd hit
>> the worst case", then we could make a decision.

> The worst case loss is that you have to rescan the entire inner
> relation, so it's pretty darned bad.  I'm not sure how to construct an
> optimal worst case fot that being monumentally expensive, but making
> the inner relation gigantic is probably a good start.

"Rescan"?  I'm pretty sure there are no cases where nodeHash reads the
inner relation more than once.  If you mean dumping to disk vs not dumping
to disk, yeah, that's a big increment in the cost.

> If we could allow NTUP_PER_BUCKET to drop when the hashtable is
> expected to fit in memory either way, perhaps with some safety margin
> (e.g. we expect to use less than 75% of work_mem), I bet that would
> make the people who have been complaining about this issue happy.

Could be a good plan.  We still need some test results though.
        regards, tom lane



On Tue, Jun 10, 2014 at 10:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jun 10, 2014 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I don't really recall any hard numbers being provided.  I think if we
>>> looked at some results that said "here's the average gain, and here's
>>> the worst-case loss, and here's an estimate of how often you'd hit
>>> the worst case", then we could make a decision.
>
>> The worst case loss is that you have to rescan the entire inner
>> relation, so it's pretty darned bad.  I'm not sure how to construct an
>> optimal worst case fot that being monumentally expensive, but making
>> the inner relation gigantic is probably a good start.
>
> "Rescan"?  I'm pretty sure there are no cases where nodeHash reads the
> inner relation more than once.  If you mean dumping to disk vs not dumping
> to disk, yeah, that's a big increment in the cost.

Sorry, that's what I meant.

>> If we could allow NTUP_PER_BUCKET to drop when the hashtable is
>> expected to fit in memory either way, perhaps with some safety margin
>> (e.g. we expect to use less than 75% of work_mem), I bet that would
>> make the people who have been complaining about this issue happy.
>
> Could be a good plan.  We still need some test results though.

Sounds reasonable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> If we could allow NTUP_PER_BUCKET to drop when the hashtable is
> expected to fit in memory either way, perhaps with some safety margin
> (e.g. we expect to use less than 75% of work_mem), I bet that would
> make the people who have been complaining about this issue happy.  And
> probably a few people who haven't been complaining, too: I don't
> recall the precise performance numbers that were posted before, but
> ISTR the difference between 10 and 1 was pretty dramatic.

This is more-or-less what I had been hoping to find time to do.  As Tom
points out, more testing is needed, and there's still the trade-off
between "fits-in-L2" and "doesn't" which can make a pretty big
difference while building the hash table.  Of course, once it's built,
it's faster to use it when it's larger as there are fewer collisions.

I do plan to come back to this and will hopefully find time over the
next few months.
Thanks,
    Stephen

On Tue, Jun 10, 2014 at 5:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:

 
The problem case is when you have 1 batch and the increased memory
consumption causes you to switch to 2 batches.  That's expensive.  It
seems clear based on previous testing that *on the average*
NTUP_PER_BUCKET = 1 will be better, but in the case where it causes an
increase in the number of batches it will be much worse - particularly
because the only way we ever increase the number of batches is to
double it, which is almost always going to be a huge loss.

Is there a reason we don't do hybrid hashing, where if 80% fits in memory than we write out only the 20% that doesn't? And then when probing the table with the other input, the 80% that land in in-memory buckets get handled immediately, and only the 20 that land in the on-disk buckets get written for the next step?

Obviously no one implemented it yet, but is there a fundamental reason for that or just a round tuit problem?
 
Cheers,

Jeff
On Tue, Jun 10, 2014 at 1:43 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Jun 10, 2014 at 5:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> The problem case is when you have 1 batch and the increased memory
>> consumption causes you to switch to 2 batches.  That's expensive.  It
>> seems clear based on previous testing that *on the average*
>> NTUP_PER_BUCKET = 1 will be better, but in the case where it causes an
>> increase in the number of batches it will be much worse - particularly
>> because the only way we ever increase the number of batches is to
>> double it, which is almost always going to be a huge loss.
>
> Is there a reason we don't do hybrid hashing, where if 80% fits in memory
> than we write out only the 20% that doesn't? And then when probing the table
> with the other input, the 80% that land in in-memory buckets get handled
> immediately, and only the 20 that land in the on-disk buckets get written
> for the next step?

We have an optimization that is a little bit like that.  The "skew"
hash join stuff tries to (essentially) ensure that the MCVs are in the
first batch.

But more could probably be done.  For example, suppose we have 256
buckets.  If the hash table overflows work_mem, we could write the
contents of *one bucket* out to disk, rather than (as we currently do)
half of the table.  If we overflow again, we write another bucket.
When the number of buckets written reaches half the total, we split
all of the remaining buckets so that all 256 slots are once again
active.  Repeat as needed.

If something like that worked out, it would drastically reduce the
penalty for slightly overrunning work_mem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company