Thread: A better way than tweaking NTUP_PER_BUCKET

A better way than tweaking NTUP_PER_BUCKET

From
Simon Riggs
Date:
Previous discussions of Hash Joins have noted that the performance
decreases when the average number of tuples per bucket increases.
O(N^2) effects are seen.

We've argued this about many different ways, yet all of those
discussions have centred around the constant NTUP_PER_BUCKET. I
believe that was a subtle mistake and I have a proposal.

The code in ExecChooseHashTableSize() line 460 says
 /*
  * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
  * memory is filled.
...

but the calculation then sets the number of buckets like this

 dbuckets = ceil(ntuples / NTUP_PER_BUCKET);

**This is doesn't match the comment.** If we underestimate the number
of tuples and go on to fill the available memory, we then end up with
an average number of tuples per bucket higher than NTUP_PER_BUCKET. A
notational confusion that has been skewing the discussion.

The correct calculation that would match the objective set out in the
comment would be

 dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;

Which leads us to a much more useful value of dbuckets in the case
where using ntuples occupies much less space than is available. This
value is always same or higher than previously because of the if-test
that surrounds it.

Given my experience that on larger tables we end up underestimating
ndistinct by 10-100-1000 times, I don't think this change is
unwarranted.

This solves the problem in earlier discussions since we get a lower
average number of tuples per bucket and yet we also get to keep the
current NTUP_PER_BUCKET value. Everybody wins.

Comments?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
Simon,

* Simon Riggs (simon@2ndQuadrant.com) wrote:
> Previous discussions of Hash Joins have noted that the performance
> decreases when the average number of tuples per bucket increases.

Having the hash table so small that we have hash bucket collisions with
different 32bit hash values is extremely painful, however..

> The correct calculation that would match the objective set out in the
> comment would be
>
>  dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;

This looks to be driving the size of the hash table size off of "how
many of this size tuple can I fit into memory?" and ignoring how many
actual rows we have to hash.  Consider a work_mem of 1GB with a small
number of rows to actually hash- say 50.  With a tupsize of 8 bytes,
we'd be creating a hash table sized for some 13M buckets.

> This solves the problem in earlier discussions since we get a lower
> average number of tuples per bucket and yet we also get to keep the
> current NTUP_PER_BUCKET value. Everybody wins.

I agree that this should reduce the number of tuples per bucket, but I
don't think it's doing what you expect and based on what it's doing, it
seems wasteful.

> Comments?

Based on your argument that we want to have a bucket load which is, on
average, the size of NTUP_PER_BUCKET, I have to '-1' on this.  What we
want is to size a table large enough that we never have any true
collisions (cases where different 32-bit hash values end up in the same
bucket) *for all tuples being hashed*, that includes both the side
building the hash table and the side doing the scan.  This should be
done when we can avoid batching- if we don't have enough to avoid
batching while having such a large table, we should consider adjusting
the hash table size down some because, in those cases, we're memory
constrained.

When we have enough memory to avoid batching, we never want to have to
check down through a bucket for a tuple which doesn't exist- we should
simply be able to look in the hash table itself, determine (with a
single fetch) that there are no entries for that hash value, and throw
the tuple away and move on to the next.
Thanks,
    Stephen

Re: A better way than tweaking NTUP_PER_BUCKET

From
Simon Riggs
Date:
On 22 June 2013 14:48, Stephen Frost <sfrost@snowman.net> wrote:

> Based on your argument that we want to have a bucket load which is, on
> average, the size of NTUP_PER_BUCKET, I have to '-1' on this.

That is not my argument. I am pointing out that the comments claim the
code does that, and yet the code does not.

So either we apply the patch to make the code fit the comment, or we
change the comment.

Tom's wish for an average tuples per bucket of 10 may be connected to
that error; I can't say. But we definitely don't end up with an
average of 10 in many cases, which is the basis for previous poor
performance reports.

> What we
> want is to size a table large enough that we never have any true
> collisions (cases where different 32-bit hash values end up in the same
> bucket) *for all tuples being hashed*, that includes both the side
> building the hash table and the side doing the scan.  This should be
> done when we can avoid batching- if we don't have enough to avoid
> batching while having such a large table, we should consider adjusting
> the hash table size down some because, in those cases, we're memory
> constrained.
>
> When we have enough memory to avoid batching, we never want to have to
> check down through a bucket for a tuple which doesn't exist- we should
> simply be able to look in the hash table itself, determine (with a
> single fetch) that there are no entries for that hash value, and throw
> the tuple away and move on to the next.

The bottom line is that you're hoping for perfection. If we knew
exactly how many tuples were in the hash table then we could size it
precisely for the hash join we are about to perform. We don't know
that, so we can't size it precisely. Worse than that is that we know
we badly estimate the number of buckets on a regular basis.

That gives us three choices: if we have a hash table fixed without
prior knowledge then we either (0) we continue to underallocate them
in many cases or (1) potentially overallocate buckets. Or (2) we learn
to cope better with the variation in the number of entries. If we rule
out the first two we have only the last one remaining.

(1) will always be a heuristic, and as you point out could itself be
an extreme setting.

So I think that (2) is the best route: Given that we know with much
better certainty the number of rows in the scanned-relation, we should
be able to examine our hash table after it has been built and decide
whether it would be cheaper to rebuild the hash table with the right
number of buckets, or continue processing with what we have now. Which
is roughly what Heikki proposed already, in January.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: A better way than tweaking NTUP_PER_BUCKET

From
Robert Haas
Date:
On Sat, Jun 22, 2013 at 9:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Previous discussions of Hash Joins have noted that the performance
> decreases when the average number of tuples per bucket increases.
> O(N^2) effects are seen.
>
> We've argued this about many different ways, yet all of those
> discussions have centred around the constant NTUP_PER_BUCKET. I
> believe that was a subtle mistake and I have a proposal.
>
> The code in ExecChooseHashTableSize() line 460 says
>  /*
>   * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
>   * memory is filled.
> ...
>
> but the calculation then sets the number of buckets like this
>
>  dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
>
> **This is doesn't match the comment.** If we underestimate the number
> of tuples and go on to fill the available memory, we then end up with
> an average number of tuples per bucket higher than NTUP_PER_BUCKET. A
> notational confusion that has been skewing the discussion.
>
> The correct calculation that would match the objective set out in the
> comment would be
>
>  dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
>
> Which leads us to a much more useful value of dbuckets in the case
> where using ntuples occupies much less space than is available. This
> value is always same or higher than previously because of the if-test
> that surrounds it.

+1.  I think this is very clever.

The other thing that seems to be distorting things here is that this
function doesn't account for the memory consumed by the bucket array.
So as things stand today, changing NTUP_PER_BUCKET can't ever increase
the required number of batches.  But that's really nonsensical.
Setting NTUP_PER_BUCKET to a smaller value like 1 or even, in effect,
a value less than 1 would probably improve performance for large hash
tables, but there's no way to decide what value is too expensive
because the memory cost of changing it isn't accounted for in the
first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: A better way than tweaking NTUP_PER_BUCKET

From
Robert Haas
Date:
On Sat, Jun 22, 2013 at 9:48 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> The correct calculation that would match the objective set out in the
>> comment would be
>>
>>  dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
>
> This looks to be driving the size of the hash table size off of "how
> many of this size tuple can I fit into memory?" and ignoring how many
> actual rows we have to hash.  Consider a work_mem of 1GB with a small
> number of rows to actually hash- say 50.  With a tupsize of 8 bytes,
> we'd be creating a hash table sized for some 13M buckets.

This is a fair point, but I still think Simon's got a good point, too.Letting the number of buckets ramp up when
there'sample memory seems
 
like a broadly sensible strategy.  We might need to put a floor on the
effective load factor, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
On Saturday, June 22, 2013, Simon Riggs wrote:
On 22 June 2013 14:48, Stephen Frost <sfrost@snowman.net> wrote:

> Based on your argument that we want to have a bucket load which is, on
> average, the size of NTUP_PER_BUCKET, I have to '-1' on this.

That is not my argument. I am pointing out that the comments claim the
code does that, and yet the code does not.

To be honest, I don't read that comment quite the same way as you do. 

Tom's wish for an average tuples per bucket of 10 may be connected to
that error; I can't say. But we definitely don't end up with an
average of 10 in many cases, which is the basis for previous poor
performance reports.

I don't believe Tom wishes for an average of 10..  That's just what NTUP_PER_BUCKET has always been set to. 

The bottom line is that you're hoping for perfection. If we knew
exactly how many tuples were in the hash table then we could size it
precisely for the hash join we are about to perform. We don't know
that, so we can't size it precisely. Worse than that is that we know
we badly estimate the number of buckets on a regular basis.

I'm actually not trying for perfection. What I think we should start with is to at least not shoot ourselves in the foot by cutting the bucket size to one-tenth of our distinct tuple estimate and virtually guaranteeing lots of true collisions which are expensive. 
 
That gives us three choices: if we have a hash table fixed without
prior knowledge then we either (0) we continue to underallocate them
in many cases or (1) potentially overallocate buckets. Or (2) we learn
to cope better with the variation in the number of entries. If we rule
out the first two we have only the last one remaining.

(1) will always be a heuristic, and as you point out could itself be
an extreme setting

A heuristic based on our estimates is a good choice, imv.  I do think we can do better than we are now though. 

So I think that (2) is the best route: Given that we know with much
better certainty the number of rows in the scanned-relation, we should
be able to examine our hash table after it has been built and decide
whether it would be cheaper to rebuild the hash table with the right
number of buckets, or continue processing with what we have now. Which
is roughly what Heikki proposed already, in January.

I'm actually not a huge fan of this as it's certainly not cheap to do. If it can be shown to be better than an improved heuristic then perhaps it would work but I'm not convinced. 

One other way to address having a smaller actual hash table is to come up with another way to detect empty parts of the hash space, perhaps by using a bitmap to represent the hash space (obviously the individual bits would represent some chunk of the 32bit hash space, perhaps as much as 1/64'th, allowing our bitmap to fit into a 64bit int; that may end up being too small to actually help though). This would mean that when we did get to scanning a bucket we would at least be much more likely of finding a match instead of scanning it and finding nothing. 

Thanks,

Stephen 

Re: A better way than tweaking NTUP_PER_BUCKET

From
Heikki Linnakangas
Date:
On 22.06.2013 19:19, Simon Riggs wrote:
> So I think that (2) is the best route: Given that we know with much
> better certainty the number of rows in the scanned-relation, we should
> be able to examine our hash table after it has been built and decide
> whether it would be cheaper to rebuild the hash table with the right
> number of buckets, or continue processing with what we have now. Which
> is roughly what Heikki proposed already, in January.

Back in January, I wrote a quick patch to experiment with rehashing when
the hash table becomes too full. It was too late to make it into 9.3 so
I didn't pursue it further back then, but IIRC it worked. If we have the
capability to rehash, the accuracy of the initial guess becomes much
less important.

- Heikki

Attachment

Re: A better way than tweaking NTUP_PER_BUCKET

From
Simon Riggs
Date:
On 22 June 2013 21:40, Stephen Frost <sfrost@snowman.net> wrote:

> I'm actually not a huge fan of this as it's certainly not cheap to do. If it
> can be shown to be better than an improved heuristic then perhaps it would
> work but I'm not convinced.

We need two heuristics, it would seem:

* an initial heuristic to overestimate the number of buckets when we
have sufficient memory to do so

* a heuristic to determine whether it is cheaper to rebuild a dense
hash table into a better one.

Although I like Heikki's rebuild approach we can't do this every x2
overstretch. Given large underestimates exist we'll end up rehashing
5-12 times, which seems bad. Better to let the hash table build and
then re-hash once, it we can see it will be useful.

OK?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
On Saturday, June 22, 2013, Heikki Linnakangas wrote:
On 22.06.2013 19:19, Simon Riggs wrote:
So I think that (2) is the best route: Given that we know with much
better certainty the number of rows in the scanned-relation, we should
be able to examine our hash table after it has been built and decide
whether it would be cheaper to rebuild the hash table with the right
number of buckets, or continue processing with what we have now. Which
is roughly what Heikki proposed already, in January.

Back in January, I wrote a quick patch to experiment with rehashing when the hash table becomes too full. It was too late to make it into 9.3 so I didn't pursue it further back then, but IIRC it worked. If we have the capability to rehash, the accuracy of the initial guess becomes much less important.

What we're hashing isn't going to change mid-way through or be updated after we've started doing lookups against it. 

Why not simply scan and queue the data and then build the hash table right the first time?  Also, this patch doesn't appear to address dups and therefore would rehash unnecessarily. There's no point rehashing into more buckets if the buckets are only deep due to lots of dups. Figuring out how many distinct values there are, in order to build the best hash table, is actually pretty expensive compared to how quickly we can build the table today. Lastly, this still encourages collisions due to too few buckets. If we would simply start with more buckets outright we'd reduce the need to rehash..

Thanks,

Stephen

Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
On Saturday, June 22, 2013, Simon Riggs wrote:
On 22 June 2013 21:40, Stephen Frost <sfrost@snowman.net> wrote:

> I'm actually not a huge fan of this as it's certainly not cheap to do. If it
> can be shown to be better than an improved heuristic then perhaps it would
> work but I'm not convinced.

We need two heuristics, it would seem:

* an initial heuristic to overestimate the number of buckets when we
have sufficient memory to do so

* a heuristic to determine whether it is cheaper to rebuild a dense
hash table into a better one.

Although I like Heikki's rebuild approach we can't do this every x2
overstretch. Given large underestimates exist we'll end up rehashing
5-12 times, which seems bad. Better to let the hash table build and
then re-hash once, it we can see it will be useful.

OK?

I've been thinking a bit more on your notion of simply using as much memory as we're permitted, but maybe adjust it down based on how big the input to the hash table is (which I think we have better stats on, and even if we don't, we could easily keep track of how many tuples we've seen and consider rehashing as we go).  Still doesn't really address the issue of dups though. It may still be much larger than it should be if there's a lot of duplicates in the input that hash into a much smaller set of buckets.

Will think on it more. 

Thanks,

Stephen

Re: A better way than tweaking NTUP_PER_BUCKET

From
Simon Riggs
Date:
On 23 June 2013 03:16, Stephen Frost <sfrost@snowman.net> wrote:

>  Still doesn't really address the issue of dups though.

Checking for duplicates in all cases would be wasteful, since often we
are joining to the PK of a smaller table.

If duplicates are possible at all for a join, then it would make sense
to build the hash table more carefully to remove dupes. I think we
should treat that as a separate issue.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: A better way than tweaking NTUP_PER_BUCKET

From
Simon Riggs
Date:
On 23 June 2013 03:16, Stephen Frost <sfrost@snowman.net> wrote:

> Will think on it more.

Some other thoughts related to this...

* Why are we building a special kind of hash table? Why don't we just
use the hash table code that we in every other place in the backend.
If that code is so bad why do we use it everywhere else? That is
extensible, so we could try just using that. (Has anyone actually
tried?)

* We're not thinking about cache locality and set correspondence
either. If the join is expected to hardly ever match, then we should
be using a bitmap as a bloom filter rather than assuming that a very
large hash table is easily accessible.

* The skew hash table will be hit frequently and would show good L2
cache usage. I think I'll try adding the skew table always to see if
that improves the speed of the hash join.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
On Sunday, June 23, 2013, Simon Riggs wrote:
On 23 June 2013 03:16, Stephen Frost <sfrost@snowman.net> wrote:

>  Still doesn't really address the issue of dups though.

Checking for duplicates in all cases would be wasteful, since often we
are joining to the PK of a smaller table.

Well, that's what ndistinct is there to help us figure out. If we don't trust that though...
 
If duplicates are possible at all for a join, then it would make sense
to build the hash table more carefully to remove dupes. I think we
should treat that as a separate issue.

We can't simply remove the dups... We have to return all the matching dups in the join.  I did write a patch which created a 2-level list structure where the first level was uniques and the 2nd was dups, but it was extremely expensive to create the hash table then and scanning it wasn't much cheaper. 

Thanks,

Stephen

Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
On Sunday, June 23, 2013, Simon Riggs wrote:
On 23 June 2013 03:16, Stephen Frost <sfrost@snowman.net> wrote:

> Will think on it more.

Some other thoughts related to this...

* Why are we building a special kind of hash table? Why don't we just
use the hash table code that we in every other place in the backend.
If that code is so bad why do we use it everywhere else? That is
extensible, so we could try just using that. (Has anyone actually
tried?)

I've not looked at the hash table in the rest of the backend. 
 
* We're not thinking about cache locality and set correspondence
either. If the join is expected to hardly ever match, then we should
be using a bitmap as a bloom filter rather than assuming that a very
large hash table is easily accessible.

That's what I was suggesting earlier, though I don't think it's technically a bloom filter- doesn't that require multiple hash functions?I don't think we want to require every data type to provide multiple hash functions.   
 
* The skew hash table will be hit frequently and would show good L2
cache usage. I think I'll try adding the skew table always to see if
that improves the speed of the hash join.

The skew tables is just for common values though...   To be honest, I have some doubts about that structure really being a terribly good approach for anything which is completely in memory. 

Thanks,

Stephen 

Re: A better way than tweaking NTUP_PER_BUCKET

From
Heikki Linnakangas
Date:
On 23.06.2013 01:48, Simon Riggs wrote:
> On 22 June 2013 21:40, Stephen Frost<sfrost@snowman.net>  wrote:
>
>> I'm actually not a huge fan of this as it's certainly not cheap to do. If it
>> can be shown to be better than an improved heuristic then perhaps it would
>> work but I'm not convinced.
>
> We need two heuristics, it would seem:
>
> * an initial heuristic to overestimate the number of buckets when we
> have sufficient memory to do so
>
> * a heuristic to determine whether it is cheaper to rebuild a dense
> hash table into a better one.
>
> Although I like Heikki's rebuild approach we can't do this every x2
> overstretch. Given large underestimates exist we'll end up rehashing
> 5-12 times, which seems bad.

It's not very expensive. The hash values of all tuples have already been 
calculated, so rebuilding just means moving the tuples to the right bins.

> Better to let the hash table build and
> then re-hash once, it we can see it will be useful.

That sounds even less expensive, though.

- Heikki



Re: A better way than tweaking NTUP_PER_BUCKET

From
Atri Sharma
Date:
On Sun, Jun 23, 2013 at 8:41 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 23.06.2013 01:48, Simon Riggs wrote:
>>
>> On 22 June 2013 21:40, Stephen Frost<sfrost@snowman.net>  wrote:
>>
>>> I'm actually not a huge fan of this as it's certainly not cheap to do. If
>>> it
>>> can be shown to be better than an improved heuristic then perhaps it
>>> would
>>> work but I'm not convinced.
>>
>>
>> We need two heuristics, it would seem:
>>
>> * an initial heuristic to overestimate the number of buckets when we
>> have sufficient memory to do so
>>
>> * a heuristic to determine whether it is cheaper to rebuild a dense
>> hash table into a better one.
>>
>> Although I like Heikki's rebuild approach we can't do this every x2
>> overstretch. Given large underestimates exist we'll end up rehashing
>> 5-12 times, which seems bad.
>
>
> It's not very expensive. The hash values of all tuples have already been
> calculated, so rebuilding just means moving the tuples to the right bins.
>
>
>> Better to let the hash table build and
>> then re-hash once, it we can see it will be useful.
>
>
> That sounds even less expensive, though.

Hi all,

I just popped in here on Simon's advice to put an idea I had about
optimizing hash joins on this thread.

Essentially, I was thinking of using bloom filters in the part where
we match the actual hash values of the outer relation's tuple and the
hash table. We could do a bloom filter lookup first, and if it gives a
positive, then we can proceed like we do currently, since we could
have a false positive. However, if we have a negative from the bloom
filter, then, we can skip that tuple since bloom filters never give
false negatives.

Another thing we could potentially look at is that in the case when
the hash table doesnt fit into memory, and we have to do disk lookups,
then, we could do bloom filter lookups first in order to select the
tuples to be read from disk(only the positive ones).

Thoughts/Comments please?

Regards,

Atri



--
Regards,

Atri
l'apprenant



Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
Atri,

* Atri Sharma (atri.jiit@gmail.com) wrote:
> I just popped in here on Simon's advice to put an idea I had about
> optimizing hash joins on this thread.

I'd encourage reading the thread a bit first, in the future.. :)

> Essentially, I was thinking of using bloom filters in the part where
> we match the actual hash values of the outer relation's tuple and the
> hash table. We could do a bloom filter lookup first, and if it gives a
> positive, then we can proceed like we do currently, since we could
> have a false positive. However, if we have a negative from the bloom
> filter, then, we can skip that tuple since bloom filters never give
> false negatives.

I suggested this up-thread already, but it's not really a bloom filter
as there's only one hash function available- I can't see us requiring
every data type to provide multiple hash functions.  Still, I do think
breaking the single 32-bit hash key space up into fixed-sized chunks and
then having a bitfield array which we test against (very similar to how
the visibility map works) to see if there's any chance that a given hash
key exists might be valuable.  The problem is that, because we don't
have multiple hash functions, it's not clear how much "empty" space we'd
actually end up with.

This needs to be tested with different data sets and different sizes for
the bitfield (leading to correspondingly different sizes for the amount
of key space covered  by a single bit), along with performance metrics
which determine how expensive it is to build and use the bitfield.

> Another thing we could potentially look at is that in the case when
> the hash table doesnt fit into memory, and we have to do disk lookups,
> then, we could do bloom filter lookups first in order to select the
> tuples to be read from disk(only the positive ones).

We could have a bitfield filter (as I described above) created for each
bucket and then test against that before considering if we actually have
to go look in that bucket, yes.  I'm not sure if that's quite what you
were thinking, but I can see how a bitfield per bucket might work.  If
you were suggesting something else, please clarify.
Thanks,
    Stephen

Re: A better way than tweaking NTUP_PER_BUCKET

From
Atri Sharma
Date:
On Wed, Jun 26, 2013 at 7:12 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Atri,
>
> * Atri Sharma (atri.jiit@gmail.com) wrote:
>> I just popped in here on Simon's advice to put an idea I had about
>> optimizing hash joins on this thread.
>
> I'd encourage reading the thread a bit first, in the future.. :)
>
Yeah, I actually read a bit(admitted, not much) of the above thread. I
was following it a bit as well.

> I suggested this up-thread already, but it's not really a bloom filter
> as there's only one hash function available- I can't see us requiring
> every data type to provide multiple hash functions.  Still, I do think
> breaking the single 32-bit hash key space up into fixed-sized chunks and
> then having a bitfield array which we test against (very similar to how
> the visibility map works) to see if there's any chance that a given hash
> key exists might be valuable.  The problem is that, because we don't
> have multiple hash functions, it's not clear how much "empty" space we'd
> actually end up with.

Agreed.

> We could have a bitfield filter (as I described above) created for each
> bucket and then test against that before considering if we actually have
> to go look in that bucket, yes.  I'm not sure if that's quite what you
> were thinking, but I can see how a bitfield per bucket might work.  If
> you were suggesting something else, please clarify.

Yeah, this is what I wanted.

My point is that I would like to help in the implementation, if possible. :)

Regards,

Atri


--
Regards,

Atri
l'apprenant



Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
* Atri Sharma (atri.jiit@gmail.com) wrote:
> My point is that I would like to help in the implementation, if possible. :)

Feel free to go ahead and implement it..  I'm not sure when I'll have a
chance to (probably not in the next week or two anyway).  Unfortunately,
the bigger issue here is really about testing the results and
determining if it's actually faster/better with various data sets
(including ones which have duplicates).  I've got one test data set
which has some interesting characteristics (for one thing, hashing the
"large" side and then seq-scanning the "small" side is actually faster
than going the other way, which is quite 'odd' imv for a hashing
system): http://snowman.net/~sfrost/test_case2.sql

You might also look at the other emails that I sent regarding this
subject and NTUP_PER_BUCKET.  Having someone confirm what I saw wrt
changing that parameter would be nice and it would be a good comparison
point against any kind of pre-filtering that we're doing.

One thing that re-reading the bloom filter description reminded me of is
that it's at least conceivable that we could take the existing hash
functions for each data type and do double-hashing or perhaps seed the
value to be hashed with additional data to produce an "independent" hash
result to use.  Again, a lot of things that need to be tested and
measured to see if they improve overall performance.
Thanks,
    Stephen

Re: A better way than tweaking NTUP_PER_BUCKET

From
Atri Sharma
Date:
On Wed, Jun 26, 2013 at 9:20 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Atri Sharma (atri.jiit@gmail.com) wrote:
>> My point is that I would like to help in the implementation, if possible. :)
>
> Feel free to go ahead and implement it..  I'm not sure when I'll have a
> chance to (probably not in the next week or two anyway).  Unfortunately,
> the bigger issue here is really about testing the results and
> determining if it's actually faster/better with various data sets
> (including ones which have duplicates).  I've got one test data set
> which has some interesting characteristics (for one thing, hashing the
> "large" side and then seq-scanning the "small" side is actually faster
> than going the other way, which is quite 'odd' imv for a hashing
> system): http://snowman.net/~sfrost/test_case2.sql
>
> You might also look at the other emails that I sent regarding this
> subject and NTUP_PER_BUCKET.  Having someone confirm what I saw wrt
> changing that parameter would be nice and it would be a good comparison
> point against any kind of pre-filtering that we're doing.
>
> One thing that re-reading the bloom filter description reminded me of is
> that it's at least conceivable that we could take the existing hash
> functions for each data type and do double-hashing or perhaps seed the
> value to be hashed with additional data to produce an "independent" hash
> result to use.  Again, a lot of things that need to be tested and
> measured to see if they improve overall performance.

Right, let me look.Although, I am pretty busy atm with ordered set
functions, so will get it done maybe last week of this month.

Another thing I believe in is that we should have multiple hashing
functions for bloom filters, which generate different probability
values so that the coverage is good.

Regards,

Atri

--
Regards,

Atri
l'apprenant



Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
* Atri Sharma (atri.jiit@gmail.com) wrote:
> Right, let me look.Although, I am pretty busy atm with ordered set
> functions, so will get it done maybe last week of this month.

Isn't it currently the last week of this month? :)  I'm guessing you
mean July.

> Another thing I believe in is that we should have multiple hashing
> functions for bloom filters, which generate different probability
> values so that the coverage is good.

I really don't see that happening, to be honest..  I think it would be
interesting to try some of the surrogate-additional-hashing that I
described.
Thanks,
    Stephen

Re: A better way than tweaking NTUP_PER_BUCKET

From
Atri Sharma
Date:
>
> Isn't it currently the last week of this month? :)  I'm guessing you
> mean July.Heh,no.I lose track of time these days. Alright, second week of July then.


> I really don't see that happening, to be honest..  I think it would be
> interesting to try some of the surrogate-additional-hashing that I
> described.

Yes, I agree to that.

Let me think more and get back on this.

Regards,

Atri



--
Regards,

Atri
l'apprenant



Re: A better way than tweaking NTUP_PER_BUCKET

From
Jeff Janes
Date:
On Sun, Jun 23, 2013 at 2:45 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 23 June 2013 03:16, Stephen Frost <sfrost@snowman.net> wrote:

> Will think on it more.

Some other thoughts related to this...

* Why are we building a special kind of hash table? Why don't we just
use the hash table code that we in every other place in the backend.
If that code is so bad why do we use it everywhere else? That is
extensible, so we could try just using that. (Has anyone actually
tried?)

Do you mean dynahash.c?

Dynahash has no provision to spill to disk gracefully, which seems like a huge limit for this purpose, but not one for the purposes it currently serves.  Also, dynahash is built around use of C structs, not tuples, as the things being stored.  Perhaps that is not the mis-match I think it is?
 

* We're not thinking about cache locality and set correspondence
either. If the join is expected to hardly ever match, then we should
be using a bitmap as a bloom filter rather than assuming that a very
large hash table is easily accessible.

* The skew hash table will be hit frequently and would show good L2
cache usage. I think I'll try adding the skew table always to see if
that improves the speed of the hash join.

Is there a good description of what the skew table is; and of the hash join code in general?  Last time I dug into nodeHash.c, I realized it really needs a README file (unless there is already one I missed), or at least some high level comments.  Most of the comments in there seem to just restate what the code itself already says, without explaining why it is being done or how it fits into the big picture.
 
Cheers,

Jeff

Re: A better way than tweaking NTUP_PER_BUCKET

From
Atri Sharma
Date:
Hi all,

I have been working on a patch for the above discussed
functionalities. I made an array of int32s, one for each bucket in a
hash table(the array is per hash table).

I am using the hash value directly to set the corresponding bit in the
bit field.Specifically:

bitfield_orvalue = 1<<hashvalue;
hashtable->bitFields[bucketNumber] =
(hashtable->bitFields[bucketNumber]) |bitfield_orvalue;

But,the hash values are way beyond this, and I admit that my choice of
int32 as bitfield isn't correct here.

The hash values are like:

13599104251359910425-1662820951-1662820951

What should I be using for the bit map?

Regards,

Atri

--
Regards,

Atri
l'apprenant



Re: A better way than tweaking NTUP_PER_BUCKET

From
Bruce Momjian
Date:
Uh, were are we on this?  Is it a TODO?

---------------------------------------------------------------------------

On Wed, Jul  3, 2013 at 01:39:28PM +0530, Atri Sharma wrote:
> Hi all,
> 
> I have been working on a patch for the above discussed
> functionalities. I made an array of int32s, one for each bucket in a
> hash table(the array is per hash table).
> 
> I am using the hash value directly to set the corresponding bit in the
> bit field.Specifically:
> 
> bitfield_orvalue = 1<<hashvalue;
> hashtable->bitFields[bucketNumber] =
> (hashtable->bitFields[bucketNumber]) |bitfield_orvalue;
> 
> But,the hash values are way beyond this, and I admit that my choice of
> int32 as bitfield isn't correct here.
> 
> The hash values are like:
> 
> 1359910425
>  1359910425
>  -1662820951
>  -1662820951
> 
> What should I be using for the bit map?
> 
> Regards,
> 
> Atri
> 
> --
> Regards,
> 
> Atri
> l'apprenant
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
* Bruce Momjian (bruce@momjian.us) wrote:
> Uh, were are we on this?  Is it a TODO?

I've been strongly considering my previous patch which tweaked
NTUP_PER_BUCKET to '1' (instead of the default '10') when there's
sufficient work_mem for it.  There was recently another complaint on IRC
about our tendency to hash the larger partition rather than the smaller
one which I believe would be resolved by doing so.

The main thing holding me back has been concern that there may be cases
which perform worse with the change, either because hashing the larger
partition actually ended up being faster or due to the increase in
memory usage.

In the end, I believe we absolutely should do something about this.
Hashing a 64M-row table (requiring upwards of 8G) instead of hashing
a 2M-row table is really bad of us.

Thoughts?
Thanks,
    Stephen

Re: A better way than tweaking NTUP_PER_BUCKET

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Bruce Momjian (bruce@momjian.us) wrote:
>> Uh, were are we on this?  Is it a TODO?

> I've been strongly considering my previous patch which tweaked
> NTUP_PER_BUCKET to '1' (instead of the default '10') when there's
> sufficient work_mem for it.  There was recently another complaint on IRC
> about our tendency to hash the larger partition rather than the smaller
> one which I believe would be resolved by doing so.

> The main thing holding me back has been concern that there may be cases
> which perform worse with the change, either because hashing the larger
> partition actually ended up being faster or due to the increase in
> memory usage.

> In the end, I believe we absolutely should do something about this.
> Hashing a 64M-row table (requiring upwards of 8G) instead of hashing
> a 2M-row table is really bad of us.

Huh?  I don't see anything in the thread suggesting that we're doing that,
or that changing NTUP_PER_BUCKET would fix it if we do.  Are you thinking
of some other discussion?

AFAICT, there was no consensus in this thread on what to do, which
probably has something to do with the lack of concrete performance
tests presented to back up any particular proposal.
        regards, tom lane



Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > In the end, I believe we absolutely should do something about this.
> > Hashing a 64M-row table (requiring upwards of 8G) instead of hashing
> > a 2M-row table is really bad of us.
>
> Huh?  I don't see anything in the thread suggesting that we're doing that,
> or that changing NTUP_PER_BUCKET would fix it if we do.  Are you thinking
> of some other discussion?

This thread sprung from or was at least related to, as I recall, my
issues with NTUP_PER_BUCKET over the summer.  Perhaps I'm wrong, but
here's the thread I was referring to:

http://www.postgresql.org/message-id/20130404201612.GM4361@tamriel.snowman.net

Where I demonstrated that we decide to hash a much larger table, rather
than the smaller one, based on the estimated depth of the buckets and
including the costing from that, which is driven based on how big we
decide to make the hash table where we use NTUP_PER_BUCKET to pick a
table size much smaller than we really should be.

> AFAICT, there was no consensus in this thread on what to do, which
> probably has something to do with the lack of concrete performance
> tests presented to back up any particular proposal.

This I entirely agree with- more testing and more information on how
such a change impacts other workloads would be great.  Unfortunately,
while I've provided a couple of test cases and seen similar situations
on IRC, this is very data-dependent which makes it difficult to have
concrete answers for every workload.

Still, I'll try and spend some time w/ pg_bench's schema definition and
writing up some larger queries to run through it (aiui, the default set
of queries won't typically result in a hashjoin) and see what happens
there.
Thanks,
    Stephen

Re: A better way than tweaking NTUP_PER_BUCKET

From
Simon Riggs
Date:
On 25 January 2014 22:33, Stephen Frost <sfrost@snowman.net> wrote:

> * Tom Lane (tgl@sss.pgh.pa.us) wrote:

>> AFAICT, there was no consensus in this thread on what to do, which
>> probably has something to do with the lack of concrete performance
>> tests presented to back up any particular proposal.
>
> This I entirely agree with- more testing and more information on how
> such a change impacts other workloads would be great.  Unfortunately,
> while I've provided a couple of test cases and seen similar situations
> on IRC, this is very data-dependent which makes it difficult to have
> concrete answers for every workload.
>
> Still, I'll try and spend some time w/ pg_bench's schema definition and
> writing up some larger queries to run through it (aiui, the default set
> of queries won't typically result in a hashjoin) and see what happens
> there.

The case that action of some kind was needed was clear, for me.
Hopefully some small improvement can be found from that investigation,
even if the greatest gain is in some way under dispute.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: A better way than tweaking NTUP_PER_BUCKET

From
Atri Sharma
Date:

Sent from my iPad

> On 26-Jan-2014, at 4:38, Simon Riggs <simon@2ndQuadrant.com> wrote:
>
>> On 25 January 2014 22:33, Stephen Frost <sfrost@snowman.net> wrote:
>>
>> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>
>>> AFAICT, there was no consensus in this thread on what to do, which
>>> probably has something to do with the lack of concrete performance
>>> tests presented to back up any particular proposal.
>>
>> This I entirely agree with- more testing and more information on how
>> such a change impacts other workloads would be great.  Unfortunately,
>> while I've provided a couple of test cases and seen similar situations
>> on IRC, this is very data-dependent which makes it difficult to have
>> concrete answers for every workload.
>>
>> Still, I'll try and spend some time w/ pg_bench's schema definition and
>> writing up some larger queries to run through it (aiui, the default set
>> of queries won't typically result in a hashjoin) and see what happens
>> there.
>
> The case that action of some kind was needed was clear, for me.
> Hopefully some small improvement can be found from that investigation,
> even if the greatest gain is in some way

I really don't see a way to improve performance in this field except for introducing completely new logic (if we don't
hackNTUP_PER_BUCKET). 

My previous patch added bloom filters as an additional layer which was checked before the actual hash join lookup,
hencereducing the number of hash lookups for negatives. 

The size of bloom filters required for any significant performance gains and over the additional overhead added because
ofthe additional bloom filter lookup was quite big, hence it did not make any sense to add those. 

Upon my discussion with Steven recently, I thought of redoing the patch based on some new tests but I haven't been
reallyable to work on that. 

Regards,

Atri


Re: A better way than tweaking NTUP_PER_BUCKET

From
Simon Riggs
Date:
On 25 January 2014 23:08, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 25 January 2014 22:33, Stephen Frost <sfrost@snowman.net> wrote:
>
>> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>
>>> AFAICT, there was no consensus in this thread on what to do, which
>>> probably has something to do with the lack of concrete performance
>>> tests presented to back up any particular proposal.
>>
>> This I entirely agree with- more testing and more information on how
>> such a change impacts other workloads would be great.  Unfortunately,
>> while I've provided a couple of test cases and seen similar situations
>> on IRC, this is very data-dependent which makes it difficult to have
>> concrete answers for every workload.
>>
>> Still, I'll try and spend some time w/ pg_bench's schema definition and
>> writing up some larger queries to run through it (aiui, the default set
>> of queries won't typically result in a hashjoin) and see what happens
>> there.
>
> The case that action of some kind was needed was clear, for me.
> Hopefully some small improvement can be found from that investigation,
> even if the greatest gain is in some way under dispute.

I don't see anything for 9.4 in here now.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
* Simon Riggs (simon@2ndQuadrant.com) wrote:
> I don't see anything for 9.4 in here now.

Attached is what I was toying with (thought I had attached it previously
somewhere..  perhaps not), but in re-testing, it doesn't appear to do
enough to move things in the right direction in all cases.  I did play
with this a fair bit yesterday and while it improved some cases by 20%
(eg: a simple join between pgbench_accounts and pgbench_history), when
we decide to *still* hash the larger side (as in my 'test_case2.sql'),
it can cause a similairly-sized decrease in performance.  Of course, if
we can push that case to hash the smaller side (which I did by hand with
cpu_tuple_cost), then it goes back to being a win to use a larger number
of buckets.

I definitely feel that there's room for improvment here but it's not an
easily done thing, unfortunately.  To be honest, I was pretty surprised
when I saw that the larger number of buckets performed worse, even if it
was when we picked the "wrong" side to hash and I plan to look into that
more closely to try and understand what's happening.  My first guess
would be what Tom had mentioned over the summer- if the size of the
bucket array ends up being larger than the CPU cache, we can end up
paying a great deal more to build the hash table than it costs to scan
through the deeper buckets that we end up with as a result (particularly
when we're scanning a smaller table).  Of course, choosing to hash the
larger table makes that more likely..
Thanks,
    Stephen

Re: A better way than tweaking NTUP_PER_BUCKET

From
Stephen Frost
Date:
* Stephen Frost (sfrost@snowman.net) wrote:
> * Simon Riggs (simon@2ndQuadrant.com) wrote:
> > I don't see anything for 9.4 in here now.
>
> Attached [...]

I'm apparently bad at this 'attaching' thing, particularly on this
subject.

Here it is.

    Thanks,

        Stephen

Attachment

Re: A better way than tweaking NTUP_PER_BUCKET

From
Pavel Stehule
Date:



2014-01-27 Stephen Frost <sfrost@snowman.net>
* Simon Riggs (simon@2ndQuadrant.com) wrote:
> I don't see anything for 9.4 in here now.

Attached is what I was toying with (thought I had attached it previously
somewhere..  perhaps not), but in re-testing, it doesn't appear to do
enough to move things in the right direction in all cases.  I did play
with this a fair bit yesterday and while it improved some cases by 20%
(eg: a simple join between pgbench_accounts and pgbench_history), when
we decide to *still* hash the larger side (as in my 'test_case2.sql'),
it can cause a similairly-sized decrease in performance.  Of course, if
we can push that case to hash the smaller side (which I did by hand with
cpu_tuple_cost), then it goes back to being a win to use a larger number
of buckets.

I definitely feel that there's room for improvment here but it's not an
easily done thing, unfortunately.  To be honest, I was pretty surprised
when I saw that the larger number of buckets performed worse, even if it
was when we picked the "wrong" side to hash and I plan to look into that
more closely to try and understand what's happening.  My first guess
would be what Tom had mentioned over the summer- if the size of the
bucket array ends up being larger than the CPU cache, we can end up
paying a great deal more to build the hash table than it costs to scan
through the deeper buckets that we end up with as a result (particularly
when we're scanning a smaller table).  Of course, choosing to hash the
larger table makes that more likely..

This topic is interesting - we found very bad performance with hashing large tables with high work_mem. MergeJoin with quicksort was significantly faster.

I didn't deeper research - there is a possibility of virtualization overhead.

Regards

Pavel
 

        Thanks,

                Stephen

Re: A better way than tweaking NTUP_PER_BUCKET

From
Simon Riggs
Date:
On 27 January 2014 17:44, Pavel Stehule <pavel.stehule@gmail.com> wrote:

> This topic is interesting - we found very bad performance with hashing large
> tables with high work_mem. MergeJoin with quicksort was significantly
> faster.

I've seen this also.

> I didn't deeper research - there is a possibility of virtualization
> overhead.

I took measurements and the effect was repeatable and happened for all
sizes of work_mem, but nothing more to add.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: A better way than tweaking NTUP_PER_BUCKET

From
Jeremy Harris
Date:
On 27/01/14 18:00, Simon Riggs wrote:
> On 27 January 2014 17:44, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>> This topic is interesting - we found very bad performance with hashing large
>> tables with high work_mem. MergeJoin with quicksort was significantly
>> faster.
>
> I've seen this also.
>
>> I didn't deeper research - there is a possibility of virtualization
>> overhead.
>
> I took measurements and the effect was repeatable and happened for all
> sizes of work_mem, but nothing more to add.

FWIW my current list-based internal merge seems to perform worse at
larger work-mem, compared to quicksort.   I've been starting to wonder
if the rate of new ram-chip page opens is an issue (along with the
more-usually considered cache effects).  Any random-access workload
would be affected by this, if it really exists.
-- 
Cheers,   Jeremy




Re: A better way than tweaking NTUP_PER_BUCKET

From
Jeff Janes
Date:
On Mon, Jan 27, 2014 at 10:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 27 January 2014 17:44, Pavel Stehule <pavel.stehule@gmail.com> wrote:

> This topic is interesting - we found very bad performance with hashing large
> tables with high work_mem. MergeJoin with quicksort was significantly
> faster.

I've seen this also.

> I didn't deeper research - there is a possibility of virtualization
> overhead.

I took measurements and the effect was repeatable and happened for all
sizes of work_mem, but nothing more to add.

I get similar results if I join on integers.  But joining on text, the hash wins by a mile.

I use this as a simple test bed:

alter table pgbench_accounts drop CONSTRAINT pgbench_accounts_pkey;
update pgbench_accounts set filler = md5(aid::text);

set work_mem to whatever keeps the join off of disk for the given scale;
set enable_hashjoin to whatever;

select sum(a1.abalance*a2.abalance) from pgbench_accounts a1 join pgbench_accounts a2 using (aid);
select sum(a1.abalance*a2.abalance) from pgbench_accounts a1 join pgbench_accounts a2 using (filler);

hash integer:     1832.695 ms
merge integer:   1462.913 ms
hash text:          2353.115 ms
merge text:     11,218.628 ms

The cost estimates do not depend on the column used in the join despite a 6 fold difference in run time, so the planner is perhaps missing a trick there.

Cheers,

Jeff