Re: A better way than tweaking NTUP_PER_BUCKET - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | Re: A better way than tweaking NTUP_PER_BUCKET |
Date | |
Msg-id | CAOuzzgo5F-t0R=9YoDGZN5=-PaQM9pUHEm3HvN0A-oX_MPSXFw@mail.gmail.com Whole thread Raw |
In response to | Re: A better way than tweaking NTUP_PER_BUCKET (Simon Riggs <simon@2ndQuadrant.com>) |
Responses |
Re: A better way than tweaking NTUP_PER_BUCKET
|
List | pgsql-hackers |
On Saturday, June 22, 2013, Simon Riggs wrote:
On 22 June 2013 14:48, Stephen Frost <sfrost@snowman.net> wrote:
> Based on your argument that we want to have a bucket load which is, on
> average, the size of NTUP_PER_BUCKET, I have to '-1' on this.
That is not my argument. I am pointing out that the comments claim the
code does that, and yet the code does not.
To be honest, I don't read that comment quite the same way as you do.
Tom's wish for an average tuples per bucket of 10 may be connected to
that error; I can't say. But we definitely don't end up with an
average of 10 in many cases, which is the basis for previous poor
performance reports.
I don't believe Tom wishes for an average of 10.. That's just what NTUP_PER_BUCKET has always been set to.
The bottom line is that you're hoping for perfection. If we knew
exactly how many tuples were in the hash table then we could size it
precisely for the hash join we are about to perform. We don't know
that, so we can't size it precisely. Worse than that is that we know
we badly estimate the number of buckets on a regular basis.
I'm actually not trying for perfection. What I think we should start with is to at least not shoot ourselves in the foot by cutting the bucket size to one-tenth of our distinct tuple estimate and virtually guaranteeing lots of true collisions which are expensive.
That gives us three choices: if we have a hash table fixed without
prior knowledge then we either (0) we continue to underallocate them
in many cases or (1) potentially overallocate buckets. Or (2) we learn
to cope better with the variation in the number of entries. If we rule
out the first two we have only the last one remaining.
(1) will always be a heuristic, and as you point out could itself be
an extreme setting
A heuristic based on our estimates is a good choice, imv. I do think we can do better than we are now though.
So I think that (2) is the best route: Given that we know with much
better certainty the number of rows in the scanned-relation, we should
be able to examine our hash table after it has been built and decide
whether it would be cheaper to rebuild the hash table with the right
number of buckets, or continue processing with what we have now. Which
is roughly what Heikki proposed already, in January.
I'm actually not a huge fan of this as it's certainly not cheap to do. If it can be shown to be better than an improved heuristic then perhaps it would work but I'm not convinced.
One other way to address having a smaller actual hash table is to come up with another way to detect empty parts of the hash space, perhaps by using a bitmap to represent the hash space (obviously the individual bits would represent some chunk of the 32bit hash space, perhaps as much as 1/64'th, allowing our bitmap to fit into a 64bit int; that may end up being too small to actually help though). This would mean that when we did get to scanning a bucket we would at least be much more likely of finding a match instead of scanning it and finding nothing.
Thanks,
Stephen
pgsql-hackers by date: