Home > mailing lists

Thread: Hash index todo list item

Hash index todo list item

From

Kenneth Marshall

Date:

02 September 2007, 15:04:11

Dear PostgreSQL Hackers:

After following the hackers mailing list for quite a while,
I am going to start investigating what will need to be done
to improve hash index performance. Below are the pieces of
this project that I am currently considering:

1. Characterize the current hash index implementation against the BTree index, with a focus on space utilization and
lookupperformance against a collection of test data. This will give a baseline performance test to evaluate the impact
of changes. I initially do not plan to bench the hash creation process since my initial focus will be on lookup
performance.

2. Evaluate the performance of different hash index implementations and/or changes to the current implementation. My
currentplan is to keep the implementation as simple as possible and still provide the desired performance. Several
hashindex suggestions deal with changing the layout of the keys on a page to improve lookup performance, including
reducingthe bucket size to a fraction of a page or only storing the hash value on the page, instead of the index
valueitself. My goal in this phase is to produce one or more versions with better performance than the current BTree.

3. Look at build time and concurrency issues with the addition of some additional tests to the test bed. (1)

4. Repeat as needed.

This is the rough plan. Does anyone see anything critical that
is missing at this point? Please send me any suggestions for test
data and various performance test ideas, since I will be working
on that first.

Regards,
Ken Marshall

Re: Hash index todo list item

From

Tom Lane

Date:

02 September 2007, 23:41:28

Kenneth Marshall <ktm@rice.edu> writes:
> ... This is the rough plan. Does anyone see anything critical that
> is missing at this point?

Sounds pretty good.  Let me brain-dump one item on you: one thing that
hash currently has over btree is the ability to handle index items up
to a full page.  Now, if you go with a scheme that only stores hash
codes and not the underlying data, you can not only handle that but
improve on it; but if you reduce the bucket size and don't remove the
data, it'd be a step backward.  The idea I had about dealing with that
was to only reduce the size of primary buckets --- if it's necessary to
add overflow space to a bucket, the overflow units are still full pages.
So an index tuple up to a page in size can always be accommodated by
adding an overflow page to the bucket.

Just a thought, but AFAIR it's not in the archives anywhere.
        regards, tom lane

Re: Hash index todo list item

From

Simon Riggs

Date:

03 September 2007, 06:32:21

On Sun, 2007-09-02 at 13:04 -0500, Kenneth Marshall wrote:
> Dear PostgreSQL Hackers:
> 
> After following the hackers mailing list for quite a while,
> I am going to start investigating what will need to be done
> to improve hash index performance. Below are the pieces of
> this project that I am currently considering:
> 
> 1. Characterize the current hash index implementation against
>    the BTree index, with a focus on space utilization and
>    lookup performance against a collection of test data. This
>    will give a baseline performance test to evaluate the impact
>    of changes. I initially do not plan to bench the hash creation
>    process since my initial focus will be on lookup performance.
> 
> 2. Evaluate the performance of different hash index implementations
>    and/or changes to the current implementation. My current plan is
>    to keep the implementation as simple as possible and still provide
>    the desired performance. Several hash index suggestions deal with
>    changing the layout of the keys on a page to improve lookup
>    performance, including reducing the bucket size to a fraction of
>    a page or only storing the hash value on the page, instead of
>    the index value itself. My goal in this phase is to produce one
>    or more versions with better performance than the current BTree.
>    
> 3. Look at build time and concurrency issues with the addition of
>    some additional tests to the test bed. (1)
> 
> 4. Repeat as needed.
> 
> This is the rough plan. Does anyone see anything critical that
> is missing at this point? Please send me any suggestions for test
> data and various performance test ideas, since I will be working
> on that first.

Sounds good.

I'd be particularly interested in large indexes, say ~ 0.5 - 2GB. There
are likely to be various effects apparent as the indexes grow. It would
be too easy to do all the tests with smaller indexes and miss things.

Other factors are:
- volatility
- concurrency

My general experience is that hash-based indexes are better when the
range of inputs is relatively well-known, allowing a fast lookup. If
that is the only benefit of hash indexes, a flexible hashing scheme may
simply weaken the benefit-case for using them. If that's true, should
the index build process examine the key values in the data to determine
the best parameters to use? Kind of ANALYZE before build.

My current feeling is that they ought to be very good at handling
read-mostly situations such as privilege checking or UPDATE-intensive
situations such as Customer-Current-Credit tracking, when the number of
customers is large.

It might also be worth looking at lossy hash indexes, i.e. the index
stores only the block numbers. That would need to be part of the
discussion around how lossy we will allow indexes to be.

We currently have two kinds of full text index with different
concurrency use cases, so it should be acceptable to have hash indexes
have a clear benefit in one use case but a clear loss in another.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Hash index todo list item

From

Kenneth Marshall

Date:

03 September 2007, 14:18:20

On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote:
> Kenneth Marshall <ktm@rice.edu> writes:
> > ... This is the rough plan. Does anyone see anything critical that
> > is missing at this point?
> 
> Sounds pretty good.  Let me brain-dump one item on you: one thing that
> hash currently has over btree is the ability to handle index items up
> to a full page.  Now, if you go with a scheme that only stores hash
> codes and not the underlying data, you can not only handle that but
> improve on it; but if you reduce the bucket size and don't remove the
> data, it'd be a step backward.  The idea I had about dealing with that
> was to only reduce the size of primary buckets --- if it's necessary to
> add overflow space to a bucket, the overflow units are still full pages.
> So an index tuple up to a page in size can always be accommodated by
> adding an overflow page to the bucket.
> 
> Just a thought, but AFAIR it's not in the archives anywhere.
> 
>             regards, tom lane
> 
Tom,

Thank you for the input. I agree that keeping the ability to accomodate
an index tuple up to a page is size worth keeping. I think that your
goal in reducing the bucket size is to improve the packing efficiency
of the index. Since the on disk page size remains the same, it may be
possible to use a different structure overlayed on the current bucket
size and still improve the packing efficiency of the index. After some
more mulling, here are some further thoughts on the improved hash table
implementation:

- Hash lookup is O(1) while btree is O(logN). Is there a value in optimizing the NOT case, i.e. the entry is not in the
table?

- Space versus performance trade-off. This may tie into cache efficiency and use of L2/L3, shared buffers, main memory.
Denserlayouts with a higher load facter may be a bit slower in lookups but play much nicer in a multi-user system. Look
atthe possibility of a lossy mapping?
 

- Build time versus update time. How does concurrency enter into the picture regarding simultaneous updates, inserts,
deletes,and lookups?
 

- Could a hybrid structure with some type of prefix compression give a more efficient layout and possibly improve
performance?

- Index larger fields. Btree is limited to blocksize/3, the current hash implementation can go up to a full block.

- What about multi-column indexes? The current implementation only supports 1 column.

More ideas are welcome and I will add them to the list for
investigation.

Regards,
Ken

Re: Hash index todo list item

From

Kenneth Marshall

Date:

03 September 2007, 14:36:53

On Mon, Sep 03, 2007 at 10:33:54AM +0100, Simon Riggs wrote:
> > 
> > This is the rough plan. Does anyone see anything critical that
> > is missing at this point? Please send me any suggestions for test
> > data and various performance test ideas, since I will be working
> > on that first.
> 
> Sounds good.
> 
> I'd be particularly interested in large indexes, say ~ 0.5 - 2GB. There
> are likely to be various effects apparent as the indexes grow. It would
> be too easy to do all the tests with smaller indexes and miss things.
> 
> Other factors are:
> - volatility
> - concurrency
> 
> My general experience is that hash-based indexes are better when the
> range of inputs is relatively well-known, allowing a fast lookup. If
> that is the only benefit of hash indexes, a flexible hashing scheme may
> simply weaken the benefit-case for using them. If that's true, should
> the index build process examine the key values in the data to determine
> the best parameters to use? Kind of ANALYZE before build.
> 
> My current feeling is that they ought to be very good at handling
> read-mostly situations such as privilege checking or UPDATE-intensive
> situations such as Customer-Current-Credit tracking, when the number of
> customers is large.
> 
> It might also be worth looking at lossy hash indexes, i.e. the index
> stores only the block numbers. That would need to be part of the
> discussion around how lossy we will allow indexes to be.
> 
> We currently have two kinds of full text index with different
> concurrency use cases, so it should be acceptable to have hash indexes
> have a clear benefit in one use case but a clear loss in another.
> 
> -- 
>   Simon Riggs
>   2ndQuadrant  http://www.2ndQuadrant.com
> 

Simon,

Thank you for your input. I would like to include some tests with large
indexes too. Do you have any ideas for a test corpus or should we try
and generate the test data programatically? Many people in the literature
of text retrieval use the TREC* data for at least some of their runs. I
am going to check at work to see if the campus has access to the data,
otherwise I will do some web crawling to generate some sample data. I
have just posted a reply to Tom Lane with some further ideas for consideration
in the new hash index support. Like you, I suspect that volatile data that
results in many index changes may not work well with hash indexes, in general.
PostgreSQL has the additional burden of needing to access both the index and
the data heap. Obviously, the less I/O that is needed the better the
performance is likely to be. The new HOT functionality plus clustering the
table data on the hash index would effectively organize the table into the
"hash buckets" which could help with reducing both the churn in the index
as well as in the tables.

Regards,
Ken

Re: Hash index todo list item

From

Gregory Stark

Date:

03 September 2007, 19:24:33

"Kenneth Marshall" <ktm@rice.edu> writes:

> On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote:
>> Kenneth Marshall <ktm@rice.edu> writes:
>> > ... This is the rough plan. Does anyone see anything critical that
>> > is missing at this point?
>> 
>> Sounds pretty good.  Let me brain-dump one item on you: one thing that
>> hash currently has over btree is the ability to handle index items up
>> to a full page.  Now, if you go with a scheme that only stores hash
>> codes and not the underlying data, you can not only handle that but
>> improve on it; 

I think that would be a big selling point for hash indexes. It would let you
index even toasted data which are larger than a page. I'm not sure whether you
can make it work for unique indexes though. But for non-unique indexes I think
it would be a solid win and mean you cover a set of use cases quite distinct
from btree indexes.

> - Hash lookup is O(1) while btree is O(logN). 

That's not really true. There's a tradeoff between insertion time and lookup
time. In order to get O(1) lookups you need to work pretty hard to maintain
the hash table including spending a lot of time reorganizing it when you grow
it. If you don't want to spend the time on inserts then you end up with
buckets and the hash table is basically just a linear speedup to whatever
algorithm you use to scan the buckets.

> - What about multi-column indexes? The current implementation
>   only supports 1 column.

That seems kind of weird. It seems obvious that you mix the three hashes
together which reduces it to the solved problem. 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Hash index todo list item

From

Tom Lane

Date:

03 September 2007, 20:57:54

Gregory Stark <stark@enterprisedb.com> writes:
> "Kenneth Marshall" <ktm@rice.edu> writes:
>> - What about multi-column indexes? The current implementation
>> only supports 1 column.

> That seems kind of weird. It seems obvious that you mix the three hashes
> together which reduces it to the solved problem. 

No, because part of the deal is that you can do lookups using only the
leading index columns.  At least, all the existing multicolumn index
types can do that.
        regards, tom lane

Re: Hash index todo list item

From

"Ben Tilly"

Date:

03 September 2007, 21:21:20

On 9/3/07, Gregory Stark <stark@enterprisedb.com> wrote:
>
> "Kenneth Marshall" <ktm@rice.edu> writes:
>
> > On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote:
> >> Kenneth Marshall <ktm@rice.edu> writes:
> >> > ... This is the rough plan. Does anyone see anything critical that
> >> > is missing at this point?
> >>
> >> Sounds pretty good.  Let me brain-dump one item on you: one thing that
> >> hash currently has over btree is the ability to handle index items up
> >> to a full page.  Now, if you go with a scheme that only stores hash
> >> codes and not the underlying data, you can not only handle that but
> >> improve on it;
>
> I think that would be a big selling point for hash indexes. It would let you
> index even toasted data which are larger than a page. I'm not sure whether you
> can make it work for unique indexes though. But for non-unique indexes I think
> it would be a solid win and mean you cover a set of use cases quite distinct
> from btree indexes.
>
> > - Hash lookup is O(1) while btree is O(logN).
>
> That's not really true. There's a tradeoff between insertion time and lookup
> time. In order to get O(1) lookups you need to work pretty hard to maintain
> the hash table including spending a lot of time reorganizing it when you grow
> it. If you don't want to spend the time on inserts then you end up with
> buckets and the hash table is basically just a linear speedup to whatever
> algorithm you use to scan the buckets.

These facts notwithstanding, average insert performance remains O(1)
if you grow the hash exponentially every time it needs to be grown.
Suppose, for example, that you use a power of 2 arrangement.  Then the
worst case scenario, right after a split, is that all of your keys had
to be inserted, all had to be moved once, half had to be moved twice,
a quarter 3 times, etc.  So the ratio of moves to keys is 1 + 1/2 +
1/4 + ... which is a well-known geometric series converging on 2.

True, when you cross the threshold a lot of work needs to be done.
Life would be simpler if you could just put up a lock while you split
the hash.  You can't do that for a busy transactional database though.But if you want to be clever about it, you build
intoyour hash

implementation the intelligence to be able to have 1 or 2 hash
locations to search.  When they are both present, all inserts go into
one of them, all deletes and updates are performed against both.  Then
you're able to have a background job reorganize your hash while the
database continues to use it.

> > - What about multi-column indexes? The current implementation
> >   only supports 1 column.
>
> That seems kind of weird. It seems obvious that you mix the three hashes
> together which reduces it to the solved problem.

That raises a very random thought.  One of the nicer features of
Oracle is the ability to have function-based indexes.  So you could
index, say, trim(lower(person.name)).  There are a *lot* of practical
situations where that comes in handy.  The best workaround that I can
think of for not having that is to have a column defined to hold the
result of the function, maintain that column with a trigger, then
index that column.  Which works, but is inelegant.  (It also requires
storing completely redundant data.)

Is there any prospect of postgres aquiring that functionality?

Ben

Re: Hash index todo list item

From

Kenneth Marshall

Date:

03 September 2007, 21:27:41

On Mon, Sep 03, 2007 at 05:20:34PM -0700, Ben Tilly wrote:
> 
> That raises a very random thought.  One of the nicer features of
> Oracle is the ability to have function-based indexes.  So you could
> index, say, trim(lower(person.name)).  There are a *lot* of practical
> situations where that comes in handy.  The best workaround that I can
> think of for not having that is to have a column defined to hold the
> result of the function, maintain that column with a trigger, then
> index that column.  Which works, but is inelegant.  (It also requires
> storing completely redundant data.)
> 
> Is there any prospect of postgres aquiring that functionality?
> 
> Ben
> 
I believe that PostgreSQL already supports functional indexes. In fact,
one suggestion to address the egregiously poor performance of the current
hash index was to replace it with a functional index.

Regards,
Ken

Re: Hash index todo list item

From

Tom Lane

Date:

03 September 2007, 21:34:25

"Ben Tilly" <btilly@gmail.com> writes:
> That raises a very random thought.  One of the nicer features of
> Oracle is the ability to have function-based indexes.  So you could
> index, say, trim(lower(person.name)).

> Is there any prospect of postgres aquiring that functionality?

Uh, no, since it's already there; has been since Berkeley days ...
        regards, tom lane

Re: Hash index todo list item

From

"Ben Tilly"

Date:

03 September 2007, 23:06:44

On 9/3/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Ben Tilly" <btilly@gmail.com> writes:
> > That raises a very random thought.  One of the nicer features of
> > Oracle is the ability to have function-based indexes.  So you could
> > index, say, trim(lower(person.name)).
>
> > Is there any prospect of postgres aquiring that functionality?
>
> Uh, no, since it's already there; has been since Berkeley days ...

Nice!

I know of at least one DBA who is moving from Oracle to postgres who
will be *very* happy to hear that.

Ben

Re: Hash index todo list item

From

Kenneth Marshall

Date:

05 September 2007, 17:07:11

On Sun, Sep 02, 2007 at 01:04:04PM -0500, Kenneth Marshall wrote:
> Dear PostgreSQL Hackers:
> 
> After following the hackers mailing list for quite a while,
> I am going to start investigating what will need to be done
> to improve hash index performance. Below are the pieces of
> this project that I am currently considering:
> 
> 1. Characterize the current hash index implementation against
>    the BTree index, with a focus on space utilization and
>    lookup performance against a collection of test data. This
>    will give a baseline performance test to evaluate the impact
>    of changes. I initially do not plan to bench the hash creation
>    process since my initial focus will be on lookup performance.
> 

Here are very basic results for a table with 1.6m entries:

postgres=# CREATE TABLE dict (word varchar(100));
CREATE TABLE

postgres=# COPY dict FROM '/tmp/words';
COPY 1648379
postgres=# select count(*) from dict; count  
---------1648379
(1 row)

Time: 11187.418 ms
postgres=# select count(*) from dict; count  
---------1648379
(1 row)

Time: 6040.912 ms
postgres=# CREATE INDEX wordhash ON dict USING hash (word);
CREATE INDEX
Time: 11108707.160 ms

postgres=# select * from dict where word = 'avatar'; word  
--------avatar
(1 row)

Time: 79.823 ms
postgres=# select * from dict where word = 'zebra';word  
-------zebra
(1 row)

Time: 9.864 ms
postgres=# select * from dict where word = 'turkey';  word  
--------turkey
(1 row)

Time: 18.418 ms
Time: 1.045 ms
Time: 1.257 ms
Time: 1.080 ms

postgres=# CREATE INDEX wordbtree ON dict USING btree (word);
CREATE INDEX

Time: 25438.884 ms

postgres=# select * from dict where word = 'avatar'; word  
--------avatar
(1 row)

Time: 13.400 ms
postgres=# select * from dict where word = 'zebra';word  
-------zebra
(1 row)

Time: 1.173 ms
postgres=# select * from dict where word = 'turkey'; word  
--------turkey
(1 row)

Time: 1.186 ms
Time: 1.103 ms
Time: 1.099 ms
Time: 1.108 ms

------------------------------
Size of table =       87556096

Size of hash index = 268451840
Size of btree index = 53510144

From my very small sample on an unloaded machine, a hash index lookup
took the least amount of time. It had a much larger initial time which
could be attributable to cache population effects. The size is 5X that
of the Btree index. I will continue to improve the test suite as more
granularity is needed. If anyone has a good data generator, please let
me know. Otherwise I will just roll my own.

Regards,
Ken

Re: Hash index todo list item

From

Hannu Krosing

Date:

06 September 2007, 09:18:03

Ühel kenal päeval, E, 2007-09-03 kell 19:55, kirjutas Tom Lane:
> Gregory Stark <stark@enterprisedb.com> writes:
> > "Kenneth Marshall" <ktm@rice.edu> writes:
> >> - What about multi-column indexes? The current implementation
> >> only supports 1 column.
> 
> > That seems kind of weird. It seems obvious that you mix the three hashes
> > together which reduces it to the solved problem. 
> 
> No, because part of the deal is that you can do lookups using only the
> leading index columns.  At least, all the existing multicolumn index
> types can do that.

One approahc is not to mix hashes, but to partition the hash, so that
each column gets its N bits in the hash. 

If you do it smartly you can use any column for index lookups, not just
the leading one.

>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

Re: Hash index todo list item

From

Tom Lane

Date:

06 September 2007, 10:38:25

Hannu Krosing <hannu@skype.net> writes:
> Ühel kenal päeval, E, 2007-09-03 kell 19:55, kirjutas Tom Lane:
>> No, because part of the deal is that you can do lookups using only the
>> leading index columns.  At least, all the existing multicolumn index
>> types can do that.

> One approahc is not to mix hashes, but to partition the hash, so that
> each column gets its N bits in the hash. 

How does that help?  You still need all the keys to find out which
bucket to look in.
        regards, tom lane

Re: Hash index todo list item

From

Hannu Krosing

Date:

06 September 2007, 11:01:00

Ühel kenal päeval, N, 2007-09-06 kell 09:38, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > Ühel kenal päeval, E, 2007-09-03 kell 19:55, kirjutas Tom Lane:
> >> No, because part of the deal is that you can do lookups using only the
> >> leading index columns.  At least, all the existing multicolumn index
> >> types can do that.
> 
> > One approahc is not to mix hashes, but to partition the hash, so that
> > each column gets its N bits in the hash. 
> 
> How does that help?  You still need all the keys to find out which
> bucket to look in.

no. you need to look at only the buckets where that part of hash matches

say you allocate bits 4-7 for column 2 and then need to look up column 2
value with hash 3 . here you need to look at only buckets N*16 + 3, that
is, you need to examine only each 16th bucket

---------
Hannu

Re: Hash index todo list item

From

Mark Mielke

Date:

06 September 2007, 13:37:00

Hannu Krosing wrote: <blockquote cite="mid:1189087192.7470.16.camel@hannu-laptop" type="cite"><blockquote
type="cite"><blockquotetype="cite"><pre wrap="">One approahc is not to mix hashes, but to partition the hash, so that
 
each column gets its N bits in the hash.      </pre></blockquote><pre wrap="">How does that help?  You still need all
thekeys to find out which
 
bucket to look in.   </pre></blockquote><pre wrap="">
no. you need to look at only the buckets where that part of hash matches

say you allocate bits 4-7 for column 2 and then need to look up column 2
value with hash 3 . here you need to look at only buckets N*16 + 3, that
is, you need to examine only each 16th bucket
 </pre></blockquote><br /> I don't like the truncating hash suggestion because it limits the ability of a hash code to
uniquelyidentify a key.<br /><br /> If a user requires the ability to search on both (column1) and (column1, column2),
theycan create two hash indexes and the planner can decide which to use.<br /> Or, they can use a btree. I think hash
hasa subset of uses where it would be a significant gain, and focus should be spent on this subset.<br /><br />
Cheers,<br/> mark<br /><br /><pre class="moz-signature" cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Hash index todo list item

From

Michael Glaesemann

Date:

06 September 2007, 14:07:47

On Sep 6, 2007, at 10:53 , Mark Mielke wrote:

> I don't like the truncating hash suggestion because it limits the  
> ability of a hash code to uniquely identify a key.

AIUI, a hash can't be used as a unique identifier: it always needs to  
be rechecked due to the chance of collisions. There might be other  
issues with truncation, but preventing hashes from being unique isn't  
one of them.

Michael Glaesemann
grzm seespotcode net

Re: Hash index todo list item

From

Kenneth Marshall

Date:

06 September 2007, 15:09:09

On Thu, Sep 06, 2007 at 11:53:45AM -0400, Mark Mielke wrote:
> Hannu Krosing wrote:
>>>> One approahc is not to mix hashes, but to partition the hash, so that
>>>> each column gets its N bits in the hash.       
>>> How does that help?  You still need all the keys to find out which
>>> bucket to look in.
>>>     
>>
>> no. you need to look at only the buckets where that part of hash matches
>>
>> say you allocate bits 4-7 for column 2 and then need to look up column 2
>> value with hash 3 . here you need to look at only buckets N*16 + 3, that
>> is, you need to examine only each 16th bucket
>>
>>   
>
> I don't like the truncating hash suggestion because it limits the ability 
> of a hash code to uniquely identify a key.
>
> If a user requires the ability to search on both (column1) and (column1, 
> column2), they can create two hash indexes and the planner can decide which 
> to use.
> Or, they can use a btree. I think hash has a subset of uses where it would 
> be a significant gain, and focus should be spent on this subset.
>
> Cheers,
> mark
>
I agree that we should focus primarily on the subset of uses for hash
indexes where there would be a significant gain. I do think that being
able to use a single O(1) hash lookup against all the values specified
in a pseudo-multi-column index could be very beneficial in reducing
access time and I/O. 

Since we already have to check the actual tuple values for any index
lookup in postgresql, we could only store the full hash value and the
corresponding TIDs in the bucket. Then when we lookup an item by
calculating its hash, if the exact hash is not present in the bucket,
then we know that the item is not in the index. If the value exists,
then we would check the heap tuples before returning the results. Thus
a negative lookup only needs to check the index and if the hash function
is "good" there will be optimally only 1 possibly valid heap tuple if
there is a match. One very big win for this change is to allow a much
smaller index size (hash value + relevant TIDs) and the large column
values are only stored in the actual data tuples.

Regards,
Ken
> -- 
> Mark Mielke <mark@mielke.cc>
>

Re: Hash index todo list item

From

Mark Mielke

Date:

06 September 2007, 19:51:33

Michael Glaesemann wrote:
> On Sep 6, 2007, at 10:53 , Mark Mielke wrote:
>> I don't like the truncating hash suggestion because it limits the 
>> ability of a hash code to uniquely identify a key.
> AIUI, a hash can't be used as a unique identifier: it always needs to 
> be rechecked due to the chance of collisions. There might be other 
> issues with truncation, but preventing hashes from being unique isn't 
> one of them.

Of course - that's why I used the word "limit".

Hash works best, when the key is unique, however. A 32-bit hash will be 
many powers of 2 more unique than a 8-bit hash.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: Hash index todo list item

From

Martijn van Oosterhout

Date:

07 September 2007, 03:26:11

On Thu, Sep 06, 2007 at 01:08:59PM -0500, Kenneth Marshall wrote:
> Since we already have to check the actual tuple values for any index
> lookup in postgresql, we could only store the full hash value and the
> corresponding TIDs in the bucket. Then when we lookup an item by
> calculating its hash, if the exact hash is not present in the bucket,
> then we know that the item is not in the index.

Sounds like you'd be returning a bitmap for use with a bitmap scan.
That's a different take on other suggestions I've heard and would allow
a hash index to have an almost unlimited key size yet flexible
matching... (combined with other index, or even just the same index).

Neat.

Have a nice day,
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Hash index todo list item

From

Neil Conway

Date:

07 September 2007, 03:56:27

On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote:
> 2. Evaluate the performance of different hash index implementations
>    and/or changes to the current implementation. My current plan is
>    to keep the implementation as simple as possible and still provide
>    the desired performance. Several hash index suggestions deal with
>    changing the layout of the keys on a page to improve lookup
>    performance, including reducing the bucket size to a fraction of
>    a page or only storing the hash value on the page, instead of
>    the index value itself.

You might find this patch useful:
   http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php

It implements the "just store the hash in the index" idea; it also sorts
the entries in a bucket by the hash value, which allows binary search to
be used to locate candidate matches.

I was surprised that this didn't result in a performance improvement for
the benchmarks that I ran, but I never got around to investigating
further (either running more benchmarks or checking whether there was a
bug in the implementation).

Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
it up to HEAD if you'd like.

-Neil

Re: Hash index todo list item

From

"Heikki Linnakangas"

Date:

07 September 2007, 08:56:56

Neil Conway wrote:
> You might find this patch useful:
> 
>     http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php

Oh, I had forgot about that.

> It implements the "just store the hash in the index" idea; it also sorts
> the entries in a bucket by the hash value, which allows binary search to
> be used to locate candidate matches.
> 
> I was surprised that this didn't result in a performance improvement for
> the benchmarks that I ran, but I never got around to investigating
> further (either running more benchmarks or checking whether there was a
> bug in the implementation).

You did get a smaller index, which has cache benefits with larger
tables. You didn't compare the size comparison against b-tree, but a
smaller index is a good thing.

I think you left some money on the table on that front. Since all the
HashItems are fixed size, you can get rid of the line pointers
altogether. Given that a sizeof(HashItemData) is 8 bytes, and a line
pointer is 4 bytes, that should give a further 1/3 reduction in index
size. If you're willing to give up on the alignment of HashItemData, you
can get it down to 6 bytes.

Even from a CPU point of view, as the table becomes bigger and the
b-tree becomes deeper, the difference between O(1) and O(log n) lookup
starts to become more significant.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Hash index todo list item

From

Kenneth Marshall

Date:

07 September 2007, 10:29:36

On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote:
> On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote:
> > 2. Evaluate the performance of different hash index implementations
> >    and/or changes to the current implementation. My current plan is
> >    to keep the implementation as simple as possible and still provide
> >    the desired performance. Several hash index suggestions deal with
> >    changing the layout of the keys on a page to improve lookup
> >    performance, including reducing the bucket size to a fraction of
> >    a page or only storing the hash value on the page, instead of
> >    the index value itself.
> 
> You might find this patch useful:
> 
>     http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
> 
> It implements the "just store the hash in the index" idea; it also sorts
> the entries in a bucket by the hash value, which allows binary search to
> be used to locate candidate matches.
> 
> I was surprised that this didn't result in a performance improvement for
> the benchmarks that I ran, but I never got around to investigating
> further (either running more benchmarks or checking whether there was a
> bug in the implementation).
> 
> Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
> it up to HEAD if you'd like.
> 
> -Neil
> 
This is a great starting point. I would appreciate it if you have the
time and could make it apply cleanly to HEAD. I remember when you first
posted it but had forgotten, probably because of the lack-luster results.
Just a quick glance at the patch and from what I can tell, tagging the
index as lossy causes a lot more work to be done than should be needed
in theory. Currently the index-scan machinery will recheck the value
against the original value for lossy indexes. However, given that we
are using a good hash function with a low chance of collision, if we
mark the unique items in the index then they do not actually have to
be rechecked during the scan. Do you have any suggestions for implementing
that optimization or is there any option to tell the scan machinery to
only re-check a certain list of tuples? Thank you again for pointing
this patch out and please let me know when you have a version against
HEAD.

Regards,
Ken
> 
>

Re: Hash index todo list item

From

Kenneth Marshall

Date:

07 September 2007, 10:39:07

On Fri, Sep 07, 2007 at 12:55:37PM +0100, Heikki Linnakangas wrote:
> Neil Conway wrote:
> > You might find this patch useful:
> > 
> >     http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
> 
> Oh, I had forgot about that.
> 
> > It implements the "just store the hash in the index" idea; it also sorts
> > the entries in a bucket by the hash value, which allows binary search to
> > be used to locate candidate matches.
> > 
> > I was surprised that this didn't result in a performance improvement for
> > the benchmarks that I ran, but I never got around to investigating
> > further (either running more benchmarks or checking whether there was a
> > bug in the implementation).
> 
> You did get a smaller index, which has cache benefits with larger
> tables. You didn't compare the size comparison against b-tree, but a
> smaller index is a good thing.
> 
> I think you left some money on the table on that front. Since all the
> HashItems are fixed size, you can get rid of the line pointers
> altogether. Given that a sizeof(HashItemData) is 8 bytes, and a line
> pointer is 4 bytes, that should give a further 1/3 reduction in index
> size. If you're willing to give up on the alignment of HashItemData, you
> can get it down to 6 bytes.
> 
> Even from a CPU point of view, as the table becomes bigger and the
> b-tree becomes deeper, the difference between O(1) and O(log n) lookup
> starts to become more significant.
> 
If you use the size values from my initial tests, the hash index is
down to 3X the btree size instead of 5X. If we can make these additional
changes and add a unique flags to the index entry, we would have a much
smaller index. I had not thought of it at the time, but the addition of
the unique/sole item flag would allow the use of the hash index to
support unique indexes and be used for primary keys. In some usage
cases, the O(1) would be very advantageous.

Regards,
Ken

Re: Hash index todo list item

From

Mark Mielke

Date:

07 September 2007, 10:50:21

Kenneth Marshall wrote: <blockquote cite="mid:20070907132928.GC19403@it.is.rice.edu" type="cite"><pre wrap="">On Thu,
Sep06, 2007 at 11:56:25PM -0700, Neil Conway wrote: </pre><blockquote type="cite"><pre wrap="">You might find this
patchuseful:
 
   <a class="moz-txt-link-freetext"
href="http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php">http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php</a>
...

Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
it up to HEAD if you'd like.   </pre></blockquote><pre wrap="">This is a great starting point. I would appreciate it if
youhave the
 
time and could make it apply cleanly to HEAD. I remember when you first
posted it but had forgotten, probably because of the lack-luster results.
Just a quick glance at the patch and from what I can tell, tagging the
index as lossy causes a lot more work to be done than should be needed
in theory. Currently the index-scan machinery will recheck the value
against the original value for lossy indexes. However, given that we
are using a good hash function with a low chance of collision, if we
mark the unique items in the index then they do not actually have to
be rechecked during the scan. Do you have any suggestions for implementing
that optimization or is there any option to tell the scan machinery to
only re-check a certain list of tuples? Thank you again for pointing
this patch out and please let me know when you have a version against
HEAD. </pre></blockquote> What do you mean by "mark the unique items in the index then they do not actually have to be
recheckedduring the scan." Even if there is a unique hash value mapping to a unique key, there is no guarantee that a
newvalue won't result in that same hash value. Such is the nature of hashes. A hash key map does not mean a value
match.The value must be checked. The opposite, however, may be true. If the hash key is not found, then we know the row
forthe value does not exist.<br /><br /> Have you measured the performance of re-checking? I have always assumed the
performanceof re-checking was near free when compared to the cost of looking up the tuples in the table to determine
whetheror not they are "live". This is why I have not been upset that bitmap index scans often re-check.<br /><br />
Cheers,<br/> mark<br /><br /><pre class="moz-signature" cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Hash index todo list item

From

Kenneth Marshall

Date:

07 September 2007, 11:01:12

On Fri, Sep 07, 2007 at 09:50:07AM -0400, Mark Mielke wrote:
> Kenneth Marshall wrote:
>> On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote:
>>   
>>> You might find this patch useful:
>>>
>>>     http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
>>> ...
>>>
>>> Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
>>> it up to HEAD if you'd like.
>>>     
>> This is a great starting point. I would appreciate it if you have the
>> time and could make it apply cleanly to HEAD. I remember when you first
>> posted it but had forgotten, probably because of the lack-luster results.
>> Just a quick glance at the patch and from what I can tell, tagging the
>> index as lossy causes a lot more work to be done than should be needed
>> in theory. Currently the index-scan machinery will recheck the value
>> against the original value for lossy indexes. However, given that we
>> are using a good hash function with a low chance of collision, if we
>> mark the unique items in the index then they do not actually have to
>> be rechecked during the scan. Do you have any suggestions for implementing
>> that optimization or is there any option to tell the scan machinery to
>> only re-check a certain list of tuples? Thank you again for pointing
>> this patch out and please let me know when you have a version against
>> HEAD.
>>   
> What do you mean by "mark the unique items in the index then they do not 
> actually have to be rechecked during the scan." Even if there is a unique 
> hash value mapping to a unique key, there is no guarantee that a new value 
> won't result in that same hash value. Such is the nature of hashes. A hash 
> key map does not mean a value match. The value must be checked. The 
> opposite, however, may be true. If the hash key is not found, then we know 
> the row for the value does not exist.
>
> Have you measured the performance of re-checking? I have always assumed the 
> performance of re-checking was near free when compared to the cost of 
> looking up the tuples in the table to determine whether or not they are 
> "live". This is why I have not been upset that bitmap index scans often 
> re-check.
>
> Cheers,
> mark

I understand that a hash value is a many-to-one mapping. That is the
point of the flag in the index. The flag means that there is only one
item in the heap corresponding to that hash value. In this case we
know that the value in the heap is the correct one and a possibly
very expensive string comparison can be skipped. Given that the hash
function is doing its job, almost every string comparison can be skipped.
How long would it take to compare 1-32K of data? How much CPU usage?
With this field in place, you only need to check tuple visibility.

Regards,
Ken

Re: Hash index todo list item

From

Kenneth Marshall

Date:

07 September 2007, 11:17:03

On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote:
> On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote:
> > 2. Evaluate the performance of different hash index implementations
> >    and/or changes to the current implementation. My current plan is
> >    to keep the implementation as simple as possible and still provide
> >    the desired performance. Several hash index suggestions deal with
> >    changing the layout of the keys on a page to improve lookup
> >    performance, including reducing the bucket size to a fraction of
> >    a page or only storing the hash value on the page, instead of
> >    the index value itself.
> 
> You might find this patch useful:
> 
>     http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
> 
> It implements the "just store the hash in the index" idea; it also sorts
> the entries in a bucket by the hash value, which allows binary search to
> be used to locate candidate matches.
> 
> I was surprised that this didn't result in a performance improvement for
> the benchmarks that I ran, but I never got around to investigating
> further (either running more benchmarks or checking whether there was a
> bug in the implementation).
> 
> Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
> it up to HEAD if you'd like.
> 
> -Neil
> 
I have another question. Did the scan code at this time use the
heap-order scanning? Could that have had an impact on the patch
performance?

Ken

Re: Hash index todo list item

From

Mark Mielke

Date:

07 September 2007, 11:30:53

Kenneth Marshall wrote:
> I understand that a hash value is a many-to-one mapping. That is the
> point of the flag in the index. The flag means that there is only one
> item in the heap corresponding to that hash value. In this case we
> know that the value in the heap is the correct one and a possibly
> very expensive string comparison can be skipped. Given that the hash
> function is doing its job, almost every string comparison can be skipped.
> How long would it take to compare 1-32K of data? How much CPU usage?
> With this field in place, you only need to check tuple visibility
The value comparison cannot be skipped. I do not think you understand 
the many-to-one mapping in its entirety.

Example:
   Table has: a(1), b(2), c(3)   Index has: 1 => 1 (unique), 2 => 2 (unique), 3 => 3 (unique)

Query:
   select * from table where key = 'z';

If 'z' hashes to '3' (completely possible), then the index record 3 
points to tuple 3, and it "exists". Only, it doesn't because 'a' <> 'z'. 
You MUST check the value.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: Hash index todo list item

From

Brian Hurt

Date:

07 September 2007, 11:41:48

Kenneth Marshall wrote:

>I understand that a hash value is a many-to-one mapping. That is the
>point of the flag in the index. The flag means that there is only one
>item in the heap corresponding to that hash value. In this case we
>know that the value in the heap is the correct one and a possibly
>very expensive string comparison can be skipped. Given that the hash
>function is doing its job, almost every string comparison can be skipped.
>How long would it take to compare 1-32K of data? How much CPU usage?
>With this field in place, you only need to check tuple visibility.
>  
>

How likely is it that you will get a hash collision, two strings that 
are different that will hash to the same value?  To avoid this requires 
a very large hash key (128 bits, minimum)- otherwise you get into 
birthday attack problems.  With a 32-bit hash, the likelyhood is greater 
than 50% that two strings in a collection of 100,000 will hash to the 
same value.  With a 64-bit hash, the likelyhood is greater than 50% that 
two strings in a collection of 10 billion will has to same value.  10 
billion is a large number, but not an unreasonable number, of strings to 
want to put into a hash table- and it's exactly this case where the O(1) 
cost of hashtables starts being a real win.

Brian

Re: Hash index todo list item

From

Kenneth Marshall

Date:

07 September 2007, 11:44:20

On Fri, Sep 07, 2007 at 10:30:30AM -0400, Mark Mielke wrote:
> Kenneth Marshall wrote:
>> I understand that a hash value is a many-to-one mapping. That is the
>> point of the flag in the index. The flag means that there is only one
>> item in the heap corresponding to that hash value. In this case we
>> know that the value in the heap is the correct one and a possibly
>> very expensive string comparison can be skipped. Given that the hash
>> function is doing its job, almost every string comparison can be skipped.
>> How long would it take to compare 1-32K of data? How much CPU usage?
>> With this field in place, you only need to check tuple visibility
> The value comparison cannot be skipped. I do not think you understand the 
> many-to-one mapping in its entirety.
>
> Example:
>
>    Table has: a(1), b(2), c(3)
>    Index has: 1 => 1 (unique), 2 => 2 (unique), 3 => 3 (unique)
>
> Query:
>
>    select * from table where key = 'z';
>
> If 'z' hashes to '3' (completely possible), then the index record 3 points 
> to tuple 3, and it "exists". Only, it doesn't because 'a' <> 'z'. You MUST 
> check the value.
>
> Cheers,
> mark
>
Yes, you are completely correct. Thank you for the reminder.

Regards,
Ken

Re: Hash index todo list item

From

Kenneth Marshall

Date:

07 September 2007, 11:55:18

On Fri, Sep 07, 2007 at 10:36:41AM -0400, Brian Hurt wrote:
> Kenneth Marshall wrote:
>
>> I understand that a hash value is a many-to-one mapping. That is the
>> point of the flag in the index. The flag means that there is only one
>> item in the heap corresponding to that hash value. In this case we
>> know that the value in the heap is the correct one and a possibly
>> very expensive string comparison can be skipped. Given that the hash
>> function is doing its job, almost every string comparison can be skipped.
>> How long would it take to compare 1-32K of data? How much CPU usage?
>> With this field in place, you only need to check tuple visibility.
>>  
>
> How likely is it that you will get a hash collision, two strings that are 
> different that will hash to the same value?  To avoid this requires a very 
> large hash key (128 bits, minimum)- otherwise you get into birthday attack 
> problems.  With a 32-bit hash, the likelyhood is greater than 50% that two 
> strings in a collection of 100,000 will hash to the same value.  With a 
> 64-bit hash, the likelyhood is greater than 50% that two strings in a 
> collection of 10 billion will has to same value.  10 billion is a large 
> number, but not an unreasonable number, of strings to want to put into a 
> hash table- and it's exactly this case where the O(1) cost of hashtables 
> starts being a real win.
>
> Brian
>
Yes, there is a non-negligible chance of collision (In a DB is there
any chance that is non-negligible? :) ) and the values must be checked
against the actual. The win is the collapse of the index size and only
needed to check a small fraction of the actual tuples.

Regards,
Ken

Re: Hash index todo list item

From

Brian Hurt

Date:

07 September 2007, 12:10:50

Kenneth Marshall wrote:<br /><blockquote cite="mid20070907145507.GJ19403@it.is.rice.edu" type="cite"><blockquote
type="cite"><blockquotetype="cite"><pre wrap="">      </pre></blockquote><pre wrap="">How likely is it that you will
geta hash collision, two strings that are 
 
different that will hash to the same value?  To avoid this requires a very 
large hash key (128 bits, minimum)- otherwise you get into birthday attack 
problems.  With a 32-bit hash, the likelyhood is greater than 50% that two 
strings in a collection of 100,000 will hash to the same value.  With a 
64-bit hash, the likelyhood is greater than 50% that two strings in a 
collection of 10 billion will has to same value.  10 billion is a large 
number, but not an unreasonable number, of strings to want to put into a 
hash table- and it's exactly this case where the O(1) cost of hashtables 
starts being a real win.

Brian
   </pre></blockquote><pre wrap="">Yes, there is a non-negligible chance of collision (In a DB is there
any chance that is non-negligible? :) ) and the values must be checked
against the actual. The win is the collapse of the index size and only
needed to check a small fraction of the actual tuples.

 </pre></blockquote><br /> Ah, OK- I misunderstood you.  I thought you were saying that the hash values would need to
beunique, and you wouldn't check the original values at all.  My bad.<br /><br /> Brian<br /><br />

Re: Hash index todo list item

From

Kenneth Marshall

Date:

07 September 2007, 12:17:34

On Fri, Sep 07, 2007 at 11:08:13AM -0400, Brian Hurt wrote:
> Kenneth Marshall wrote:
>
>>>>      
>>> How likely is it that you will get a hash collision, two strings that are 
>>> different that will hash to the same value?  To avoid this requires a 
>>> very large hash key (128 bits, minimum)- otherwise you get into birthday 
>>> attack problems.  With a 32-bit hash, the likelyhood is greater than 50% 
>>> that two strings in a collection of 100,000 will hash to the same value.  
>>> With a 64-bit hash, the likelyhood is greater than 50% that two strings 
>>> in a collection of 10 billion will has to same value.  10 billion is a 
>>> large number, but not an unreasonable number, of strings to want to put 
>>> into a hash table- and it's exactly this case where the O(1) cost of 
>>> hashtables starts being a real win.
>>>
>>> Brian
>>>
>>>    
>> Yes, there is a non-negligible chance of collision (In a DB is there
>> any chance that is non-negligible? :) ) and the values must be checked
>> against the actual. The win is the collapse of the index size and only
>> needed to check a small fraction of the actual tuples.
>>
>>
>>  
>
> Ah, OK- I misunderstood you.  I thought you were saying that the hash 
> values would need to be unique, and you wouldn't check the original values 
> at all.  My bad.
>
> Brian
>
No, you were correct. I misstated originally and you and Mark both pointed
out my mistake.

Regards,
Ken

Re: Hash index todo list item

From

Kenneth Marshall

Date:

08 September 2007, 17:21:34

On Fri, Sep 07, 2007 at 10:36:41AM -0400, Brian Hurt wrote:
> Kenneth Marshall wrote:
>
>> I understand that a hash value is a many-to-one mapping. That is the
>> point of the flag in the index. The flag means that there is only one
>> item in the heap corresponding to that hash value. In this case we
>> know that the value in the heap is the correct one and a possibly
>> very expensive string comparison can be skipped. Given that the hash
>> function is doing its job, almost every string comparison can be skipped.
>> How long would it take to compare 1-32K of data? How much CPU usage?
>> With this field in place, you only need to check tuple visibility.
>>  
>
> How likely is it that you will get a hash collision, two strings that are 
> different that will hash to the same value?  To avoid this requires a very 
> large hash key (128 bits, minimum)- otherwise you get into birthday attack 
> problems.  With a 32-bit hash, the likelyhood is greater than 50% that two 
> strings in a collection of 100,000 will hash to the same value.  With a 
> 64-bit hash, the likelyhood is greater than 50% that two strings in a 
> collection of 10 billion will has to same value.  10 billion is a large 
> number, but not an unreasonable number, of strings to want to put into a 
> hash table- and it's exactly this case where the O(1) cost of hashtables 
> starts being a real win.
>
> Brian
>
Continuing this train of thought.... While it would make sense for larger
keys to store the hash in the index, if the key is smaller, particularly
if it is of fixed size, it would make sense to store the key in the index
instead. This would have the benefit of allowing use of the hash index
in non-lossy mode albeit with a slight increase in complexity.

Ken

Re: Hash index todo list item

From

Mark Mielke

Date:

08 September 2007, 18:14:18

Kenneth Marshall wrote:<br /><blockquote cite="mid:20070908202122.GA5679@it.is.rice.edu" type="cite"><pre
wrap="">Continuingthis train of thought.... While it would make sense for larger
 
keys to store the hash in the index, if the key is smaller, particularly
if it is of fixed size, it would make sense to store the key in the index
instead. This would have the benefit of allowing use of the hash index
in non-lossy mode albeit with a slight increase in complexity. </pre></blockquote> I suspect there is no value in
designinga hash implementation to work well for a context where a btree index would already perform equally well.<br
/><br/> If there are too few hash buckets, performance is not O(1). For a hash index to function better than btree, I
believefocus should be spent on the O(1) case, which means ensuring that enough hash buckets are used to provide
O(1).<br/><br /> All of these must match: 1) Hash value, 2) Key value, 3) Tuple visibility.<br /><br /> In the optimum
O(1)scenario, each existing key will map to a hash bucket that contains ~1 entry. For this case, there is no value to
havingthe key stored in the index row, as 3) Tuple visibility, will still require access to the table row. In this
optimumscenario, I do not believe anything of value is saved by storing the key in the index row. The loss, however, is
thatthe hash index data structures become more complex, and would likely require support for variable length data. The
resultingincrease in hash index size and code complexity would reduce performance.<br /><br /> Just an opinion.<br
/><br/> Cheers,<br /> mark<br /><pre class="moz-signature" cols="72">
 
-- 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Hash index todo list item

From

Kenneth Marshall

Date:

08 September 2007, 18:49:07

On Sat, Sep 08, 2007 at 05:14:09PM -0400, Mark Mielke wrote:
> Kenneth Marshall wrote:
>> Continuing this train of thought.... While it would make sense for larger
>> keys to store the hash in the index, if the key is smaller, particularly
>> if it is of fixed size, it would make sense to store the key in the index
>> instead. This would have the benefit of allowing use of the hash index
>> in non-lossy mode albeit with a slight increase in complexity.
>>   
> I suspect there is no value in designing a hash implementation to work well 
> for a context where a btree index would already perform equally well.
>
> If there are too few hash buckets, performance is not O(1). For a hash 
> index to function better than btree, I believe focus should be spent on the 
> O(1) case, which means ensuring that enough hash buckets are used to 
> provide O(1).
>
> All of these must match: 1) Hash value, 2) Key value, 3) Tuple visibility.
>
> In the optimum O(1) scenario, each existing key will map to a hash bucket 
> that contains ~1 entry. For this case, there is no value to having the key 
> stored in the index row, as 3) Tuple visibility, will still require access 
> to the table row. In this optimum scenario, I do not believe anything of 
> value is saved by storing the key in the index row. The loss, however, is 
> that the hash index data structures become more complex, and would likely 
> require support for variable length data. The resulting increase in hash 
> index size and code complexity would reduce performance.
>
> Just an opinion.
>
> Cheers,
> mark
>

I agree that we should focus on the O(1) area. The value of storing the
actual value, possibly as an option, is that the key check can be done
in the index without requiring a heap lookup to check the actual value
which would be a benefit for a unique index. I agree that supporting
variable length data would complicate the index and reduce performance.
I am not willing to assume that ~1 entry per hash bucket is necessarily
what we want, at least without some testing. It seems reasonable that
with the performance differences between L1/L2/L3 cache, main memory,
and the disk subsystem a higher load factor would result in better
overall system performance by reducing cache line misses and improving
hardware pre-fetch efficiency. Along with the hypothetical performance
wins, the hash index space efficiency would be improved by a similar
factor. Obviously, all of these ideas would need to be tested in
various workload environments. In the large index arena, 10^6 to 10^9
keys and more, space efficiency will help keep the index manageable
in todays system memories.

Please keep the ideas and comments coming. I am certain that a synthesis
of them will provide an implementation with the performance characteristics
that we are seeking.

Regards,
Ken

> -- 
> Mark Mielke <mark@mielke.cc>
>

Re: Hash index todo list item

From

Mark Mielke

Date:

08 September 2007, 19:56:29

Kenneth Marshall wrote:<br /><blockquote cite="mid:20070908214900.GB5679@it.is.rice.edu" type="cite"><pre wrap="">The
valueof storing the
 
actual value, possibly as an option, is that the key check can be done
in the index without requiring a heap lookup to check the actual value
which would be a benefit for a unique index. I agree that supporting
variable length data would complicate the index and reduce performance.
I am not willing to assume that ~1 entry per hash bucket is necessarily
what we want, at least without some testing.</pre></blockquote> I think that if the case of >1 entry per hash
becomescommon enough to be significant, and the key is stored in the hash, that a btree will perform equal or better,
andthere is no point in pursuing  such a hash index model. This is where we are today.<br /><br /><blockquote
cite="mid:20070908214900.GB5679@it.is.rice.edu"type="cite"><pre wrap="">It seems reasonable that
 
with the performance differences between L1/L2/L3 cache, main memory,
and the disk subsystem a higher load factor would result in better
overall system performance by reducing cache line misses and improving
hardware pre-fetch efficiency.</pre></blockquote> If the key is stored, all of these factors likely favor the btree
formatover the hash format. Again, this is where we are today.<br /><br /><blockquote
cite="mid:20070908214900.GB5679@it.is.rice.edu"type="cite"><pre wrap="">Along with the hypothetical performance
 
wins, the hash index space efficiency would be improved by a similar
factor. Obviously, all of these ideas would need to be tested in
various workload environments. In the large index arena, 10^6 to 10^9
keys and more, space efficiency will help keep the index manageable
in todays system memories. </pre></blockquote> Space efficiency is provided by not storing the key, nor the header data
required(length prefix?).<br /><blockquote cite="mid:20070908214900.GB5679@it.is.rice.edu" type="cite"><pre
wrap="">Pleasekeep the ideas and comments coming. I am certain that a synthesis
 
of them will provide an implementation with the performance characteristics
that we are seeking. </pre></blockquote> The subject interests me. :-)<br /><br /> Cheers,<br /> mark<br /><br /><pre
class="moz-signature"cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Hash index todo list item

From

Kenneth Marshall

Date:

09 September 2007, 23:43:07

On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote:
> Kenneth Marshall <ktm@rice.edu> writes:
> > ... This is the rough plan. Does anyone see anything critical that
> > is missing at this point?
> 
> Sounds pretty good.  Let me brain-dump one item on you: one thing that
> hash currently has over btree is the ability to handle index items up
> to a full page.  Now, if you go with a scheme that only stores hash
> codes and not the underlying data, you can not only handle that but
> improve on it; but if you reduce the bucket size and don't remove the
> data, it'd be a step backward.  The idea I had about dealing with that
> was to only reduce the size of primary buckets --- if it's necessary to
> add overflow space to a bucket, the overflow units are still full pages.
> So an index tuple up to a page in size can always be accommodated by
> adding an overflow page to the bucket.
> 
> Just a thought, but AFAIR it's not in the archives anywhere.
> 
>             regards, tom lane
> 
I was thinking about this some more, and it strikes me that we can
keep the page size = bucket size = overflow size in the new scheme
of storing the hash value in the index and still reduce the effective
bucket size. Let's make the new size the (page size / 2^n) where n
is chosen appropriately. Then we we store the value in the page we
simply use n bits of the hash to determine which sub-piece of the
page to use to actually store the value. We may need to add a multiplier
to adjust the decision to split the page based on the mini-page. This
should allow us to much more densely pack the pages and approach the
1 item per bucket. This could easily shrink the size of the index by
a factor of 2^n. Let me know what you think.

Regards,
Ken

Re: Hash index todo list item

From

Jens-Wolfhard Schicke

Date:

10 September 2007, 06:02:44

--On Samstag, September 08, 2007 18:56:23 -0400 Mark Mielke 
<mark@mark.mielke.cc> wrote:
> Kenneth Marshall wrote:
> Along with the hypothetical performance
> wins, the hash index space efficiency would be improved by a similar
> factor. Obviously, all of these ideas would need to be tested in
> various workload environments. In the large index arena, 10^6 to 10^9
> keys and more, space efficiency will help keep the index manageable
> in todays system memories.
>
>
> Space efficiency is provided by not storing the key, nor the header data
> required (length prefix?).
Space efficiency at ~1 entry per bucket: How about using closed hashing, 
saving in each page a bitmask in front which specifies which entries hold 
valid entries and in the rest of the page row-pointers (is this the correct 
expression? I don't know...) without further data. Should provide 
reasonably simple data structure and alignment for the pointers.

> Please keep the ideas and comments coming. I am certain that a synthesis
> of them will provide an implementation with the performance
> characteristics
> that we are seeking.

One should look into new plan nodes for "!= ANY()", "NOT EXISTS" and 
similar. A node like "look into hash and true if bucket is empty" would 
work without checking tuple visibility when the bucket is empty and could 
be a win in some situations.

Do we want special cases for short keys like INT4? In those cases the 
implementation might use hash == key and put that knowledge to use in 
plans. Even a unique constraint might then be doable. Does the 
postgresql-storage backend on linux support sparse files? Might be a win 
when holes in the sequence turn up.

Re: Hash index todo list item

From

Jens-Wolfhard Schicke

Date:

10 September 2007, 08:28:06

More random thoughts:

- Hash-Indices are best for unique keys, but every table needs a new hash 
key, which means one more random page access. Is there any way to build 
multi-_table_ indices? A join might then fetch all table rows with a given 
unique key after one page fetch for the combined index.

- Hashes with trivial hash-functions (like identity) can also return rows 
in a desired order.

- Is there a case where a sequentially scanning a hash-index is useful? I 
can't find any, but maybe somebody else has a use-case.

- What about HashJoins when the referenced tables have hash-indices?

- What about hash-indices where entries are inserted for multiple columns. 
Assume a table like this:

CREATE TABLE link (obj_id1 INT4, obj_id2 INT4);

and a query like

SELECT * FROM link WHERE ? IN (obj_id1, obj_id2);

or some join using a similar condition. It might be a nice thing to insert 
entries at both HASH(obj_id1) and HASH(obj_id2), which would eliminate the 
need to check in two indices and do a bitmap OR. OTOH it might not be 
faster in any significant use cases because who'd need a link table with 
nearly unique linked objects?

- In cases where the distribution of the hash-function is good, but a small 
and relatively even number of rows exist for each key (like it might be the 
case in the above example), it might be nice to reserve a given amount of 
same-key row entries in each bucket, and hold a fill-count at the front of 
it. That would avoid costly page fetches after each collision. You'd create 
a hash-index with n-buckets, each m-elements large. When the bucket is 
full, the usual collision handling continues.

- About hash enlargement: What about always using only the first k bits of 
each hash value. When you find that the hash is "quite-full" (however that 
is defined and detected), k is increased by one, effectively doubling the 
hash size. New entries are then written as usual, while retrieving the old 
entries needs to test at the k-bit-position first and if there is a miss, 
also at the k-1-position and so forth. To limit this search, some 
background process could after analyzing the index move old entries to the 
now correct k-bit-position and increment some "min-k"-value once all old 
entries have been moved. After the hash has been increased, the index would 
approximately half it's speed for some time then. Additionally one could 
also insert the entry at the new position if it has been found at the old 
one only while using the index. A special "miss"-entry at the new position 
doesn't help if nothing could be found because the old positions will 
usually hold some data which resides there even if it uses k bits.

Re: Hash index todo list item

From

Mark Mielke

Date:

10 September 2007, 11:36:35

Kenneth Marshall wrote: <blockquote cite="mid:20070910024258.GC5679@it.is.rice.edu" type="cite"><pre wrap="">On Sun,
Sep02, 2007 at 10:41:22PM -0400, Tom Lane wrote: </pre><blockquote type="cite"><pre wrap="">Kenneth Marshall <a
class="moz-txt-link-rfc2396E"href="mailto:ktm@rice.edu"><ktm@rice.edu></a> writes:   </pre><blockquote
type="cite"><prewrap="">... This is the rough plan. Does anyone see anything critical that
 
is missing at this point?     </pre></blockquote><pre wrap="">Sounds pretty good.  Let me brain-dump one item on you:
onething that
 
hash currently has over btree is the ability to handle index items up
to a full page.  Now, if you go with a scheme that only stores hash
codes and not the underlying data, you can not only handle that but
improve on it; but if you reduce the bucket size and don't remove the
data, it'd be a step backward.  The idea I had about dealing with that
was to only reduce the size of primary buckets --- if it's necessary to
add overflow space to a bucket, the overflow units are still full pages.
So an index tuple up to a page in size can always be accommodated by
adding an overflow page to the bucket.

Just a thought, but AFAIR it's not in the archives anywhere.
        regards, tom lane
   </pre></blockquote><pre wrap="">I was thinking about this some more, and it strikes me that we can
keep the page size = bucket size = overflow size in the new scheme
of storing the hash value in the index and still reduce the effective
bucket size. Let's make the new size the (page size / 2^n) where n
is chosen appropriately. Then we we store the value in the page we
simply use n bits of the hash to determine which sub-piece of the
page to use to actually store the value. We may need to add a multiplier
to adjust the decision to split the page based on the mini-page. This
should allow us to much more densely pack the pages and approach the
1 item per bucket. This could easily shrink the size of the index by
a factor of 2^n. Let me know what you think. </pre></blockquote> My personal opinion is that something like this is
requiredto take best advantage of hashes. I didn't respond immediately because I don't have advice on what the best
implementationwould look like.<br /><br /> I have also been wondering about the effect of a hash index that includes
multiplevalues to the same key (either a non-unique key, or different tuples from the same key due to table updates). I
startedby thinking that the maximum number of hash entries per hash bucket should be kept to 2 or 3 to prevent
reductionin performance to that of btree, but I think this doesn't work if multiple tuples can have the same key.
Unless- the maps is hash value =1:1> index page =1:1> hash bucket =1:N> hash value =1:M=> tuples. Guarantee
thanN is small (either <= 2 or <=4 depending on performance evaluation) by re-hashing if N ever becomes > 2 or
>4. Choose a sparse harsh. Let 1:M be indefinite? Also, optimize the 1:M=1:1 case, such that the value can be
inline?<br/><br /> For most cases, I would think the above model would make it cheap to determine if the key does not
exist,as well as the 1:1 (hash value:key) case requiring a single page lookup. Update performance would suffer, but
lookupshould be faster than btree in these cases, as btree often requires > 1 index page scan.<br /><br />
Cheers,<br/> mark<br /><br /><pre class="moz-signature" cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Hash index todo list item

From

Martijn van Oosterhout

Date:

10 September 2007, 12:05:25

On Sat, Sep 08, 2007 at 06:56:23PM -0400, Mark Mielke wrote:
> I think that if the case of >1 entry per hash becomes common enough to
> be significant, and the key is stored in the hash, that a btree will
> perform equal or better, and there is no point in pursuing  such a hash
> index model. This is where we are today.

There is the point that if a user does an UPDATE of a row without
changing the key your index will have to store entries with the same
hash. If your goal is mostly write-once tables, then that's cool, but
otherwise you're going to have to find a way of dealing with that. It's
not just going to happen a lot, it is going to be common, even for
unique indexes.

Presumably HOT will help with this, but that relies on all index columns
not to change.

The major benenfits will mostly come from not storing the key at all. I
think.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Hash index todo list item

From

Tom Raney

Date:

20 September 2007, 21:12:40

We are pleased to announce an upcoming patch to the hash index code
which improves build time and index size, based on this item in the
TODO list:
During index creation, pre-sort the tuples to improve build speed
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php

Using our implementation, build times and index sizes are
comparable with btree index build times and index sizes.
For example, for a particular 12 million row relation, the
8.2.4 release requires over 2.8 hours to build the hash index. 
With our patch, the index is built in 80 seconds.
Here is the link for a graph, showing a comparative analysis of
btree and hash index builds and describing the benchmark data.
http://web.cecs.pdx.edu/~raneyt/pg/

We are currently cleaning up the patch and will submit it asap.

Regards,
Shreya Bhargava <shreya_bhargav@yahoo.com>
Tom Raney <twraney@comcast.net>


Kenneth Marshall wrote:
> Dear PostgreSQL Hackers:
>
> After following the hackers mailing list for quite a while,
> I am going to start investigating what will need to be done
> to improve hash index performance. Below are the pieces of
> this project that I am currently considering:
>
> 1. Characterize the current hash index implementation against
>    the BTree index, with a focus on space utilization and
>    lookup performance against a collection of test data. This
>    will give a baseline performance test to evaluate the impact
>    of changes. I initially do not plan to bench the hash creation
>    process since my initial focus will be on lookup performance.
>
> 2. Evaluate the performance of different hash index implementations
>    and/or changes to the current implementation. My current plan is
>    to keep the implementation as simple as possible and still provide
>    the desired performance. Several hash index suggestions deal with
>    changing the layout of the keys on a page to improve lookup
>    performance, including reducing the bucket size to a fraction of
>    a page or only storing the hash value on the page, instead of
>    the index value itself. My goal in this phase is to produce one
>    or more versions with better performance than the current BTree.
>    
> 3. Look at build time and concurrency issues with the addition of
>    some additional tests to the test bed. (1)
>
> 4. Repeat as needed.
>
> This is the rough plan. Does anyone see anything critical that
> is missing at this point? Please send me any suggestions for test
> data and various performance test ideas, since I will be working
> on that first.
>
> Regards,
> Ken Marshall 
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

Re: Hash index todo list item

From

Kenneth Marshall

Date:

25 September 2007, 10:17:28

On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
> We are pleased to announce an upcoming patch to the hash index code
> which improves build time and index size, based on this item in the
> TODO list:
> During index creation, pre-sort the tuples to improve build speed
> http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php
>
> Using our implementation, build times and index sizes are
> comparable with btree index build times and index sizes.
> For example, for a particular 12 million row relation, the
> 8.2.4 release requires over 2.8 hours to build the hash index. With our 
> patch, the index is built in 80 seconds.
> Here is the link for a graph, showing a comparative analysis of
> btree and hash index builds and describing the benchmark data.
> http://web.cecs.pdx.edu/~raneyt/pg/
>
> We are currently cleaning up the patch and will submit it asap.
>
> Regards,
> Shreya Bhargava <shreya_bhargav@yahoo.com>
> Tom Raney <twraney@comcast.net>
>

That is super! (and timely)

I was just looking at that some myself since the large index build
times make testing a very laborious process. I am looking forward to
seeing the details. 

Regards,
Ken

Re: Hash index todo list item

From

Gregory Stark

Date:

25 September 2007, 11:36:02

"Kenneth Marshall" <ktm@rice.edu> writes:

> On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
>
>> Using our implementation, build times and index sizes are
>> comparable with btree index build times and index sizes.
>...
> That is super! (and timely)

It is pretty super. I have a few comments to raise but don't take it to be too
negative, it sounds like this is a big step forward towards making hash
indexes valuable.

Firstly in the graphs it seems the index size graph has an exponential x-axis
but a linear y-axis. This makes it hard to see what I would expect to be
pretty much linear growth. The curves look exponential which would mean linear
growth but of course it's hard to tell.

Also, the growth in the time chart looks pretty much linear. That seems weird
since I would expect there would be a visible extra component since sort times
are n-log(n). Perhaps you need to test still larger tables to see that though.

In any case it's clear from the results you have there that the change is a
positive one and fixes a fundamental problem with the hash index build code.

Something else you should perhaps test is indexing something which is
substantially larger than hash function output. A hash function is going to
make the most sense when indexing something like strings for which you want to
avoid the long comparison costs. Especially consider testing this on a UTF8
locale with expensive comparisons (like a CJK locale for example).

Note that the bottom line for the problems with hash indexes is that the
current implementation doesn't offer any advantages over btree indexes. Hash
indexes need to be not only as good of a choice as btree indexes but
significantly better a significantly better choice at least for some specific
circumstances.

Also, if you're going to submit a patch you should check out a copy of the CVS
HEAD and work from that. I don't think there are huge differences in the area
of hash indexes though. But in most other areas you would be spending quite a
bit of time dealing details which have changed since.

Finally note that we're in the final throes of the 8.3 feature freeze.
Normally any patch submitted now would be held until 8.3 is released and
development on 8.4 is started. I could imagine an exception being made for
hash indexes since they're so moribund currently but probably not. The flip
side of that argument is that there's not much point in making an exception
for something which will only be really useful once further work is done in
the same area.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Hash index todo list item

From

Kenneth Marshall

Date:

25 September 2007, 13:26:14

On Tue, Sep 25, 2007 at 03:35:47PM +0100, Gregory Stark wrote:
> "Kenneth Marshall" <ktm@rice.edu> writes:
> 
> > On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
> >
> >> Using our implementation, build times and index sizes are
> >> comparable with btree index build times and index sizes.
> >...
> > That is super! (and timely)
> 
> It is pretty super. I have a few comments to raise but don't take it to be too
> negative, it sounds like this is a big step forward towards making hash
> indexes valuable.
> 
> Firstly in the graphs it seems the index size graph has an exponential x-axis
> but a linear y-axis. This makes it hard to see what I would expect to be
> pretty much linear growth. The curves look exponential which would mean linear
> growth but of course it's hard to tell.
> 
> Also, the growth in the time chart looks pretty much linear. That seems weird
> since I would expect there would be a visible extra component since sort times
> are n-log(n). Perhaps you need to test still larger tables to see that though.
> 
> In any case it's clear from the results you have there that the change is a
> positive one and fixes a fundamental problem with the hash index build code.
> 
> Something else you should perhaps test is indexing something which is
> substantially larger than hash function output. A hash function is going to
> make the most sense when indexing something like strings for which you want to
> avoid the long comparison costs. Especially consider testing this on a UTF8
> locale with expensive comparisons (like a CJK locale for example).
> 
> Note that the bottom line for the problems with hash indexes is that the
> current implementation doesn't offer any advantages over btree indexes. Hash
> indexes need to be not only as good of a choice as btree indexes but
> significantly better a significantly better choice at least for some specific
> circumstances.
> 
> Also, if you're going to submit a patch you should check out a copy of the CVS
> HEAD and work from that. I don't think there are huge differences in the area
> of hash indexes though. But in most other areas you would be spending quite a
> bit of time dealing details which have changed since.
> 
> Finally note that we're in the final throes of the 8.3 feature freeze.
> Normally any patch submitted now would be held until 8.3 is released and
> development on 8.4 is started. I could imagine an exception being made for
> hash indexes since they're so moribund currently but probably not. The flip
> side of that argument is that there's not much point in making an exception
> for something which will only be really useful once further work is done in
> the same area.
> 

Although I am very excited about this patch, I do not see any real value
in including it in 8.3. As you mention, we need to to have a hash index
implementation that outperforms btree in some problem regime and that is
currently not the case. I have just recently started the process of
gathering ideas and having discussions on various approaches to making
hash indexes more performant and we have a number of ideas on which to
start our investigation. I do think that this patch will make the testing
and evaluation, that will be needed to truly improve the hash index, much
much easier.

Regards,
Ken

Re: Hash index todo list item

From

Michael Glaesemann

Date:

25 September 2007, 13:38:52

On Sep 25, 2007, at 11:26 , Kenneth Marshall wrote:

> Although I am very excited about this patch, I do not see any real  
> value
> in including it in 8.3.

I don't think you have to worry about it being in 8.3. Feature freeze  
was months ago.

Michael Glaesemann
grzm seespotcode net

Re: Hash index todo list item

From

Tom Raney

Date:

26 September 2007, 02:43:06

Kenneth Marshall wrote:
> On Tue, Sep 25, 2007 at 03:35:47PM +0100, Gregory Stark wrote:
>   
>> "Kenneth Marshall" <ktm@rice.edu> writes:
>>
>>     
>>> On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
>>>
>>>       
>>>> Using our implementation, build times and index sizes are
>>>> comparable with btree index build times and index sizes.
>>>>         
>>> ...
>>> That is super! (and timely)
>>>       
>> It is pretty super. I have a few comments to raise but don't take it to be too
>> negative, it sounds like this is a big step forward towards making hash
>> indexes valuable.
>>
>> Firstly in the graphs it seems the index size graph has an exponential x-axis
>> but a linear y-axis. This makes it hard to see what I would expect to be
>> pretty much linear growth. The curves look exponential which would mean linear
>> growth but of course it's hard to tell.
>>
>> Also, the growth in the time chart looks pretty much linear. That seems weird
>> since I would expect there would be a visible extra component since sort times
>> are n-log(n). Perhaps you need to test still larger tables to see that though.
>>
>> In any case it's clear from the results you have there that the change is a
>> positive one and fixes a fundamental problem with the hash index build code.
>>
>> Something else you should perhaps test is indexing something which is
>> substantially larger than hash function output. A hash function is going to
>> make the most sense when indexing something like strings for which you want to
>> avoid the long comparison costs. Especially consider testing this on a UTF8
>> locale with expensive comparisons (like a CJK locale for example).
>>
>> Note that the bottom line for the problems with hash indexes is that the
>> current implementation doesn't offer any advantages over btree indexes. Hash
>> indexes need to be not only as good of a choice as btree indexes but
>> significantly better a significantly better choice at least for some specific
>> circumstances.
>>
>> Also, if you're going to submit a patch you should check out a copy of the CVS
>> HEAD and work from that. I don't think there are huge differences in the area
>> of hash indexes though. But in most other areas you would be spending quite a
>> bit of time dealing details which have changed since.
>>
>> Finally note that we're in the final throes of the 8.3 feature freeze.
>> Normally any patch submitted now would be held until 8.3 is released and
>> development on 8.4 is started. I could imagine an exception being made for
>> hash indexes since they're so moribund currently but probably not. The flip
>> side of that argument is that there's not much point in making an exception
>> for something which will only be really useful once further work is done in
>> the same area.
>>
>>     
>
> Although I am very excited about this patch, I do not see any real value
> in including it in 8.3. As you mention, we need to to have a hash index
> implementation that outperforms btree in some problem regime and that is
> currently not the case. I have just recently started the process of
> gathering ideas and having discussions on various approaches to making
> hash indexes more performant and we have a number of ideas on which to
> start our investigation. I do think that this patch will make the testing
> and evaluation, that will be needed to truly improve the hash index, much
> much easier.
>
> Regards,
> Ken
>
>   
We're glad to contribute and be a part of Postgres.  The patch has been 
posted to pgsql-patches@postgresql.org.

Speeding up the creation time of hash indexes on non-trivial relations 
was our goal.  This will allow some interesting performance tests of the 
hash index on very large relations.  It may be that the near constant 
lookup time of the hash index outperforms the Btree index for some large 
data sets and for certain types of data and distributions.

Sincerely,
Tom Raney

Re: Hash index todo list item

From

Bruce Momjian

Date:

28 September 2007, 15:48:21

This has been saved for the 8.4 release:
http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Tom Raney wrote:
> We are pleased to announce an upcoming patch to the hash index code
> which improves build time and index size, based on this item in the
> TODO list:
> During index creation, pre-sort the tuples to improve build speed
> http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php
> 
> Using our implementation, build times and index sizes are
> comparable with btree index build times and index sizes.
> For example, for a particular 12 million row relation, the
> 8.2.4 release requires over 2.8 hours to build the hash index. 
> With our patch, the index is built in 80 seconds.
> Here is the link for a graph, showing a comparative analysis of
> btree and hash index builds and describing the benchmark data.
> http://web.cecs.pdx.edu/~raneyt/pg/
> 
> We are currently cleaning up the patch and will submit it asap.
> 
> Regards,
> Shreya Bhargava <shreya_bhargav@yahoo.com>
> Tom Raney <twraney@comcast.net>
> 
> 
> Kenneth Marshall wrote:
> > Dear PostgreSQL Hackers:
> >
> > After following the hackers mailing list for quite a while,
> > I am going to start investigating what will need to be done
> > to improve hash index performance. Below are the pieces of
> > this project that I am currently considering:
> >
> > 1. Characterize the current hash index implementation against
> >    the BTree index, with a focus on space utilization and
> >    lookup performance against a collection of test data. This
> >    will give a baseline performance test to evaluate the impact
> >    of changes. I initially do not plan to bench the hash creation
> >    process since my initial focus will be on lookup performance.
> >
> > 2. Evaluate the performance of different hash index implementations
> >    and/or changes to the current implementation. My current plan is
> >    to keep the implementation as simple as possible and still provide
> >    the desired performance. Several hash index suggestions deal with
> >    changing the layout of the keys on a page to improve lookup
> >    performance, including reducing the bucket size to a fraction of
> >    a page or only storing the hash value on the page, instead of
> >    the index value itself. My goal in this phase is to produce one
> >    or more versions with better performance than the current BTree.
> >    
> > 3. Look at build time and concurrency issues with the addition of
> >    some additional tests to the test bed. (1)
> >
> > 4. Repeat as needed.
> >
> > This is the rough plan. Does anyone see anything critical that
> > is missing at this point? Please send me any suggestions for test
> > data and various performance test ideas, since I will be working
> > on that first.
> >
> > Regards,
> > Ken Marshall 
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 5: don't forget to increase your free space map settings
> >
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Hash index todo list item

From

Kenneth Marshall

Date:

12 October 2007, 20:50:49

On Wed, Sep 05, 2007 at 03:07:03PM -0500, Kenneth Marshall wrote:
> On Sun, Sep 02, 2007 at 01:04:04PM -0500, Kenneth Marshall wrote:
> > Dear PostgreSQL Hackers:
> > 
> > After following the hackers mailing list for quite a while,
> > I am going to start investigating what will need to be done
> > to improve hash index performance. Below are the pieces of
> > this project that I am currently considering:
> > 
> > 1. Characterize the current hash index implementation against
> >    the BTree index, with a focus on space utilization and
> >    lookup performance against a collection of test data. This
> >    will give a baseline performance test to evaluate the impact
> >    of changes. I initially do not plan to bench the hash creation
> >    process since my initial focus will be on lookup performance.
> > 
> 
> Here are very basic results for a table with 1.6m entries:
> 
> postgres=# CREATE TABLE dict (word varchar(100));
> CREATE TABLE
> 
> postgres=# COPY dict FROM '/tmp/words';
> COPY 1648379
> postgres=# select count(*) from dict;
>   count  
> ---------
>  1648379
> (1 row)
> 
> Time: 11187.418 ms
> postgres=# select count(*) from dict;
>   count  
> ---------
>  1648379
> (1 row)
> 
> Time: 6040.912 ms
> postgres=# CREATE INDEX wordhash ON dict USING hash (word);
> CREATE INDEX
> Time: 11108707.160 ms
> 
> postgres=# select * from dict where word = 'avatar';
>   word  
> --------
>  avatar
> (1 row)
> 
> Time: 79.823 ms
> postgres=# select * from dict where word = 'zebra';
>  word  
> -------
>  zebra
> (1 row)
> 
> Time: 9.864 ms
> postgres=# select * from dict where word = 'turkey'; 
>   word  
> --------
>  turkey
> (1 row)
> 
> Time: 18.418 ms
> Time: 1.045 ms
> Time: 1.257 ms
> Time: 1.080 ms
> 
> postgres=# CREATE INDEX wordbtree ON dict USING btree (word);
> CREATE INDEX
> 
> Time: 25438.884 ms
> 
> postgres=# select * from dict where word = 'avatar';
>   word  
> --------
>  avatar
> (1 row)
> 
> Time: 13.400 ms
> postgres=# select * from dict where word = 'zebra';
>  word  
> -------
>  zebra
> (1 row)
> 
> Time: 1.173 ms
> postgres=# select * from dict where word = 'turkey';
>   word  
> --------
>  turkey
> (1 row)
> 
> Time: 1.186 ms
> Time: 1.103 ms
> Time: 1.099 ms
> Time: 1.108 ms
> 
> ------------------------------
> Size of table =       87556096
> 
> Size of hash index = 268451840
> Size of btree index = 53510144
> 
> From my very small sample on an unloaded machine, a hash index lookup
> took the least amount of time. It had a much larger initial time which
> could be attributable to cache population effects. The size is 5X that
> of the Btree index. I will continue to improve the test suite as more
> granularity is needed. If anyone has a good data generator, please let
> me know. Otherwise I will just roll my own.
> 
> Regards,
> Ken
> 
I have just re-ran the previous benchmark against 8.3beta1 (versus 8.2)
including the hash index sorted build patch and the results are below:

test=# CREATE TABLE dict (word varchar(100));
CREATE TABLE
Time: 24.463 ms
test=# COPY dict FROM '/tmp/cracklib-words';
LOG:  checkpoints are occurring too frequently (21 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
COPY 1648379
Time: 44470.235 ms
test=# select count(*) from dict; count  
---------1648379
(1 row)

Time: 4553.924 ms
test=# CREATE INDEX wordhash ON dict USING hash (word);
CREATE INDEX
Time: 85905.960 ms
test=# select * from dict where word = 'avatar'; word  
--------avatar
(1 row)

Time: 592.906 ms
test=# select * from dict where word = 'zebra';word  
-------zebra
(1 row)

Time: 21.458 ms
test=# select * from dict where word = 'turkey'; word  
--------turkey
(1 row)

Time: 22.532 ms
Time: 1.224 ms
Time: 1.200 ms
Time: 1.264 ms
test=# CREATE INDEX wordbtree ON dict USING btree (word);
CREATE INDEX
Time: 33714.874 ms
test=# select * from dict where word = 'avatar'; word  
--------avatar
(1 row)

Time: 24.296 ms
test=# select * from dict where word = 'zebra';word  
-------zebra
(1 row)

Time: 1.400 ms
test=# select * from dict where word = 'turkey'; word  
--------turkey
(1 row)

Time: 28.302 ms
Time: 1.284 ms
Time: 1.313 ms

---------------------------------------
Size of table =         69877760

Size of hash index =   268451840
Size of btree index =   48570368

I just wanted to have this baseline for future patches since
8.3 is just around the bend. As expected, the sorted build
process is a huge improvement from the unsorted original.

Regards,
Ken Marshall

Re: Hash index todo list item

From

Kenneth Marshall

Date:

17 October 2007, 21:43:39

Tom,

That is great. I am looking forward to your patch. After the
issues that you needed to address, I think that it would be
reasonable to add a few more user settings for the hash index.
Fill-factor is too course a knob. The others that I have been
considering are:

maxtuples - Not really the maximum, but a target value to use for setting up the initial buckets. This would allow you
toset it for data loads and avoid the "split-n-copy" trauma that you are trying to avoid with your new hash build
process.

multiplicity - Try to capture use cases that would require many overflow pages. In particular, if we discard the normal
indexpage layout we can skip the space overhead of the page pointer and generate a more compact index. Then you could
usea few more hash bits to lookup the index entry in the page. How many bits would be determined by this factor. 8-bits
wouldgive you 256 sub-pieces that could each hold about 3 entries using the current 4-byte hash, or 2 entries using an
8-bytehash.

What do you think?

Cheers,
Ken

On Wed, Oct 17, 2007 at 03:31:58PM -0700, Tom Raney wrote:
> Kenneth,
>
> Great!
>
> Yes, we did update the code to use the estimate.  I will post the patch 
> with this update.  We only saw a very small difference in index build time, 
> but you may when you add many columns to the base relation. 
> With 1 billion tuples, you should start to see the hash index outperform 
> the btree for some equality probes, I would imagine.  With a 90% fill 
> factor, the btree would require 4 levels to index that many tuples.  If the 
> top two were in memory, there would be 3 IOs needed.  I don't think PG 
> supports index only scans, so it will take the extra IO to probe the base 
> relation.  The hash may take up to 2 IOs and maybe even less (or maybe more 
> depending on how many overflow buckets there are).  It might be interesting 
> to fiddle with the fill factors of each index - hash pages (buckets) 
> default to 75% full. 
> -Tom
>> Tom,
>>
>> I am getting ready to stitch in an updated, simplified version
>> of Neil Conway's hash-only hash index patch. Did you have a
>> chance to update your sizing function to use the planner-like
>> estimate and not a full table scan? I would like to be able
>> to use that when my test table start to have 10^9 entries.
>> If you have not had a chance, I will try and add it myself.
>>
>> Regards,
>> Ken
>>
>>   
>
>

Re: Hash index todo list item

From

Tom Raney

Date:

21 October 2007, 17:13:59

Kenneth, I just pushed the revised patch (v2!).  The revised approach 
samples the parent relation to estimate the number of tuples rather than 
performing a complete scan.  In my tests, the estimate appears to be 
accurate, erring on the larger side, which is fine.

> Tom,
>
> That is great. I am looking forward to your patch. After the
> issues that you needed to address, I think that it would be
> reasonable to add a few more user settings for the hash index.
> Fill-factor is too course a knob. The others that I have been
> considering are:
>   

> maxtuples - Not really the maximum, but a target value to use
>   for setting up the initial buckets. This would allow you to
>   set it for data loads and avoid the "split-n-copy" trauma
>   that you are trying to avoid with your new hash build process.
>   
If I understand you correctly, I believe we already do this with our 
current build process, there should not be any splits of the index if we 
estimated the tuple count correctly.  However, what gets you is 
collisions where lots of overflow pages occur when distinct keys map to 
the same bucket, or if you have lots of duplicate keys.  Because your 
overall tuple count doesn't exceed the fill factor, no splits occur, but 
lengthy bucket chains lead to lots of IOs.  You touch on this below.
> multiplicity - Try to capture use cases that would require many
>   overflow pages. In particular, if we discard the normal index
>   page layout we can skip the space overhead of the page pointer
>   and generate a more compact index. Then you could use a few
>   more hash bits to lookup the index entry in the page. How many
>   bits would be determined by this factor. 8-bits would give
>   you 256 sub-pieces that could each hold about 3 entries using
>   the current 4-byte hash, or 2 entries using an 8-byte hash.
>
> What do you think?
>   
Yes, this is a good direction.  If we can increase the number of buckets 
and reduce the bucket size (either physically or virtually) to allow 
more direct access without creating a huge index on disk, that would be 
ideal.  But, then if you do have collisions, overflows occur more 
frequently.  I spoke with Neil Conway yesterday at the PG conference 
here in Portland and he piqued my interest in examining his hash code 
more closely to see what he has already done in this area.

-Tom
> Cheers,
> Ken
>
> On Wed, Oct 17, 2007 at 03:31:58PM -0700, Tom Raney wrote:
>   
>> Kenneth,
>>
>> Great!
>>
>> Yes, we did update the code to use the estimate.  I will post the patch 
>> with this update.  We only saw a very small difference in index build time, 
>> but you may when you add many columns to the base relation. 
>> With 1 billion tuples, you should start to see the hash index outperform 
>> the btree for some equality probes, I would imagine.  With a 90% fill 
>> factor, the btree would require 4 levels to index that many tuples.  If the 
>> top two were in memory, there would be 3 IOs needed.  I don't think PG 
>> supports index only scans, so it will take the extra IO to probe the base 
>> relation.  The hash may take up to 2 IOs and maybe even less (or maybe more 
>> depending on how many overflow buckets there are).  It might be interesting 
>> to fiddle with the fill factors of each index - hash pages (buckets) 
>> default to 75% full. 
>> -Tom
>>     
>>> Tom,
>>>
>>> I am getting ready to stitch in an updated, simplified version
>>> of Neil Conway's hash-only hash index patch. Did you have a
>>> chance to update your sizing function to use the planner-like
>>> estimate and not a full table scan? I would like to be able
>>> to use that when my test table start to have 10^9 entries.
>>> If you have not had a chance, I will try and add it myself.
>>>
>>> Regards,
>>> Ken
>>>
>>>   
>>>       
>>     
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
>

Re: Hash index todo list item

From

Kenneth Marshall

Date:

21 October 2007, 17:45:17

Tom,

Thank you for the update. I am currently working on updating the
patch Neil Conway sent in against 8.0-ish that stores only the hash
in the index and locates the entries within the page using a binary
search. Then I will fold in your recent update.

On Sun, Oct 21, 2007 at 01:13:48PM -0700, Tom Raney wrote:
> Kenneth, I just pushed the revised patch (v2!).  The revised approach 
> samples the parent relation to estimate the number of tuples rather than 
> performing a complete scan.  In my tests, the estimate appears to be 
> accurate, erring on the larger side, which is fine.
>

Yes, larger is great and what we need to avoid all the index tuple
suffling between pages.

>> Tom,
>>
>> That is great. I am looking forward to your patch. After the
>> issues that you needed to address, I think that it would be
>> reasonable to add a few more user settings for the hash index.
>> Fill-factor is too course a knob. The others that I have been
>> considering are:
>>   
>
>> maxtuples - Not really the maximum, but a target value to use
>>   for setting up the initial buckets. This would allow you to
>>   set it for data loads and avoid the "split-n-copy" trauma
>>   that you are trying to avoid with your new hash build process.
>>   
> If I understand you correctly, I believe we already do this with our 
> current build process, there should not be any splits of the index if we 
> estimated the tuple count correctly.  However, what gets you is collisions 
> where lots of overflow pages occur when distinct keys map to the same 
> bucket, or if you have lots of duplicate keys.  Because your overall tuple 
> count doesn't exceed the fill factor, no splits occur, but lengthy bucket 
> chains lead to lots of IOs.  You touch on this below.

Yes, you do address this in your patches and it works well for an
existing heap. My idea was to minimize the shuffling problem when we
are doing a data load and do not have a heap to get a count from
because it has not been loaded yet.
>> multiplicity - Try to capture use cases that would require many
>>   overflow pages. In particular, if we discard the normal index
>>   page layout we can skip the space overhead of the page pointer
>>   and generate a more compact index. Then you could use a few
>>   more hash bits to lookup the index entry in the page. How many
>>   bits would be determined by this factor. 8-bits would give
>>   you 256 sub-pieces that could each hold about 3 entries using
>>   the current 4-byte hash, or 2 entries using an 8-byte hash.
>>
>> What do you think?
>>   
> Yes, this is a good direction.  If we can increase the number of buckets 
> and reduce the bucket size (either physically or virtually) to allow more 
> direct access without creating a huge index on disk, that would be ideal.  
> But, then if you do have collisions, overflows occur more frequently.  I 
> spoke with Neil Conway yesterday at the PG conference here in Portland and 
> he piqued my interest in examining his hash code more closely to see what 
> he has already done in this area.

Right, overflows would occur more frequently and any overflow would
allocate a full page. It may be possible to estimate the multiplicity
and minimize the use of overflow pages. If we know that on average that
there are no more than 10 items in a bucket, we can size the virtual
buckets on the first page to support 10 items and minimize the rollover
to an overflow page.

Other ideas, once we hit the overflow page, back-off to the current
fullpage use to maximize the fill-factor, possibly using Neil's
binary search to speed lookups within the overflow pages. Also, with
the smaller virtual buckets use the remaining space on the page as
the first overflow page to hopefully prevent needing to allocate a
full new overflow page. This would be most useful for the smaller
virtual bucket sizes and improve the overall packing efficiency of
the index. If we intend to take advantage of the O(1) behavior, it
will be most useful for large numbers of tuples, more than 10^9. A
32-bit hash is not good enough and we will need to use a 64-bit hash
to reduce collisions in the hash table and consequent need for overflow
pages. I already have the hash function updated to support 64-bit and
will look at getting that functional once I have the hash-only version
working with the 32-bit hashes. Also, without the need to store the
value in the index, we can boost the size of indexable fields to the
page size, and larger. I was tentatively targeting 32k as the implementation
max, because then hashing time becomes the issue. Of course, all of these
changes and options will need to be benchmarked and discussed.

Thank you again for posting the updated patch.

Regards,
Ken

>
> -Tom
>> Cheers,
>> Ken
>>
>> On Wed, Oct 17, 2007 at 03:31:58PM -0700, Tom Raney wrote:
>>   
>>> Kenneth,
>>>
>>> Great!
>>>
>>> Yes, we did update the code to use the estimate.  I will post the patch 
>>> with this update.  We only saw a very small difference in index build 
>>> time, but you may when you add many columns to the base relation. With 1 
>>> billion tuples, you should start to see the hash index outperform the 
>>> btree for some equality probes, I would imagine.  With a 90% fill factor, 
>>> the btree would require 4 levels to index that many tuples.  If the top 
>>> two were in memory, there would be 3 IOs needed.  I don't think PG 
>>> supports index only scans, so it will take the extra IO to probe the base 
>>> relation.  The hash may take up to 2 IOs and maybe even less (or maybe 
>>> more depending on how many overflow buckets there are).  It might be 
>>> interesting to fiddle with the fill factors of each index - hash pages 
>>> (buckets) default to 75% full. -Tom
>>>     
>>>> Tom,
>>>>
>>>> I am getting ready to stitch in an updated, simplified version
>>>> of Neil Conway's hash-only hash index patch. Did you have a
>>>> chance to update your sizing function to use the planner-like
>>>> estimate and not a full table scan? I would like to be able
>>>> to use that when my test table start to have 10^9 entries.
>>>> If you have not had a chance, I will try and add it myself.
>>>>
>>>> Regards,
>>>> Ken
>>>>
>>>>         
>>>     
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 2: Don't 'kill -9' the postmaster
>>
>>   
>
>

Re: Hash index todo list item

From

Kenneth Marshall

Date:

22 October 2007, 16:13:23

On Thu, Sep 13, 2007 at 02:02:14PM -0700, Neil Conway wrote:
> On Fri, 2007-09-07 at 08:29 -0500, Kenneth Marshall wrote:
> > This is a great starting point. I would appreciate it if you have the
> > time and could make it apply cleanly to HEAD.
> 
> Just to give you an update on this, I'll try to find the time to get it
> done soon, but my day job is keeping me really busy these days, so I'm
> not sure when I'll be able to get to it...
> 
> -Neil
> 
Neil,

I am currently working on integrating your previous patch with the
8.3beta1 hash code + the sorted build patch by Tom Raney. I have
been adding back pieces that were removed and I had a question
about index tuples, in particular what does the size field indicate?
Is it the size in bytes of the data after the IndexTupleData or the
total size in bytes including the IndexTupleData?

In order to use the hitem with the sorted build process, I think
that the flags should provide the size info so that the tuplesort
code can be used. This will also allow the use of >32-bit hash
values. Am I completely off base?

Regards,
Ken

Re: Hash index todo list item

From

Shreya Bhargava

Date:

06 November 2007, 02:47:25

"Note that the bottom line for the problems with hash indexes is that the<br />current implementation doesn't offer any
advantagesover btree indexes. Hash<br />indexes need to be not only as good of a choice as btree indexes but<br
/>significantlybetter a significantly better choice at least for some specific<br />circumstances."<br /><br
/><pre><tt><tt>Weperformed some probe tests on our patch on <br />hash index and the original btree index to compare
the<br />performance between the two. We used a 80 million tuple table.<br />The table contained single integer
attributeand the data<br />range was 0 - 100000000. (generated by a random generator).<br />We did the following:<br
/><br/>1. Populate the table with 80 million tuples.<br />2. Create HASH index on the table.<br />3. clear both linux
cache& psql buffers.<br />   (exiting psql and restarting it cleared the psql buffers;<br />    to clear linux
cache,we used drop_cache command)<br />4. start psql<br />5. </tt></tt><tt><tt>select on aninteger in the range of
valuesin the table</tt></tt><tt><tt>.<br />   (all test numbers were big ones, like 98934599)<br />6. record the
time.<br/>7. exit psql.<br />8. drop caches.(as described above)<br />9. repeat 4-8 for different numbers.<br />10.
DropHash index.<br />11. Create Btree index and repeat 3-9.<br /> <br />The results are as shown below. All trials are
inms. <br />Count is the number of such tuples in the table.<br /> <br /> HASH INDEX:<br /> Number       Count   Trial1
    Trial2      Trial3<br /> 21367898      2         152.0        129.5      131.1<br /> 95678699      2         140.6
     145.6      137.5<br /> 99899799      2         148.0        147.4      152.6<br /> 97677899      2         142.0
    153.5      112.0<br /> 94311123      2         137.6        146.3      141.3<br /> 67544455      2         141.6
   144.6      152.7<br /> 88877677      2         122.1        123.1      130.8<br /> 88788988      2         150.2
  150.0      171.7<br /> 44333344      1        119.9        116.3      117.6<br /> 34267878      1          97.5
 99.9      114.8<br /> 55489781      2         115.7        117.2      145.3 <br /> 97676767      1         169.5
168.5      181.7 <br /> 99767564      1         142.7        133.6      132.7<br /> 21367989      4         198.0
193.2      186.7<br /> <br />BTREE INDEX<br /> <br /> Number      Trial1    Trial2      Trial3<br /> 21367898   145.5
 145.6       150.6<br /> 95678699   177.5     170.0       176.4<br /> 99899799   175.4     181.2       185.4<br />
97677899  136.9     149.0       130.8<br /> 94311123   179.0     175.3       166.3<br /> 67544455   161.7     162.2
 170.4<br /> 88877677   138.7     135.2       149.9<br /> 88788988   165.7     179.3       186.3<br /> 44333344   166.0
   172.8       184.3<br /> 34267878   159.1     168.8       147.8<br /> 55489781   183.2     195.4       185.1<br />
97676767  147.2     143.6       132.6<br /> 99767564   167.8     178.3       162.1<br /> 21367989   232.3     238.1
216.9</tt></tt></pre>From the results obtained, the average of all the hash probes is 141.8ms, the average for btree is
168.5,a difference of about 27.The standard deviations are about 23, so this is a statistically significant
difference.<br/><pre><tt><tt>Our prediction that the hash index would take on the average one<br />probe for about 10ms
andthe btree would take three probes for about 30 ms<br />or a difference of about 20ms was pretty well shown by the
differencewe<br />got of about 27. Hope these data points will help with some questions<br />about the performance
differencesbetween Hash and Btree index.<br /><br />Regards,<br />Shreya</tt></tt></pre><br /><br /><span
style="font-family:monospace;"></span><tt><tt><br /><br /></tt></tt><b><i>Gregory Stark
<stark@enterprisedb.com></i></b>wrote:<blockquote class="replbq" style="border-left: 2px solid rgb(16, 16, 255);
margin-left:5px; padding-left: 5px;"> "Kenneth Marshall"  writes:<br /><br />> On Thu, Sep 20, 2007 at 05:12:45PM
-0700,Tom Raney wrote:<br />><br />>> Using our implementation, build times and index sizes are<br />>>
comparablewith btree index build times and index sizes.<br />>...<br />> That is super! (and timely)<br /><br
/>Itis pretty super. I have a few comments to raise but don't take it to be too<br />negative, it sounds like this is a
bigstep forward towards making hash<br />indexes valuable.<br /><br />Firstly in the graphs it seems the index size
graphhas an exponential x-axis<br />but a linear y-axis. This makes it hard to see what I would expect to be<br
/>prettymuch linear growth. The curves look exponential which would mean linear<br />growth but of course it's hard to
tell.<br/><br />Also, the growth in the time chart looks pretty much linear. That seems weird<br />since I would expect
therewould be a visible extra component since sort times<br />are n-log(n). Perhaps you need to test still larger
tablesto see that though.<br /><br />In any case it's clear from the results you have there that the change is a<br
/>positiveone and fixes a fundamental problem with the hash index build code.<br /><br />Something else you should
perhapstest is indexing something which is<br />substantially larger than hash function output. A hash function is
goingto<br />make the most sense when indexing something like strings for which you want to<br />avoid the long
comparisoncosts. Especially consider testing this on a UTF8<br />locale with expensive comparisons (like a CJK locale
forexample).<br /><br />Note that the bottom line for the problems with hash indexes is that the<br />current
implementationdoesn't offer any advantages over btree indexes. Hash<br />indexes need to be not only as good of a
choiceas btree indexes but<br />significantly better a significantly better choice at least for some specific<br
/>circumstances.<br/><br />Also, if you're going to submit a patch you should check out a copy of the CVS<br />HEAD and
workfrom that. I don't think there are huge differences in the area<br />of hash indexes though. But in most other
areasyou would be spending quite a<br />bit of time dealing details which have changed since.<br /><br />Finally note
thatwe're in the final throes of the 8.3 feature freeze.<br />Normally any patch submitted now would be held until 8.3
isreleased and<br />development on 8.4 is started. I could imagine an exception being made for<br />hash indexes since
they'reso moribund currently but probably not. The flip<br />side of that argument is that there's not much point in
makingan exception<br />for something which will only be really useful once further work is done in<br />the same
area.<br/><br />-- <br /> Gregory Stark<br /> EnterpriseDB http://www.enterprisedb.com<br /><br
/>---------------------------(endof broadcast)---------------------------<br />TIP 2: Don't 'kill -9' the postmaster<br
/></blockquote><br/><p> __________________________________________________<br />Do You Yahoo!?<br />Tired of spam?
Yahoo!Mail has the best spam protection around <br />http://mail.yahoo.com

Re: Hash index todo list item

From

"Gokulakannan Somasundaram"

Date:

06 November 2007, 04:27:49

I have not followed this thread very closely. But just wanted to give my inputs.


> From the results obtained, the average of all the hash probes is 141.8ms,
> the average for btree is 168.5, a difference of about 27.The standard
> deviations are about 23, so this is a statistically significant difference.
> Our prediction that the hash index would take on the average one
> probe for about 10ms and the btree would take three probes for about 30 ms
> or a difference of about 20ms was pretty well shown by the difference we
> got of about 27. Hope these data points will help with some questions
> about the performance differences between Hash and Btree index.
>

We all know that Hash indexes are good for equality queries and Binary
indexes are good for both equality queries and Range queries. So for
equality queries, Hash index should do only one Logical I/O (if no
hash collisions) and Binary index should do atleast 3 (I don't know
the level of B-tree that came out as a result of this). You can enable
the Logical I/O count by applying this patch.

*** postgresql-8.3beta1/src/backend/storage/buffer/bufmgr.c    Tue Sep 25
18:11:48 2007
--- postgresql-8.3patch/src/backend/storage/buffer/bufmgr.c    Fri Oct 19
23:18:36 2007
***************
*** 1470,1477 ****         localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
     appendStringInfo(&str,
!     "!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
!                 ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate);     appendStringInfo(&str,
"!\tLocal blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",                      ReadLocalBufferCount -
LocalBufferHitCount,
LocalBufferFlushCount, localhitrate);
--- 1470,1477 ----         localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
     appendStringInfo(&str,
!     "!\tShared blocks: %10ld Logical Reads, %10ld Physical Reads, %10ld
written, buffer hit rate = %.2f%%\n",
!                 ReadBufferCount, ReadBufferCount - BufferHitCount,
BufferFlushCount, hitrate);     appendStringInfo(&str,     "!\tLocal  blocks: %10ld read, %10ld written, buffer hit
rate= %.2f%%\n",                      ReadLocalBufferCount - LocalBufferHitCount,
 
LocalBufferFlushCount, localhitrate);


If possible, it would be useful for you to present the reduction in
Logical I/O count, as it very well might translate to Physical I/Os
for simple index scans. Atleast check whether it is in the ratio 1:3.

Hope my suggestion helps.


Thanks,
Gokul.
CertoSQL Project,
Allied Solution Group.
(www.alliedgroups.com)

Re: Hash index todo list item

From

Heikki Linnakangas

Date:

06 November 2007, 07:00:11

Shreya Bhargava wrote:
> 1. Populate the table with 80 million tuples.
> 2. Create HASH index on the table.
> 3. clear both linux cache & psql buffers.
>    (exiting psql and restarting it cleared the psql buffers;
>     to clear linux cache, we used drop_cache command)
> 4. start psql
> 5. select on an integer in the range of values in the table.
>    (all test numbers were big ones, like 98934599)
> 6. record the time.
> 7. exit psql.
> 8. drop caches.(as described above)
> 9. repeat 4-8 for different numbers.
> 10. Drop Hash index.
> 11. Create Btree index and repeat 3-9.

It seems you're mostly measuring the overhead of starting a backend, 
populating the relcache etc.

Restarting psql doesn't clear the postgres shared buffer cache. Or did 
you mean that you restarted postgres?

Anyway, I don't think it's interesting to test with cleared caches. 
Surely the metapage and first 1-2 levels of the b-tree would stay cached 
all the time in real life.

> From the results obtained, the average of all the hash probes is 141.8ms, the average for btree is 168.5, a
differenceof about 27.The standard deviations are about 23, so this is a statistically significant difference.

I don't trust those numbers much, but in any case I don't think that 
edge is big enough to justify the existence of hash indexes.

If you're looking for a use case where hash index is faster, I'd suggest 
using a data type with an expensive comparison function. Like long 
multi-byte strings in UTF-8 encoding.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Hash index todo list item

From

Jens-Wolfhard Schicke

Date:

06 November 2007, 07:07:14

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Shreya Bhargava wrote:
> "Note that the bottom line for the problems with hash indexes is that the
> current implementation doesn't offer any advantages over btree indexes. Hash
> indexes need to be not only as good of a choice as btree indexes but
> significantly better a significantly better choice at least for some
> specific circumstances."
Oh it does. I recently used a hash index to speed up my database. Namely
I found it improved performance when indexing a non-integer column
containing english words.

I don't know how much of that data was cached, according to the sound
of my harddrive it wasn't all of it. Consider this anecdotical
evidence, but the speedup was noticeable.

> We performed some probe tests on our patch on 
> hash index and the original btree index to compare the 
> performance between the two. We used a 80 million tuple table.
> The table contained single integer attribute and the data
> range was 0 - 100000000. (generated by a random generator).

I'd be interested how much difference is there between non-integer
index behaviour. I at least had the impression that in my case
the sorted strings in the btree pages didn't compare too well.

> */Gregory Stark <stark@enterprisedb.com>/* wrote:
>     "Kenneth Marshall" writes:
>     > On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
>     >> Using our implementation, build times and index sizes are
>     >> comparable with btree index build times and index sizes.
Way to go! Currently building hash indices is no fun.

Regards, Jens-W. Schicke
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHMErXzhchXT4RR5ARAh7pAKCZIZFJFa7Oq25GvwDhiZJFsrtwgACbBC1F
otwhIZVlNgUGlroePIafi1c=
=N1f7
-----END PGP SIGNATURE-----

Re: Hash index todo list item

From

Tom Lane

Date:

06 November 2007, 10:46:40

Shreya Bhargava <shreya_bhargav@yahoo.com> writes:
> We performed some probe tests on our patch on 
> hash index and the original btree index to compare the 
> performance between the two.

Interesting, but ...

> From the results obtained, the average of all the hash probes is 141.8ms, the average for btree is 168.5, a
differenceof about 27.The standard deviations are about 23, so this is a statistically significant difference.
 

> Our prediction that the hash index would take on the average one
> probe for about 10ms and the btree would take three probes for about 30 ms
> or a difference of about 20ms was pretty well shown by the difference we
> got of about 27.

... unfortunately, a zero-cache situation isn't very reflective of the
real world.  In reality, for an index that is getting used often enough
to impact application performance at all, the root page will stay in
cache and likely so will some of the upper tree levels.  So I don't
think this experiment proves anything at all about real-world
performance.

What I'd like to see is an aggregate time measurement for millions of
random probes into an index that's noticeably larger than memory.
It would also be interesting to measure the speed of the fully-cached
case (same test, but index smaller than memory).
        regards, tom lane

Re: Hash index todo list item

From

Kenneth Marshall

Date:

14 November 2007, 20:13:03

On Thu, Sep 13, 2007 at 02:02:14PM -0700, Neil Conway wrote:
> On Fri, 2007-09-07 at 08:29 -0500, Kenneth Marshall wrote:
> > This is a great starting point. I would appreciate it if you have the
> > time and could make it apply cleanly to HEAD.
> 
> Just to give you an update on this, I'll try to find the time to get it
> done soon, but my day job is keeping me really busy these days, so I'm
> not sure when I'll be able to get to it...
> 
> -Neil
> 
Neil,

I have been working on putting an updated version of your
patch into the current source. My first try was to try and
put your patch in directly, but it differed so much from the
current build that it was not obvious how to address things
like the current hash_index sorted build patch, which I need
to be able to test with indexes of any size at all. My current
try is to replace the _hash_formitem() calls with a function
called _hash_form_tuple() that actually returns an IndexTuple
and not an HashItem. This will allow it to be used quite
naturally with the current sorted build patch. Here is what
it looks like now:

/** _hash_form_tuple -- construct index tuple using hash(value) not value*/
IndexTuple
_hash_form_tuple(IndexTuple itup, Relation rel)
{       IndexTuple      result;       Size            size;       uint32          hashkey;       Datum           datum;
     bool            isnull;

       /* disallow nulls in hash keys */       if (IndexTupleHasNulls(itup))               ereport(ERROR,
               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),                                errmsg("hash indexes cannot
containnull keys")

));
       if (rel->rd_rel->relnatts != 1)               elog(ERROR, "hash indexes support only one index key");
       /* hash the tuple; we only store the hash value in the index */       datum = index_getattr(itup, 1,
RelationGetDescr(rel),&isnull);       Assert(!isnull);       hashkey = _hash_datum2hashkey(rel, datum);

       size = IndexTupleSize(itup);       result = (IndexTuple) palloc(size);       memcpy(result, itup, size);
returnresult;

}

I am not currently doing anything other than returning the current
IndexTuple that was created with index_form_tuple(). Am I daft, or
can I just memcpy() the 6 bytes of TID, add the 2 bytes of t_info
(everything 0 and the size set to 6 + 2 + sizeof(hash) = 10), and
the 4 bytes of hash. This will allow me to handle 8-byte hashes
in the future. If you see a problem with this approach, please
let me know. I would appreciate any feedback you can give.

Regards,
Ken
> 
>