Thread: Hash index todo list item
Dear PostgreSQL Hackers: After following the hackers mailing list for quite a while, I am going to start investigating what will need to be done to improve hash index performance. Below are the pieces of this project that I am currently considering: 1. Characterize the current hash index implementation against the BTree index, with a focus on space utilization and lookupperformance against a collection of test data. This will give a baseline performance test to evaluate the impact of changes. I initially do not plan to bench the hash creation process since my initial focus will be on lookup performance. 2. Evaluate the performance of different hash index implementations and/or changes to the current implementation. My currentplan is to keep the implementation as simple as possible and still provide the desired performance. Several hashindex suggestions deal with changing the layout of the keys on a page to improve lookup performance, including reducingthe bucket size to a fraction of a page or only storing the hash value on the page, instead of the index valueitself. My goal in this phase is to produce one or more versions with better performance than the current BTree. 3. Look at build time and concurrency issues with the addition of some additional tests to the test bed. (1) 4. Repeat as needed. This is the rough plan. Does anyone see anything critical that is missing at this point? Please send me any suggestions for test data and various performance test ideas, since I will be working on that first. Regards, Ken Marshall
Kenneth Marshall <ktm@rice.edu> writes: > ... This is the rough plan. Does anyone see anything critical that > is missing at this point? Sounds pretty good. Let me brain-dump one item on you: one thing that hash currently has over btree is the ability to handle index items up to a full page. Now, if you go with a scheme that only stores hash codes and not the underlying data, you can not only handle that but improve on it; but if you reduce the bucket size and don't remove the data, it'd be a step backward. The idea I had about dealing with that was to only reduce the size of primary buckets --- if it's necessary to add overflow space to a bucket, the overflow units are still full pages. So an index tuple up to a page in size can always be accommodated by adding an overflow page to the bucket. Just a thought, but AFAIR it's not in the archives anywhere. regards, tom lane
On Sun, 2007-09-02 at 13:04 -0500, Kenneth Marshall wrote: > Dear PostgreSQL Hackers: > > After following the hackers mailing list for quite a while, > I am going to start investigating what will need to be done > to improve hash index performance. Below are the pieces of > this project that I am currently considering: > > 1. Characterize the current hash index implementation against > the BTree index, with a focus on space utilization and > lookup performance against a collection of test data. This > will give a baseline performance test to evaluate the impact > of changes. I initially do not plan to bench the hash creation > process since my initial focus will be on lookup performance. > > 2. Evaluate the performance of different hash index implementations > and/or changes to the current implementation. My current plan is > to keep the implementation as simple as possible and still provide > the desired performance. Several hash index suggestions deal with > changing the layout of the keys on a page to improve lookup > performance, including reducing the bucket size to a fraction of > a page or only storing the hash value on the page, instead of > the index value itself. My goal in this phase is to produce one > or more versions with better performance than the current BTree. > > 3. Look at build time and concurrency issues with the addition of > some additional tests to the test bed. (1) > > 4. Repeat as needed. > > This is the rough plan. Does anyone see anything critical that > is missing at this point? Please send me any suggestions for test > data and various performance test ideas, since I will be working > on that first. Sounds good. I'd be particularly interested in large indexes, say ~ 0.5 - 2GB. There are likely to be various effects apparent as the indexes grow. It would be too easy to do all the tests with smaller indexes and miss things. Other factors are: - volatility - concurrency My general experience is that hash-based indexes are better when the range of inputs is relatively well-known, allowing a fast lookup. If that is the only benefit of hash indexes, a flexible hashing scheme may simply weaken the benefit-case for using them. If that's true, should the index build process examine the key values in the data to determine the best parameters to use? Kind of ANALYZE before build. My current feeling is that they ought to be very good at handling read-mostly situations such as privilege checking or UPDATE-intensive situations such as Customer-Current-Credit tracking, when the number of customers is large. It might also be worth looking at lossy hash indexes, i.e. the index stores only the block numbers. That would need to be part of the discussion around how lossy we will allow indexes to be. We currently have two kinds of full text index with different concurrency use cases, so it should be acceptable to have hash indexes have a clear benefit in one use case but a clear loss in another. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote: > Kenneth Marshall <ktm@rice.edu> writes: > > ... This is the rough plan. Does anyone see anything critical that > > is missing at this point? > > Sounds pretty good. Let me brain-dump one item on you: one thing that > hash currently has over btree is the ability to handle index items up > to a full page. Now, if you go with a scheme that only stores hash > codes and not the underlying data, you can not only handle that but > improve on it; but if you reduce the bucket size and don't remove the > data, it'd be a step backward. The idea I had about dealing with that > was to only reduce the size of primary buckets --- if it's necessary to > add overflow space to a bucket, the overflow units are still full pages. > So an index tuple up to a page in size can always be accommodated by > adding an overflow page to the bucket. > > Just a thought, but AFAIR it's not in the archives anywhere. > > regards, tom lane > Tom, Thank you for the input. I agree that keeping the ability to accomodate an index tuple up to a page is size worth keeping. I think that your goal in reducing the bucket size is to improve the packing efficiency of the index. Since the on disk page size remains the same, it may be possible to use a different structure overlayed on the current bucket size and still improve the packing efficiency of the index. After some more mulling, here are some further thoughts on the improved hash table implementation: - Hash lookup is O(1) while btree is O(logN). Is there a value in optimizing the NOT case, i.e. the entry is not in the table? - Space versus performance trade-off. This may tie into cache efficiency and use of L2/L3, shared buffers, main memory. Denserlayouts with a higher load facter may be a bit slower in lookups but play much nicer in a multi-user system. Look atthe possibility of a lossy mapping? - Build time versus update time. How does concurrency enter into the picture regarding simultaneous updates, inserts, deletes,and lookups? - Could a hybrid structure with some type of prefix compression give a more efficient layout and possibly improve performance? - Index larger fields. Btree is limited to blocksize/3, the current hash implementation can go up to a full block. - What about multi-column indexes? The current implementation only supports 1 column. More ideas are welcome and I will add them to the list for investigation. Regards, Ken
On Mon, Sep 03, 2007 at 10:33:54AM +0100, Simon Riggs wrote: > > > > This is the rough plan. Does anyone see anything critical that > > is missing at this point? Please send me any suggestions for test > > data and various performance test ideas, since I will be working > > on that first. > > Sounds good. > > I'd be particularly interested in large indexes, say ~ 0.5 - 2GB. There > are likely to be various effects apparent as the indexes grow. It would > be too easy to do all the tests with smaller indexes and miss things. > > Other factors are: > - volatility > - concurrency > > My general experience is that hash-based indexes are better when the > range of inputs is relatively well-known, allowing a fast lookup. If > that is the only benefit of hash indexes, a flexible hashing scheme may > simply weaken the benefit-case for using them. If that's true, should > the index build process examine the key values in the data to determine > the best parameters to use? Kind of ANALYZE before build. > > My current feeling is that they ought to be very good at handling > read-mostly situations such as privilege checking or UPDATE-intensive > situations such as Customer-Current-Credit tracking, when the number of > customers is large. > > It might also be worth looking at lossy hash indexes, i.e. the index > stores only the block numbers. That would need to be part of the > discussion around how lossy we will allow indexes to be. > > We currently have two kinds of full text index with different > concurrency use cases, so it should be acceptable to have hash indexes > have a clear benefit in one use case but a clear loss in another. > > -- > Simon Riggs > 2ndQuadrant http://www.2ndQuadrant.com > Simon, Thank you for your input. I would like to include some tests with large indexes too. Do you have any ideas for a test corpus or should we try and generate the test data programatically? Many people in the literature of text retrieval use the TREC* data for at least some of their runs. I am going to check at work to see if the campus has access to the data, otherwise I will do some web crawling to generate some sample data. I have just posted a reply to Tom Lane with some further ideas for consideration in the new hash index support. Like you, I suspect that volatile data that results in many index changes may not work well with hash indexes, in general. PostgreSQL has the additional burden of needing to access both the index and the data heap. Obviously, the less I/O that is needed the better the performance is likely to be. The new HOT functionality plus clustering the table data on the hash index would effectively organize the table into the "hash buckets" which could help with reducing both the churn in the index as well as in the tables. Regards, Ken
"Kenneth Marshall" <ktm@rice.edu> writes: > On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote: >> Kenneth Marshall <ktm@rice.edu> writes: >> > ... This is the rough plan. Does anyone see anything critical that >> > is missing at this point? >> >> Sounds pretty good. Let me brain-dump one item on you: one thing that >> hash currently has over btree is the ability to handle index items up >> to a full page. Now, if you go with a scheme that only stores hash >> codes and not the underlying data, you can not only handle that but >> improve on it; I think that would be a big selling point for hash indexes. It would let you index even toasted data which are larger than a page. I'm not sure whether you can make it work for unique indexes though. But for non-unique indexes I think it would be a solid win and mean you cover a set of use cases quite distinct from btree indexes. > - Hash lookup is O(1) while btree is O(logN). That's not really true. There's a tradeoff between insertion time and lookup time. In order to get O(1) lookups you need to work pretty hard to maintain the hash table including spending a lot of time reorganizing it when you grow it. If you don't want to spend the time on inserts then you end up with buckets and the hash table is basically just a linear speedup to whatever algorithm you use to scan the buckets. > - What about multi-column indexes? The current implementation > only supports 1 column. That seems kind of weird. It seems obvious that you mix the three hashes together which reduces it to the solved problem. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Kenneth Marshall" <ktm@rice.edu> writes: >> - What about multi-column indexes? The current implementation >> only supports 1 column. > That seems kind of weird. It seems obvious that you mix the three hashes > together which reduces it to the solved problem. No, because part of the deal is that you can do lookups using only the leading index columns. At least, all the existing multicolumn index types can do that. regards, tom lane
On 9/3/07, Gregory Stark <stark@enterprisedb.com> wrote: > > "Kenneth Marshall" <ktm@rice.edu> writes: > > > On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote: > >> Kenneth Marshall <ktm@rice.edu> writes: > >> > ... This is the rough plan. Does anyone see anything critical that > >> > is missing at this point? > >> > >> Sounds pretty good. Let me brain-dump one item on you: one thing that > >> hash currently has over btree is the ability to handle index items up > >> to a full page. Now, if you go with a scheme that only stores hash > >> codes and not the underlying data, you can not only handle that but > >> improve on it; > > I think that would be a big selling point for hash indexes. It would let you > index even toasted data which are larger than a page. I'm not sure whether you > can make it work for unique indexes though. But for non-unique indexes I think > it would be a solid win and mean you cover a set of use cases quite distinct > from btree indexes. > > > - Hash lookup is O(1) while btree is O(logN). > > That's not really true. There's a tradeoff between insertion time and lookup > time. In order to get O(1) lookups you need to work pretty hard to maintain > the hash table including spending a lot of time reorganizing it when you grow > it. If you don't want to spend the time on inserts then you end up with > buckets and the hash table is basically just a linear speedup to whatever > algorithm you use to scan the buckets. These facts notwithstanding, average insert performance remains O(1) if you grow the hash exponentially every time it needs to be grown. Suppose, for example, that you use a power of 2 arrangement. Then the worst case scenario, right after a split, is that all of your keys had to be inserted, all had to be moved once, half had to be moved twice, a quarter 3 times, etc. So the ratio of moves to keys is 1 + 1/2 + 1/4 + ... which is a well-known geometric series converging on 2. True, when you cross the threshold a lot of work needs to be done. Life would be simpler if you could just put up a lock while you split the hash. You can't do that for a busy transactional database though.But if you want to be clever about it, you build intoyour hash implementation the intelligence to be able to have 1 or 2 hash locations to search. When they are both present, all inserts go into one of them, all deletes and updates are performed against both. Then you're able to have a background job reorganize your hash while the database continues to use it. > > - What about multi-column indexes? The current implementation > > only supports 1 column. > > That seems kind of weird. It seems obvious that you mix the three hashes > together which reduces it to the solved problem. That raises a very random thought. One of the nicer features of Oracle is the ability to have function-based indexes. So you could index, say, trim(lower(person.name)). There are a *lot* of practical situations where that comes in handy. The best workaround that I can think of for not having that is to have a column defined to hold the result of the function, maintain that column with a trigger, then index that column. Which works, but is inelegant. (It also requires storing completely redundant data.) Is there any prospect of postgres aquiring that functionality? Ben
On Mon, Sep 03, 2007 at 05:20:34PM -0700, Ben Tilly wrote: > > That raises a very random thought. One of the nicer features of > Oracle is the ability to have function-based indexes. So you could > index, say, trim(lower(person.name)). There are a *lot* of practical > situations where that comes in handy. The best workaround that I can > think of for not having that is to have a column defined to hold the > result of the function, maintain that column with a trigger, then > index that column. Which works, but is inelegant. (It also requires > storing completely redundant data.) > > Is there any prospect of postgres aquiring that functionality? > > Ben > I believe that PostgreSQL already supports functional indexes. In fact, one suggestion to address the egregiously poor performance of the current hash index was to replace it with a functional index. Regards, Ken
"Ben Tilly" <btilly@gmail.com> writes: > That raises a very random thought. One of the nicer features of > Oracle is the ability to have function-based indexes. So you could > index, say, trim(lower(person.name)). > Is there any prospect of postgres aquiring that functionality? Uh, no, since it's already there; has been since Berkeley days ... regards, tom lane
On 9/3/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Ben Tilly" <btilly@gmail.com> writes: > > That raises a very random thought. One of the nicer features of > > Oracle is the ability to have function-based indexes. So you could > > index, say, trim(lower(person.name)). > > > Is there any prospect of postgres aquiring that functionality? > > Uh, no, since it's already there; has been since Berkeley days ... Nice! I know of at least one DBA who is moving from Oracle to postgres who will be *very* happy to hear that. Ben
On Sun, Sep 02, 2007 at 01:04:04PM -0500, Kenneth Marshall wrote: > Dear PostgreSQL Hackers: > > After following the hackers mailing list for quite a while, > I am going to start investigating what will need to be done > to improve hash index performance. Below are the pieces of > this project that I am currently considering: > > 1. Characterize the current hash index implementation against > the BTree index, with a focus on space utilization and > lookup performance against a collection of test data. This > will give a baseline performance test to evaluate the impact > of changes. I initially do not plan to bench the hash creation > process since my initial focus will be on lookup performance. > Here are very basic results for a table with 1.6m entries: postgres=# CREATE TABLE dict (word varchar(100)); CREATE TABLE postgres=# COPY dict FROM '/tmp/words'; COPY 1648379 postgres=# select count(*) from dict; count ---------1648379 (1 row) Time: 11187.418 ms postgres=# select count(*) from dict; count ---------1648379 (1 row) Time: 6040.912 ms postgres=# CREATE INDEX wordhash ON dict USING hash (word); CREATE INDEX Time: 11108707.160 ms postgres=# select * from dict where word = 'avatar'; word --------avatar (1 row) Time: 79.823 ms postgres=# select * from dict where word = 'zebra';word -------zebra (1 row) Time: 9.864 ms postgres=# select * from dict where word = 'turkey'; word --------turkey (1 row) Time: 18.418 ms Time: 1.045 ms Time: 1.257 ms Time: 1.080 ms postgres=# CREATE INDEX wordbtree ON dict USING btree (word); CREATE INDEX Time: 25438.884 ms postgres=# select * from dict where word = 'avatar'; word --------avatar (1 row) Time: 13.400 ms postgres=# select * from dict where word = 'zebra';word -------zebra (1 row) Time: 1.173 ms postgres=# select * from dict where word = 'turkey'; word --------turkey (1 row) Time: 1.186 ms Time: 1.103 ms Time: 1.099 ms Time: 1.108 ms ------------------------------ Size of table = 87556096 Size of hash index = 268451840 Size of btree index = 53510144 From my very small sample on an unloaded machine, a hash index lookup took the least amount of time. It had a much larger initial time which could be attributable to cache population effects. The size is 5X that of the Btree index. I will continue to improve the test suite as more granularity is needed. If anyone has a good data generator, please let me know. Otherwise I will just roll my own. Regards, Ken
Ühel kenal päeval, E, 2007-09-03 kell 19:55, kirjutas Tom Lane: > Gregory Stark <stark@enterprisedb.com> writes: > > "Kenneth Marshall" <ktm@rice.edu> writes: > >> - What about multi-column indexes? The current implementation > >> only supports 1 column. > > > That seems kind of weird. It seems obvious that you mix the three hashes > > together which reduces it to the solved problem. > > No, because part of the deal is that you can do lookups using only the > leading index columns. At least, all the existing multicolumn index > types can do that. One approahc is not to mix hashes, but to partition the hash, so that each column gets its N bits in the hash. If you do it smartly you can use any column for index lookups, not just the leading one. > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
Hannu Krosing <hannu@skype.net> writes: > Ühel kenal päeval, E, 2007-09-03 kell 19:55, kirjutas Tom Lane: >> No, because part of the deal is that you can do lookups using only the >> leading index columns. At least, all the existing multicolumn index >> types can do that. > One approahc is not to mix hashes, but to partition the hash, so that > each column gets its N bits in the hash. How does that help? You still need all the keys to find out which bucket to look in. regards, tom lane
Ühel kenal päeval, N, 2007-09-06 kell 09:38, kirjutas Tom Lane: > Hannu Krosing <hannu@skype.net> writes: > > Ühel kenal päeval, E, 2007-09-03 kell 19:55, kirjutas Tom Lane: > >> No, because part of the deal is that you can do lookups using only the > >> leading index columns. At least, all the existing multicolumn index > >> types can do that. > > > One approahc is not to mix hashes, but to partition the hash, so that > > each column gets its N bits in the hash. > > How does that help? You still need all the keys to find out which > bucket to look in. no. you need to look at only the buckets where that part of hash matches say you allocate bits 4-7 for column 2 and then need to look up column 2 value with hash 3 . here you need to look at only buckets N*16 + 3, that is, you need to examine only each 16th bucket --------- Hannu
Hannu Krosing wrote: <blockquote cite="mid:1189087192.7470.16.camel@hannu-laptop" type="cite"><blockquote type="cite"><blockquotetype="cite"><pre wrap="">One approahc is not to mix hashes, but to partition the hash, so that each column gets its N bits in the hash. </pre></blockquote><pre wrap="">How does that help? You still need all thekeys to find out which bucket to look in. </pre></blockquote><pre wrap=""> no. you need to look at only the buckets where that part of hash matches say you allocate bits 4-7 for column 2 and then need to look up column 2 value with hash 3 . here you need to look at only buckets N*16 + 3, that is, you need to examine only each 16th bucket </pre></blockquote><br /> I don't like the truncating hash suggestion because it limits the ability of a hash code to uniquelyidentify a key.<br /><br /> If a user requires the ability to search on both (column1) and (column1, column2), theycan create two hash indexes and the planner can decide which to use.<br /> Or, they can use a btree. I think hash hasa subset of uses where it would be a significant gain, and focus should be spent on this subset.<br /><br /> Cheers,<br/> mark<br /><br /><pre class="moz-signature" cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
On Sep 6, 2007, at 10:53 , Mark Mielke wrote: > I don't like the truncating hash suggestion because it limits the > ability of a hash code to uniquely identify a key. AIUI, a hash can't be used as a unique identifier: it always needs to be rechecked due to the chance of collisions. There might be other issues with truncation, but preventing hashes from being unique isn't one of them. Michael Glaesemann grzm seespotcode net
On Thu, Sep 06, 2007 at 11:53:45AM -0400, Mark Mielke wrote: > Hannu Krosing wrote: >>>> One approahc is not to mix hashes, but to partition the hash, so that >>>> each column gets its N bits in the hash. >>> How does that help? You still need all the keys to find out which >>> bucket to look in. >>> >> >> no. you need to look at only the buckets where that part of hash matches >> >> say you allocate bits 4-7 for column 2 and then need to look up column 2 >> value with hash 3 . here you need to look at only buckets N*16 + 3, that >> is, you need to examine only each 16th bucket >> >> > > I don't like the truncating hash suggestion because it limits the ability > of a hash code to uniquely identify a key. > > If a user requires the ability to search on both (column1) and (column1, > column2), they can create two hash indexes and the planner can decide which > to use. > Or, they can use a btree. I think hash has a subset of uses where it would > be a significant gain, and focus should be spent on this subset. > > Cheers, > mark > I agree that we should focus primarily on the subset of uses for hash indexes where there would be a significant gain. I do think that being able to use a single O(1) hash lookup against all the values specified in a pseudo-multi-column index could be very beneficial in reducing access time and I/O. Since we already have to check the actual tuple values for any index lookup in postgresql, we could only store the full hash value and the corresponding TIDs in the bucket. Then when we lookup an item by calculating its hash, if the exact hash is not present in the bucket, then we know that the item is not in the index. If the value exists, then we would check the heap tuples before returning the results. Thus a negative lookup only needs to check the index and if the hash function is "good" there will be optimally only 1 possibly valid heap tuple if there is a match. One very big win for this change is to allow a much smaller index size (hash value + relevant TIDs) and the large column values are only stored in the actual data tuples. Regards, Ken > -- > Mark Mielke <mark@mielke.cc> >
Michael Glaesemann wrote: > On Sep 6, 2007, at 10:53 , Mark Mielke wrote: >> I don't like the truncating hash suggestion because it limits the >> ability of a hash code to uniquely identify a key. > AIUI, a hash can't be used as a unique identifier: it always needs to > be rechecked due to the chance of collisions. There might be other > issues with truncation, but preventing hashes from being unique isn't > one of them. Of course - that's why I used the word "limit". Hash works best, when the key is unique, however. A 32-bit hash will be many powers of 2 more unique than a 8-bit hash. Cheers, mark -- Mark Mielke <mark@mielke.cc>
On Thu, Sep 06, 2007 at 01:08:59PM -0500, Kenneth Marshall wrote: > Since we already have to check the actual tuple values for any index > lookup in postgresql, we could only store the full hash value and the > corresponding TIDs in the bucket. Then when we lookup an item by > calculating its hash, if the exact hash is not present in the bucket, > then we know that the item is not in the index. Sounds like you'd be returning a bitmap for use with a bitmap scan. That's a different take on other suggestions I've heard and would allow a hash index to have an almost unlimited key size yet flexible matching... (combined with other index, or even just the same index). Neat. Have a nice day, Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote: > 2. Evaluate the performance of different hash index implementations > and/or changes to the current implementation. My current plan is > to keep the implementation as simple as possible and still provide > the desired performance. Several hash index suggestions deal with > changing the layout of the keys on a page to improve lookup > performance, including reducing the bucket size to a fraction of > a page or only storing the hash value on the page, instead of > the index value itself. You might find this patch useful: http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php It implements the "just store the hash in the index" idea; it also sorts the entries in a bucket by the hash value, which allows binary search to be used to locate candidate matches. I was surprised that this didn't result in a performance improvement for the benchmarks that I ran, but I never got around to investigating further (either running more benchmarks or checking whether there was a bug in the implementation). Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge it up to HEAD if you'd like. -Neil
Neil Conway wrote: > You might find this patch useful: > > http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php Oh, I had forgot about that. > It implements the "just store the hash in the index" idea; it also sorts > the entries in a bucket by the hash value, which allows binary search to > be used to locate candidate matches. > > I was surprised that this didn't result in a performance improvement for > the benchmarks that I ran, but I never got around to investigating > further (either running more benchmarks or checking whether there was a > bug in the implementation). You did get a smaller index, which has cache benefits with larger tables. You didn't compare the size comparison against b-tree, but a smaller index is a good thing. I think you left some money on the table on that front. Since all the HashItems are fixed size, you can get rid of the line pointers altogether. Given that a sizeof(HashItemData) is 8 bytes, and a line pointer is 4 bytes, that should give a further 1/3 reduction in index size. If you're willing to give up on the alignment of HashItemData, you can get it down to 6 bytes. Even from a CPU point of view, as the table becomes bigger and the b-tree becomes deeper, the difference between O(1) and O(log n) lookup starts to become more significant. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote: > On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote: > > 2. Evaluate the performance of different hash index implementations > > and/or changes to the current implementation. My current plan is > > to keep the implementation as simple as possible and still provide > > the desired performance. Several hash index suggestions deal with > > changing the layout of the keys on a page to improve lookup > > performance, including reducing the bucket size to a fraction of > > a page or only storing the hash value on the page, instead of > > the index value itself. > > You might find this patch useful: > > http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php > > It implements the "just store the hash in the index" idea; it also sorts > the entries in a bucket by the hash value, which allows binary search to > be used to locate candidate matches. > > I was surprised that this didn't result in a performance improvement for > the benchmarks that I ran, but I never got around to investigating > further (either running more benchmarks or checking whether there was a > bug in the implementation). > > Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge > it up to HEAD if you'd like. > > -Neil > This is a great starting point. I would appreciate it if you have the time and could make it apply cleanly to HEAD. I remember when you first posted it but had forgotten, probably because of the lack-luster results. Just a quick glance at the patch and from what I can tell, tagging the index as lossy causes a lot more work to be done than should be needed in theory. Currently the index-scan machinery will recheck the value against the original value for lossy indexes. However, given that we are using a good hash function with a low chance of collision, if we mark the unique items in the index then they do not actually have to be rechecked during the scan. Do you have any suggestions for implementing that optimization or is there any option to tell the scan machinery to only re-check a certain list of tuples? Thank you again for pointing this patch out and please let me know when you have a version against HEAD. Regards, Ken > >
On Fri, Sep 07, 2007 at 12:55:37PM +0100, Heikki Linnakangas wrote: > Neil Conway wrote: > > You might find this patch useful: > > > > http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php > > Oh, I had forgot about that. > > > It implements the "just store the hash in the index" idea; it also sorts > > the entries in a bucket by the hash value, which allows binary search to > > be used to locate candidate matches. > > > > I was surprised that this didn't result in a performance improvement for > > the benchmarks that I ran, but I never got around to investigating > > further (either running more benchmarks or checking whether there was a > > bug in the implementation). > > You did get a smaller index, which has cache benefits with larger > tables. You didn't compare the size comparison against b-tree, but a > smaller index is a good thing. > > I think you left some money on the table on that front. Since all the > HashItems are fixed size, you can get rid of the line pointers > altogether. Given that a sizeof(HashItemData) is 8 bytes, and a line > pointer is 4 bytes, that should give a further 1/3 reduction in index > size. If you're willing to give up on the alignment of HashItemData, you > can get it down to 6 bytes. > > Even from a CPU point of view, as the table becomes bigger and the > b-tree becomes deeper, the difference between O(1) and O(log n) lookup > starts to become more significant. > If you use the size values from my initial tests, the hash index is down to 3X the btree size instead of 5X. If we can make these additional changes and add a unique flags to the index entry, we would have a much smaller index. I had not thought of it at the time, but the addition of the unique/sole item flag would allow the use of the hash index to support unique indexes and be used for primary keys. In some usage cases, the O(1) would be very advantageous. Regards, Ken
Kenneth Marshall wrote: <blockquote cite="mid:20070907132928.GC19403@it.is.rice.edu" type="cite"><pre wrap="">On Thu, Sep06, 2007 at 11:56:25PM -0700, Neil Conway wrote: </pre><blockquote type="cite"><pre wrap="">You might find this patchuseful: <a class="moz-txt-link-freetext" href="http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php">http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php</a> ... Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge it up to HEAD if you'd like. </pre></blockquote><pre wrap="">This is a great starting point. I would appreciate it if youhave the time and could make it apply cleanly to HEAD. I remember when you first posted it but had forgotten, probably because of the lack-luster results. Just a quick glance at the patch and from what I can tell, tagging the index as lossy causes a lot more work to be done than should be needed in theory. Currently the index-scan machinery will recheck the value against the original value for lossy indexes. However, given that we are using a good hash function with a low chance of collision, if we mark the unique items in the index then they do not actually have to be rechecked during the scan. Do you have any suggestions for implementing that optimization or is there any option to tell the scan machinery to only re-check a certain list of tuples? Thank you again for pointing this patch out and please let me know when you have a version against HEAD. </pre></blockquote> What do you mean by "mark the unique items in the index then they do not actually have to be recheckedduring the scan." Even if there is a unique hash value mapping to a unique key, there is no guarantee that a newvalue won't result in that same hash value. Such is the nature of hashes. A hash key map does not mean a value match.The value must be checked. The opposite, however, may be true. If the hash key is not found, then we know the row forthe value does not exist.<br /><br /> Have you measured the performance of re-checking? I have always assumed the performanceof re-checking was near free when compared to the cost of looking up the tuples in the table to determine whetheror not they are "live". This is why I have not been upset that bitmap index scans often re-check.<br /><br /> Cheers,<br/> mark<br /><br /><pre class="moz-signature" cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
On Fri, Sep 07, 2007 at 09:50:07AM -0400, Mark Mielke wrote: > Kenneth Marshall wrote: >> On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote: >> >>> You might find this patch useful: >>> >>> http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php >>> ... >>> >>> Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge >>> it up to HEAD if you'd like. >>> >> This is a great starting point. I would appreciate it if you have the >> time and could make it apply cleanly to HEAD. I remember when you first >> posted it but had forgotten, probably because of the lack-luster results. >> Just a quick glance at the patch and from what I can tell, tagging the >> index as lossy causes a lot more work to be done than should be needed >> in theory. Currently the index-scan machinery will recheck the value >> against the original value for lossy indexes. However, given that we >> are using a good hash function with a low chance of collision, if we >> mark the unique items in the index then they do not actually have to >> be rechecked during the scan. Do you have any suggestions for implementing >> that optimization or is there any option to tell the scan machinery to >> only re-check a certain list of tuples? Thank you again for pointing >> this patch out and please let me know when you have a version against >> HEAD. >> > What do you mean by "mark the unique items in the index then they do not > actually have to be rechecked during the scan." Even if there is a unique > hash value mapping to a unique key, there is no guarantee that a new value > won't result in that same hash value. Such is the nature of hashes. A hash > key map does not mean a value match. The value must be checked. The > opposite, however, may be true. If the hash key is not found, then we know > the row for the value does not exist. > > Have you measured the performance of re-checking? I have always assumed the > performance of re-checking was near free when compared to the cost of > looking up the tuples in the table to determine whether or not they are > "live". This is why I have not been upset that bitmap index scans often > re-check. > > Cheers, > mark I understand that a hash value is a many-to-one mapping. That is the point of the flag in the index. The flag means that there is only one item in the heap corresponding to that hash value. In this case we know that the value in the heap is the correct one and a possibly very expensive string comparison can be skipped. Given that the hash function is doing its job, almost every string comparison can be skipped. How long would it take to compare 1-32K of data? How much CPU usage? With this field in place, you only need to check tuple visibility. Regards, Ken
On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote: > On Sun, 2007-02-09 at 13:04 -0500, Kenneth Marshall wrote: > > 2. Evaluate the performance of different hash index implementations > > and/or changes to the current implementation. My current plan is > > to keep the implementation as simple as possible and still provide > > the desired performance. Several hash index suggestions deal with > > changing the layout of the keys on a page to improve lookup > > performance, including reducing the bucket size to a fraction of > > a page or only storing the hash value on the page, instead of > > the index value itself. > > You might find this patch useful: > > http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php > > It implements the "just store the hash in the index" idea; it also sorts > the entries in a bucket by the hash value, which allows binary search to > be used to locate candidate matches. > > I was surprised that this didn't result in a performance improvement for > the benchmarks that I ran, but I never got around to investigating > further (either running more benchmarks or checking whether there was a > bug in the implementation). > > Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge > it up to HEAD if you'd like. > > -Neil > I have another question. Did the scan code at this time use the heap-order scanning? Could that have had an impact on the patch performance? Ken
Kenneth Marshall wrote: > I understand that a hash value is a many-to-one mapping. That is the > point of the flag in the index. The flag means that there is only one > item in the heap corresponding to that hash value. In this case we > know that the value in the heap is the correct one and a possibly > very expensive string comparison can be skipped. Given that the hash > function is doing its job, almost every string comparison can be skipped. > How long would it take to compare 1-32K of data? How much CPU usage? > With this field in place, you only need to check tuple visibility The value comparison cannot be skipped. I do not think you understand the many-to-one mapping in its entirety. Example: Table has: a(1), b(2), c(3) Index has: 1 => 1 (unique), 2 => 2 (unique), 3 => 3 (unique) Query: select * from table where key = 'z'; If 'z' hashes to '3' (completely possible), then the index record 3 points to tuple 3, and it "exists". Only, it doesn't because 'a' <> 'z'. You MUST check the value. Cheers, mark -- Mark Mielke <mark@mielke.cc>
Kenneth Marshall wrote: >I understand that a hash value is a many-to-one mapping. That is the >point of the flag in the index. The flag means that there is only one >item in the heap corresponding to that hash value. In this case we >know that the value in the heap is the correct one and a possibly >very expensive string comparison can be skipped. Given that the hash >function is doing its job, almost every string comparison can be skipped. >How long would it take to compare 1-32K of data? How much CPU usage? >With this field in place, you only need to check tuple visibility. > > How likely is it that you will get a hash collision, two strings that are different that will hash to the same value? To avoid this requires a very large hash key (128 bits, minimum)- otherwise you get into birthday attack problems. With a 32-bit hash, the likelyhood is greater than 50% that two strings in a collection of 100,000 will hash to the same value. With a 64-bit hash, the likelyhood is greater than 50% that two strings in a collection of 10 billion will has to same value. 10 billion is a large number, but not an unreasonable number, of strings to want to put into a hash table- and it's exactly this case where the O(1) cost of hashtables starts being a real win. Brian
On Fri, Sep 07, 2007 at 10:30:30AM -0400, Mark Mielke wrote: > Kenneth Marshall wrote: >> I understand that a hash value is a many-to-one mapping. That is the >> point of the flag in the index. The flag means that there is only one >> item in the heap corresponding to that hash value. In this case we >> know that the value in the heap is the correct one and a possibly >> very expensive string comparison can be skipped. Given that the hash >> function is doing its job, almost every string comparison can be skipped. >> How long would it take to compare 1-32K of data? How much CPU usage? >> With this field in place, you only need to check tuple visibility > The value comparison cannot be skipped. I do not think you understand the > many-to-one mapping in its entirety. > > Example: > > Table has: a(1), b(2), c(3) > Index has: 1 => 1 (unique), 2 => 2 (unique), 3 => 3 (unique) > > Query: > > select * from table where key = 'z'; > > If 'z' hashes to '3' (completely possible), then the index record 3 points > to tuple 3, and it "exists". Only, it doesn't because 'a' <> 'z'. You MUST > check the value. > > Cheers, > mark > Yes, you are completely correct. Thank you for the reminder. Regards, Ken
On Fri, Sep 07, 2007 at 10:36:41AM -0400, Brian Hurt wrote: > Kenneth Marshall wrote: > >> I understand that a hash value is a many-to-one mapping. That is the >> point of the flag in the index. The flag means that there is only one >> item in the heap corresponding to that hash value. In this case we >> know that the value in the heap is the correct one and a possibly >> very expensive string comparison can be skipped. Given that the hash >> function is doing its job, almost every string comparison can be skipped. >> How long would it take to compare 1-32K of data? How much CPU usage? >> With this field in place, you only need to check tuple visibility. >> > > How likely is it that you will get a hash collision, two strings that are > different that will hash to the same value? To avoid this requires a very > large hash key (128 bits, minimum)- otherwise you get into birthday attack > problems. With a 32-bit hash, the likelyhood is greater than 50% that two > strings in a collection of 100,000 will hash to the same value. With a > 64-bit hash, the likelyhood is greater than 50% that two strings in a > collection of 10 billion will has to same value. 10 billion is a large > number, but not an unreasonable number, of strings to want to put into a > hash table- and it's exactly this case where the O(1) cost of hashtables > starts being a real win. > > Brian > Yes, there is a non-negligible chance of collision (In a DB is there any chance that is non-negligible? :) ) and the values must be checked against the actual. The win is the collapse of the index size and only needed to check a small fraction of the actual tuples. Regards, Ken
Kenneth Marshall wrote:<br /><blockquote cite="mid20070907145507.GJ19403@it.is.rice.edu" type="cite"><blockquote type="cite"><blockquotetype="cite"><pre wrap=""> </pre></blockquote><pre wrap="">How likely is it that you will geta hash collision, two strings that are different that will hash to the same value? To avoid this requires a very large hash key (128 bits, minimum)- otherwise you get into birthday attack problems. With a 32-bit hash, the likelyhood is greater than 50% that two strings in a collection of 100,000 will hash to the same value. With a 64-bit hash, the likelyhood is greater than 50% that two strings in a collection of 10 billion will has to same value. 10 billion is a large number, but not an unreasonable number, of strings to want to put into a hash table- and it's exactly this case where the O(1) cost of hashtables starts being a real win. Brian </pre></blockquote><pre wrap="">Yes, there is a non-negligible chance of collision (In a DB is there any chance that is non-negligible? :) ) and the values must be checked against the actual. The win is the collapse of the index size and only needed to check a small fraction of the actual tuples. </pre></blockquote><br /> Ah, OK- I misunderstood you. I thought you were saying that the hash values would need to beunique, and you wouldn't check the original values at all. My bad.<br /><br /> Brian<br /><br />
On Fri, Sep 07, 2007 at 11:08:13AM -0400, Brian Hurt wrote: > Kenneth Marshall wrote: > >>>> >>> How likely is it that you will get a hash collision, two strings that are >>> different that will hash to the same value? To avoid this requires a >>> very large hash key (128 bits, minimum)- otherwise you get into birthday >>> attack problems. With a 32-bit hash, the likelyhood is greater than 50% >>> that two strings in a collection of 100,000 will hash to the same value. >>> With a 64-bit hash, the likelyhood is greater than 50% that two strings >>> in a collection of 10 billion will has to same value. 10 billion is a >>> large number, but not an unreasonable number, of strings to want to put >>> into a hash table- and it's exactly this case where the O(1) cost of >>> hashtables starts being a real win. >>> >>> Brian >>> >>> >> Yes, there is a non-negligible chance of collision (In a DB is there >> any chance that is non-negligible? :) ) and the values must be checked >> against the actual. The win is the collapse of the index size and only >> needed to check a small fraction of the actual tuples. >> >> >> > > Ah, OK- I misunderstood you. I thought you were saying that the hash > values would need to be unique, and you wouldn't check the original values > at all. My bad. > > Brian > No, you were correct. I misstated originally and you and Mark both pointed out my mistake. Regards, Ken
On Fri, Sep 07, 2007 at 10:36:41AM -0400, Brian Hurt wrote: > Kenneth Marshall wrote: > >> I understand that a hash value is a many-to-one mapping. That is the >> point of the flag in the index. The flag means that there is only one >> item in the heap corresponding to that hash value. In this case we >> know that the value in the heap is the correct one and a possibly >> very expensive string comparison can be skipped. Given that the hash >> function is doing its job, almost every string comparison can be skipped. >> How long would it take to compare 1-32K of data? How much CPU usage? >> With this field in place, you only need to check tuple visibility. >> > > How likely is it that you will get a hash collision, two strings that are > different that will hash to the same value? To avoid this requires a very > large hash key (128 bits, minimum)- otherwise you get into birthday attack > problems. With a 32-bit hash, the likelyhood is greater than 50% that two > strings in a collection of 100,000 will hash to the same value. With a > 64-bit hash, the likelyhood is greater than 50% that two strings in a > collection of 10 billion will has to same value. 10 billion is a large > number, but not an unreasonable number, of strings to want to put into a > hash table- and it's exactly this case where the O(1) cost of hashtables > starts being a real win. > > Brian > Continuing this train of thought.... While it would make sense for larger keys to store the hash in the index, if the key is smaller, particularly if it is of fixed size, it would make sense to store the key in the index instead. This would have the benefit of allowing use of the hash index in non-lossy mode albeit with a slight increase in complexity. Ken
Kenneth Marshall wrote:<br /><blockquote cite="mid:20070908202122.GA5679@it.is.rice.edu" type="cite"><pre wrap="">Continuingthis train of thought.... While it would make sense for larger keys to store the hash in the index, if the key is smaller, particularly if it is of fixed size, it would make sense to store the key in the index instead. This would have the benefit of allowing use of the hash index in non-lossy mode albeit with a slight increase in complexity. </pre></blockquote> I suspect there is no value in designinga hash implementation to work well for a context where a btree index would already perform equally well.<br /><br/> If there are too few hash buckets, performance is not O(1). For a hash index to function better than btree, I believefocus should be spent on the O(1) case, which means ensuring that enough hash buckets are used to provide O(1).<br/><br /> All of these must match: 1) Hash value, 2) Key value, 3) Tuple visibility.<br /><br /> In the optimum O(1)scenario, each existing key will map to a hash bucket that contains ~1 entry. For this case, there is no value to havingthe key stored in the index row, as 3) Tuple visibility, will still require access to the table row. In this optimumscenario, I do not believe anything of value is saved by storing the key in the index row. The loss, however, is thatthe hash index data structures become more complex, and would likely require support for variable length data. The resultingincrease in hash index size and code complexity would reduce performance.<br /><br /> Just an opinion.<br /><br/> Cheers,<br /> mark<br /><pre class="moz-signature" cols="72"> -- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
On Sat, Sep 08, 2007 at 05:14:09PM -0400, Mark Mielke wrote: > Kenneth Marshall wrote: >> Continuing this train of thought.... While it would make sense for larger >> keys to store the hash in the index, if the key is smaller, particularly >> if it is of fixed size, it would make sense to store the key in the index >> instead. This would have the benefit of allowing use of the hash index >> in non-lossy mode albeit with a slight increase in complexity. >> > I suspect there is no value in designing a hash implementation to work well > for a context where a btree index would already perform equally well. > > If there are too few hash buckets, performance is not O(1). For a hash > index to function better than btree, I believe focus should be spent on the > O(1) case, which means ensuring that enough hash buckets are used to > provide O(1). > > All of these must match: 1) Hash value, 2) Key value, 3) Tuple visibility. > > In the optimum O(1) scenario, each existing key will map to a hash bucket > that contains ~1 entry. For this case, there is no value to having the key > stored in the index row, as 3) Tuple visibility, will still require access > to the table row. In this optimum scenario, I do not believe anything of > value is saved by storing the key in the index row. The loss, however, is > that the hash index data structures become more complex, and would likely > require support for variable length data. The resulting increase in hash > index size and code complexity would reduce performance. > > Just an opinion. > > Cheers, > mark > I agree that we should focus on the O(1) area. The value of storing the actual value, possibly as an option, is that the key check can be done in the index without requiring a heap lookup to check the actual value which would be a benefit for a unique index. I agree that supporting variable length data would complicate the index and reduce performance. I am not willing to assume that ~1 entry per hash bucket is necessarily what we want, at least without some testing. It seems reasonable that with the performance differences between L1/L2/L3 cache, main memory, and the disk subsystem a higher load factor would result in better overall system performance by reducing cache line misses and improving hardware pre-fetch efficiency. Along with the hypothetical performance wins, the hash index space efficiency would be improved by a similar factor. Obviously, all of these ideas would need to be tested in various workload environments. In the large index arena, 10^6 to 10^9 keys and more, space efficiency will help keep the index manageable in todays system memories. Please keep the ideas and comments coming. I am certain that a synthesis of them will provide an implementation with the performance characteristics that we are seeking. Regards, Ken > -- > Mark Mielke <mark@mielke.cc> >
Kenneth Marshall wrote:<br /><blockquote cite="mid:20070908214900.GB5679@it.is.rice.edu" type="cite"><pre wrap="">The valueof storing the actual value, possibly as an option, is that the key check can be done in the index without requiring a heap lookup to check the actual value which would be a benefit for a unique index. I agree that supporting variable length data would complicate the index and reduce performance. I am not willing to assume that ~1 entry per hash bucket is necessarily what we want, at least without some testing.</pre></blockquote> I think that if the case of >1 entry per hash becomescommon enough to be significant, and the key is stored in the hash, that a btree will perform equal or better, andthere is no point in pursuing such a hash index model. This is where we are today.<br /><br /><blockquote cite="mid:20070908214900.GB5679@it.is.rice.edu"type="cite"><pre wrap="">It seems reasonable that with the performance differences between L1/L2/L3 cache, main memory, and the disk subsystem a higher load factor would result in better overall system performance by reducing cache line misses and improving hardware pre-fetch efficiency.</pre></blockquote> If the key is stored, all of these factors likely favor the btree formatover the hash format. Again, this is where we are today.<br /><br /><blockquote cite="mid:20070908214900.GB5679@it.is.rice.edu"type="cite"><pre wrap="">Along with the hypothetical performance wins, the hash index space efficiency would be improved by a similar factor. Obviously, all of these ideas would need to be tested in various workload environments. In the large index arena, 10^6 to 10^9 keys and more, space efficiency will help keep the index manageable in todays system memories. </pre></blockquote> Space efficiency is provided by not storing the key, nor the header data required(length prefix?).<br /><blockquote cite="mid:20070908214900.GB5679@it.is.rice.edu" type="cite"><pre wrap="">Pleasekeep the ideas and comments coming. I am certain that a synthesis of them will provide an implementation with the performance characteristics that we are seeking. </pre></blockquote> The subject interests me. :-)<br /><br /> Cheers,<br /> mark<br /><br /><pre class="moz-signature"cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote: > Kenneth Marshall <ktm@rice.edu> writes: > > ... This is the rough plan. Does anyone see anything critical that > > is missing at this point? > > Sounds pretty good. Let me brain-dump one item on you: one thing that > hash currently has over btree is the ability to handle index items up > to a full page. Now, if you go with a scheme that only stores hash > codes and not the underlying data, you can not only handle that but > improve on it; but if you reduce the bucket size and don't remove the > data, it'd be a step backward. The idea I had about dealing with that > was to only reduce the size of primary buckets --- if it's necessary to > add overflow space to a bucket, the overflow units are still full pages. > So an index tuple up to a page in size can always be accommodated by > adding an overflow page to the bucket. > > Just a thought, but AFAIR it's not in the archives anywhere. > > regards, tom lane > I was thinking about this some more, and it strikes me that we can keep the page size = bucket size = overflow size in the new scheme of storing the hash value in the index and still reduce the effective bucket size. Let's make the new size the (page size / 2^n) where n is chosen appropriately. Then we we store the value in the page we simply use n bits of the hash to determine which sub-piece of the page to use to actually store the value. We may need to add a multiplier to adjust the decision to split the page based on the mini-page. This should allow us to much more densely pack the pages and approach the 1 item per bucket. This could easily shrink the size of the index by a factor of 2^n. Let me know what you think. Regards, Ken
--On Samstag, September 08, 2007 18:56:23 -0400 Mark Mielke <mark@mark.mielke.cc> wrote: > Kenneth Marshall wrote: > Along with the hypothetical performance > wins, the hash index space efficiency would be improved by a similar > factor. Obviously, all of these ideas would need to be tested in > various workload environments. In the large index arena, 10^6 to 10^9 > keys and more, space efficiency will help keep the index manageable > in todays system memories. > > > Space efficiency is provided by not storing the key, nor the header data > required (length prefix?). Space efficiency at ~1 entry per bucket: How about using closed hashing, saving in each page a bitmask in front which specifies which entries hold valid entries and in the rest of the page row-pointers (is this the correct expression? I don't know...) without further data. Should provide reasonably simple data structure and alignment for the pointers. > Please keep the ideas and comments coming. I am certain that a synthesis > of them will provide an implementation with the performance > characteristics > that we are seeking. One should look into new plan nodes for "!= ANY()", "NOT EXISTS" and similar. A node like "look into hash and true if bucket is empty" would work without checking tuple visibility when the bucket is empty and could be a win in some situations. Do we want special cases for short keys like INT4? In those cases the implementation might use hash == key and put that knowledge to use in plans. Even a unique constraint might then be doable. Does the postgresql-storage backend on linux support sparse files? Might be a win when holes in the sequence turn up.
More random thoughts: - Hash-Indices are best for unique keys, but every table needs a new hash key, which means one more random page access. Is there any way to build multi-_table_ indices? A join might then fetch all table rows with a given unique key after one page fetch for the combined index. - Hashes with trivial hash-functions (like identity) can also return rows in a desired order. - Is there a case where a sequentially scanning a hash-index is useful? I can't find any, but maybe somebody else has a use-case. - What about HashJoins when the referenced tables have hash-indices? - What about hash-indices where entries are inserted for multiple columns. Assume a table like this: CREATE TABLE link (obj_id1 INT4, obj_id2 INT4); and a query like SELECT * FROM link WHERE ? IN (obj_id1, obj_id2); or some join using a similar condition. It might be a nice thing to insert entries at both HASH(obj_id1) and HASH(obj_id2), which would eliminate the need to check in two indices and do a bitmap OR. OTOH it might not be faster in any significant use cases because who'd need a link table with nearly unique linked objects? - In cases where the distribution of the hash-function is good, but a small and relatively even number of rows exist for each key (like it might be the case in the above example), it might be nice to reserve a given amount of same-key row entries in each bucket, and hold a fill-count at the front of it. That would avoid costly page fetches after each collision. You'd create a hash-index with n-buckets, each m-elements large. When the bucket is full, the usual collision handling continues. - About hash enlargement: What about always using only the first k bits of each hash value. When you find that the hash is "quite-full" (however that is defined and detected), k is increased by one, effectively doubling the hash size. New entries are then written as usual, while retrieving the old entries needs to test at the k-bit-position first and if there is a miss, also at the k-1-position and so forth. To limit this search, some background process could after analyzing the index move old entries to the now correct k-bit-position and increment some "min-k"-value once all old entries have been moved. After the hash has been increased, the index would approximately half it's speed for some time then. Additionally one could also insert the entry at the new position if it has been found at the old one only while using the index. A special "miss"-entry at the new position doesn't help if nothing could be found because the old positions will usually hold some data which resides there even if it uses k bits.
Kenneth Marshall wrote: <blockquote cite="mid:20070910024258.GC5679@it.is.rice.edu" type="cite"><pre wrap="">On Sun, Sep02, 2007 at 10:41:22PM -0400, Tom Lane wrote: </pre><blockquote type="cite"><pre wrap="">Kenneth Marshall <a class="moz-txt-link-rfc2396E"href="mailto:ktm@rice.edu"><ktm@rice.edu></a> writes: </pre><blockquote type="cite"><prewrap="">... This is the rough plan. Does anyone see anything critical that is missing at this point? </pre></blockquote><pre wrap="">Sounds pretty good. Let me brain-dump one item on you: onething that hash currently has over btree is the ability to handle index items up to a full page. Now, if you go with a scheme that only stores hash codes and not the underlying data, you can not only handle that but improve on it; but if you reduce the bucket size and don't remove the data, it'd be a step backward. The idea I had about dealing with that was to only reduce the size of primary buckets --- if it's necessary to add overflow space to a bucket, the overflow units are still full pages. So an index tuple up to a page in size can always be accommodated by adding an overflow page to the bucket. Just a thought, but AFAIR it's not in the archives anywhere. regards, tom lane </pre></blockquote><pre wrap="">I was thinking about this some more, and it strikes me that we can keep the page size = bucket size = overflow size in the new scheme of storing the hash value in the index and still reduce the effective bucket size. Let's make the new size the (page size / 2^n) where n is chosen appropriately. Then we we store the value in the page we simply use n bits of the hash to determine which sub-piece of the page to use to actually store the value. We may need to add a multiplier to adjust the decision to split the page based on the mini-page. This should allow us to much more densely pack the pages and approach the 1 item per bucket. This could easily shrink the size of the index by a factor of 2^n. Let me know what you think. </pre></blockquote> My personal opinion is that something like this is requiredto take best advantage of hashes. I didn't respond immediately because I don't have advice on what the best implementationwould look like.<br /><br /> I have also been wondering about the effect of a hash index that includes multiplevalues to the same key (either a non-unique key, or different tuples from the same key due to table updates). I startedby thinking that the maximum number of hash entries per hash bucket should be kept to 2 or 3 to prevent reductionin performance to that of btree, but I think this doesn't work if multiple tuples can have the same key. Unless- the maps is hash value =1:1> index page =1:1> hash bucket =1:N> hash value =1:M=> tuples. Guarantee thanN is small (either <= 2 or <=4 depending on performance evaluation) by re-hashing if N ever becomes > 2 or >4. Choose a sparse harsh. Let 1:M be indefinite? Also, optimize the 1:M=1:1 case, such that the value can be inline?<br/><br /> For most cases, I would think the above model would make it cheap to determine if the key does not exist,as well as the 1:1 (hash value:key) case requiring a single page lookup. Update performance would suffer, but lookupshould be faster than btree in these cases, as btree often requires > 1 index page scan.<br /><br /> Cheers,<br/> mark<br /><br /><pre class="moz-signature" cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
On Sat, Sep 08, 2007 at 06:56:23PM -0400, Mark Mielke wrote: > I think that if the case of >1 entry per hash becomes common enough to > be significant, and the key is stored in the hash, that a btree will > perform equal or better, and there is no point in pursuing such a hash > index model. This is where we are today. There is the point that if a user does an UPDATE of a row without changing the key your index will have to store entries with the same hash. If your goal is mostly write-once tables, then that's cool, but otherwise you're going to have to find a way of dealing with that. It's not just going to happen a lot, it is going to be common, even for unique indexes. Presumably HOT will help with this, but that relies on all index columns not to change. The major benenfits will mostly come from not storing the key at all. I think. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
We are pleased to announce an upcoming patch to the hash index code which improves build time and index size, based on this item in the TODO list: During index creation, pre-sort the tuples to improve build speed http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php Using our implementation, build times and index sizes are comparable with btree index build times and index sizes. For example, for a particular 12 million row relation, the 8.2.4 release requires over 2.8 hours to build the hash index. With our patch, the index is built in 80 seconds. Here is the link for a graph, showing a comparative analysis of btree and hash index builds and describing the benchmark data. http://web.cecs.pdx.edu/~raneyt/pg/ We are currently cleaning up the patch and will submit it asap. Regards, Shreya Bhargava <shreya_bhargav@yahoo.com> Tom Raney <twraney@comcast.net> Kenneth Marshall wrote: > Dear PostgreSQL Hackers: > > After following the hackers mailing list for quite a while, > I am going to start investigating what will need to be done > to improve hash index performance. Below are the pieces of > this project that I am currently considering: > > 1. Characterize the current hash index implementation against > the BTree index, with a focus on space utilization and > lookup performance against a collection of test data. This > will give a baseline performance test to evaluate the impact > of changes. I initially do not plan to bench the hash creation > process since my initial focus will be on lookup performance. > > 2. Evaluate the performance of different hash index implementations > and/or changes to the current implementation. My current plan is > to keep the implementation as simple as possible and still provide > the desired performance. Several hash index suggestions deal with > changing the layout of the keys on a page to improve lookup > performance, including reducing the bucket size to a fraction of > a page or only storing the hash value on the page, instead of > the index value itself. My goal in this phase is to produce one > or more versions with better performance than the current BTree. > > 3. Look at build time and concurrency issues with the addition of > some additional tests to the test bed. (1) > > 4. Repeat as needed. > > This is the rough plan. Does anyone see anything critical that > is missing at this point? Please send me any suggestions for test > data and various performance test ideas, since I will be working > on that first. > > Regards, > Ken Marshall > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote: > We are pleased to announce an upcoming patch to the hash index code > which improves build time and index size, based on this item in the > TODO list: > During index creation, pre-sort the tuples to improve build speed > http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php > > Using our implementation, build times and index sizes are > comparable with btree index build times and index sizes. > For example, for a particular 12 million row relation, the > 8.2.4 release requires over 2.8 hours to build the hash index. With our > patch, the index is built in 80 seconds. > Here is the link for a graph, showing a comparative analysis of > btree and hash index builds and describing the benchmark data. > http://web.cecs.pdx.edu/~raneyt/pg/ > > We are currently cleaning up the patch and will submit it asap. > > Regards, > Shreya Bhargava <shreya_bhargav@yahoo.com> > Tom Raney <twraney@comcast.net> > That is super! (and timely) I was just looking at that some myself since the large index build times make testing a very laborious process. I am looking forward to seeing the details. Regards, Ken
"Kenneth Marshall" <ktm@rice.edu> writes: > On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote: > >> Using our implementation, build times and index sizes are >> comparable with btree index build times and index sizes. >... > That is super! (and timely) It is pretty super. I have a few comments to raise but don't take it to be too negative, it sounds like this is a big step forward towards making hash indexes valuable. Firstly in the graphs it seems the index size graph has an exponential x-axis but a linear y-axis. This makes it hard to see what I would expect to be pretty much linear growth. The curves look exponential which would mean linear growth but of course it's hard to tell. Also, the growth in the time chart looks pretty much linear. That seems weird since I would expect there would be a visible extra component since sort times are n-log(n). Perhaps you need to test still larger tables to see that though. In any case it's clear from the results you have there that the change is a positive one and fixes a fundamental problem with the hash index build code. Something else you should perhaps test is indexing something which is substantially larger than hash function output. A hash function is going to make the most sense when indexing something like strings for which you want to avoid the long comparison costs. Especially consider testing this on a UTF8 locale with expensive comparisons (like a CJK locale for example). Note that the bottom line for the problems with hash indexes is that the current implementation doesn't offer any advantages over btree indexes. Hash indexes need to be not only as good of a choice as btree indexes but significantly better a significantly better choice at least for some specific circumstances. Also, if you're going to submit a patch you should check out a copy of the CVS HEAD and work from that. I don't think there are huge differences in the area of hash indexes though. But in most other areas you would be spending quite a bit of time dealing details which have changed since. Finally note that we're in the final throes of the 8.3 feature freeze. Normally any patch submitted now would be held until 8.3 is released and development on 8.4 is started. I could imagine an exception being made for hash indexes since they're so moribund currently but probably not. The flip side of that argument is that there's not much point in making an exception for something which will only be really useful once further work is done in the same area. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Tue, Sep 25, 2007 at 03:35:47PM +0100, Gregory Stark wrote: > "Kenneth Marshall" <ktm@rice.edu> writes: > > > On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote: > > > >> Using our implementation, build times and index sizes are > >> comparable with btree index build times and index sizes. > >... > > That is super! (and timely) > > It is pretty super. I have a few comments to raise but don't take it to be too > negative, it sounds like this is a big step forward towards making hash > indexes valuable. > > Firstly in the graphs it seems the index size graph has an exponential x-axis > but a linear y-axis. This makes it hard to see what I would expect to be > pretty much linear growth. The curves look exponential which would mean linear > growth but of course it's hard to tell. > > Also, the growth in the time chart looks pretty much linear. That seems weird > since I would expect there would be a visible extra component since sort times > are n-log(n). Perhaps you need to test still larger tables to see that though. > > In any case it's clear from the results you have there that the change is a > positive one and fixes a fundamental problem with the hash index build code. > > Something else you should perhaps test is indexing something which is > substantially larger than hash function output. A hash function is going to > make the most sense when indexing something like strings for which you want to > avoid the long comparison costs. Especially consider testing this on a UTF8 > locale with expensive comparisons (like a CJK locale for example). > > Note that the bottom line for the problems with hash indexes is that the > current implementation doesn't offer any advantages over btree indexes. Hash > indexes need to be not only as good of a choice as btree indexes but > significantly better a significantly better choice at least for some specific > circumstances. > > Also, if you're going to submit a patch you should check out a copy of the CVS > HEAD and work from that. I don't think there are huge differences in the area > of hash indexes though. But in most other areas you would be spending quite a > bit of time dealing details which have changed since. > > Finally note that we're in the final throes of the 8.3 feature freeze. > Normally any patch submitted now would be held until 8.3 is released and > development on 8.4 is started. I could imagine an exception being made for > hash indexes since they're so moribund currently but probably not. The flip > side of that argument is that there's not much point in making an exception > for something which will only be really useful once further work is done in > the same area. > Although I am very excited about this patch, I do not see any real value in including it in 8.3. As you mention, we need to to have a hash index implementation that outperforms btree in some problem regime and that is currently not the case. I have just recently started the process of gathering ideas and having discussions on various approaches to making hash indexes more performant and we have a number of ideas on which to start our investigation. I do think that this patch will make the testing and evaluation, that will be needed to truly improve the hash index, much much easier. Regards, Ken
On Sep 25, 2007, at 11:26 , Kenneth Marshall wrote: > Although I am very excited about this patch, I do not see any real > value > in including it in 8.3. I don't think you have to worry about it being in 8.3. Feature freeze was months ago. Michael Glaesemann grzm seespotcode net
Kenneth Marshall wrote: > On Tue, Sep 25, 2007 at 03:35:47PM +0100, Gregory Stark wrote: > >> "Kenneth Marshall" <ktm@rice.edu> writes: >> >> >>> On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote: >>> >>> >>>> Using our implementation, build times and index sizes are >>>> comparable with btree index build times and index sizes. >>>> >>> ... >>> That is super! (and timely) >>> >> It is pretty super. I have a few comments to raise but don't take it to be too >> negative, it sounds like this is a big step forward towards making hash >> indexes valuable. >> >> Firstly in the graphs it seems the index size graph has an exponential x-axis >> but a linear y-axis. This makes it hard to see what I would expect to be >> pretty much linear growth. The curves look exponential which would mean linear >> growth but of course it's hard to tell. >> >> Also, the growth in the time chart looks pretty much linear. That seems weird >> since I would expect there would be a visible extra component since sort times >> are n-log(n). Perhaps you need to test still larger tables to see that though. >> >> In any case it's clear from the results you have there that the change is a >> positive one and fixes a fundamental problem with the hash index build code. >> >> Something else you should perhaps test is indexing something which is >> substantially larger than hash function output. A hash function is going to >> make the most sense when indexing something like strings for which you want to >> avoid the long comparison costs. Especially consider testing this on a UTF8 >> locale with expensive comparisons (like a CJK locale for example). >> >> Note that the bottom line for the problems with hash indexes is that the >> current implementation doesn't offer any advantages over btree indexes. Hash >> indexes need to be not only as good of a choice as btree indexes but >> significantly better a significantly better choice at least for some specific >> circumstances. >> >> Also, if you're going to submit a patch you should check out a copy of the CVS >> HEAD and work from that. I don't think there are huge differences in the area >> of hash indexes though. But in most other areas you would be spending quite a >> bit of time dealing details which have changed since. >> >> Finally note that we're in the final throes of the 8.3 feature freeze. >> Normally any patch submitted now would be held until 8.3 is released and >> development on 8.4 is started. I could imagine an exception being made for >> hash indexes since they're so moribund currently but probably not. The flip >> side of that argument is that there's not much point in making an exception >> for something which will only be really useful once further work is done in >> the same area. >> >> > > Although I am very excited about this patch, I do not see any real value > in including it in 8.3. As you mention, we need to to have a hash index > implementation that outperforms btree in some problem regime and that is > currently not the case. I have just recently started the process of > gathering ideas and having discussions on various approaches to making > hash indexes more performant and we have a number of ideas on which to > start our investigation. I do think that this patch will make the testing > and evaluation, that will be needed to truly improve the hash index, much > much easier. > > Regards, > Ken > > We're glad to contribute and be a part of Postgres. The patch has been posted to pgsql-patches@postgresql.org. Speeding up the creation time of hash indexes on non-trivial relations was our goal. This will allow some interesting performance tests of the hash index on very large relations. It may be that the near constant lookup time of the hash index outperforms the Btree index for some large data sets and for certain types of data and distributions. Sincerely, Tom Raney
This has been saved for the 8.4 release: http://momjian.postgresql.org/cgi-bin/pgpatches_hold --------------------------------------------------------------------------- Tom Raney wrote: > We are pleased to announce an upcoming patch to the hash index code > which improves build time and index size, based on this item in the > TODO list: > During index creation, pre-sort the tuples to improve build speed > http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php > > Using our implementation, build times and index sizes are > comparable with btree index build times and index sizes. > For example, for a particular 12 million row relation, the > 8.2.4 release requires over 2.8 hours to build the hash index. > With our patch, the index is built in 80 seconds. > Here is the link for a graph, showing a comparative analysis of > btree and hash index builds and describing the benchmark data. > http://web.cecs.pdx.edu/~raneyt/pg/ > > We are currently cleaning up the patch and will submit it asap. > > Regards, > Shreya Bhargava <shreya_bhargav@yahoo.com> > Tom Raney <twraney@comcast.net> > > > Kenneth Marshall wrote: > > Dear PostgreSQL Hackers: > > > > After following the hackers mailing list for quite a while, > > I am going to start investigating what will need to be done > > to improve hash index performance. Below are the pieces of > > this project that I am currently considering: > > > > 1. Characterize the current hash index implementation against > > the BTree index, with a focus on space utilization and > > lookup performance against a collection of test data. This > > will give a baseline performance test to evaluate the impact > > of changes. I initially do not plan to bench the hash creation > > process since my initial focus will be on lookup performance. > > > > 2. Evaluate the performance of different hash index implementations > > and/or changes to the current implementation. My current plan is > > to keep the implementation as simple as possible and still provide > > the desired performance. Several hash index suggestions deal with > > changing the layout of the keys on a page to improve lookup > > performance, including reducing the bucket size to a fraction of > > a page or only storing the hash value on the page, instead of > > the index value itself. My goal in this phase is to produce one > > or more versions with better performance than the current BTree. > > > > 3. Look at build time and concurrency issues with the addition of > > some additional tests to the test bed. (1) > > > > 4. Repeat as needed. > > > > This is the rough plan. Does anyone see anything critical that > > is missing at this point? Please send me any suggestions for test > > data and various performance test ideas, since I will be working > > on that first. > > > > Regards, > > Ken Marshall > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 5: don't forget to increase your free space map settings > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, Sep 05, 2007 at 03:07:03PM -0500, Kenneth Marshall wrote: > On Sun, Sep 02, 2007 at 01:04:04PM -0500, Kenneth Marshall wrote: > > Dear PostgreSQL Hackers: > > > > After following the hackers mailing list for quite a while, > > I am going to start investigating what will need to be done > > to improve hash index performance. Below are the pieces of > > this project that I am currently considering: > > > > 1. Characterize the current hash index implementation against > > the BTree index, with a focus on space utilization and > > lookup performance against a collection of test data. This > > will give a baseline performance test to evaluate the impact > > of changes. I initially do not plan to bench the hash creation > > process since my initial focus will be on lookup performance. > > > > Here are very basic results for a table with 1.6m entries: > > postgres=# CREATE TABLE dict (word varchar(100)); > CREATE TABLE > > postgres=# COPY dict FROM '/tmp/words'; > COPY 1648379 > postgres=# select count(*) from dict; > count > --------- > 1648379 > (1 row) > > Time: 11187.418 ms > postgres=# select count(*) from dict; > count > --------- > 1648379 > (1 row) > > Time: 6040.912 ms > postgres=# CREATE INDEX wordhash ON dict USING hash (word); > CREATE INDEX > Time: 11108707.160 ms > > postgres=# select * from dict where word = 'avatar'; > word > -------- > avatar > (1 row) > > Time: 79.823 ms > postgres=# select * from dict where word = 'zebra'; > word > ------- > zebra > (1 row) > > Time: 9.864 ms > postgres=# select * from dict where word = 'turkey'; > word > -------- > turkey > (1 row) > > Time: 18.418 ms > Time: 1.045 ms > Time: 1.257 ms > Time: 1.080 ms > > postgres=# CREATE INDEX wordbtree ON dict USING btree (word); > CREATE INDEX > > Time: 25438.884 ms > > postgres=# select * from dict where word = 'avatar'; > word > -------- > avatar > (1 row) > > Time: 13.400 ms > postgres=# select * from dict where word = 'zebra'; > word > ------- > zebra > (1 row) > > Time: 1.173 ms > postgres=# select * from dict where word = 'turkey'; > word > -------- > turkey > (1 row) > > Time: 1.186 ms > Time: 1.103 ms > Time: 1.099 ms > Time: 1.108 ms > > ------------------------------ > Size of table = 87556096 > > Size of hash index = 268451840 > Size of btree index = 53510144 > > From my very small sample on an unloaded machine, a hash index lookup > took the least amount of time. It had a much larger initial time which > could be attributable to cache population effects. The size is 5X that > of the Btree index. I will continue to improve the test suite as more > granularity is needed. If anyone has a good data generator, please let > me know. Otherwise I will just roll my own. > > Regards, > Ken > I have just re-ran the previous benchmark against 8.3beta1 (versus 8.2) including the hash index sorted build patch and the results are below: test=# CREATE TABLE dict (word varchar(100)); CREATE TABLE Time: 24.463 ms test=# COPY dict FROM '/tmp/cracklib-words'; LOG: checkpoints are occurring too frequently (21 seconds apart) HINT: Consider increasing the configuration parameter "checkpoint_segments". COPY 1648379 Time: 44470.235 ms test=# select count(*) from dict; count ---------1648379 (1 row) Time: 4553.924 ms test=# CREATE INDEX wordhash ON dict USING hash (word); CREATE INDEX Time: 85905.960 ms test=# select * from dict where word = 'avatar'; word --------avatar (1 row) Time: 592.906 ms test=# select * from dict where word = 'zebra';word -------zebra (1 row) Time: 21.458 ms test=# select * from dict where word = 'turkey'; word --------turkey (1 row) Time: 22.532 ms Time: 1.224 ms Time: 1.200 ms Time: 1.264 ms test=# CREATE INDEX wordbtree ON dict USING btree (word); CREATE INDEX Time: 33714.874 ms test=# select * from dict where word = 'avatar'; word --------avatar (1 row) Time: 24.296 ms test=# select * from dict where word = 'zebra';word -------zebra (1 row) Time: 1.400 ms test=# select * from dict where word = 'turkey'; word --------turkey (1 row) Time: 28.302 ms Time: 1.284 ms Time: 1.313 ms --------------------------------------- Size of table = 69877760 Size of hash index = 268451840 Size of btree index = 48570368 I just wanted to have this baseline for future patches since 8.3 is just around the bend. As expected, the sorted build process is a huge improvement from the unsorted original. Regards, Ken Marshall
Tom, That is great. I am looking forward to your patch. After the issues that you needed to address, I think that it would be reasonable to add a few more user settings for the hash index. Fill-factor is too course a knob. The others that I have been considering are: maxtuples - Not really the maximum, but a target value to use for setting up the initial buckets. This would allow you toset it for data loads and avoid the "split-n-copy" trauma that you are trying to avoid with your new hash build process. multiplicity - Try to capture use cases that would require many overflow pages. In particular, if we discard the normal indexpage layout we can skip the space overhead of the page pointer and generate a more compact index. Then you could usea few more hash bits to lookup the index entry in the page. How many bits would be determined by this factor. 8-bits wouldgive you 256 sub-pieces that could each hold about 3 entries using the current 4-byte hash, or 2 entries using an 8-bytehash. What do you think? Cheers, Ken On Wed, Oct 17, 2007 at 03:31:58PM -0700, Tom Raney wrote: > Kenneth, > > Great! > > Yes, we did update the code to use the estimate. I will post the patch > with this update. We only saw a very small difference in index build time, > but you may when you add many columns to the base relation. > With 1 billion tuples, you should start to see the hash index outperform > the btree for some equality probes, I would imagine. With a 90% fill > factor, the btree would require 4 levels to index that many tuples. If the > top two were in memory, there would be 3 IOs needed. I don't think PG > supports index only scans, so it will take the extra IO to probe the base > relation. The hash may take up to 2 IOs and maybe even less (or maybe more > depending on how many overflow buckets there are). It might be interesting > to fiddle with the fill factors of each index - hash pages (buckets) > default to 75% full. > -Tom >> Tom, >> >> I am getting ready to stitch in an updated, simplified version >> of Neil Conway's hash-only hash index patch. Did you have a >> chance to update your sizing function to use the planner-like >> estimate and not a full table scan? I would like to be able >> to use that when my test table start to have 10^9 entries. >> If you have not had a chance, I will try and add it myself. >> >> Regards, >> Ken >> >> > >
Kenneth, I just pushed the revised patch (v2!). The revised approach samples the parent relation to estimate the number of tuples rather than performing a complete scan. In my tests, the estimate appears to be accurate, erring on the larger side, which is fine. > Tom, > > That is great. I am looking forward to your patch. After the > issues that you needed to address, I think that it would be > reasonable to add a few more user settings for the hash index. > Fill-factor is too course a knob. The others that I have been > considering are: > > maxtuples - Not really the maximum, but a target value to use > for setting up the initial buckets. This would allow you to > set it for data loads and avoid the "split-n-copy" trauma > that you are trying to avoid with your new hash build process. > If I understand you correctly, I believe we already do this with our current build process, there should not be any splits of the index if we estimated the tuple count correctly. However, what gets you is collisions where lots of overflow pages occur when distinct keys map to the same bucket, or if you have lots of duplicate keys. Because your overall tuple count doesn't exceed the fill factor, no splits occur, but lengthy bucket chains lead to lots of IOs. You touch on this below. > multiplicity - Try to capture use cases that would require many > overflow pages. In particular, if we discard the normal index > page layout we can skip the space overhead of the page pointer > and generate a more compact index. Then you could use a few > more hash bits to lookup the index entry in the page. How many > bits would be determined by this factor. 8-bits would give > you 256 sub-pieces that could each hold about 3 entries using > the current 4-byte hash, or 2 entries using an 8-byte hash. > > What do you think? > Yes, this is a good direction. If we can increase the number of buckets and reduce the bucket size (either physically or virtually) to allow more direct access without creating a huge index on disk, that would be ideal. But, then if you do have collisions, overflows occur more frequently. I spoke with Neil Conway yesterday at the PG conference here in Portland and he piqued my interest in examining his hash code more closely to see what he has already done in this area. -Tom > Cheers, > Ken > > On Wed, Oct 17, 2007 at 03:31:58PM -0700, Tom Raney wrote: > >> Kenneth, >> >> Great! >> >> Yes, we did update the code to use the estimate. I will post the patch >> with this update. We only saw a very small difference in index build time, >> but you may when you add many columns to the base relation. >> With 1 billion tuples, you should start to see the hash index outperform >> the btree for some equality probes, I would imagine. With a 90% fill >> factor, the btree would require 4 levels to index that many tuples. If the >> top two were in memory, there would be 3 IOs needed. I don't think PG >> supports index only scans, so it will take the extra IO to probe the base >> relation. The hash may take up to 2 IOs and maybe even less (or maybe more >> depending on how many overflow buckets there are). It might be interesting >> to fiddle with the fill factors of each index - hash pages (buckets) >> default to 75% full. >> -Tom >> >>> Tom, >>> >>> I am getting ready to stitch in an updated, simplified version >>> of Neil Conway's hash-only hash index patch. Did you have a >>> chance to update your sizing function to use the planner-like >>> estimate and not a full table scan? I would like to be able >>> to use that when my test table start to have 10^9 entries. >>> If you have not had a chance, I will try and add it myself. >>> >>> Regards, >>> Ken >>> >>> >>> >> > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > >
Tom, Thank you for the update. I am currently working on updating the patch Neil Conway sent in against 8.0-ish that stores only the hash in the index and locates the entries within the page using a binary search. Then I will fold in your recent update. On Sun, Oct 21, 2007 at 01:13:48PM -0700, Tom Raney wrote: > Kenneth, I just pushed the revised patch (v2!). The revised approach > samples the parent relation to estimate the number of tuples rather than > performing a complete scan. In my tests, the estimate appears to be > accurate, erring on the larger side, which is fine. > Yes, larger is great and what we need to avoid all the index tuple suffling between pages. >> Tom, >> >> That is great. I am looking forward to your patch. After the >> issues that you needed to address, I think that it would be >> reasonable to add a few more user settings for the hash index. >> Fill-factor is too course a knob. The others that I have been >> considering are: >> > >> maxtuples - Not really the maximum, but a target value to use >> for setting up the initial buckets. This would allow you to >> set it for data loads and avoid the "split-n-copy" trauma >> that you are trying to avoid with your new hash build process. >> > If I understand you correctly, I believe we already do this with our > current build process, there should not be any splits of the index if we > estimated the tuple count correctly. However, what gets you is collisions > where lots of overflow pages occur when distinct keys map to the same > bucket, or if you have lots of duplicate keys. Because your overall tuple > count doesn't exceed the fill factor, no splits occur, but lengthy bucket > chains lead to lots of IOs. You touch on this below. Yes, you do address this in your patches and it works well for an existing heap. My idea was to minimize the shuffling problem when we are doing a data load and do not have a heap to get a count from because it has not been loaded yet. >> multiplicity - Try to capture use cases that would require many >> overflow pages. In particular, if we discard the normal index >> page layout we can skip the space overhead of the page pointer >> and generate a more compact index. Then you could use a few >> more hash bits to lookup the index entry in the page. How many >> bits would be determined by this factor. 8-bits would give >> you 256 sub-pieces that could each hold about 3 entries using >> the current 4-byte hash, or 2 entries using an 8-byte hash. >> >> What do you think? >> > Yes, this is a good direction. If we can increase the number of buckets > and reduce the bucket size (either physically or virtually) to allow more > direct access without creating a huge index on disk, that would be ideal. > But, then if you do have collisions, overflows occur more frequently. I > spoke with Neil Conway yesterday at the PG conference here in Portland and > he piqued my interest in examining his hash code more closely to see what > he has already done in this area. Right, overflows would occur more frequently and any overflow would allocate a full page. It may be possible to estimate the multiplicity and minimize the use of overflow pages. If we know that on average that there are no more than 10 items in a bucket, we can size the virtual buckets on the first page to support 10 items and minimize the rollover to an overflow page. Other ideas, once we hit the overflow page, back-off to the current fullpage use to maximize the fill-factor, possibly using Neil's binary search to speed lookups within the overflow pages. Also, with the smaller virtual buckets use the remaining space on the page as the first overflow page to hopefully prevent needing to allocate a full new overflow page. This would be most useful for the smaller virtual bucket sizes and improve the overall packing efficiency of the index. If we intend to take advantage of the O(1) behavior, it will be most useful for large numbers of tuples, more than 10^9. A 32-bit hash is not good enough and we will need to use a 64-bit hash to reduce collisions in the hash table and consequent need for overflow pages. I already have the hash function updated to support 64-bit and will look at getting that functional once I have the hash-only version working with the 32-bit hashes. Also, without the need to store the value in the index, we can boost the size of indexable fields to the page size, and larger. I was tentatively targeting 32k as the implementation max, because then hashing time becomes the issue. Of course, all of these changes and options will need to be benchmarked and discussed. Thank you again for posting the updated patch. Regards, Ken > > -Tom >> Cheers, >> Ken >> >> On Wed, Oct 17, 2007 at 03:31:58PM -0700, Tom Raney wrote: >> >>> Kenneth, >>> >>> Great! >>> >>> Yes, we did update the code to use the estimate. I will post the patch >>> with this update. We only saw a very small difference in index build >>> time, but you may when you add many columns to the base relation. With 1 >>> billion tuples, you should start to see the hash index outperform the >>> btree for some equality probes, I would imagine. With a 90% fill factor, >>> the btree would require 4 levels to index that many tuples. If the top >>> two were in memory, there would be 3 IOs needed. I don't think PG >>> supports index only scans, so it will take the extra IO to probe the base >>> relation. The hash may take up to 2 IOs and maybe even less (or maybe >>> more depending on how many overflow buckets there are). It might be >>> interesting to fiddle with the fill factors of each index - hash pages >>> (buckets) default to 75% full. -Tom >>> >>>> Tom, >>>> >>>> I am getting ready to stitch in an updated, simplified version >>>> of Neil Conway's hash-only hash index patch. Did you have a >>>> chance to update your sizing function to use the planner-like >>>> estimate and not a full table scan? I would like to be able >>>> to use that when my test table start to have 10^9 entries. >>>> If you have not had a chance, I will try and add it myself. >>>> >>>> Regards, >>>> Ken >>>> >>>> >>> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 2: Don't 'kill -9' the postmaster >> >> > >
On Thu, Sep 13, 2007 at 02:02:14PM -0700, Neil Conway wrote: > On Fri, 2007-09-07 at 08:29 -0500, Kenneth Marshall wrote: > > This is a great starting point. I would appreciate it if you have the > > time and could make it apply cleanly to HEAD. > > Just to give you an update on this, I'll try to find the time to get it > done soon, but my day job is keeping me really busy these days, so I'm > not sure when I'll be able to get to it... > > -Neil > Neil, I am currently working on integrating your previous patch with the 8.3beta1 hash code + the sorted build patch by Tom Raney. I have been adding back pieces that were removed and I had a question about index tuples, in particular what does the size field indicate? Is it the size in bytes of the data after the IndexTupleData or the total size in bytes including the IndexTupleData? In order to use the hitem with the sorted build process, I think that the flags should provide the size info so that the tuplesort code can be used. This will also allow the use of >32-bit hash values. Am I completely off base? Regards, Ken
"Note that the bottom line for the problems with hash indexes is that the<br />current implementation doesn't offer any advantagesover btree indexes. Hash<br />indexes need to be not only as good of a choice as btree indexes but<br />significantlybetter a significantly better choice at least for some specific<br />circumstances."<br /><br /><pre><tt><tt>Weperformed some probe tests on our patch on <br />hash index and the original btree index to compare the<br />performance between the two. We used a 80 million tuple table.<br />The table contained single integer attributeand the data<br />range was 0 - 100000000. (generated by a random generator).<br />We did the following:<br /><br/>1. Populate the table with 80 million tuples.<br />2. Create HASH index on the table.<br />3. clear both linux cache& psql buffers.<br /> (exiting psql and restarting it cleared the psql buffers;<br /> to clear linux cache,we used drop_cache command)<br />4. start psql<br />5. </tt></tt><tt><tt>select on aninteger in the range of valuesin the table</tt></tt><tt><tt>.<br /> (all test numbers were big ones, like 98934599)<br />6. record the time.<br/>7. exit psql.<br />8. drop caches.(as described above)<br />9. repeat 4-8 for different numbers.<br />10. DropHash index.<br />11. Create Btree index and repeat 3-9.<br /> <br />The results are as shown below. All trials are inms. <br />Count is the number of such tuples in the table.<br /> <br /> HASH INDEX:<br /> Number Count Trial1 Trial2 Trial3<br /> 21367898 2 152.0 129.5 131.1<br /> 95678699 2 140.6 145.6 137.5<br /> 99899799 2 148.0 147.4 152.6<br /> 97677899 2 142.0 153.5 112.0<br /> 94311123 2 137.6 146.3 141.3<br /> 67544455 2 141.6 144.6 152.7<br /> 88877677 2 122.1 123.1 130.8<br /> 88788988 2 150.2 150.0 171.7<br /> 44333344 1 119.9 116.3 117.6<br /> 34267878 1 97.5 99.9 114.8<br /> 55489781 2 115.7 117.2 145.3 <br /> 97676767 1 169.5 168.5 181.7 <br /> 99767564 1 142.7 133.6 132.7<br /> 21367989 4 198.0 193.2 186.7<br /> <br />BTREE INDEX<br /> <br /> Number Trial1 Trial2 Trial3<br /> 21367898 145.5 145.6 150.6<br /> 95678699 177.5 170.0 176.4<br /> 99899799 175.4 181.2 185.4<br /> 97677899 136.9 149.0 130.8<br /> 94311123 179.0 175.3 166.3<br /> 67544455 161.7 162.2 170.4<br /> 88877677 138.7 135.2 149.9<br /> 88788988 165.7 179.3 186.3<br /> 44333344 166.0 172.8 184.3<br /> 34267878 159.1 168.8 147.8<br /> 55489781 183.2 195.4 185.1<br /> 97676767 147.2 143.6 132.6<br /> 99767564 167.8 178.3 162.1<br /> 21367989 232.3 238.1 216.9</tt></tt></pre>From the results obtained, the average of all the hash probes is 141.8ms, the average for btree is 168.5,a difference of about 27.The standard deviations are about 23, so this is a statistically significant difference.<br/><pre><tt><tt>Our prediction that the hash index would take on the average one<br />probe for about 10ms andthe btree would take three probes for about 30 ms<br />or a difference of about 20ms was pretty well shown by the differencewe<br />got of about 27. Hope these data points will help with some questions<br />about the performance differencesbetween Hash and Btree index.<br /><br />Regards,<br />Shreya</tt></tt></pre><br /><br /><span style="font-family:monospace;"></span><tt><tt><br /><br /></tt></tt><b><i>Gregory Stark <stark@enterprisedb.com></i></b>wrote:<blockquote class="replbq" style="border-left: 2px solid rgb(16, 16, 255); margin-left:5px; padding-left: 5px;"> "Kenneth Marshall" writes:<br /><br />> On Thu, Sep 20, 2007 at 05:12:45PM -0700,Tom Raney wrote:<br />><br />>> Using our implementation, build times and index sizes are<br />>> comparablewith btree index build times and index sizes.<br />>...<br />> That is super! (and timely)<br /><br />Itis pretty super. I have a few comments to raise but don't take it to be too<br />negative, it sounds like this is a bigstep forward towards making hash<br />indexes valuable.<br /><br />Firstly in the graphs it seems the index size graphhas an exponential x-axis<br />but a linear y-axis. This makes it hard to see what I would expect to be<br />prettymuch linear growth. The curves look exponential which would mean linear<br />growth but of course it's hard to tell.<br/><br />Also, the growth in the time chart looks pretty much linear. That seems weird<br />since I would expect therewould be a visible extra component since sort times<br />are n-log(n). Perhaps you need to test still larger tablesto see that though.<br /><br />In any case it's clear from the results you have there that the change is a<br />positiveone and fixes a fundamental problem with the hash index build code.<br /><br />Something else you should perhapstest is indexing something which is<br />substantially larger than hash function output. A hash function is goingto<br />make the most sense when indexing something like strings for which you want to<br />avoid the long comparisoncosts. Especially consider testing this on a UTF8<br />locale with expensive comparisons (like a CJK locale forexample).<br /><br />Note that the bottom line for the problems with hash indexes is that the<br />current implementationdoesn't offer any advantages over btree indexes. Hash<br />indexes need to be not only as good of a choiceas btree indexes but<br />significantly better a significantly better choice at least for some specific<br />circumstances.<br/><br />Also, if you're going to submit a patch you should check out a copy of the CVS<br />HEAD and workfrom that. I don't think there are huge differences in the area<br />of hash indexes though. But in most other areasyou would be spending quite a<br />bit of time dealing details which have changed since.<br /><br />Finally note thatwe're in the final throes of the 8.3 feature freeze.<br />Normally any patch submitted now would be held until 8.3 isreleased and<br />development on 8.4 is started. I could imagine an exception being made for<br />hash indexes since they'reso moribund currently but probably not. The flip<br />side of that argument is that there's not much point in makingan exception<br />for something which will only be really useful once further work is done in<br />the same area.<br/><br />-- <br /> Gregory Stark<br /> EnterpriseDB http://www.enterprisedb.com<br /><br />---------------------------(endof broadcast)---------------------------<br />TIP 2: Don't 'kill -9' the postmaster<br /></blockquote><br/><p> __________________________________________________<br />Do You Yahoo!?<br />Tired of spam? Yahoo!Mail has the best spam protection around <br />http://mail.yahoo.com
I have not followed this thread very closely. But just wanted to give my inputs. > From the results obtained, the average of all the hash probes is 141.8ms, > the average for btree is 168.5, a difference of about 27.The standard > deviations are about 23, so this is a statistically significant difference. > Our prediction that the hash index would take on the average one > probe for about 10ms and the btree would take three probes for about 30 ms > or a difference of about 20ms was pretty well shown by the difference we > got of about 27. Hope these data points will help with some questions > about the performance differences between Hash and Btree index. > We all know that Hash indexes are good for equality queries and Binary indexes are good for both equality queries and Range queries. So for equality queries, Hash index should do only one Logical I/O (if no hash collisions) and Binary index should do atleast 3 (I don't know the level of B-tree that came out as a result of this). You can enable the Logical I/O count by applying this patch. *** postgresql-8.3beta1/src/backend/storage/buffer/bufmgr.c Tue Sep 25 18:11:48 2007 --- postgresql-8.3patch/src/backend/storage/buffer/bufmgr.c Fri Oct 19 23:18:36 2007 *************** *** 1470,1477 **** localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount; appendStringInfo(&str, ! "!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n", ! ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate); appendStringInfo(&str, "!\tLocal blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n", ReadLocalBufferCount - LocalBufferHitCount, LocalBufferFlushCount, localhitrate); --- 1470,1477 ---- localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount; appendStringInfo(&str, ! "!\tShared blocks: %10ld Logical Reads, %10ld Physical Reads, %10ld written, buffer hit rate = %.2f%%\n", ! ReadBufferCount, ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate); appendStringInfo(&str, "!\tLocal blocks: %10ld read, %10ld written, buffer hit rate= %.2f%%\n", ReadLocalBufferCount - LocalBufferHitCount, LocalBufferFlushCount, localhitrate); If possible, it would be useful for you to present the reduction in Logical I/O count, as it very well might translate to Physical I/Os for simple index scans. Atleast check whether it is in the ratio 1:3. Hope my suggestion helps. Thanks, Gokul. CertoSQL Project, Allied Solution Group. (www.alliedgroups.com)
Shreya Bhargava wrote: > 1. Populate the table with 80 million tuples. > 2. Create HASH index on the table. > 3. clear both linux cache & psql buffers. > (exiting psql and restarting it cleared the psql buffers; > to clear linux cache, we used drop_cache command) > 4. start psql > 5. select on an integer in the range of values in the table. > (all test numbers were big ones, like 98934599) > 6. record the time. > 7. exit psql. > 8. drop caches.(as described above) > 9. repeat 4-8 for different numbers. > 10. Drop Hash index. > 11. Create Btree index and repeat 3-9. It seems you're mostly measuring the overhead of starting a backend, populating the relcache etc. Restarting psql doesn't clear the postgres shared buffer cache. Or did you mean that you restarted postgres? Anyway, I don't think it's interesting to test with cleared caches. Surely the metapage and first 1-2 levels of the b-tree would stay cached all the time in real life. > From the results obtained, the average of all the hash probes is 141.8ms, the average for btree is 168.5, a differenceof about 27.The standard deviations are about 23, so this is a statistically significant difference. I don't trust those numbers much, but in any case I don't think that edge is big enough to justify the existence of hash indexes. If you're looking for a use case where hash index is faster, I'd suggest using a data type with an expensive comparison function. Like long multi-byte strings in UTF-8 encoding. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Shreya Bhargava wrote: > "Note that the bottom line for the problems with hash indexes is that the > current implementation doesn't offer any advantages over btree indexes. Hash > indexes need to be not only as good of a choice as btree indexes but > significantly better a significantly better choice at least for some > specific circumstances." Oh it does. I recently used a hash index to speed up my database. Namely I found it improved performance when indexing a non-integer column containing english words. I don't know how much of that data was cached, according to the sound of my harddrive it wasn't all of it. Consider this anecdotical evidence, but the speedup was noticeable. > We performed some probe tests on our patch on > hash index and the original btree index to compare the > performance between the two. We used a 80 million tuple table. > The table contained single integer attribute and the data > range was 0 - 100000000. (generated by a random generator). I'd be interested how much difference is there between non-integer index behaviour. I at least had the impression that in my case the sorted strings in the btree pages didn't compare too well. > */Gregory Stark <stark@enterprisedb.com>/* wrote: > "Kenneth Marshall" writes: > > On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote: > >> Using our implementation, build times and index sizes are > >> comparable with btree index build times and index sizes. Way to go! Currently building hash indices is no fun. Regards, Jens-W. Schicke -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHMErXzhchXT4RR5ARAh7pAKCZIZFJFa7Oq25GvwDhiZJFsrtwgACbBC1F otwhIZVlNgUGlroePIafi1c= =N1f7 -----END PGP SIGNATURE-----
Shreya Bhargava <shreya_bhargav@yahoo.com> writes: > We performed some probe tests on our patch on > hash index and the original btree index to compare the > performance between the two. Interesting, but ... > From the results obtained, the average of all the hash probes is 141.8ms, the average for btree is 168.5, a differenceof about 27.The standard deviations are about 23, so this is a statistically significant difference. > Our prediction that the hash index would take on the average one > probe for about 10ms and the btree would take three probes for about 30 ms > or a difference of about 20ms was pretty well shown by the difference we > got of about 27. ... unfortunately, a zero-cache situation isn't very reflective of the real world. In reality, for an index that is getting used often enough to impact application performance at all, the root page will stay in cache and likely so will some of the upper tree levels. So I don't think this experiment proves anything at all about real-world performance. What I'd like to see is an aggregate time measurement for millions of random probes into an index that's noticeably larger than memory. It would also be interesting to measure the speed of the fully-cached case (same test, but index smaller than memory). regards, tom lane
On Thu, Sep 13, 2007 at 02:02:14PM -0700, Neil Conway wrote: > On Fri, 2007-09-07 at 08:29 -0500, Kenneth Marshall wrote: > > This is a great starting point. I would appreciate it if you have the > > time and could make it apply cleanly to HEAD. > > Just to give you an update on this, I'll try to find the time to get it > done soon, but my day job is keeping me really busy these days, so I'm > not sure when I'll be able to get to it... > > -Neil > Neil, I have been working on putting an updated version of your patch into the current source. My first try was to try and put your patch in directly, but it differed so much from the current build that it was not obvious how to address things like the current hash_index sorted build patch, which I need to be able to test with indexes of any size at all. My current try is to replace the _hash_formitem() calls with a function called _hash_form_tuple() that actually returns an IndexTuple and not an HashItem. This will allow it to be used quite naturally with the current sorted build patch. Here is what it looks like now: /** _hash_form_tuple -- construct index tuple using hash(value) not value*/ IndexTuple _hash_form_tuple(IndexTuple itup, Relation rel) { IndexTuple result; Size size; uint32 hashkey; Datum datum; bool isnull; /* disallow nulls in hash keys */ if (IndexTupleHasNulls(itup)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("hash indexes cannot containnull keys") )); if (rel->rd_rel->relnatts != 1) elog(ERROR, "hash indexes support only one index key"); /* hash the tuple; we only store the hash value in the index */ datum = index_getattr(itup, 1, RelationGetDescr(rel),&isnull); Assert(!isnull); hashkey = _hash_datum2hashkey(rel, datum); size = IndexTupleSize(itup); result = (IndexTuple) palloc(size); memcpy(result, itup, size); returnresult; } I am not currently doing anything other than returning the current IndexTuple that was created with index_form_tuple(). Am I daft, or can I just memcpy() the 6 bytes of TID, add the 2 bytes of t_info (everything 0 and the size set to 6 + 2 + sizeof(hash) = 10), and the 4 bytes of hash. This will allow me to handle 8-byte hashes in the future. If you see a problem with this approach, please let me know. I would appreciate any feedback you can give. Regards, Ken > >