Thread: Re: [PATCHES] updated hash functions for postgresql v1
Dear PostgreSQL developers, I am re-sending this to keep this last change to the internal hash function on the radar. Ken ............ Sorry about the delay for this update to the new hash index implementation. I was trying to get the WAL logging in place and forgot to post the actual patch. The WAL for hash indexes will need to wait for 8.5, but I did want to add back in the piece of the Bob Jenkins 2006 hash function that was stripped out of the initial patch on application due to concerns about the randomness of the resulting hash values. Here is a re-post of my initial findings comparing the old/new Jenkins hash from lookup2 and lookup3. I have added a third column containing the results for the hash_any() resulting from the attached patch as well as simple timing test for a DB reindex both before and after patching. Also attached is a simple documentation patch updating the note attached to the hash index description. Regards, Ken ---------------------------------------------------- Hi, I have finally had a chance to do some investigation on the performance of the old hash mix() function versus the updated mix()/final() in the new hash function. Here is a table of my current results for both the old and the new hash function. In this case cracklib refers to the cracklib-dict containing 1648379 unique words massaged in various ways to generate input strings for the hash functions. The result is the number of collisions in the hash values generated. hash input old new newv2 ---------- --- --- ----- cracklib 338 316 338 cracklib x 2 (i.e. clibclib) 305 319 300 cracklib x 3 (clibclibclib) 323 329 315 cracklib x 10 302 310 329 cracklib x 100 350 335 298 cracklib x 1000 314 309 315 cracklib x 100 truncated to char(100) 311 327 320 uint32 from 1-1648379 309 319 347 (uint32 1-1948379)*256 309 314 304 (uint32 1-1948379)*16 310 314 324 "a"uint32 (i.e. a00001,a0002...) 320 321 312 uint32uint32 (i.e. uint64) 321 287 309 The different result columns are old = Jenkins 1996 hash function(lookup2.c), new = Jenkins 2006 hash function (lookup3.c), and newv2 = adaptation of current hash_any() to incorporate the separate mix()/final() functions. As you can see from the results, spliting the mix() and final() apart does not result in any perceptible loss of randomness in the hash assignment. I also ran a crude timing for a reindex of the following database: CREATE TABLE dict (word text); CREATE INDEX wordhash ON dict USING hash (word); INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo'); INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict); ... (21 times) REINDEX TABLE ... The average time to reindex the table using our current hash_any() without the separate mix()/final() was 1696ms and 1482ms with the separate mix()/final() stages giving almost 13% better performance for this stupid metric.
Attachment
Kenneth Marshall wrote: > Dear PostgreSQL developers, > > I am re-sending this to keep this last change to the > internal hash function on the radar. Please add it to the commitfest wiki page, http://wiki.postgresql.org/wiki/CommitFest_2008-11 Thanks -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Mon, 2008-12-22 at 13:47 -0600, Kenneth Marshall wrote: > Dear PostgreSQL developers, > > I am re-sending this to keep this last change to the > internal hash function on the radar. > Hi Ken, A few comments: 1. New patch with very minor changes attached. 2. I reverted the change you made to indices.sgml. We still don't use WAL for hash indexes, and in my opinion we should continue to discourage their use until we do use WAL. We can add back in the comment that hash indexes are suitable for large keys if we have some results to show that. 3. There was a regression test failure in union.sql because the ordering of the results was different. I updated the regression test. 4. Hash functions affect a lot more than hash indexes, so I ran through a variety of tests that use a HashAggregate plan. Test setup and results are attached. These results show no difference between the old and the new code (about 0.1% better). 5. The hash index build time shows some improvement. The new code won in every instance in which a there were a lot of duplicates in the table (100 distinct values, 50K of each) by around 5%. The new code appeared to be the same or slightly worse in the case of hash index builds with few duplicates (1000000 distinct values, 5 of each). The difference was about 1% worse, which is probably just noise. Note: I'm no expert on hash functions. Take all of my tests with a grain of salt. I would feel a little better if I saw at least one test that showed better performance of the new code on a reasonable-looking distribution of data. The hash index build that you showed only took a second or two -- it would be nice to see a test that lasted at least a minute. Regards, Jeff Davis
Attachment
On Fri, Jan 09, 2009 at 12:04:15PM -0800, Jeff Davis wrote: > On Mon, 2008-12-22 at 13:47 -0600, Kenneth Marshall wrote: > > Dear PostgreSQL developers, > > > > I am re-sending this to keep this last change to the > > internal hash function on the radar. > > > > Hi Ken, > > A few comments: > > 1. New patch with very minor changes attached. > > 2. I reverted the change you made to indices.sgml. We still don't use > WAL for hash indexes, and in my opinion we should continue to discourage > their use until we do use WAL. We can add back in the comment that hash > indexes are suitable for large keys if we have some results to show > that. > > 3. There was a regression test failure in union.sql because the ordering > of the results was different. I updated the regression test. > > 4. Hash functions affect a lot more than hash indexes, so I ran through > a variety of tests that use a HashAggregate plan. Test setup and results > are attached. These results show no difference between the old and the > new code (about 0.1% better). > > 5. The hash index build time shows some improvement. The new code won in > every instance in which a there were a lot of duplicates in the table > (100 distinct values, 50K of each) by around 5%. > > The new code appeared to be the same or slightly worse in the case of > hash index builds with few duplicates (1000000 distinct values, 5 of > each). The difference was about 1% worse, which is probably just noise. > > Note: I'm no expert on hash functions. Take all of my tests with a grain > of salt. > > I would feel a little better if I saw at least one test that showed > better performance of the new code on a reasonable-looking distribution > of data. The hash index build that you showed only took a second or two > -- it would be nice to see a test that lasted at least a minute. > > Regards, > Jeff Davis > > Jeff, Thanks for the review. I would not really expect any differences in hash index build times other than normal noise variances. The most definitive benchmark that I have seen was done with my original patch submission in March posted by Luke of Greenplum: "We just applied this and saw a 5 percent speedup on a hash aggregation query with four columns in a 'group by' clauserun against a single TPC-H table." I wonder if they would be willing to re-run their test? Thanks again. Regards, Ken
On Fri, 2009-01-09 at 14:29 -0600, Kenneth Marshall wrote: > Jeff, > > Thanks for the review. I would not really expect any differences in hash > index build times other than normal noise variances. The most definitive > benchmark that I have seen was done with my original patch submission > in March posted by Luke of Greenplum: > > "We just applied this and saw a 5 percent speedup on a hash aggregation > query with four columns in a 'group by' clause run against a single > TPC-H table." > > I wonder if they would be willing to re-run their test? Thanks again. Separating mix() and final() should have some performance benefit, right? Regards,Jeff Davis
On Fri, Jan 09, 2009 at 02:00:39PM -0800, Jeff Davis wrote: > On Fri, 2009-01-09 at 14:29 -0600, Kenneth Marshall wrote: > > Jeff, > > > > Thanks for the review. I would not really expect any differences in hash > > index build times other than normal noise variances. The most definitive > > benchmark that I have seen was done with my original patch submission > > in March posted by Luke of Greenplum: > > > > "We just applied this and saw a 5 percent speedup on a hash aggregation > > query with four columns in a 'group by' clause run against a single > > TPC-H table." > > > > I wonder if they would be willing to re-run their test? Thanks again. > > Separating mix() and final() should have some performance benefit, > right? > > Regards, > Jeff Davis > > Yes, it does but the results can be swamped by other latencies in the code path. Tests such as Tom's benchmark of the underlying functions is needed to isolate the timings effectively or a benchmark like Greenplum's that will benefit from a more efficient function. Ken
On Sat, 2009-01-10 at 11:06 -0600, Kenneth Marshall wrote: > > Separating mix() and final() should have some performance benefit, > > right? > > > Yes, it does but the results can be swamped by other latencies in the > code path. Tests such as Tom's benchmark of the underlying functions is > needed to isolate the timings effectively or a benchmark like Greenplum's > that will benefit from a more efficient function. > Ok. I isolated the function itself by just doing: -- 10 million rows of random()::text EXPLAIN ANALYZE SELECT hashtext(t) FROM randomtext; I ran 5 times on both old and new code, eliminating the top and bottom and taking the average of the remaining 3, and I got a 6.9% performance improvement with the new code. I tried quickly with a few other data types and got similar results. It's obviously a small microbenchmark, but that's good enough for me. Thanks! Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > I ran 5 times on both old and new code, eliminating the top and bottom > and taking the average of the remaining 3, and I got a 6.9% performance > improvement with the new code. The question that has been carefully evaded throughout the discussion of this patch is whether the randomness of the hash result is decreased, and if so what is that likely to cost us in performance of the various hash-dependent algorithms. I would still like to have an answer to that before we make a change to gain marginal performance improvement in the hash function itself (which is already shown to be barely measurable in the total context of a hash-dependent operation...) regards, tom lane
On Sat, 2009-01-10 at 13:36 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > I ran 5 times on both old and new code, eliminating the top and bottom > > and taking the average of the remaining 3, and I got a 6.9% performance > > improvement with the new code. > > The question that has been carefully evaded throughout the discussion > of this patch is whether the randomness of the hash result is decreased, > and if so what is that likely to cost us in performance of the various > hash-dependent algorithms. I would still like to have an answer to that > before we make a change to gain marginal performance improvement in > the hash function itself (which is already shown to be barely measurable > in the total context of a hash-dependent operation...) > In: http://archives.postgresql.org/message-id/20081104202655.GP18362@it.is.rice.edu Ken showed that the number of hash collisions is the same in the old and the new code for his test data. Not a direct measurement of randomness, but it's some kind of indication. I'm not an authority on either hash functions or statistics, so I'll have to defer to Ken or someone else to actually prove that the randomness is maintained. Regards,Jeff Davis
Tom Lane <tgl@sss.pgh.pa.us> writes: > Jeff Davis <pgsql@j-davis.com> writes: >> I ran 5 times on both old and new code, eliminating the top and bottom >> and taking the average of the remaining 3, and I got a 6.9% performance >> improvement with the new code. > > The question that has been carefully evaded throughout the discussion > of this patch is whether the randomness of the hash result is decreased, In fairness that doesn't seem to be the case. The original patch was posted with such an analysis using cracklib and raw binary data: http://article.gmane.org/gmane.comp.db.postgresql.devel.general/105675 > marginal performance improvement in the hash function itself (which is > already shown to be barely measurable in the total context of a > hash-dependent operation...) If it's a 6% gain in the speed of Hash Join or HashAggregate it would be very interesting. But I gather it's a 6% speedup in the time spent actually in the hash function. Is that really where much of our time is going? If it's 10% of the total time to execute one of these nodes then we're talking about a 0.6% optimization... -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
On Sat, 2009-01-10 at 13:57 -0500, Gregory Stark wrote: > But I gather it's a 6% speedup in the time spent actually in the > hash function. That's correct. I was not able to detect any difference at all between the old and new code using HashAggregate, which is where most of my testing effort went: http://archives.postgresql.org/message-id/1231531455.25019.75.camel@jdavis (see test results attached to that message, if you're interested) I tested with two data distributions and 6 data types. The 6-10% speedup I saw was a super-micro benchmark testing the hash function, and I didn't spend much time with that. Regards,Jeff Davis
On Sat, Jan 10, 2009 at 10:56:15AM -0800, Jeff Davis wrote: > On Sat, 2009-01-10 at 13:36 -0500, Tom Lane wrote: > > Jeff Davis <pgsql@j-davis.com> writes: > > > I ran 5 times on both old and new code, eliminating the top and bottom > > > and taking the average of the remaining 3, and I got a 6.9% performance > > > improvement with the new code. > > > > The question that has been carefully evaded throughout the discussion > > of this patch is whether the randomness of the hash result is decreased, > > and if so what is that likely to cost us in performance of the various > > hash-dependent algorithms. I would still like to have an answer to that > > before we make a change to gain marginal performance improvement in > > the hash function itself (which is already shown to be barely measurable > > in the total context of a hash-dependent operation...) > > > > In: > http://archives.postgresql.org/message-id/20081104202655.GP18362@it.is.rice.edu > > Ken showed that the number of hash collisions is the same in the old and > the new code for his test data. Not a direct measurement of randomness, > but it's some kind of indication. > > I'm not an authority on either hash functions or statistics, so I'll > have to defer to Ken or someone else to actually prove that the > randomness is maintained. > > Regards, > Jeff Davis > First, while I am not an expert by any means, basic statistics will give you the probability of a collision when packing N items into M buckets chosen at random. In all of my tests, I used 1.6M items into 2^32 buckets. If the bits mixing is complete and random, the number of collisions should be close to the same no matter what arrangement of bits are used for inputs. I think my test cases indicate quite clearly that the new hash function is as independent of bit- order as the older functions, in that the number of bucket collisions is basically the same for all layouts. I have the test harness and can test any other input that you would like me to. Second, the author of the basic hash function (old and new) has performed more extensive testing and has shown that the new functions are better in randomizing bits than his original function. I will try and run a micro-benchmark of my original patch in March and the result of the incremental approach that is the result of my Novermber patch tomorrow. Cheers, Ken
On Sat, Jan 10, 2009 at 01:57:27PM -0500, Gregory Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > Jeff Davis <pgsql@j-davis.com> writes: > >> I ran 5 times on both old and new code, eliminating the top and bottom > >> and taking the average of the remaining 3, and I got a 6.9% performance > >> improvement with the new code. > > > > The question that has been carefully evaded throughout the discussion > > of this patch is whether the randomness of the hash result is decreased, > > In fairness that doesn't seem to be the case. The original patch was posted > with such an analysis using cracklib and raw binary data: > > http://article.gmane.org/gmane.comp.db.postgresql.devel.general/105675 > > > marginal performance improvement in the hash function itself (which is > > already shown to be barely measurable in the total context of a > > hash-dependent operation...) > > If it's a 6% gain in the speed of Hash Join or HashAggregate it would be very > interesting. But I gather it's a 6% speedup in the time spent actually in the > hash function. Is that really where much of our time is going? If it's 10% of > the total time to execute one of these nodes then we're talking about a 0.6% > optimization... > The Greenplum test did show a 5% increase in performance with the replacement functions in March. Regards, Ken
On Sat, Jan 10, 2009 at 01:36:25PM -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > I ran 5 times on both old and new code, eliminating the top and bottom > > and taking the average of the remaining 3, and I got a 6.9% performance > > improvement with the new code. > > The question that has been carefully evaded throughout the discussion > of this patch is whether the randomness of the hash result is decreased, > and if so what is that likely to cost us in performance of the various > hash-dependent algorithms. I would still like to have an answer to that > before we make a change to gain marginal performance improvement in > the hash function itself (which is already shown to be barely measurable > in the total context of a hash-dependent operation...) > > regards, tom lane > Dear Hackers and Reviewers, In response to Tom's questioning the randomness of the hash_any when the mixing functions are split into two, the first used when sequentially processing the input and the second for the final mix, I have generated a more detailed analysis of the two hash_any functions. First, one change to the 11/2008 patch, the keylen is added to a, b and c initially so we do not need to add it later on. The is the micro-diff: ----------------------------- --- hashfunc.c_TWOMIX 2009-01-22 14:07:34.000000000 -0600 +++ hashfunc.c_TWOMIX2 2009-01-22 14:17:32.000000000 -0600 @@ -336,7 +336,6 @@ /* handle the last 11 bytes */ k = (const unsigned char *) ka; - c += keylen;#ifdef WORDS_BIGENDIAN switch (len) { @@ -439,7 +438,6 @@ } /* handle the last 11 bytes */ - c += keylen;#ifdef WORDS_BIGENDIAN switch (len) /* all the case statementsfall through */ ----------------------------- The remainder of this document will use the runs from my initial results broken out using various power-of-two bucket sizes to simulate our actual use in PostgreSQL as the number of items hashed increases and we use more and more bits of our hash to identify the appropriate bucket. I have run each test twice, once with our current hash_any() with the single mix() function and then a second time using my patch from the November commitfest plus the patch above to produce a new hash_any() with two separate mixing functions mix() and final(). For each run I have generated a sequence of unique inputs, up to the number of buckets, hashed them with the hash functions (both old and new), then I calculate the expected number of collision p(n) using the poisson formula for each number of buckets, where the number of buckets are 2**16, 2**18, 2**20, 2**22, 2**24, and 2**26. For my initial run, I used a string consisting of the letter 'a' followed by the integer representation of the numbers from 0 to the (number of buckets - 1): 1) "a"uint32 ((i.e. a00001,a0002...) Number of buckets: 65536 Total number of items: 65536 Expected number with n items: 24109 24109 12054 4018 1004 200 33 4 Actual number mix(): 24044 24172 12078 4036 980 186 30 10 Actual number mix()/final(): 24027 24232 12060 3972 1001 207 31 5 1 Number of buckets: 262144 Total number of items: 262144 Expected number with n items: 96437 96437 48218 16072 4018 803 133 19 2 Actual number mix(): 96224 96730 48240 15951 4094 744 143 17 1 Actual number mix()/final(): 96335 96646 48071 16128 4018 801 122 21 2 Number of buckets: 1048576 Total number of items: 1048576 Expected number with n items: 385749 385749 192874 64291 16072 3214 535 76 9 Actual number mix(): 385716 385596 193243 64115 16053 3285 478 77 12 1 Actual number mix()/final(): 385955 385016 193789 63768 16259 3190 511 79 8 1 Number of buckets: 4194304 Total number of items: 4194304 Expected number with n items: 1542998 1542998 771499 257166 64291 12858 2143 306 38 Actual number mix(): 1542536 1543489 771351 257777 63830 12847 2123 326 19 5 1 Actual number mix()/final(): 1541828 1544429 772072 256178 64579 12774 2129 288 22 5 Number of buckets: 16777216 Total number of items: 16777216 Expected number with n items: 6171992 6171992 3085996 1028665 257166 51433 8572 1224 153 Actual number mix(): 6170866 6174079 3085912 1027140 257808 51385 8638 1219 146 23 Actual number mix()/final(): 6172058 6171991 3086279 1027916 257535 51465 8554 1243 149 23 3 Number of buckets: 67108864 Total number of items: 67108864 Expected number with n items: 24687971 24687971 12343985 4114661 1028665 205733 34288 4898 612 Actual number mix(): 24686110 24690897 12344232 4113515 1028232 205682 34546 4942 629 72 7 Actual number mix()/final(): 24708515 24669248 12333034 4114796 1034256 208424 34888 5023 598 77 5 Here is a second run with number of items = (number of buckets)/2: Number of buckets: 65536 Total number of items: 32768 Expected number with n items: 39749 19874 4968 828 103 10 Actual number mix(): 39704 19978 4898 842 103 10 1 Actual number mix()/final(): 39715 19922 4991 779 119 9 1 Number of buckets: 262144 Total number of items: 131072 Expected number with n items: 158998 79499 19874 3312 414 41 3 Actual number mix(): 158753 79972 19703 3216 458 38 4 Actual number mix()/final(): 158869 79688 19853 3303 393 33 3 2 Number of buckets: 1048576 Total number of items: 524288 Expected number with n items: 635993 317996 79499 13249 1656 165 13 Actual number mix(): 636118 317839 79388 13435 1628 152 16 Actual number mix()/final(): 636102 317824 79527 13267 1683 161 12 Number of buckets: 4194304 Total number of items: 2097152 Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3 Actual number mix(): 2544529 1270639 318904 53005 6516 645 61 5 Actual number mix()/final(): 2544014 1271483 318720 52849 6559 630 47 2 Number of buckets: 16777216 Total number of items: 8388608 Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15 Actual number mix(): 10174997 5089736 1271355 211528 26679 2683 220 17 1 Actual number mix()/final(): 10176567 5087207 1271789 212014 26684 2703 235 16 1 Number of buckets: 67108864 Total number of items: 33554432 Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3 Actual number mix(): 40703033 20353345 5086252 848860 105885 10536 891 59 3 Actual number mix()/final(): 40770874 20250828 5096890 865448 111835 11890 1013 76 10 2) uint32uint32 (i.e. uint64) Number of buckets: 65536 Total number of items: 32768 Expected number with n items: 39749 19874 4968 828 103 10 Actual number mix(): 39770 19863 4947 822 125 9 Actual number mix()/final(): 39754 19852 4990 837 91 11 1 Number of buckets: 262144 Total number of items: 131072 Expected number with n items: 158998 79499 19874 3312 414 41 3 Actual number mix(): 159046 79464 19823 3334 430 43 3 1 Actual number mix()/final(): 158879 79732 19774 3301 406 47 5 Number of buckets: 1048576 Total number of items: 524288 Expected number with n items: 635993 317996 79499 13249 1656 165 13 Actual number mix(): 636275 317580 79514 13357 1661 172 14 3 Actual number mix()/final(): 635578 318574 79504 13162 1585 159 13 1 Number of buckets: 4194304 Total number of items: 2097152 Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3 Actual number mix(): 2543769 1272196 318067 53050 6497 672 48 4 1 Actual number mix()/final(): 2543014 1273485 317812 52703 6589 635 59 7 Number of buckets: 16777216 Total number of items: 8388608 Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15 Actual number mix(): 10174587 5090467 1270909 211836 26513 2679 207 18 Actual number mix()/final(): 10175142 5089481 1270919 212555 26234 2643 224 15 3 Number of buckets: 67108864 Total number of items: 33554432 Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3 Actual number mix(): 40706281 20347746 5088639 848133 106388 10669 946 60 2 Actual number mix()/final(): 40701438 20354677 5088549 846570 106157 10578 838 55 2 3) uint32 from 1-1648379 Number of buckets: 65536 Total number of items: 32768 Expected number with n items: 39749 19874 4968 828 103 10 Actual number mix(): 39758 19844 4993 835 97 9 Actual number mix()/final(): 39842 19712 5016 849 109 7 1 Number of buckets: 262144 Total number of items: 131072 Expected number with n items: 158998 79499 19874 3312 414 41 3 Actual number mix(): 158973 79523 19900 3283 426 38 1 Actual number mix()/final(): 159136 79293 19896 3346 421 48 3 1 Number of buckets: 1048576 Total number of items: 524288 Expected number with n items: 635993 317996 79499 13249 1656 165 13 Actual number mix(): 636083 317917 79484 13183 1711 180 16 2 Actual number mix()/final(): 636183 317772 79438 13305 1685 173 20 Number of buckets: 4194304 Total number of items: 2097152 Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3 Actual number mix(): 2544325 1271367 318260 52950 6660 683 54 4 1 Actual number mix()/final(): 2544022 1272059 317778 53060 6646 665 70 4 Number of buckets: 16777216 Total number of items: 8388608 Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15 Actual number mix(): 10176360 5087249 1272083 212010 26655 2629 212 18 Actual number mix()/final(): 10175315 5088617 1272068 212144 26224 2589 234 23 1 1 Number of buckets: 67108864 Total number of items: 33554432 Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3 Actual number mix(): 40704034 20350542 5088760 848309 105721 10520 894 77 7 Actual number mix()/final(): 40774785 20243786 5098862 866801 111865 11641 1019 100 5 4) cracklib Number of buckets: 65536 Total number of items: 32767 Expected number with n items: 39750 19874 4968 828 103 10 Actual number mix(): 39731 19858 5049 794 92 11 1 Actual number mix()/final(): 39785 19801 5011 823 106 9 1 Number of buckets: 262144 Total number of items: 131071 Expected number with n items: 158998 79498 19874 3312 414 41 3 Actual number mix(): 159077 79420 19806 3380 409 49 3 Actual number mix()/final(): 159020 79475 19845 3366 383 54 1 Number of buckets: 1048576 Total number of items: 524287 Expected number with n items: 635994 317996 79498 13249 1656 165 13 Actual number mix(): 635979 318066 79420 13281 1641 163 23 3 Actual number mix()/final(): 635694 318637 79122 13271 1686 150 13 3 Number of buckets: 4194304 Total number of items: 1648379 Expected number with n items: 2831263 1112698 218647 28643 2814 221 14 Actual number mix(): 2830769 1113458 218633 28396 2786 250 11 1 Actual number mix()/final(): 2831304 1113004 218002 28843 2926 212 13 Number of buckets: 16777216 Total number of items: 1648379 Expected number with n items: 15207226 1494125 73399 2403 59 1 Actual number mix(): 15206923 1494718 73129 2384 59 3 Actual number mix()/final(): 15207183 1494267 73250 2452 64 Number of buckets: 67108864 Total number of items: 1648379 Expected number with n items: 65480564 1608383 19753 161 1.47989202515749e-08 Actual number mix(): 65480389 1608727 19593 154 1 Actual number mix()/final(): 65480455 1608609 19631 168 1 5) cracklib x 100 Number of buckets: 65536 Total number of items: 32767 Expected number with n items: 39750 19874 4968 828 103 10 Actual number mix(): 39829 19742 5003 847 99 14 2 Actual number mix()/final(): 39783 19794 5023 828 97 11 Number of buckets: 262144 Total number of items: 131071 Expected number with n items: 158998 79498 19874 3312 414 41 3 Actual number mix(): 159136 79242 20007 3289 405 62 3 Actual number mix()/final(): 158961 79546 19905 3270 408 51 3 Number of buckets: 1048576 Total number of items: 524287 Expected number with n items: 635994 317996 79498 13249 1656 165 13 Actual number mix(): 635815 318331 79372 13218 1658 168 12 2 Actual number mix()/final(): 635986 318152 79309 13231 1687 190 21 Number of buckets: 4194304 Total number of items: 1648379 Expected number with n items: 2831263 1112698 218647 28643 2814 221 14 Actual number mix(): 2830891 1113468 218202 28712 2804 208 18 1 Actual number mix()/final(): 2831786 1111616 219209 28661 2816 199 16 1 Number of buckets: 16777216 Total number of items: 1648379 Expected number with n items: 15207226 1494125 73399 2403 59 1 Actual number mix(): 15207428 1493777 73492 2459 59 1 Actual number mix()/final(): 15207777 1493063 73877 2435 63 1 Number of buckets: 67108864 Total number of items: 1648379 Expected number with n items: 65480564 1608383 19753 161 1.47989202515749e-08 Actual number mix(): 65480463 1608595 19635 170 1 Actual number mix()/final(): 65480650 1608222 19820 171 1 As you can see, all of the probabilities for both the current single mix() function and the split mix()/final() functions match the theoretical results as calculated by the poisson formula. Here is a nice reference that I found online describing the testing procedure that I am using: http://rabbit.eng.miami.edu/class/een318/poisson.pdf I can find no situations where the current version of hash_any() performs statistically better than the version resulting from my split mix()/final() function patch from 11/2008. I would be happy to share the test harness with anyone who wishes to test further. Please let me know if there is anything I can do. I still recommend applying the split mixing function patch that I submitted for the 11/2008 commitfest and I think these tests should allay Tom's concerns about the randomness of the new hash function. Regards, Ken Marshall
On Sun, Jan 25, 2009 at 10:27:03PM -0600, Kenneth Marshall wrote: > On Sat, Jan 10, 2009 at 01:36:25PM -0500, Tom Lane wrote: > > Jeff Davis <pgsql@j-davis.com> writes: > > > I ran 5 times on both old and new code, eliminating the top and bottom > > > and taking the average of the remaining 3, and I got a 6.9% performance > > > improvement with the new code. > > > > The question that has been carefully evaded throughout the discussion > > of this patch is whether the randomness of the hash result is decreased, > > and if so what is that likely to cost us in performance of the various > > hash-dependent algorithms. I would still like to have an answer to that > > before we make a change to gain marginal performance improvement in > > the hash function itself (which is already shown to be barely measurable > > in the total context of a hash-dependent operation...) > > > > regards, tom lane > > > > Dear Hackers and Reviewers, > > In response to Tom's questioning the randomness of the hash_any > when the mixing functions are split into two, the first used when > sequentially processing the input and the second for the final > mix, I have generated a more detailed analysis of the two hash_any > functions. > > First, one change to the 11/2008 patch, the keylen is added to a, b and > c initially so we do not need to add it later on. The is the > micro-diff: > ----------------------------- > > --- hashfunc.c_TWOMIX 2009-01-22 14:07:34.000000000 -0600 > +++ hashfunc.c_TWOMIX2 2009-01-22 14:17:32.000000000 -0600 > @@ -336,7 +336,6 @@ > > /* handle the last 11 bytes */ > k = (const unsigned char *) ka; > - c += keylen; > #ifdef WORDS_BIGENDIAN > switch (len) > { > @@ -439,7 +438,6 @@ > } > > /* handle the last 11 bytes */ > - c += keylen; > #ifdef WORDS_BIGENDIAN > switch (len) /* all the case statements fall through */ > > ----------------------------- > > The remainder of this document will use the runs from my initial results > broken out using various power-of-two bucket sizes to simulate our actual > use in PostgreSQL as the number of items hashed increases and we use more > and more bits of our hash to identify the appropriate bucket. I have run > each test twice, once with our current hash_any() with the single mix() > function and then a second time using my patch from the November commitfest > plus the patch above to produce a new hash_any() with two separate mixing > functions mix() and final(). For each run I have generated a sequence of > unique inputs, up to the number of buckets, hashed them with the hash > functions (both old and new), then I calculate the expected number of > collision p(n) using the poisson formula for each number of buckets, > where the number of buckets are 2**16, 2**18, 2**20, 2**22, 2**24, and > 2**26. For my initial run, I used a string consisting of the letter 'a' > followed by the integer representation of the numbers from 0 to the > (number of buckets - 1): > > 1) "a"uint32 ((i.e. a00001,a0002...) > Number of buckets: 65536 > Total number of items: 65536 > Expected number with n items: 24109 24109 12054 4018 1004 200 33 4 > Actual number mix(): 24044 24172 12078 4036 980 186 30 10 > Actual number mix()/final(): 24027 24232 12060 3972 1001 207 31 5 1 > > Number of buckets: 262144 > Total number of items: 262144 > Expected number with n items: 96437 96437 48218 16072 4018 803 133 19 2 > Actual number mix(): 96224 96730 48240 15951 4094 744 143 17 1 > Actual number mix()/final(): 96335 96646 48071 16128 4018 801 122 21 2 > > Number of buckets: 1048576 > Total number of items: 1048576 > Expected number with n items: 385749 385749 192874 64291 16072 3214 535 76 9 > Actual number mix(): 385716 385596 193243 64115 16053 3285 478 77 12 1 > Actual number mix()/final(): 385955 385016 193789 63768 16259 3190 511 79 8 1 > > Number of buckets: 4194304 > Total number of items: 4194304 > Expected number with n items: 1542998 1542998 771499 257166 64291 12858 2143 306 38 > Actual number mix(): 1542536 1543489 771351 257777 63830 12847 2123 326 19 5 1 > Actual number mix()/final(): 1541828 1544429 772072 256178 64579 12774 2129 288 22 5 > > Number of buckets: 16777216 > Total number of items: 16777216 > Expected number with n items: 6171992 6171992 3085996 1028665 257166 51433 8572 1224 153 > Actual number mix(): 6170866 6174079 3085912 1027140 257808 51385 8638 1219 146 23 > Actual number mix()/final(): 6172058 6171991 3086279 1027916 257535 51465 8554 1243 149 23 3 > > Number of buckets: 67108864 > Total number of items: 67108864 > Expected number with n items: 24687971 24687971 12343985 4114661 1028665 205733 34288 4898 612 > Actual number mix(): 24686110 24690897 12344232 4113515 1028232 205682 34546 4942 629 72 7 > Actual number mix()/final(): 24708515 24669248 12333034 4114796 1034256 208424 34888 5023 598 77 5 > > Here is a second run with number of items = (number of buckets)/2: > Number of buckets: 65536 > Total number of items: 32768 > Expected number with n items: 39749 19874 4968 828 103 10 > Actual number mix(): 39704 19978 4898 842 103 10 1 > Actual number mix()/final(): 39715 19922 4991 779 119 9 1 > > Number of buckets: 262144 > Total number of items: 131072 > Expected number with n items: 158998 79499 19874 3312 414 41 3 > Actual number mix(): 158753 79972 19703 3216 458 38 4 > Actual number mix()/final(): 158869 79688 19853 3303 393 33 3 2 > > Number of buckets: 1048576 > Total number of items: 524288 > Expected number with n items: 635993 317996 79499 13249 1656 165 13 > Actual number mix(): 636118 317839 79388 13435 1628 152 16 > Actual number mix()/final(): 636102 317824 79527 13267 1683 161 12 > > Number of buckets: 4194304 > Total number of items: 2097152 > Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3 > Actual number mix(): 2544529 1270639 318904 53005 6516 645 61 5 > Actual number mix()/final(): 2544014 1271483 318720 52849 6559 630 47 2 > > Number of buckets: 16777216 > Total number of items: 8388608 > Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15 > Actual number mix(): 10174997 5089736 1271355 211528 26679 2683 220 17 1 > Actual number mix()/final(): 10176567 5087207 1271789 212014 26684 2703 235 16 1 > > Number of buckets: 67108864 > Total number of items: 33554432 > Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3 > Actual number mix(): 40703033 20353345 5086252 848860 105885 10536 891 59 3 > Actual number mix()/final(): 40770874 20250828 5096890 865448 111835 11890 1013 76 10 > > 2) uint32uint32 (i.e. uint64) > Number of buckets: 65536 > Total number of items: 32768 > Expected number with n items: 39749 19874 4968 828 103 10 > Actual number mix(): 39770 19863 4947 822 125 9 > Actual number mix()/final(): 39754 19852 4990 837 91 11 1 > > Number of buckets: 262144 > Total number of items: 131072 > Expected number with n items: 158998 79499 19874 3312 414 41 3 > Actual number mix(): 159046 79464 19823 3334 430 43 3 1 > Actual number mix()/final(): 158879 79732 19774 3301 406 47 5 > > Number of buckets: 1048576 > Total number of items: 524288 > Expected number with n items: 635993 317996 79499 13249 1656 165 13 > Actual number mix(): 636275 317580 79514 13357 1661 172 14 3 > Actual number mix()/final(): 635578 318574 79504 13162 1585 159 13 1 > > Number of buckets: 4194304 > Total number of items: 2097152 > Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3 > Actual number mix(): 2543769 1272196 318067 53050 6497 672 48 4 1 > Actual number mix()/final(): 2543014 1273485 317812 52703 6589 635 59 7 > > Number of buckets: 16777216 > Total number of items: 8388608 > Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15 > Actual number mix(): 10174587 5090467 1270909 211836 26513 2679 207 18 > Actual number mix()/final(): 10175142 5089481 1270919 212555 26234 2643 224 15 3 > > Number of buckets: 67108864 > Total number of items: 33554432 > Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3 > Actual number mix(): 40706281 20347746 5088639 848133 106388 10669 946 60 2 > Actual number mix()/final(): 40701438 20354677 5088549 846570 106157 10578 838 55 2 > > 3) uint32 from 1-1648379 > Number of buckets: 65536 > Total number of items: 32768 > Expected number with n items: 39749 19874 4968 828 103 10 > Actual number mix(): 39758 19844 4993 835 97 9 > Actual number mix()/final(): 39842 19712 5016 849 109 7 1 > > Number of buckets: 262144 > Total number of items: 131072 > Expected number with n items: 158998 79499 19874 3312 414 41 3 > Actual number mix(): 158973 79523 19900 3283 426 38 1 > Actual number mix()/final(): 159136 79293 19896 3346 421 48 3 1 > > Number of buckets: 1048576 > Total number of items: 524288 > Expected number with n items: 635993 317996 79499 13249 1656 165 13 > Actual number mix(): 636083 317917 79484 13183 1711 180 16 2 > Actual number mix()/final(): 636183 317772 79438 13305 1685 173 20 > > Number of buckets: 4194304 > Total number of items: 2097152 > Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3 > Actual number mix(): 2544325 1271367 318260 52950 6660 683 54 4 1 > Actual number mix()/final(): 2544022 1272059 317778 53060 6646 665 70 4 > > Number of buckets: 16777216 > Total number of items: 8388608 > Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15 > Actual number mix(): 10176360 5087249 1272083 212010 26655 2629 212 18 > Actual number mix()/final(): 10175315 5088617 1272068 212144 26224 2589 234 23 1 1 > > Number of buckets: 67108864 > Total number of items: 33554432 > Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3 > Actual number mix(): 40704034 20350542 5088760 848309 105721 10520 894 77 7 > Actual number mix()/final(): 40774785 20243786 5098862 866801 111865 11641 1019 100 5 > > 4) cracklib > Number of buckets: 65536 > Total number of items: 32767 > Expected number with n items: 39750 19874 4968 828 103 10 > Actual number mix(): 39731 19858 5049 794 92 11 1 > Actual number mix()/final(): 39785 19801 5011 823 106 9 1 > > Number of buckets: 262144 > Total number of items: 131071 > Expected number with n items: 158998 79498 19874 3312 414 41 3 > Actual number mix(): 159077 79420 19806 3380 409 49 3 > Actual number mix()/final(): 159020 79475 19845 3366 383 54 1 > > Number of buckets: 1048576 > Total number of items: 524287 > Expected number with n items: 635994 317996 79498 13249 1656 165 13 > Actual number mix(): 635979 318066 79420 13281 1641 163 23 3 > Actual number mix()/final(): 635694 318637 79122 13271 1686 150 13 3 > > Number of buckets: 4194304 > Total number of items: 1648379 > Expected number with n items: 2831263 1112698 218647 28643 2814 221 14 > Actual number mix(): 2830769 1113458 218633 28396 2786 250 11 1 > Actual number mix()/final(): 2831304 1113004 218002 28843 2926 212 13 > > Number of buckets: 16777216 > Total number of items: 1648379 > Expected number with n items: 15207226 1494125 73399 2403 59 1 > Actual number mix(): 15206923 1494718 73129 2384 59 3 > Actual number mix()/final(): 15207183 1494267 73250 2452 64 > > Number of buckets: 67108864 > Total number of items: 1648379 > Expected number with n items: 65480564 1608383 19753 161 1.47989202515749e-08 > Actual number mix(): 65480389 1608727 19593 154 1 > Actual number mix()/final(): 65480455 1608609 19631 168 1 > > 5) cracklib x 100 > Number of buckets: 65536 > Total number of items: 32767 > Expected number with n items: 39750 19874 4968 828 103 10 > Actual number mix(): 39829 19742 5003 847 99 14 2 > Actual number mix()/final(): 39783 19794 5023 828 97 11 > > Number of buckets: 262144 > Total number of items: 131071 > Expected number with n items: 158998 79498 19874 3312 414 41 3 > Actual number mix(): 159136 79242 20007 3289 405 62 3 > Actual number mix()/final(): 158961 79546 19905 3270 408 51 3 > > Number of buckets: 1048576 > Total number of items: 524287 > Expected number with n items: 635994 317996 79498 13249 1656 165 13 > Actual number mix(): 635815 318331 79372 13218 1658 168 12 2 > Actual number mix()/final(): 635986 318152 79309 13231 1687 190 21 > > Number of buckets: 4194304 > Total number of items: 1648379 > Expected number with n items: 2831263 1112698 218647 28643 2814 221 14 > Actual number mix(): 2830891 1113468 218202 28712 2804 208 18 1 > Actual number mix()/final(): 2831786 1111616 219209 28661 2816 199 16 1 > > Number of buckets: 16777216 > Total number of items: 1648379 > Expected number with n items: 15207226 1494125 73399 2403 59 1 > Actual number mix(): 15207428 1493777 73492 2459 59 1 > Actual number mix()/final(): 15207777 1493063 73877 2435 63 1 > > Number of buckets: 67108864 > Total number of items: 1648379 > Expected number with n items: 65480564 1608383 19753 161 1.47989202515749e-08 > Actual number mix(): 65480463 1608595 19635 170 1 > Actual number mix()/final(): 65480650 1608222 19820 171 1 > > As you can see, all of the probabilities for both the current single mix() > function and the split mix()/final() functions match the theoretical results > as calculated by the poisson formula. Here is a nice reference that I found > online describing the testing procedure that I am using: > > http://rabbit.eng.miami.edu/class/een318/poisson.pdf > > I can find no situations where the current version of hash_any() performs > statistically better than the version resulting from my split mix()/final() > function patch from 11/2008. I would be happy to share the test harness with > anyone who wishes to test further. Please let me know if there is anything > I can do. I still recommend applying the split mixing function patch that > I submitted for the 11/2008 commitfest and I think these tests should allay > Tom's concerns about the randomness of the new hash function. > > Regards, > Ken Marshall > Dear Hackers, I have updated the patch posted by Jeff Davis on January 9th to include the micro-patch above as well as updated the polymorphism regressions tests. This applies cleanly to the latest CVS pull. Regards, Ken
Attachment
Kenneth Marshall <ktm@rice.edu> writes: > I have updated the patch posted by Jeff Davis on January 9th > to include the micro-patch above as well as updated the polymorphism > regressions tests. This applies cleanly to the latest CVS pull. Applied --- thanks for being persistent about resolving the doubts on this. One thing that apparently neither of you realized was that the polymorphism results were varying between bigendian and littleendian machines; I suppose you are using different hardware and that's why you didn't agree on what the results should be. Since we already agreed we were going to tolerate endianness dependence in the hash functions, I fixed that by adding some ORDER BYs. regards, tom lane
Did this change hashtext() visible to users? We have been using it quite widely for partitioning our databases. If so then it should be marked quite visibly in release notes as there might be others who will be hit by this.
regards
Asko
regards
Asko
On Mon, Feb 9, 2009 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kenneth Marshall <ktm@rice.edu> writes:Applied --- thanks for being persistent about resolving the doubts on this.
> I have updated the patch posted by Jeff Davis on January 9th
> to include the micro-patch above as well as updated the polymorphism
> regressions tests. This applies cleanly to the latest CVS pull.
One thing that apparently neither of you realized was that the
polymorphism results were varying between bigendian and littleendian
machines; I suppose you are using different hardware and that's why you
didn't agree on what the results should be.
Since we already agreed we were going to tolerate endianness dependence
in the hash functions, I fixed that by adding some ORDER BYs.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Asko Oja <ascoja@gmail.com> writes: > Did this change hashtext() visible to users? We have been using it quite > widely for partitioning our databases. If so then it should be marked quite > visibly in release notes as there might be others who will be hit by this. The hash functions are undocumented, have changed in the past, and are likely to change again in the future. If you are using them in a way that depends on them to give the same answers across versions, you'd better stop. regards, tom lane
On Wed, 2009-02-11 at 11:22 -0500, Tom Lane wrote: > Asko Oja <ascoja@gmail.com> writes: > > Did this change hashtext() visible to users? We have been using it quite > > widely for partitioning our databases. If so then it should be marked quite > > visibly in release notes as there might be others who will be hit by this. > > The hash functions are undocumented, have changed in the past, and are > likely to change again in the future. If you are using them in a way > that depends on them to give the same answers across versions, you'd > better stop. Is at least the fact that they "are undocumented, have changed in the past, and are likely to change again in the future" documented ? I'm sure this is something that has hit unwary users in the past and will hit again in the future, so some words about it in the doc's would be appropriate. search for "hashtext" on http://www.postgresql.org/docs/8.4/interactive/index.html returned no results, so I guess even theyr "undocumented, will surprise you" status is not documented. Hashing is a quite fundamental thing in computing, so I was quite surprised to find out it had silently changed. I had never checked the docs for hash functions, but I had assumed, that internal functions are prefixed by pg_ and anything else is public, free to use functionality. Changing hash functions also makes in-place upgrades a lot harder, as they can't be done incrementally anymore for tables which use hash indexes. > regards, tom lane > -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Hannu Krosing <hannu@2ndQuadrant.com> writes: > I had never checked the docs for hash functions, but I had assumed, that > internal functions are prefixed by pg_ and anything else is public, free > to use functionality. Sure, it's free to use. It's not free to assume that we promise never to change it. > Changing hash functions also makes in-place upgrades a lot harder, as > they can't be done incrementally anymore for tables which use hash > indexes. Hash indexes are so far from being production-grade that this argument is not significant. regards, tom lane
On Wed, Oct 28, 2009 at 03:31:17PM -0400, Tom Lane wrote: > Hannu Krosing <hannu@2ndQuadrant.com> writes: > > I had never checked the docs for hash functions, but I had assumed, that > > internal functions are prefixed by pg_ and anything else is public, free > > to use functionality. > > Sure, it's free to use. It's not free to assume that we promise never > to change it. > > > Changing hash functions also makes in-place upgrades a lot harder, as > > they can't be done incrementally anymore for tables which use hash > > indexes. > > Hash indexes are so far from being production-grade that this argument > is not significant. > > regards, tom lane In addition that change from 8.3 -> 8.4 to store only the hash and not the value in the index means that a reindex would be required in any event. Cheers, Ken
Kenneth Marshall <ktm@rice.edu> writes: > On Wed, Oct 28, 2009 at 03:31:17PM -0400, Tom Lane wrote: >> Hash indexes are so far from being production-grade that this argument >> is not significant. > In addition that change from 8.3 -> 8.4 to store only the hash and not > the value in the index means that a reindex would be required in any event. Indeed, and I fully expect there will be some more on-disk format changes required before we get to the point where hash indexes are actually interesting for production. If we start insisting that they be in-place-upgradable now, we will pretty much guarantee that they never become useful enough to justify the restriction :-( (As examples, the hash bucket size probably needs revisiting, and we ought to think very hard about whether we shouldn't switch to 64-bit hash values. And that's not even considering some of the more advanced suggestions that have been made.) regards, tom lane
On Wed, 2009-10-28 at 21:09 +0200, Hannu Krosing wrote: > Is at least the fact that they "are undocumented, have changed in the > past, and are likely to change again in the future" documented ? That's a little confusing to me: how do we document that something is undocumented? And where do we stop? > Hashing is a quite fundamental thing in computing, so I was quite > surprised to find out it had silently changed. There are many reasons to use a hash, and we don't want people to use these functions for the wrong purpose. I have seen people use a performance hash for security purposes before, and I had to demonstrate some hash collisions to show why that was a bad idea. So, if we do provide documented functions, it should be done carefully. Trying to develop and document a set of standardized, stable hash functions covering a wide range of possible use cases sounds like it may be better served by an extension. Regards,Jeff Davis
On Wed, 2009-10-28 at 12:51 -0700, Jeff Davis wrote: > On Wed, 2009-10-28 at 21:09 +0200, Hannu Krosing wrote: > > Is at least the fact that they "are undocumented, have changed in the > > past, and are likely to change again in the future" documented ? > > That's a little confusing to me: how do we document that something is > undocumented? And where do we stop? My previous e-mail message documents that the undocumentedness is undocumented, so no need to go any further here ;) Though undocumented, the hash functions are easily discoverable by doing \df *hash* in psql > > Hashing is a quite fundamental thing in computing, so I was quite > > surprised to find out it had silently changed. > > There are many reasons to use a hash, and we don't want people to use > these functions for the wrong purpose. I don't think that not documenting a hash function helps here at all. > I have seen people use a > performance hash for security purposes before, and I had to demonstrate > some hash collisions to show why that was a bad idea. I've seen people use CRC32 as hash and then hit a collisions in 15 tries with quite large keys. > So, if we do provide documented functions, it should be done carefully. Any user-visible behavior change should be done carefully, even if the original behavior is not documented. Careful upgrade of hasxxx functions would have kept the old functions, and introduced the new ones with _v2 suffix, and then used these in appropriate places. then kept the old ones for a few versions, with maybe a deprecation warning and then moved them to contrib for a few more versions. Doing it this way could leave them "undocumented" and still not break peoples applications in mysterious ways. > Trying to develop and document a set of standardized, stable hash > functions covering a wide range of possible use cases sounds like it may > be better served by an extension. I guess there are enough security/crypt/strong hashes in pgcrypto package so that should not be a problem. -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
On Wed, 2009-10-28 at 15:31 -0400, Tom Lane wrote: > Hannu Krosing <hannu@2ndQuadrant.com> writes: > > I had never checked the docs for hash functions, but I had assumed, that > > internal functions are prefixed by pg_ and anything else is public, free > > to use functionality. > > Sure, it's free to use. It's not free to assume that we promise never > to change it. > > > Changing hash functions also makes in-place upgrades a lot harder, as > > they can't be done incrementally anymore for tables which use hash > > indexes. > > Hash indexes are so far from being production-grade that this argument > is not significant. AFAIK in-place upgrade is also not quite production-grade, so this was meant as a forward-looking note for next time the hashxxx functions will change. > regards, tom lane -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
On Wed, 2009-10-28 at 12:51 -0700, Jeff Davis wrote: > Trying to develop and document a set of standardized, stable hash > functions covering a wide range of possible use cases sounds like it may > be better served by an extension. I suspect that some of the participants in this thread have PL/Proxy in mind. PL/Proxy should probably ship its own set of hash functions. If we ever get built-in partitioning by hash (see thread nearby and most previous ones like it), we should also think about providing a hash function that doesn't change output over versions and is independent of hash index implementation concerns.
On Thu, 2009-10-29 at 09:47 +0200, Peter Eisentraut wrote: > On Wed, 2009-10-28 at 12:51 -0700, Jeff Davis wrote: > > Trying to develop and document a set of standardized, stable hash > > functions covering a wide range of possible use cases sounds like it may > > be better served by an extension. > > I suspect that some of the participants in this thread have PL/Proxy in > mind. Yes, I think that pl/proxy is the prime example of such use. > PL/Proxy should probably ship its own set of hash functions. Or maybe we could just extract the hashes form some version of stock postgresql (say 8.3) and then make those available in contrib under the name "stable_hashes" ? > If we ever get built-in partitioning by hash (see thread nearby and most > previous ones like it), we should also think about providing a hash > function that doesn't change output over versions and is independent of > hash index implementation concerns. And maybe even documented ;) -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Hannu Krosing <hannu@2ndQuadrant.com> writes: > Or maybe we could just extract the hashes form some version of stock > postgresql (say 8.3) and then make those available in contrib under the > name "stable_hashes" ? A better name would be "wishful_thinking" ... contrib does not have control over some of the main risk factors, such as changes to the internal representation of datatypes. regards, tom lane
On Thu, 2009-10-29 at 09:32 -0400, Tom Lane wrote: > Hannu Krosing <hannu@2ndQuadrant.com> writes: > > Or maybe we could just extract the hashes form some version of stock > > postgresql (say 8.3) and then make those available in contrib under the > > name "stable_hashes" ? > > A better name would be "wishful_thinking" ... contrib does not have > control over some of the main risk factors, such as changes to the > internal representation of datatypes. Good point. But the internal layout of data types popular for use as hash key changes rarely. The hash function change, however, is a showstopper for everyone concerned.
On Thu, 2009-10-29 at 09:32 -0400, Tom Lane wrote: > Hannu Krosing <hannu@2ndQuadrant.com> writes: > > Or maybe we could just extract the hashes form some version of stock > > postgresql (say 8.3) and then make those available in contrib under the > > name "stable_hashes" ? > > A better name would be "wishful_thinking" ... contrib does not have > control over some of the main risk factors, such as changes to the > internal representation of datatypes. My understanding was, that contrib is actually the only possibility to have something maintained in sync with core postgreSQL. > regards, tom lane -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training