Thread: Re: [PATCHES] updated hash functions for postgresql v1

Re: [PATCHES] updated hash functions for postgresql v1

From
Kenneth Marshall
Date:
Dear PostgreSQL developers,

I am re-sending this to keep this last change to the
internal hash function on the radar.

Ken
............
Sorry about the delay for this update to the new hash
index implementation. I was trying to get the WAL logging
in place and forgot to post the actual patch. The WAL
for hash indexes will need to wait for 8.5, but I did
want to add back in the piece of the Bob Jenkins 2006
hash function that was stripped out of the initial
patch on application due to concerns about the randomness
of the resulting hash values. Here is a re-post of my
initial findings comparing the old/new Jenkins hash
from lookup2 and lookup3. I have added a third column
containing the results for the hash_any() resulting
from the attached patch as well as simple timing test
for a DB reindex both before and after patching.

Also attached is a simple documentation patch updating
the note attached to the hash index description.

Regards,
Ken
----------------------------------------------------
Hi,

I have finally had a chance to do some investigation on
the performance of the old hash mix() function versus
the updated mix()/final() in the new hash function. Here
is a table of my current results for both the old and the
new hash function. In this case cracklib refers to the
cracklib-dict containing 1648379 unique words massaged
in various ways to generate input strings for the hash
functions. The result is the number of collisions in the
hash values generated.

hash input                            old    new  newv2
----------                            ---    ---  -----
cracklib                              338    316  338
cracklib x 2 (i.e. clibclib)          305    319  300
cracklib x 3 (clibclibclib)           323    329  315
cracklib x 10                         302    310  329
cracklib x 100                        350    335  298
cracklib x 1000                       314    309  315
cracklib x 100 truncated to char(100) 311    327  320

uint32 from 1-1648379                 309    319  347
(uint32 1-1948379)*256                309    314  304
(uint32 1-1948379)*16                 310    314  324
"a"uint32 (i.e. a00001,a0002...)      320    321  312

uint32uint32 (i.e. uint64)            321    287  309

The different result columns are old = Jenkins 1996 hash
function(lookup2.c), new = Jenkins 2006 hash function
(lookup3.c), and newv2 = adaptation of current hash_any()
to incorporate the separate mix()/final() functions. As
you can see from the results, spliting the mix() and final()
apart does not result in any perceptible loss of randomness
in the hash assignment. I also ran a crude timing for a
reindex of the following database:

CREATE TABLE dict (word text);
CREATE INDEX wordhash ON dict USING hash (word);
INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo');
INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict);
... (21 times)

REINDEX TABLE
...

The average time to reindex the table using our current
hash_any() without the separate mix()/final() was 1696ms
and 1482ms with the separate mix()/final() stages giving
almost 13% better performance for this stupid metric.

Attachment

Re: [PATCHES] updated hash functions for postgresql v1

From
Alvaro Herrera
Date:
Kenneth Marshall wrote:
> Dear PostgreSQL developers,
> 
> I am re-sending this to keep this last change to the
> internal hash function on the radar.

Please add it to the commitfest wiki page,
http://wiki.postgresql.org/wiki/CommitFest_2008-11

Thanks

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: [PATCHES] updated hash functions for postgresql v1

From
Jeff Davis
Date:
On Mon, 2008-12-22 at 13:47 -0600, Kenneth Marshall wrote:
> Dear PostgreSQL developers,
>
> I am re-sending this to keep this last change to the
> internal hash function on the radar.
>

Hi Ken,

A few comments:

1. New patch with very minor changes attached.

2. I reverted the change you made to indices.sgml. We still don't use
WAL for hash indexes, and in my opinion we should continue to discourage
their use until we do use WAL. We can add back in the comment that hash
indexes are suitable for large keys if we have some results to show
that.

3. There was a regression test failure in union.sql because the ordering
of the results was different. I updated the regression test.

4. Hash functions affect a lot more than hash indexes, so I ran through
a variety of tests that use a HashAggregate plan. Test setup and results
are attached. These results show no difference between the old and the
new code (about 0.1% better).

5. The hash index build time shows some improvement. The new code won in
every instance in which a there were a lot of duplicates in the table
(100 distinct values, 50K of each) by around 5%.

The new code appeared to be the same or slightly worse in the case of
hash index builds with few duplicates (1000000 distinct values, 5 of
each). The difference was about 1% worse, which is probably just noise.

Note: I'm no expert on hash functions. Take all of my tests with a grain
of salt.

I would feel a little better if I saw at least one test that showed
better performance of the new code on a reasonable-looking distribution
of data. The hash index build that you showed only took a second or two
-- it would be nice to see a test that lasted at least a minute.

Regards,
    Jeff Davis



Attachment

Re: [PATCHES] updated hash functions for postgresql v1

From
Kenneth Marshall
Date:
On Fri, Jan 09, 2009 at 12:04:15PM -0800, Jeff Davis wrote:
> On Mon, 2008-12-22 at 13:47 -0600, Kenneth Marshall wrote: 
> > Dear PostgreSQL developers,
> > 
> > I am re-sending this to keep this last change to the
> > internal hash function on the radar.
> > 
> 
> Hi Ken,
> 
> A few comments:
> 
> 1. New patch with very minor changes attached.
> 
> 2. I reverted the change you made to indices.sgml. We still don't use
> WAL for hash indexes, and in my opinion we should continue to discourage
> their use until we do use WAL. We can add back in the comment that hash
> indexes are suitable for large keys if we have some results to show
> that.
> 
> 3. There was a regression test failure in union.sql because the ordering
> of the results was different. I updated the regression test.
> 
> 4. Hash functions affect a lot more than hash indexes, so I ran through
> a variety of tests that use a HashAggregate plan. Test setup and results
> are attached. These results show no difference between the old and the
> new code (about 0.1% better).
> 
> 5. The hash index build time shows some improvement. The new code won in
> every instance in which a there were a lot of duplicates in the table
> (100 distinct values, 50K of each) by around 5%.
> 
> The new code appeared to be the same or slightly worse in the case of
> hash index builds with few duplicates (1000000 distinct values, 5 of
> each). The difference was about 1% worse, which is probably just noise.
> 
> Note: I'm no expert on hash functions. Take all of my tests with a grain
> of salt.
> 
> I would feel a little better if I saw at least one test that showed
> better performance of the new code on a reasonable-looking distribution
> of data. The hash index build that you showed only took a second or two
> -- it would be nice to see a test that lasted at least a minute.
> 
> Regards,
>     Jeff Davis
> 
> 

Jeff,

Thanks for the review. I would not really expect any differences in hash
index build times other than normal noise variances. The most definitive
benchmark that I have seen was done with my original patch submission
in March posted by Luke of Greenplum:
 "We just applied this and saw a 5 percent speedup on a hash aggregation  query with four columns in a 'group by'
clauserun against a single  TPC-H table."
 

I wonder if they would be willing to re-run their test? Thanks again.

Regards,
Ken



Re: [PATCHES] updated hash functions for postgresql v1

From
Jeff Davis
Date:
On Fri, 2009-01-09 at 14:29 -0600, Kenneth Marshall wrote:
> Jeff,
> 
> Thanks for the review. I would not really expect any differences in hash
> index build times other than normal noise variances. The most definitive
> benchmark that I have seen was done with my original patch submission
> in March posted by Luke of Greenplum:
> 
>   "We just applied this and saw a 5 percent speedup on a hash aggregation
>    query with four columns in a 'group by' clause run against a single
>    TPC-H table."
> 
> I wonder if they would be willing to re-run their test? Thanks again.

Separating mix() and final() should have some performance benefit,
right?

Regards,Jeff Davis



Re: [PATCHES] updated hash functions for postgresql v1

From
Kenneth Marshall
Date:
On Fri, Jan 09, 2009 at 02:00:39PM -0800, Jeff Davis wrote:
> On Fri, 2009-01-09 at 14:29 -0600, Kenneth Marshall wrote:
> > Jeff,
> > 
> > Thanks for the review. I would not really expect any differences in hash
> > index build times other than normal noise variances. The most definitive
> > benchmark that I have seen was done with my original patch submission
> > in March posted by Luke of Greenplum:
> > 
> >   "We just applied this and saw a 5 percent speedup on a hash aggregation
> >    query with four columns in a 'group by' clause run against a single
> >    TPC-H table."
> > 
> > I wonder if they would be willing to re-run their test? Thanks again.
> 
> Separating mix() and final() should have some performance benefit,
> right?
> 
> Regards,
>     Jeff Davis
> 
> 
Yes, it does but the results can be swamped by other latencies in the
code path. Tests such as Tom's benchmark of the underlying functions is
needed to isolate the timings effectively or a benchmark like Greenplum's
that will benefit from a more efficient function.

Ken


Re: [PATCHES] updated hash functions for postgresql v1

From
Jeff Davis
Date:
On Sat, 2009-01-10 at 11:06 -0600, Kenneth Marshall wrote:
> > Separating mix() and final() should have some performance benefit,
> > right?
> > 
> Yes, it does but the results can be swamped by other latencies in the
> code path. Tests such as Tom's benchmark of the underlying functions is
> needed to isolate the timings effectively or a benchmark like Greenplum's
> that will benefit from a more efficient function.
> 

Ok. I isolated the function itself by just doing:

-- 10 million rows of random()::text
EXPLAIN ANALYZE SELECT hashtext(t) FROM randomtext;

I ran 5 times on both old and new code, eliminating the top and bottom
and taking the average of the remaining 3, and I got a 6.9% performance
improvement with the new code.

I tried quickly with a few other data types and got similar results.
It's obviously a small microbenchmark, but that's good enough for me. 

Thanks!

Regards,Jeff Davis



Re: [PATCHES] updated hash functions for postgresql v1

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> I ran 5 times on both old and new code, eliminating the top and bottom
> and taking the average of the remaining 3, and I got a 6.9% performance
> improvement with the new code.

The question that has been carefully evaded throughout the discussion
of this patch is whether the randomness of the hash result is decreased,
and if so what is that likely to cost us in performance of the various
hash-dependent algorithms.  I would still like to have an answer to that
before we make a change to gain marginal performance improvement in
the hash function itself (which is already shown to be barely measurable
in the total context of a hash-dependent operation...)
        regards, tom lane


Re: [PATCHES] updated hash functions for postgresql v1

From
Jeff Davis
Date:
On Sat, 2009-01-10 at 13:36 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > I ran 5 times on both old and new code, eliminating the top and bottom
> > and taking the average of the remaining 3, and I got a 6.9% performance
> > improvement with the new code.
> 
> The question that has been carefully evaded throughout the discussion
> of this patch is whether the randomness of the hash result is decreased,
> and if so what is that likely to cost us in performance of the various
> hash-dependent algorithms.  I would still like to have an answer to that
> before we make a change to gain marginal performance improvement in
> the hash function itself (which is already shown to be barely measurable
> in the total context of a hash-dependent operation...)
> 

In:
http://archives.postgresql.org/message-id/20081104202655.GP18362@it.is.rice.edu

Ken showed that the number of hash collisions is the same in the old and
the new code for his test data. Not a direct measurement of randomness,
but it's some kind of indication.

I'm not an authority on either hash functions or statistics, so I'll
have to defer to Ken or someone else to actually prove that the
randomness is maintained.

Regards,Jeff Davis



Re: [PATCHES] updated hash functions for postgresql v1

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Jeff Davis <pgsql@j-davis.com> writes:
>> I ran 5 times on both old and new code, eliminating the top and bottom
>> and taking the average of the remaining 3, and I got a 6.9% performance
>> improvement with the new code.
>
> The question that has been carefully evaded throughout the discussion
> of this patch is whether the randomness of the hash result is decreased,

In fairness that doesn't seem to be the case. The original patch was posted
with such an analysis using cracklib and raw binary data:

http://article.gmane.org/gmane.comp.db.postgresql.devel.general/105675

>  marginal performance improvement in the hash function itself (which is
> already shown to be barely measurable in the total context of a
> hash-dependent operation...)

If it's a 6% gain in the speed of Hash Join or HashAggregate it would be very
interesting. But I gather it's a 6% speedup in the time spent actually in the
hash function. Is that really where much of our time is going? If it's 10% of
the total time to execute one of these nodes then we're talking about a 0.6%
optimization...

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: [PATCHES] updated hash functions for postgresql v1

From
Jeff Davis
Date:
On Sat, 2009-01-10 at 13:57 -0500, Gregory Stark wrote:
> But I gather it's a 6% speedup in the time spent actually in the
> hash function.

That's correct. I was not able to detect any difference at all between
the old and new code using HashAggregate, which is where most of my
testing effort went:

http://archives.postgresql.org/message-id/1231531455.25019.75.camel@jdavis

(see test results attached to that message, if you're interested)

I tested with two data distributions and 6 data types.

The 6-10% speedup I saw was a super-micro benchmark testing the hash
function, and I didn't spend much time with that.

Regards,Jeff Davis



Re: [PATCHES] updated hash functions for postgresql v1

From
Kenneth Marshall
Date:
On Sat, Jan 10, 2009 at 10:56:15AM -0800, Jeff Davis wrote:
> On Sat, 2009-01-10 at 13:36 -0500, Tom Lane wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> > > I ran 5 times on both old and new code, eliminating the top and bottom
> > > and taking the average of the remaining 3, and I got a 6.9% performance
> > > improvement with the new code.
> > 
> > The question that has been carefully evaded throughout the discussion
> > of this patch is whether the randomness of the hash result is decreased,
> > and if so what is that likely to cost us in performance of the various
> > hash-dependent algorithms.  I would still like to have an answer to that
> > before we make a change to gain marginal performance improvement in
> > the hash function itself (which is already shown to be barely measurable
> > in the total context of a hash-dependent operation...)
> > 
> 
> In:
> http://archives.postgresql.org/message-id/20081104202655.GP18362@it.is.rice.edu
> 
> Ken showed that the number of hash collisions is the same in the old and
> the new code for his test data. Not a direct measurement of randomness,
> but it's some kind of indication.
> 
> I'm not an authority on either hash functions or statistics, so I'll
> have to defer to Ken or someone else to actually prove that the
> randomness is maintained.
> 
> Regards,
>     Jeff Davis
> 
First, while I am not an expert by any means, basic statistics will give you the
probability of a collision when packing N items into M buckets chosen at random.
In all of my tests, I used 1.6M items into 2^32 buckets. If the bits mixing is
complete and random, the number of collisions should be close to the same no
matter what arrangement of bits are used for inputs. I think my test cases
indicate quite clearly that the new hash function is as independent of bit-
order as the older functions, in that the number of bucket collisions is
basically the same for all layouts. I have the test harness and can test
any other input that you would like me to. Second, the author of the basic
hash function (old and new) has performed more extensive testing and has
shown that the new functions are better in randomizing bits than his original
function. I will try and run a micro-benchmark of my original patch in March
and the result of the incremental approach that is the result of my Novermber
patch tomorrow.

Cheers,
Ken





Re: [PATCHES] updated hash functions for postgresql v1

From
Kenneth Marshall
Date:
On Sat, Jan 10, 2009 at 01:57:27PM -0500, Gregory Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > Jeff Davis <pgsql@j-davis.com> writes:
> >> I ran 5 times on both old and new code, eliminating the top and bottom
> >> and taking the average of the remaining 3, and I got a 6.9% performance
> >> improvement with the new code.
> >
> > The question that has been carefully evaded throughout the discussion
> > of this patch is whether the randomness of the hash result is decreased,
> 
> In fairness that doesn't seem to be the case. The original patch was posted
> with such an analysis using cracklib and raw binary data:
> 
> http://article.gmane.org/gmane.comp.db.postgresql.devel.general/105675
> 
> >  marginal performance improvement in the hash function itself (which is
> > already shown to be barely measurable in the total context of a
> > hash-dependent operation...)
> 
> If it's a 6% gain in the speed of Hash Join or HashAggregate it would be very
> interesting. But I gather it's a 6% speedup in the time spent actually in the
> hash function. Is that really where much of our time is going? If it's 10% of
> the total time to execute one of these nodes then we're talking about a 0.6%
> optimization...
> 

The Greenplum test did show a 5% increase in performance with the replacement
functions in March.

Regards,
Ken


Re: [PATCHES] updated hash functions for postgresql v1

From
Kenneth Marshall
Date:
On Sat, Jan 10, 2009 at 01:36:25PM -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > I ran 5 times on both old and new code, eliminating the top and bottom
> > and taking the average of the remaining 3, and I got a 6.9% performance
> > improvement with the new code.
> 
> The question that has been carefully evaded throughout the discussion
> of this patch is whether the randomness of the hash result is decreased,
> and if so what is that likely to cost us in performance of the various
> hash-dependent algorithms.  I would still like to have an answer to that
> before we make a change to gain marginal performance improvement in
> the hash function itself (which is already shown to be barely measurable
> in the total context of a hash-dependent operation...)
> 
>             regards, tom lane
> 

Dear Hackers and Reviewers,

In response to Tom's questioning the randomness of the hash_any
when the mixing functions are split into two, the first used when
sequentially processing the input and the second for the final
mix, I have generated a more detailed analysis of the two hash_any
functions.

First, one change to the 11/2008 patch, the keylen is added to a, b and
c initially so we do not need to add it later on. The is the
micro-diff:
-----------------------------

--- hashfunc.c_TWOMIX   2009-01-22 14:07:34.000000000 -0600
+++ hashfunc.c_TWOMIX2  2009-01-22 14:17:32.000000000 -0600
@@ -336,7 +336,6 @@               /* handle the last 11 bytes */               k = (const unsigned char *) ka;
-               c += keylen;#ifdef WORDS_BIGENDIAN               switch (len)               {
@@ -439,7 +438,6 @@               }               /* handle the last 11 bytes */
-               c += keylen;#ifdef WORDS_BIGENDIAN               switch (len)                    /* all the case
statementsfall through */
 
-----------------------------

The remainder of this document will use the runs from my initial results
broken out using various power-of-two bucket sizes to simulate our actual
use in PostgreSQL as the number of items hashed increases and we use more
and more bits of our hash to identify the appropriate bucket. I have run
each test twice, once with our current hash_any() with the single mix()
function and then a second time using my patch from the November commitfest
plus the patch above to produce a new hash_any() with two separate mixing
functions mix() and final(). For each run I have generated a sequence of
unique inputs, up to the number of buckets, hashed them with the hash
functions (both old and new), then I calculate the expected number of
collision p(n) using the poisson formula for each number of buckets,
where the number of buckets are 2**16, 2**18, 2**20, 2**22, 2**24, and
2**26. For my initial run, I used a string consisting of the letter 'a'
followed by the integer representation of the numbers from 0 to the
(number of buckets - 1):

1) "a"uint32 ((i.e. a00001,a0002...)
Number of buckets: 65536
Total number of items: 65536
Expected number with n items: 24109 24109 12054 4018 1004 200 33 4
Actual number mix():          24044 24172 12078 4036 980 186 30 10
Actual number mix()/final():  24027 24232 12060 3972 1001 207 31 5 1

Number of buckets: 262144
Total number of items: 262144
Expected number with n items: 96437 96437 48218 16072 4018 803 133 19 2
Actual number mix():          96224 96730 48240 15951 4094 744 143 17 1
Actual number mix()/final():  96335 96646 48071 16128 4018 801 122 21 2

Number of buckets: 1048576
Total number of items: 1048576
Expected number with n items: 385749 385749 192874 64291 16072 3214 535 76 9
Actual number mix():          385716 385596 193243 64115 16053 3285 478 77 12 1
Actual number mix()/final():  385955 385016 193789 63768 16259 3190 511 79 8 1

Number of buckets: 4194304
Total number of items: 4194304
Expected number with n items: 1542998 1542998 771499 257166 64291 12858 2143 306 38
Actual number mix():          1542536 1543489 771351 257777 63830 12847 2123 326 19 5 1
Actual number mix()/final():  1541828 1544429 772072 256178 64579 12774 2129 288 22 5

Number of buckets: 16777216
Total number of items: 16777216
Expected number with n items: 6171992 6171992 3085996 1028665 257166 51433 8572 1224 153
Actual number mix():          6170866 6174079 3085912 1027140 257808 51385 8638 1219 146 23
Actual number mix()/final():  6172058 6171991 3086279 1027916 257535 51465 8554 1243 149 23 3

Number of buckets: 67108864
Total number of items: 67108864
Expected number with n items: 24687971 24687971 12343985 4114661 1028665 205733 34288 4898 612
Actual number mix():          24686110 24690897 12344232 4113515 1028232 205682 34546 4942 629 72 7
Actual number mix()/final():  24708515 24669248 12333034 4114796 1034256 208424 34888 5023 598 77 5

Here is a second run with number of items = (number of buckets)/2:
Number of buckets: 65536
Total number of items: 32768
Expected number with n items: 39749 19874 4968 828 103 10
Actual number mix():          39704 19978 4898 842 103 10 1
Actual number mix()/final():  39715 19922 4991 779 119 9 1

Number of buckets: 262144
Total number of items: 131072
Expected number with n items: 158998 79499 19874 3312 414 41 3
Actual number mix():          158753 79972 19703 3216 458 38 4
Actual number mix()/final():  158869 79688 19853 3303 393 33 3 2

Number of buckets: 1048576
Total number of items: 524288
Expected number with n items: 635993 317996 79499 13249 1656 165 13
Actual number mix():          636118 317839 79388 13435 1628 152 16
Actual number mix()/final():  636102 317824 79527 13267 1683 161 12

Number of buckets: 4194304
Total number of items: 2097152
Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3
Actual number mix():          2544529 1270639 318904 53005 6516 645 61 5
Actual number mix()/final():  2544014 1271483 318720 52849 6559 630 47 2

Number of buckets: 16777216
Total number of items: 8388608
Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15
Actual number mix():          10174997 5089736 1271355 211528 26679 2683 220 17 1
Actual number mix()/final():  10176567 5087207 1271789 212014 26684 2703 235 16 1

Number of buckets: 67108864
Total number of items: 33554432
Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3
Actual number mix():          40703033 20353345 5086252 848860 105885 10536 891 59 3
Actual number mix()/final():  40770874 20250828 5096890 865448 111835 11890 1013 76 10

2) uint32uint32 (i.e. uint64)
Number of buckets: 65536
Total number of items: 32768
Expected number with n items: 39749 19874 4968 828 103 10
Actual number mix():          39770 19863 4947 822 125 9
Actual number mix()/final():  39754 19852 4990 837 91 11 1

Number of buckets: 262144
Total number of items: 131072
Expected number with n items: 158998 79499 19874 3312 414 41 3
Actual number mix():          159046 79464 19823 3334 430 43 3 1
Actual number mix()/final():  158879 79732 19774 3301 406 47 5

Number of buckets: 1048576
Total number of items: 524288
Expected number with n items: 635993 317996 79499 13249 1656 165 13
Actual number mix():          636275 317580 79514 13357 1661 172 14 3
Actual number mix()/final():  635578 318574 79504 13162 1585 159 13 1

Number of buckets: 4194304
Total number of items: 2097152
Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3
Actual number mix():          2543769 1272196 318067 53050 6497 672 48 4 1
Actual number mix()/final():  2543014 1273485 317812 52703 6589 635 59 7

Number of buckets: 16777216
Total number of items: 8388608
Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15
Actual number mix():          10174587 5090467 1270909 211836 26513 2679 207 18
Actual number mix()/final():  10175142 5089481 1270919 212555 26234 2643 224 15 3

Number of buckets: 67108864
Total number of items: 33554432
Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3
Actual number mix():          40706281 20347746 5088639 848133 106388 10669 946 60 2
Actual number mix()/final():  40701438 20354677 5088549 846570 106157 10578 838 55 2

3) uint32 from 1-1648379
Number of buckets: 65536
Total number of items: 32768
Expected number with n items: 39749 19874 4968 828 103 10
Actual number mix():          39758 19844 4993 835 97 9
Actual number mix()/final():  39842 19712 5016 849 109 7 1

Number of buckets: 262144
Total number of items: 131072
Expected number with n items: 158998 79499 19874 3312 414 41 3
Actual number mix():          158973 79523 19900 3283 426 38 1
Actual number mix()/final():  159136 79293 19896 3346 421 48 3 1

Number of buckets: 1048576
Total number of items: 524288
Expected number with n items: 635993 317996 79499 13249 1656 165 13
Actual number mix():          636083 317917 79484 13183 1711 180 16 2
Actual number mix()/final():  636183 317772 79438 13305 1685 173 20

Number of buckets: 4194304
Total number of items: 2097152
Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3
Actual number mix():          2544325 1271367 318260 52950 6660 683 54 4 1
Actual number mix()/final():  2544022 1272059 317778 53060 6646 665 70 4

Number of buckets: 16777216
Total number of items: 8388608
Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15
Actual number mix():          10176360 5087249 1272083 212010 26655 2629 212 18
Actual number mix()/final():  10175315 5088617 1272068 212144 26224 2589 234 23 1 1

Number of buckets: 67108864
Total number of items: 33554432
Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3
Actual number mix():          40704034 20350542 5088760 848309 105721 10520 894 77 7
Actual number mix()/final():  40774785 20243786 5098862 866801 111865 11641 1019 100 5

4) cracklib
Number of buckets: 65536
Total number of items: 32767
Expected number with n items: 39750 19874 4968 828 103 10
Actual number mix():          39731 19858 5049 794 92 11 1
Actual number mix()/final():  39785 19801 5011 823 106 9 1

Number of buckets: 262144
Total number of items: 131071
Expected number with n items: 158998 79498 19874 3312 414 41 3
Actual number mix():          159077 79420 19806 3380 409 49 3
Actual number mix()/final():  159020 79475 19845 3366 383 54 1

Number of buckets: 1048576
Total number of items: 524287
Expected number with n items: 635994 317996 79498 13249 1656 165 13
Actual number mix():          635979 318066 79420 13281 1641 163 23 3
Actual number mix()/final():  635694 318637 79122 13271 1686 150 13 3

Number of buckets: 4194304
Total number of items: 1648379
Expected number with n items: 2831263 1112698 218647 28643 2814 221 14
Actual number mix():          2830769 1113458 218633 28396 2786 250 11 1
Actual number mix()/final():  2831304 1113004 218002 28843 2926 212 13

Number of buckets: 16777216
Total number of items: 1648379
Expected number with n items: 15207226 1494125 73399 2403 59 1
Actual number mix():          15206923 1494718 73129 2384 59 3
Actual number mix()/final():  15207183 1494267 73250 2452 64

Number of buckets: 67108864
Total number of items: 1648379
Expected number with n items: 65480564 1608383 19753 161 1.47989202515749e-08
Actual number mix():          65480389 1608727 19593 154 1
Actual number mix()/final():  65480455 1608609 19631 168 1

5) cracklib x 100
Number of buckets: 65536
Total number of items: 32767
Expected number with n items: 39750 19874 4968 828 103 10
Actual number mix():          39829 19742 5003 847 99 14 2
Actual number mix()/final():  39783 19794 5023 828 97 11

Number of buckets: 262144
Total number of items: 131071
Expected number with n items: 158998 79498 19874 3312 414 41 3
Actual number mix():          159136 79242 20007 3289 405 62 3
Actual number mix()/final():  158961 79546 19905 3270 408 51 3

Number of buckets: 1048576
Total number of items: 524287
Expected number with n items: 635994 317996 79498 13249 1656 165 13
Actual number mix():          635815 318331 79372 13218 1658 168 12 2
Actual number mix()/final():  635986 318152 79309 13231 1687 190 21

Number of buckets: 4194304
Total number of items: 1648379
Expected number with n items: 2831263 1112698 218647 28643 2814 221 14
Actual number mix():          2830891 1113468 218202 28712 2804 208 18 1
Actual number mix()/final():  2831786 1111616 219209 28661 2816 199 16 1

Number of buckets: 16777216
Total number of items: 1648379
Expected number with n items: 15207226 1494125 73399 2403 59 1
Actual number mix():          15207428 1493777 73492 2459 59 1
Actual number mix()/final():  15207777 1493063 73877 2435 63 1

Number of buckets: 67108864
Total number of items: 1648379
Expected number with n items: 65480564 1608383 19753 161 1.47989202515749e-08
Actual number mix():          65480463 1608595 19635 170 1
Actual number mix()/final():  65480650 1608222 19820 171 1

As you can see, all of the probabilities for both the current single mix()
function and the split mix()/final() functions match the theoretical results
as calculated by the poisson formula. Here is a nice reference that I found
online describing the testing procedure that I am using:

http://rabbit.eng.miami.edu/class/een318/poisson.pdf

I can find no situations where the current version of hash_any() performs
statistically better than the version resulting from my split mix()/final()
function patch from 11/2008. I would be happy to share the test harness with
anyone who wishes to test further. Please let me know if there is anything
I can do. I still recommend applying the split mixing function patch that
I submitted for the 11/2008 commitfest and I think these tests should allay
Tom's concerns about the randomness of the new hash function.

Regards,
Ken Marshall



Re: [PATCHES] updated hash functions for postgresql v1

From
Kenneth Marshall
Date:
On Sun, Jan 25, 2009 at 10:27:03PM -0600, Kenneth Marshall wrote:
> On Sat, Jan 10, 2009 at 01:36:25PM -0500, Tom Lane wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> > > I ran 5 times on both old and new code, eliminating the top and bottom
> > > and taking the average of the remaining 3, and I got a 6.9% performance
> > > improvement with the new code.
> >
> > The question that has been carefully evaded throughout the discussion
> > of this patch is whether the randomness of the hash result is decreased,
> > and if so what is that likely to cost us in performance of the various
> > hash-dependent algorithms.  I would still like to have an answer to that
> > before we make a change to gain marginal performance improvement in
> > the hash function itself (which is already shown to be barely measurable
> > in the total context of a hash-dependent operation...)
> >
> >             regards, tom lane
> >
>
> Dear Hackers and Reviewers,
>
> In response to Tom's questioning the randomness of the hash_any
> when the mixing functions are split into two, the first used when
> sequentially processing the input and the second for the final
> mix, I have generated a more detailed analysis of the two hash_any
> functions.
>
> First, one change to the 11/2008 patch, the keylen is added to a, b and
> c initially so we do not need to add it later on. The is the
> micro-diff:
> -----------------------------
>
> --- hashfunc.c_TWOMIX   2009-01-22 14:07:34.000000000 -0600
> +++ hashfunc.c_TWOMIX2  2009-01-22 14:17:32.000000000 -0600
> @@ -336,7 +336,6 @@
>
>                 /* handle the last 11 bytes */
>                 k = (const unsigned char *) ka;
> -               c += keylen;
>  #ifdef WORDS_BIGENDIAN
>                 switch (len)
>                 {
> @@ -439,7 +438,6 @@
>                 }
>
>                 /* handle the last 11 bytes */
> -               c += keylen;
>  #ifdef WORDS_BIGENDIAN
>                 switch (len)                    /* all the case statements fall through */
>
> -----------------------------
>
> The remainder of this document will use the runs from my initial results
> broken out using various power-of-two bucket sizes to simulate our actual
> use in PostgreSQL as the number of items hashed increases and we use more
> and more bits of our hash to identify the appropriate bucket. I have run
> each test twice, once with our current hash_any() with the single mix()
> function and then a second time using my patch from the November commitfest
> plus the patch above to produce a new hash_any() with two separate mixing
> functions mix() and final(). For each run I have generated a sequence of
> unique inputs, up to the number of buckets, hashed them with the hash
> functions (both old and new), then I calculate the expected number of
> collision p(n) using the poisson formula for each number of buckets,
> where the number of buckets are 2**16, 2**18, 2**20, 2**22, 2**24, and
> 2**26. For my initial run, I used a string consisting of the letter 'a'
> followed by the integer representation of the numbers from 0 to the
> (number of buckets - 1):
>
> 1) "a"uint32 ((i.e. a00001,a0002...)
> Number of buckets: 65536
> Total number of items: 65536
> Expected number with n items: 24109 24109 12054 4018 1004 200 33 4
> Actual number mix():          24044 24172 12078 4036 980 186 30 10
> Actual number mix()/final():  24027 24232 12060 3972 1001 207 31 5 1
>
> Number of buckets: 262144
> Total number of items: 262144
> Expected number with n items: 96437 96437 48218 16072 4018 803 133 19 2
> Actual number mix():          96224 96730 48240 15951 4094 744 143 17 1
> Actual number mix()/final():  96335 96646 48071 16128 4018 801 122 21 2
>
> Number of buckets: 1048576
> Total number of items: 1048576
> Expected number with n items: 385749 385749 192874 64291 16072 3214 535 76 9
> Actual number mix():          385716 385596 193243 64115 16053 3285 478 77 12 1
> Actual number mix()/final():  385955 385016 193789 63768 16259 3190 511 79 8 1
>
> Number of buckets: 4194304
> Total number of items: 4194304
> Expected number with n items: 1542998 1542998 771499 257166 64291 12858 2143 306 38
> Actual number mix():          1542536 1543489 771351 257777 63830 12847 2123 326 19 5 1
> Actual number mix()/final():  1541828 1544429 772072 256178 64579 12774 2129 288 22 5
>
> Number of buckets: 16777216
> Total number of items: 16777216
> Expected number with n items: 6171992 6171992 3085996 1028665 257166 51433 8572 1224 153
> Actual number mix():          6170866 6174079 3085912 1027140 257808 51385 8638 1219 146 23
> Actual number mix()/final():  6172058 6171991 3086279 1027916 257535 51465 8554 1243 149 23 3
>
> Number of buckets: 67108864
> Total number of items: 67108864
> Expected number with n items: 24687971 24687971 12343985 4114661 1028665 205733 34288 4898 612
> Actual number mix():          24686110 24690897 12344232 4113515 1028232 205682 34546 4942 629 72 7
> Actual number mix()/final():  24708515 24669248 12333034 4114796 1034256 208424 34888 5023 598 77 5
>
> Here is a second run with number of items = (number of buckets)/2:
> Number of buckets: 65536
> Total number of items: 32768
> Expected number with n items: 39749 19874 4968 828 103 10
> Actual number mix():          39704 19978 4898 842 103 10 1
> Actual number mix()/final():  39715 19922 4991 779 119 9 1
>
> Number of buckets: 262144
> Total number of items: 131072
> Expected number with n items: 158998 79499 19874 3312 414 41 3
> Actual number mix():          158753 79972 19703 3216 458 38 4
> Actual number mix()/final():  158869 79688 19853 3303 393 33 3 2
>
> Number of buckets: 1048576
> Total number of items: 524288
> Expected number with n items: 635993 317996 79499 13249 1656 165 13
> Actual number mix():          636118 317839 79388 13435 1628 152 16
> Actual number mix()/final():  636102 317824 79527 13267 1683 161 12
>
> Number of buckets: 4194304
> Total number of items: 2097152
> Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3
> Actual number mix():          2544529 1270639 318904 53005 6516 645 61 5
> Actual number mix()/final():  2544014 1271483 318720 52849 6559 630 47 2
>
> Number of buckets: 16777216
> Total number of items: 8388608
> Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15
> Actual number mix():          10174997 5089736 1271355 211528 26679 2683 220 17 1
> Actual number mix()/final():  10176567 5087207 1271789 212014 26684 2703 235 16 1
>
> Number of buckets: 67108864
> Total number of items: 33554432
> Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3
> Actual number mix():          40703033 20353345 5086252 848860 105885 10536 891 59 3
> Actual number mix()/final():  40770874 20250828 5096890 865448 111835 11890 1013 76 10
>
> 2) uint32uint32 (i.e. uint64)
> Number of buckets: 65536
> Total number of items: 32768
> Expected number with n items: 39749 19874 4968 828 103 10
> Actual number mix():          39770 19863 4947 822 125 9
> Actual number mix()/final():  39754 19852 4990 837 91 11 1
>
> Number of buckets: 262144
> Total number of items: 131072
> Expected number with n items: 158998 79499 19874 3312 414 41 3
> Actual number mix():          159046 79464 19823 3334 430 43 3 1
> Actual number mix()/final():  158879 79732 19774 3301 406 47 5
>
> Number of buckets: 1048576
> Total number of items: 524288
> Expected number with n items: 635993 317996 79499 13249 1656 165 13
> Actual number mix():          636275 317580 79514 13357 1661 172 14 3
> Actual number mix()/final():  635578 318574 79504 13162 1585 159 13 1
>
> Number of buckets: 4194304
> Total number of items: 2097152
> Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3
> Actual number mix():          2543769 1272196 318067 53050 6497 672 48 4 1
> Actual number mix()/final():  2543014 1273485 317812 52703 6589 635 59 7
>
> Number of buckets: 16777216
> Total number of items: 8388608
> Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15
> Actual number mix():          10174587 5090467 1270909 211836 26513 2679 207 18
> Actual number mix()/final():  10175142 5089481 1270919 212555 26234 2643 224 15 3
>
> Number of buckets: 67108864
> Total number of items: 33554432
> Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3
> Actual number mix():          40706281 20347746 5088639 848133 106388 10669 946 60 2
> Actual number mix()/final():  40701438 20354677 5088549 846570 106157 10578 838 55 2
>
> 3) uint32 from 1-1648379
> Number of buckets: 65536
> Total number of items: 32768
> Expected number with n items: 39749 19874 4968 828 103 10
> Actual number mix():          39758 19844 4993 835 97 9
> Actual number mix()/final():  39842 19712 5016 849 109 7 1
>
> Number of buckets: 262144
> Total number of items: 131072
> Expected number with n items: 158998 79499 19874 3312 414 41 3
> Actual number mix():          158973 79523 19900 3283 426 38 1
> Actual number mix()/final():  159136 79293 19896 3346 421 48 3 1
>
> Number of buckets: 1048576
> Total number of items: 524288
> Expected number with n items: 635993 317996 79499 13249 1656 165 13
> Actual number mix():          636083 317917 79484 13183 1711 180 16 2
> Actual number mix()/final():  636183 317772 79438 13305 1685 173 20
>
> Number of buckets: 4194304
> Total number of items: 2097152
> Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3
> Actual number mix():          2544325 1271367 318260 52950 6660 683 54 4 1
> Actual number mix()/final():  2544022 1272059 317778 53060 6646 665 70 4
>
> Number of buckets: 16777216
> Total number of items: 8388608
> Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15
> Actual number mix():          10176360 5087249 1272083 212010 26655 2629 212 18
> Actual number mix()/final():  10175315 5088617 1272068 212144 26224 2589 234 23 1 1
>
> Number of buckets: 67108864
> Total number of items: 33554432
> Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3
> Actual number mix():          40704034 20350542 5088760 848309 105721 10520 894 77 7
> Actual number mix()/final():  40774785 20243786 5098862 866801 111865 11641 1019 100 5
>
> 4) cracklib
> Number of buckets: 65536
> Total number of items: 32767
> Expected number with n items: 39750 19874 4968 828 103 10
> Actual number mix():          39731 19858 5049 794 92 11 1
> Actual number mix()/final():  39785 19801 5011 823 106 9 1
>
> Number of buckets: 262144
> Total number of items: 131071
> Expected number with n items: 158998 79498 19874 3312 414 41 3
> Actual number mix():          159077 79420 19806 3380 409 49 3
> Actual number mix()/final():  159020 79475 19845 3366 383 54 1
>
> Number of buckets: 1048576
> Total number of items: 524287
> Expected number with n items: 635994 317996 79498 13249 1656 165 13
> Actual number mix():          635979 318066 79420 13281 1641 163 23 3
> Actual number mix()/final():  635694 318637 79122 13271 1686 150 13 3
>
> Number of buckets: 4194304
> Total number of items: 1648379
> Expected number with n items: 2831263 1112698 218647 28643 2814 221 14
> Actual number mix():          2830769 1113458 218633 28396 2786 250 11 1
> Actual number mix()/final():  2831304 1113004 218002 28843 2926 212 13
>
> Number of buckets: 16777216
> Total number of items: 1648379
> Expected number with n items: 15207226 1494125 73399 2403 59 1
> Actual number mix():          15206923 1494718 73129 2384 59 3
> Actual number mix()/final():  15207183 1494267 73250 2452 64
>
> Number of buckets: 67108864
> Total number of items: 1648379
> Expected number with n items: 65480564 1608383 19753 161 1.47989202515749e-08
> Actual number mix():          65480389 1608727 19593 154 1
> Actual number mix()/final():  65480455 1608609 19631 168 1
>
> 5) cracklib x 100
> Number of buckets: 65536
> Total number of items: 32767
> Expected number with n items: 39750 19874 4968 828 103 10
> Actual number mix():          39829 19742 5003 847 99 14 2
> Actual number mix()/final():  39783 19794 5023 828 97 11
>
> Number of buckets: 262144
> Total number of items: 131071
> Expected number with n items: 158998 79498 19874 3312 414 41 3
> Actual number mix():          159136 79242 20007 3289 405 62 3
> Actual number mix()/final():  158961 79546 19905 3270 408 51 3
>
> Number of buckets: 1048576
> Total number of items: 524287
> Expected number with n items: 635994 317996 79498 13249 1656 165 13
> Actual number mix():          635815 318331 79372 13218 1658 168 12 2
> Actual number mix()/final():  635986 318152 79309 13231 1687 190 21
>
> Number of buckets: 4194304
> Total number of items: 1648379
> Expected number with n items: 2831263 1112698 218647 28643 2814 221 14
> Actual number mix():          2830891 1113468 218202 28712 2804 208 18 1
> Actual number mix()/final():  2831786 1111616 219209 28661 2816 199 16 1
>
> Number of buckets: 16777216
> Total number of items: 1648379
> Expected number with n items: 15207226 1494125 73399 2403 59 1
> Actual number mix():          15207428 1493777 73492 2459 59 1
> Actual number mix()/final():  15207777 1493063 73877 2435 63 1
>
> Number of buckets: 67108864
> Total number of items: 1648379
> Expected number with n items: 65480564 1608383 19753 161 1.47989202515749e-08
> Actual number mix():          65480463 1608595 19635 170 1
> Actual number mix()/final():  65480650 1608222 19820 171 1
>
> As you can see, all of the probabilities for both the current single mix()
> function and the split mix()/final() functions match the theoretical results
> as calculated by the poisson formula. Here is a nice reference that I found
> online describing the testing procedure that I am using:
>
> http://rabbit.eng.miami.edu/class/een318/poisson.pdf
>
> I can find no situations where the current version of hash_any() performs
> statistically better than the version resulting from my split mix()/final()
> function patch from 11/2008. I would be happy to share the test harness with
> anyone who wishes to test further. Please let me know if there is anything
> I can do. I still recommend applying the split mixing function patch that
> I submitted for the 11/2008 commitfest and I think these tests should allay
> Tom's concerns about the randomness of the new hash function.
>
> Regards,
> Ken Marshall
>
Dear Hackers,

I have updated the patch posted by Jeff Davis on January 9th
to include the micro-patch above as well as updated the polymorphism
regressions tests. This applies cleanly to the latest CVS pull.

Regards,
Ken


Attachment

Re: [PATCHES] updated hash functions for postgresql v1

From
Tom Lane
Date:
Kenneth Marshall <ktm@rice.edu> writes:
> I have updated the patch posted by Jeff Davis on January 9th
> to include the micro-patch above as well as updated the polymorphism
> regressions tests. This applies cleanly to the latest CVS pull.

Applied --- thanks for being persistent about resolving the doubts on this.

One thing that apparently neither of you realized was that the
polymorphism results were varying between bigendian and littleendian
machines; I suppose you are using different hardware and that's why you
didn't agree on what the results should be.

Since we already agreed we were going to tolerate endianness dependence
in the hash functions, I fixed that by adding some ORDER BYs.
        regards, tom lane


Re: [PATCHES] updated hash functions for postgresql v1

From
Asko Oja
Date:
Did this change hashtext() visible to users? We have been using it quite widely for partitioning our databases. If so then it should be marked quite visibly in release notes as there might be others who will be hit by this.

regards
Asko

On Mon, Feb 9, 2009 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kenneth Marshall <ktm@rice.edu> writes:
> I have updated the patch posted by Jeff Davis on January 9th
> to include the micro-patch above as well as updated the polymorphism
> regressions tests. This applies cleanly to the latest CVS pull.

Applied --- thanks for being persistent about resolving the doubts on this.

One thing that apparently neither of you realized was that the
polymorphism results were varying between bigendian and littleendian
machines; I suppose you are using different hardware and that's why you
didn't agree on what the results should be.

Since we already agreed we were going to tolerate endianness dependence
in the hash functions, I fixed that by adding some ORDER BYs.

                       regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [PATCHES] updated hash functions for postgresql v1

From
Tom Lane
Date:
Asko Oja <ascoja@gmail.com> writes:
> Did this change hashtext() visible to users? We have been using it quite
> widely for partitioning our databases. If so then it should be marked quite
> visibly in release notes as there might be others who will be hit by this.

The hash functions are undocumented, have changed in the past, and are
likely to change again in the future.  If you are using them in a way
that depends on them to give the same answers across versions, you'd
better stop.
        regards, tom lane


Re: [PATCHES] updated hash functions for postgresql v1

From
Hannu Krosing
Date:
On Wed, 2009-02-11 at 11:22 -0500, Tom Lane wrote:
> Asko Oja <ascoja@gmail.com> writes:
> > Did this change hashtext() visible to users? We have been using it quite
> > widely for partitioning our databases. If so then it should be marked quite
> > visibly in release notes as there might be others who will be hit by this.
> 
> The hash functions are undocumented, have changed in the past, and are
> likely to change again in the future.  If you are using them in a way
> that depends on them to give the same answers across versions, you'd
> better stop.

Is at least the fact that they "are undocumented, have changed in the
past, and are likely to change again in the future" documented ?

I'm sure this is something that has hit unwary users in the past and
will hit again in the future, so some words about it in the doc's would
be appropriate.

search for "hashtext" on
http://www.postgresql.org/docs/8.4/interactive/index.html returned no
results, so I guess even theyr "undocumented, will surprise you" status
is not documented.

Hashing is a quite fundamental thing in computing, so I was quite
surprised to find out it had silently changed. 

I had never checked the docs for hash functions, but I had assumed, that
internal functions are prefixed by pg_ and anything else is public, free
to use functionality.

Changing hash functions also makes in-place upgrades a lot harder, as
they can't be done incrementally anymore for tables which use hash
indexes.


>             regards, tom lane
> 
-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training




Re: [PATCHES] updated hash functions for postgresql v1

From
Tom Lane
Date:
Hannu Krosing <hannu@2ndQuadrant.com> writes:
> I had never checked the docs for hash functions, but I had assumed, that
> internal functions are prefixed by pg_ and anything else is public, free
> to use functionality.

Sure, it's free to use.  It's not free to assume that we promise never
to change it.

> Changing hash functions also makes in-place upgrades a lot harder, as
> they can't be done incrementally anymore for tables which use hash
> indexes.

Hash indexes are so far from being production-grade that this argument
is not significant.
        regards, tom lane


Re: [PATCHES] updated hash functions for postgresql v1

From
Kenneth Marshall
Date:
On Wed, Oct 28, 2009 at 03:31:17PM -0400, Tom Lane wrote:
> Hannu Krosing <hannu@2ndQuadrant.com> writes:
> > I had never checked the docs for hash functions, but I had assumed, that
> > internal functions are prefixed by pg_ and anything else is public, free
> > to use functionality.
> 
> Sure, it's free to use.  It's not free to assume that we promise never
> to change it.
> 
> > Changing hash functions also makes in-place upgrades a lot harder, as
> > they can't be done incrementally anymore for tables which use hash
> > indexes.
> 
> Hash indexes are so far from being production-grade that this argument
> is not significant.
> 
>             regards, tom lane

In addition that change from 8.3 -> 8.4 to store only the hash and not
the value in the index means that a reindex would be required in any event.

Cheers,
Ken


Re: [PATCHES] updated hash functions for postgresql v1

From
Tom Lane
Date:
Kenneth Marshall <ktm@rice.edu> writes:
> On Wed, Oct 28, 2009 at 03:31:17PM -0400, Tom Lane wrote:
>> Hash indexes are so far from being production-grade that this argument
>> is not significant.

> In addition that change from 8.3 -> 8.4 to store only the hash and not
> the value in the index means that a reindex would be required in any event.

Indeed, and I fully expect there will be some more on-disk format
changes required before we get to the point where hash indexes are
actually interesting for production.  If we start insisting that they
be in-place-upgradable now, we will pretty much guarantee that they
never become useful enough to justify the restriction :-(

(As examples, the hash bucket size probably needs revisiting,
and we ought to think very hard about whether we shouldn't switch
to 64-bit hash values.  And that's not even considering some of the
more advanced suggestions that have been made.)
        regards, tom lane


Re: [PATCHES] updated hash functions for postgresql v1

From
Jeff Davis
Date:
On Wed, 2009-10-28 at 21:09 +0200, Hannu Krosing wrote:
> Is at least the fact that they "are undocumented, have changed in the
> past, and are likely to change again in the future" documented ?

That's a little confusing to me: how do we document that something is
undocumented? And where do we stop?

> Hashing is a quite fundamental thing in computing, so I was quite
> surprised to find out it had silently changed. 

There are many reasons to use a hash, and we don't want people to use
these functions for the wrong purpose. I have seen people use a
performance hash for security purposes before, and I had to demonstrate
some hash collisions to show why that was a bad idea. So, if we do
provide documented functions, it should be done carefully.

Trying to develop and document a set of standardized, stable hash
functions covering a wide range of possible use cases sounds like it may
be better served by an extension.

Regards,Jeff Davis 



Re: [PATCHES] updated hash functions for postgresql v1

From
Hannu Krosing
Date:
On Wed, 2009-10-28 at 12:51 -0700, Jeff Davis wrote:
> On Wed, 2009-10-28 at 21:09 +0200, Hannu Krosing wrote:
> > Is at least the fact that they "are undocumented, have changed in the
> > past, and are likely to change again in the future" documented ?
> 
> That's a little confusing to me: how do we document that something is
> undocumented? And where do we stop?

My previous e-mail message documents that the undocumentedness is
undocumented, so no need to go any further here ;)

Though undocumented, the hash functions are easily discoverable by doing

\df *hash*

in psql

> > Hashing is a quite fundamental thing in computing, so I was quite
> > surprised to find out it had silently changed. 
> 
> There are many reasons to use a hash, and we don't want people to use
> these functions for the wrong purpose. 

I don't think that not documenting a hash function helps here at all.

> I have seen people use a
> performance hash for security purposes before, and I had to demonstrate
> some hash collisions to show why that was a bad idea. 

I've seen people use CRC32 as hash and then hit a collisions in 15 tries
with quite large keys.

> So, if we do provide documented functions, it should be done carefully.

Any user-visible behavior change should be done carefully, even if the
original behavior is not documented.

Careful upgrade of hasxxx functions would have kept the old functions,
and introduced the new ones with _v2 suffix, and then used these in
appropriate places. then kept the old ones for a few versions, with
maybe a deprecation warning and then moved them to contrib for a few
more versions.

Doing it this way could leave them "undocumented" and still not break
peoples applications in mysterious ways.

> Trying to develop and document a set of standardized, stable hash
> functions covering a wide range of possible use cases sounds like it may
> be better served by an extension.

I guess there are enough security/crypt/strong hashes in pgcrypto
package so that should not be a problem.

-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training




Re: [PATCHES] updated hash functions for postgresql v1

From
Hannu Krosing
Date:
On Wed, 2009-10-28 at 15:31 -0400, Tom Lane wrote:
> Hannu Krosing <hannu@2ndQuadrant.com> writes:
> > I had never checked the docs for hash functions, but I had assumed, that
> > internal functions are prefixed by pg_ and anything else is public, free
> > to use functionality.
> 
> Sure, it's free to use.  It's not free to assume that we promise never
> to change it.
>
> > Changing hash functions also makes in-place upgrades a lot harder, as
> > they can't be done incrementally anymore for tables which use hash
> > indexes.
> 
> Hash indexes are so far from being production-grade that this argument
> is not significant.

AFAIK in-place upgrade is also not quite production-grade, so this was
meant as a forward-looking note for next time the hashxxx functions will
change.

>             regards, tom lane
-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training




Re: [PATCHES] updated hash functions for postgresql v1

From
Peter Eisentraut
Date:
On Wed, 2009-10-28 at 12:51 -0700, Jeff Davis wrote:
> Trying to develop and document a set of standardized, stable hash
> functions covering a wide range of possible use cases sounds like it may
> be better served by an extension.

I suspect that some of the participants in this thread have PL/Proxy in
mind.  PL/Proxy should probably ship its own set of hash functions.

If we ever get built-in partitioning by hash (see thread nearby and most
previous ones like it), we should also think about providing a hash
function that doesn't change output over versions and is independent of
hash index implementation concerns.



Re: [PATCHES] updated hash functions for postgresql v1

From
Hannu Krosing
Date:
On Thu, 2009-10-29 at 09:47 +0200, Peter Eisentraut wrote:
> On Wed, 2009-10-28 at 12:51 -0700, Jeff Davis wrote:
> > Trying to develop and document a set of standardized, stable hash
> > functions covering a wide range of possible use cases sounds like it may
> > be better served by an extension.
> 
> I suspect that some of the participants in this thread have PL/Proxy in
> mind.  

Yes, I think that pl/proxy is the prime example of such use.

> PL/Proxy should probably ship its own set of hash functions.

Or maybe we could just extract the hashes form some version of stock
postgresql (say 8.3) and then make those available in contrib under the
name "stable_hashes" ?

> If we ever get built-in partitioning by hash (see thread nearby and most
> previous ones like it), we should also think about providing a hash
> function that doesn't change output over versions and is independent of
> hash index implementation concerns.

And maybe even documented ;)

-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training




Re: [PATCHES] updated hash functions for postgresql v1

From
Tom Lane
Date:
Hannu Krosing <hannu@2ndQuadrant.com> writes:
> Or maybe we could just extract the hashes form some version of stock
> postgresql (say 8.3) and then make those available in contrib under the
> name "stable_hashes" ?

A better name would be "wishful_thinking" ... contrib does not have
control over some of the main risk factors, such as changes to the
internal representation of datatypes.
        regards, tom lane


Re: [PATCHES] updated hash functions for postgresql v1

From
Peter Eisentraut
Date:
On Thu, 2009-10-29 at 09:32 -0400, Tom Lane wrote:
> Hannu Krosing <hannu@2ndQuadrant.com> writes:
> > Or maybe we could just extract the hashes form some version of stock
> > postgresql (say 8.3) and then make those available in contrib under the
> > name "stable_hashes" ?
> 
> A better name would be "wishful_thinking" ... contrib does not have
> control over some of the main risk factors, such as changes to the
> internal representation of datatypes.

Good point.  But the internal layout of data types popular for use as
hash key changes rarely.  The hash function change, however, is a
showstopper for everyone concerned.



Re: [PATCHES] updated hash functions for postgresql v1

From
Hannu Krosing
Date:
On Thu, 2009-10-29 at 09:32 -0400, Tom Lane wrote:
> Hannu Krosing <hannu@2ndQuadrant.com> writes:
> > Or maybe we could just extract the hashes form some version of stock
> > postgresql (say 8.3) and then make those available in contrib under the
> > name "stable_hashes" ?
> 
> A better name would be "wishful_thinking" ... contrib does not have
> control over some of the main risk factors, such as changes to the
> internal representation of datatypes.

My understanding was, that contrib is actually the only possibility to
have something maintained in sync with core postgreSQL.

>             regards, tom lane
-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training