On Fri, Jan 09, 2009 at 12:04:15PM -0800, Jeff Davis wrote:
> On Mon, 2008-12-22 at 13:47 -0600, Kenneth Marshall wrote:
> > Dear PostgreSQL developers,
> >
> > I am re-sending this to keep this last change to the
> > internal hash function on the radar.
> >
>
> Hi Ken,
>
> A few comments:
>
> 1. New patch with very minor changes attached.
>
> 2. I reverted the change you made to indices.sgml. We still don't use
> WAL for hash indexes, and in my opinion we should continue to discourage
> their use until we do use WAL. We can add back in the comment that hash
> indexes are suitable for large keys if we have some results to show
> that.
>
> 3. There was a regression test failure in union.sql because the ordering
> of the results was different. I updated the regression test.
>
> 4. Hash functions affect a lot more than hash indexes, so I ran through
> a variety of tests that use a HashAggregate plan. Test setup and results
> are attached. These results show no difference between the old and the
> new code (about 0.1% better).
>
> 5. The hash index build time shows some improvement. The new code won in
> every instance in which a there were a lot of duplicates in the table
> (100 distinct values, 50K of each) by around 5%.
>
> The new code appeared to be the same or slightly worse in the case of
> hash index builds with few duplicates (1000000 distinct values, 5 of
> each). The difference was about 1% worse, which is probably just noise.
>
> Note: I'm no expert on hash functions. Take all of my tests with a grain
> of salt.
>
> I would feel a little better if I saw at least one test that showed
> better performance of the new code on a reasonable-looking distribution
> of data. The hash index build that you showed only took a second or two
> -- it would be nice to see a test that lasted at least a minute.
>
> Regards,
> Jeff Davis
>
>
Jeff,
Thanks for the review. I would not really expect any differences in hash
index build times other than normal noise variances. The most definitive
benchmark that I have seen was done with my original patch submission
in March posted by Luke of Greenplum:
"We just applied this and saw a 5 percent speedup on a hash aggregation query with four columns in a 'group by'
clauserun against a single TPC-H table."
I wonder if they would be willing to re-run their test? Thanks again.
Regards,
Ken