Re: Hash index todo list item - Mailing list pgsql-hackers

From Shreya Bhargava
Subject Re: Hash index todo list item
Date
Msg-id 148682.2744.qm@web53408.mail.re2.yahoo.com
Whole thread Raw
In response to Re: Hash index todo list item  (Gregory Stark <stark@enterprisedb.com>)
Responses Re: Hash index todo list item  ("Gokulakannan Somasundaram" <gokul007@gmail.com>)
Re: Hash index todo list item  (Heikki Linnakangas <heikki@enterprisedb.com>)
Re: Hash index todo list item  (Jens-Wolfhard Schicke <drahflow@gmx.de>)
Re: Hash index todo list item  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
"Note that the bottom line for the problems with hash indexes is that the<br />current implementation doesn't offer any
advantagesover btree indexes. Hash<br />indexes need to be not only as good of a choice as btree indexes but<br
/>significantlybetter a significantly better choice at least for some specific<br />circumstances."<br /><br
/><pre><tt><tt>Weperformed some probe tests on our patch on <br />hash index and the original btree index to compare
the<br />performance between the two. We used a 80 million tuple table.<br />The table contained single integer
attributeand the data<br />range was 0 - 100000000. (generated by a random generator).<br />We did the following:<br
/><br/>1. Populate the table with 80 million tuples.<br />2. Create HASH index on the table.<br />3. clear both linux
cache& psql buffers.<br />   (exiting psql and restarting it cleared the psql buffers;<br />    to clear linux
cache,we used drop_cache command)<br />4. start psql<br />5. </tt></tt><tt><tt>select on aninteger in the range of
valuesin the table</tt></tt><tt><tt>.<br />   (all test numbers were big ones, like 98934599)<br />6. record the
time.<br/>7. exit psql.<br />8. drop caches.(as described above)<br />9. repeat 4-8 for different numbers.<br />10.
DropHash index.<br />11. Create Btree index and repeat 3-9.<br /> <br />The results are as shown below. All trials are
inms. <br />Count is the number of such tuples in the table.<br /> <br /> HASH INDEX:<br /> Number       Count   Trial1
    Trial2      Trial3<br /> 21367898      2         152.0        129.5      131.1<br /> 95678699      2         140.6
     145.6      137.5<br /> 99899799      2         148.0        147.4      152.6<br /> 97677899      2         142.0
    153.5      112.0<br /> 94311123      2         137.6        146.3      141.3<br /> 67544455      2         141.6
   144.6      152.7<br /> 88877677      2         122.1        123.1      130.8<br /> 88788988      2         150.2
  150.0      171.7<br /> 44333344      1        119.9        116.3      117.6<br /> 34267878      1          97.5
 99.9      114.8<br /> 55489781      2         115.7        117.2      145.3 <br /> 97676767      1         169.5
168.5      181.7 <br /> 99767564      1         142.7        133.6      132.7<br /> 21367989      4         198.0
193.2      186.7<br /> <br />BTREE INDEX<br /> <br /> Number      Trial1    Trial2      Trial3<br /> 21367898   145.5
 145.6       150.6<br /> 95678699   177.5     170.0       176.4<br /> 99899799   175.4     181.2       185.4<br />
97677899  136.9     149.0       130.8<br /> 94311123   179.0     175.3       166.3<br /> 67544455   161.7     162.2
 170.4<br /> 88877677   138.7     135.2       149.9<br /> 88788988   165.7     179.3       186.3<br /> 44333344   166.0
   172.8       184.3<br /> 34267878   159.1     168.8       147.8<br /> 55489781   183.2     195.4       185.1<br />
97676767  147.2     143.6       132.6<br /> 99767564   167.8     178.3       162.1<br /> 21367989   232.3     238.1
216.9</tt></tt></pre>From the results obtained, the average of all the hash probes is 141.8ms, the average for btree is
168.5,a difference of about 27.The standard deviations are about 23, so this is a statistically significant
difference.<br/><pre><tt><tt>Our prediction that the hash index would take on the average one<br />probe for about 10ms
andthe btree would take three probes for about 30 ms<br />or a difference of about 20ms was pretty well shown by the
differencewe<br />got of about 27. Hope these data points will help with some questions<br />about the performance
differencesbetween Hash and Btree index.<br /><br />Regards,<br />Shreya</tt></tt></pre><br /><br /><span
style="font-family:monospace;"></span><tt><tt><br /><br /></tt></tt><b><i>Gregory Stark
<stark@enterprisedb.com></i></b>wrote:<blockquote class="replbq" style="border-left: 2px solid rgb(16, 16, 255);
margin-left:5px; padding-left: 5px;"> "Kenneth Marshall"  writes:<br /><br />> On Thu, Sep 20, 2007 at 05:12:45PM
-0700,Tom Raney wrote:<br />><br />>> Using our implementation, build times and index sizes are<br />>>
comparablewith btree index build times and index sizes.<br />>...<br />> That is super! (and timely)<br /><br
/>Itis pretty super. I have a few comments to raise but don't take it to be too<br />negative, it sounds like this is a
bigstep forward towards making hash<br />indexes valuable.<br /><br />Firstly in the graphs it seems the index size
graphhas an exponential x-axis<br />but a linear y-axis. This makes it hard to see what I would expect to be<br
/>prettymuch linear growth. The curves look exponential which would mean linear<br />growth but of course it's hard to
tell.<br/><br />Also, the growth in the time chart looks pretty much linear. That seems weird<br />since I would expect
therewould be a visible extra component since sort times<br />are n-log(n). Perhaps you need to test still larger
tablesto see that though.<br /><br />In any case it's clear from the results you have there that the change is a<br
/>positiveone and fixes a fundamental problem with the hash index build code.<br /><br />Something else you should
perhapstest is indexing something which is<br />substantially larger than hash function output. A hash function is
goingto<br />make the most sense when indexing something like strings for which you want to<br />avoid the long
comparisoncosts. Especially consider testing this on a UTF8<br />locale with expensive comparisons (like a CJK locale
forexample).<br /><br />Note that the bottom line for the problems with hash indexes is that the<br />current
implementationdoesn't offer any advantages over btree indexes. Hash<br />indexes need to be not only as good of a
choiceas btree indexes but<br />significantly better a significantly better choice at least for some specific<br
/>circumstances.<br/><br />Also, if you're going to submit a patch you should check out a copy of the CVS<br />HEAD and
workfrom that. I don't think there are huge differences in the area<br />of hash indexes though. But in most other
areasyou would be spending quite a<br />bit of time dealing details which have changed since.<br /><br />Finally note
thatwe're in the final throes of the 8.3 feature freeze.<br />Normally any patch submitted now would be held until 8.3
isreleased and<br />development on 8.4 is started. I could imagine an exception being made for<br />hash indexes since
they'reso moribund currently but probably not. The flip<br />side of that argument is that there's not much point in
makingan exception<br />for something which will only be really useful once further work is done in<br />the same
area.<br/><br />-- <br /> Gregory Stark<br /> EnterpriseDB http://www.enterprisedb.com<br /><br
/>---------------------------(endof broadcast)---------------------------<br />TIP 2: Don't 'kill -9' the postmaster<br
/></blockquote><br/><p> __________________________________________________<br />Do You Yahoo!?<br />Tired of spam?
Yahoo!Mail has the best spam protection around <br />http://mail.yahoo.com  

pgsql-hackers by date:

Previous
From: Jeremy Drake
Date:
Subject: Re: Open items for 8.3
Next
From: Gregory Stark
Date:
Subject: Re: Visibility map thoughts