Re: Hash index build performance tweak from sorting - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Hash index build performance tweak from sorting |
Date | |
Msg-id | CANbhV-FmHF+rky-N4Lr+rfA4547KVrv6MjT-0jfZbyKmBHCXUw@mail.gmail.com Whole thread Raw |
In response to | Re: Hash index build performance tweak from sorting (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Hash index build performance tweak from sorting
|
List | pgsql-hackers |
On Thu, 28 Jul 2022 at 19:50, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Simon Riggs <simon.riggs@enterprisedb.com> writes: > > Thanks for the nudge. New version attached. > > I also see a speed improvement from this, so pushed (after minor comment > editing). Thanks > I notice though that if I feed it random data, > > --- > DROP TABLE IF EXISTS hash_speed; > CREATE unlogged TABLE hash_speed (x integer); > INSERT INTO hash_speed SELECT random()*10000000 FROM > generate_series(1,10000000) x; > vacuum hash_speed; > \timing on > CREATE INDEX ON hash_speed USING hash (x); > --- > > then the speed improvement is only about 5% not the 7.5% I see > with your original test case. I don't have an explanation > for that, do you? No, sorry. It could be a data-based effect or a physical effect. > Also, it seems like we've left some money on the table by not > exploiting downstream the knowledge that this sorting happened. > During an index build, it's no longer necessary for > _hash_pgaddtup to do _hash_binsearch, and therefore also not > _hash_get_indextuple_hashkey: we could just always append the new > tuple at the end. Perhaps checking it against the last existing > tuple is worth the trouble as a bug guard, but for sure we don't > need the log2(N) comparisons that _hash_binsearch will do. Hmm, I had that in an earlier version of the patch, not sure why it dropped out since I wrote it last year, but then I've got lots of future WIP patches in the area of hash indexes. > Another point that I noticed is that it's not really desirable to > use the same _hash_binsearch logic for insertions and searches. > _hash_binsearch finds the first entry with hash >= target, which > is necessary for searches, but for insertions we'd really rather > find the first entry with hash > target. As things stand, to > the extent that there are duplicate hash values we are still > performing unnecessary data motion within PageAddItem. That thought is new to me, and will investigate. > I've not looked into how messy these things would be to implement, > nor whether we get any noticeable speed gain thereby. But since > you've proven that cutting the PageAddItem data motion cost > yields visible savings, these things might be visible too. It's a clear follow-on thought, so will pursue. Thanks for the nudge. > At this point the cfbot will start to bleat that the patch of > record doesn't apply, so I'm going to mark the CF entry committed. > If anyone wants to produce a follow-on patch, please make a > new entry. Will do. Thanks. -- Simon Riggs http://www.EnterpriseDB.com/
pgsql-hackers by date: