Re: Hash index build performance tweak from sorting - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Hash index build performance tweak from sorting
Date
Msg-id CANbhV-FmHF+rky-N4Lr+rfA4547KVrv6MjT-0jfZbyKmBHCXUw@mail.gmail.com
Whole thread Raw
In response to Re: Hash index build performance tweak from sorting  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Hash index build performance tweak from sorting
List pgsql-hackers
On Thu, 28 Jul 2022 at 19:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Simon Riggs <simon.riggs@enterprisedb.com> writes:
> > Thanks for the nudge. New version attached.
>
> I also see a speed improvement from this, so pushed (after minor comment
> editing).

Thanks

> I notice though that if I feed it random data,
>
> ---
> DROP TABLE IF EXISTS hash_speed;
> CREATE unlogged TABLE hash_speed (x integer);
> INSERT INTO hash_speed SELECT random()*10000000 FROM
> generate_series(1,10000000) x;
> vacuum hash_speed;
> \timing on
> CREATE INDEX ON hash_speed USING hash (x);
> ---
>
> then the speed improvement is only about 5% not the 7.5% I see
> with your original test case.  I don't have an explanation
> for that, do you?

No, sorry. It could be a data-based effect or a physical effect.

> Also, it seems like we've left some money on the table by not
> exploiting downstream the knowledge that this sorting happened.
> During an index build, it's no longer necessary for
> _hash_pgaddtup to do _hash_binsearch, and therefore also not
> _hash_get_indextuple_hashkey: we could just always append the new
> tuple at the end.  Perhaps checking it against the last existing
> tuple is worth the trouble as a bug guard, but for sure we don't
> need the log2(N) comparisons that _hash_binsearch will do.

Hmm, I had that in an earlier version of the patch, not sure why it
dropped out since I wrote it last year, but then I've got lots of
future WIP patches in the area of hash indexes.

> Another point that I noticed is that it's not really desirable to
> use the same _hash_binsearch logic for insertions and searches.
> _hash_binsearch finds the first entry with hash >= target, which
> is necessary for searches, but for insertions we'd really rather
> find the first entry with hash > target.  As things stand, to
> the extent that there are duplicate hash values we are still
> performing unnecessary data motion within PageAddItem.

That thought is new to me, and will investigate.

> I've not looked into how messy these things would be to implement,
> nor whether we get any noticeable speed gain thereby.  But since
> you've proven that cutting the PageAddItem data motion cost
> yields visible savings, these things might be visible too.

It's a clear follow-on thought, so will pursue. Thanks for the nudge.

> At this point the cfbot will start to bleat that the patch of
> record doesn't apply, so I'm going to mark the CF entry committed.
> If anyone wants to produce a follow-on patch, please make a
> new entry.

Will do. Thanks.

--
Simon Riggs                http://www.EnterpriseDB.com/



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: pg_auth_members.grantor is bunk
Next
From: Simon Riggs
Date:
Subject: Re: Maximize page freezing