Re: Hash index build performance tweak from sorting - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Hash index build performance tweak from sorting
Date
Msg-id CANbhV-Hv-R6RTiQ9rrKzdezMMAx3=E+=2_pCnMXTRw99dfveXQ@mail.gmail.com
Whole thread Raw
In response to Re: Hash index build performance tweak from sorting  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Hash index build performance tweak from sorting
List pgsql-hackers
On Sat, 30 Apr 2022 at 12:12, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Apr 19, 2022 at 3:05 AM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> >
> > Hash index pages are stored in sorted order, but we don't prepare the
> > data correctly.
> >
> > We sort the data as the first step of a hash index build, but we
> > forget to sort the data by hash as well as by hash bucket.
> >
>
> I was looking into the nearby comments (Fetch hash keys and mask off
> bits we don't want to sort by.) and it sounds like we purposefully
> don't want to sort by the hash key. I see that this comment was
> originally introduced in the below commit:
>
> commit 4adc2f72a4ccd6e55e594aca837f09130a6af62b
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Mon Sep 15 18:43:41 2008 +0000
>
>     Change hash indexes to store only the hash code rather than the
> whole indexed
>     value.
>
> But even before that, we seem to mask off the bits before comparison.
> Is it that we are doing so because we want to keep the order of hash
> keys in a particular bucket so such masking was required?

We need to sort by both hash bucket and hash value.

Hash bucket id so we can identify the correct hash bucket to insert into.

But then on each bucket/overflow page we store it sorted by hash value
to make lookup faster, so inserts go faster if they are also sorted.

The pages are identical with/without this patch, its just the
difference between quicksort and insertion sort.

> Few comments on the patch:
> 1. I think it is better to use DatumGetUInt32 to fetch the hash key as
> the nearby code is using.
> 2. You may want to change the below comment in HSpool
> /*
> * We sort the hash keys based on the buckets they belong to. Below masks
> * are used in _hash_hashkey2bucket to determine the bucket of given hash
> * key.
> */

Many thanks, will do.

> >
> > Aside:

...

> It is not clear to me what exactly you want to do here but maybe it is
> a separate topic and we should discuss this separately.

Agreed, will open another thread.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: testclient.exe installed under MSVC
Next
From: Alvaro Herrera
Date:
Subject: Re: bogus: logical replication rows/cols combinations