Re: Next Steps with Hash Indexes - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Next Steps with Hash Indexes
Date
Msg-id CAA4eK1+xft7++BoyiFVbT70pjzYxPvvZfKdk+mfej6tV9sRR-Q@mail.gmail.com
Whole thread Raw
In response to Re: Next Steps with Hash Indexes  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
On Thu, Aug 12, 2021 at 9:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 11, 2021 at 8:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > As far as the specific point at hand is concerned, I think storing
> > a hash value per index column, while using only the first column's
> > hash for bucket selection, is what to do for multicol indexes.
> > We still couldn't set amoptionalkey=true for hash indexes, because
> > without a hash for the first column we don't know which bucket to
> > look in.  But storing hashes for the additional columns would allow
> > us to check additional conditions in the index, and usually save
> > trips to the heap on queries that provide additional column
> > conditions.

Yeah, this sounds reasonable but I think the alternative proposal by
Dilip (see below) and me [1] also has merits.

> >  You could also imagine sorting the contents of a bucket
> > on all the hashes, which would ease uniqueness checks.
>

Yeah, we can do that but the current design also seems to have merits
for uniqueness checks. For sorting all the hashes in the bucket, we
need to read all the overflow pages and then do sort, which could lead
to additional I/O in some cases. The other possibility is to keep all
the bucket pages sorted during insertion but that would also require
additional I/O. OTOH, in the current design, if the value is not found
in the current bucket page (which has hash values in sorted order),
only then we move to the next page.

> Earlier, I was thinking that we have two hashes, one for the first key
> column that is for identifying the bucket, and one for the remaining
> key columns which will further help with heap lookup and ordering for
> uniqueness checking.
>

I have also mentioned an almost similar idea yesterday [1]. If we go
with a specification similar to MySQL and SQLServer then probably it
would be better than storing the hashes for all the columns.


  But yeah if we have a hash value for each column
> then it will make it really flexible.
>
> I was also looking into other databases that how they support hash
> indexes, then I see at least in MySQL[1] the multiple column index has
> a limitation that you have to give all key columns in search for
> selecting the index scan.

I see that SQLServer also has the same specification for multi-column
hash index [2]. See the "Multi-column index" section. So it might not
be a bad idea to have a similar specification.

[1] -
https://www.postgresql.org/message-id/CAA4eK1JD1%3DnPDi0kDPGLC%2BJDGEYP8DgTanobvgve%2B%2BKniQ68TA%40mail.gmail.com
[2] -
https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Next Steps with Hash Indexes
Next
From: Greg Nancarrow
Date:
Subject: Re: Small documentation improvement for ALTER SUBSCRIPTION