Re: Next Steps with Hash Indexes - Mailing list pgsql-hackers

From Sadhuprasad Patro
Subject Re: Next Steps with Hash Indexes
Date
Msg-id CAFF0-CHqhQKv7F_Yo8QwyKkbCjCxWyJusRN+jvb_sC3ompwfEQ@mail.gmail.com
Whole thread Raw
In response to Re: Next Steps with Hash Indexes  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Next Steps with Hash Indexes
List pgsql-hackers
On Fri, Aug 13, 2021 at 11:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Aug 13, 2021 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 12, 2021 at 8:30 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On Thu, Aug 12, 2021 at 12:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > The design of the patch has changed since the initial proposal. It
> > > > tries to perform unique inserts by holding a write lock on the bucket
> > > > page to avoid duplicate inserts.
> > >
> > > Do you mean that you're holding a buffer lwlock while you search the
> > > whole bucket? If so, that's surely not OK.
> > >
> >
> > I think here you are worried that after holding lwlock we might
> > perform reads of overflow pages which is not a good idea. I think
> > there are fewer chances of having overflow pages for unique indexes so
> > we don't expect such cases in common
>
> I think if we identify the bucket based on the hash value of all the
> columns then there should be a fewer overflow bucket, but IIUC, in
> this patch bucket, is identified based on the hash value of the first
> column only so there could be a lot of duplicates on the first column.


IMHO, as discussed above, since other databases also have the
limitation that if you create a multi-column hash index then the hash
index can not be used until all the key columns are used in the search
condition. So my point is that users might be using the hash index
with this limitation and their use case might be that they want to
gain the best performance when they use this particular case and they
might not be looking for much flexibility like we provide in BTREE.

For reference:
https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html

https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15

We already know that performance will be better with a single hash for
multiple columns, but still I just wanted to check the performance
difference in PG. This might help us to decide the approach we need to
go for. With a quick POC of both the ideas, I have observed there is a
major performance advantage with single combined hash for multi-key
columns.

Performance Test Details: (Used PGBENCH Tool)
            Initialize cmd: “./pgbench -i -s 100 -d postgres"

postgres=# \d+ pgbench_accounts

                                         Table "public.pgbench_accounts"

  Column  |     Type      | Collation | Nullable | Default | Storage
| Compression | Stats target | Description

----------+---------------+-----------+----------+---------+----------+-------------+--------------+-------------

 aid      | integer       |           | not null |         | plain
|             |              |
 bid      | integer       |           |          |         | plain
|             |              |
 abalance | integer       |           |          |         | plain
|             |              |
 filler   | character(84) |           |          |         | extended
|             |              |

Indexes:
    "pgbench_accounts_idx" hash (aid, bid)
Access method: heap
Options: fillfactor=100

Test Command: “./pgbench -j 1 postgres -C -M prepared -S -T 300”

Performance Test Results:
Idea-1: Single Hash value for multiple key columns
                 TPS = ~292

Idea-2: Separate Hash values for each key column. But use only the
first one to search the bucket. Other hash values are used as payload
to get to the matching tuple before going to the heap.
                 TPS = ~212

Note: Here we got near to 25% better performance in a single combine
hash approach with only TWO key columns. If we go for separate Hash
values for all key columns mentioned then there will be a performance
dip and storage also will be relatively higher when we have more key
columns.

I have just done separate POC patches to get the performance results
as mentioned above, there are many other scenarios like split case, to
be taken care further.
Attaching the POC patches here just for reference…


Thanks & Regards
SadhuPrasad
EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Dagfinn Ilmari Mannsåker
Date:
Subject: Re: [PATCH] Tab completion for ALTER TABLE … ADD …
Next
From: Masahiko Sawada
Date:
Subject: Re: Skipping logical replication transactions on subscriber side