Re: Next Steps with Hash Indexes - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Next Steps with Hash Indexes
Date
Msg-id CAA4eK1KM+vFv9PBAkE9QKgMsuHKY-kzRdaaodSN3suDo8kYCYg@mail.gmail.com
Whole thread Raw
In response to Re: Next Steps with Hash Indexes  (Sadhuprasad Patro <b.sadhu@gmail.com>)
Responses Re: Next Steps with Hash Indexes
List pgsql-hackers
On Fri, Aug 27, 2021 at 4:27 PM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
>
> IMHO, as discussed above, since other databases also have the
> limitation that if you create a multi-column hash index then the hash
> index can not be used until all the key columns are used in the search
> condition. So my point is that users might be using the hash index
> with this limitation and their use case might be that they want to
> gain the best performance when they use this particular case and they
> might not be looking for much flexibility like we provide in BTREE.
>
> For reference:
> https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html
>
https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15
>
> We already know that performance will be better with a single hash for
> multiple columns, but still I just wanted to check the performance
> difference in PG. This might help us to decide the approach we need to
> go for. With a quick POC of both the ideas, I have observed there is a
> major performance advantage with single combined hash for multi-key
> columns.
>
> Performance Test Details: (Used PGBENCH Tool)
>             Initialize cmd: “./pgbench -i -s 100 -d postgres"
>
> postgres=# \d+ pgbench_accounts
>
>                                          Table "public.pgbench_accounts"
>
>   Column  |     Type      | Collation | Nullable | Default | Storage
> | Compression | Stats target | Description
>
> ----------+---------------+-----------+----------+---------+----------+-------------+--------------+-------------
>
>  aid      | integer       |           | not null |         | plain
> |             |              |
>  bid      | integer       |           |          |         | plain
> |             |              |
>  abalance | integer       |           |          |         | plain
> |             |              |
>  filler   | character(84) |           |          |         | extended
> |             |              |
>
> Indexes:
>     "pgbench_accounts_idx" hash (aid, bid)
> Access method: heap
> Options: fillfactor=100
>
> Test Command: “./pgbench -j 1 postgres -C -M prepared -S -T 300”
>
> Performance Test Results:
> Idea-1: Single Hash value for multiple key columns
>                  TPS = ~292
>
> Idea-2: Separate Hash values for each key column. But use only the
> first one to search the bucket. Other hash values are used as payload
> to get to the matching tuple before going to the heap.
>                  TPS = ~212
>
> Note: Here we got near to 25% better performance in a single combine
> hash approach with only TWO key columns.
>

That's a significant difference. Have you checked via perf or some
other way what causes this difference? I have seen that sometimes
single client performance with pgbench is not stable, so can you
please once check with 4 clients or so and possibly with a larger
dataset as well.

One more thing to consider is that it seems that the planner requires
a condition for the first column of an index before considering an
indexscan plan. See Tom's email [1] in this regard. I think it would
be better to see what kind of work is involved there if you want to
explore a single hash value for all columns idea.

[1] - https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Gilles Darold
Date:
Subject: Re: Schema variables - new implementation for Postgres 15
Next
From: Zhihong Yu
Date:
Subject: Re: Gather performance analysis