Thread: Next Steps with Hash Indexes

Next Steps with Hash Indexes

From
Simon Riggs
Date:
Hi,

I've been investigating hash indexes and have what I think is a clear
picture in my head, so time for discussion.

It would be very desirable to allow Hash Indexes to become Primary Key
Indexes, which requires both
  amroutine->amcanunique = true;
  amroutine->amcanmulticol = true;

Every other hash index TODO seems like performance tuning, so can wait
awhile, even if it is tempting to do that first.

1. Multi-column Indexes
seems to have floundered because of this thread "Multicolumn Hash Indexes",
https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us,
but those issues don't apply in most common cases and so they seem
acceptable restrictions, especially since some already apply to btrees
etc..
(And noting that Hash indexes already assume strict hash operators, so
that is not an issue).
For me, this separates into two sub-concerns:
1.1 Allow multi-columns to be defined for hash indexes
Enabling this a simple one line patch
     amroutine->amcanmulticol = true;
which works just fine on current HEAD without further changes (manual
testing, as yet).
If we do this first, then any work on uniqueness checking can take
into account multiple columns.

1.2 Combining multi-column hashes into one hash value
Trivially, this is already how it works, in the sense that we just use
the first column, however many columns there are in the index! Doing
more is an already solved problem in Postgres,
[TupleHashTableHash_internal() in src/backend/executor/execGrouping.c]
as pointed out here: "Combining hash values"
https://www.postgresql.org/message-id/CAEepm%3D3rdgjfxW4cKvJ0OEmya2-34B0qHNG1xV0vK7TGPJGMUQ%40mail.gmail.com
though noting there was no discussion on that point [1]. This just
needs a little refactoring to improve things, but it seems more like a
nice to have than an essential aspect of hash indexes that need not
block us from enabling multi-column hash indexes.

2. Unique Hash Indexes have been summarized here:
https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com
which also seems to have two parts to it.

2.1 Uniqueness Check
Amit: "to ensure that there is no duplicate entry we need to traverse
the whole bucket chain"
Agreed. That seems straightforward and can also be improved later.

2.2 Locking
Amit's idea of holding ExclusiveLock on the bucket page works for me,
but there was some doubt about splitting.

Splitting seems to be an awful behavior that users would wish to avoid
if they knew about the impact and duration. In my understanding,
splitting involves moving 50% of rows and likely touches all blocks
eventually. If the existing index is the wrong shape then just run
REINDEX. If we tune the index build, looks like REINDEX would be
quicker and easier way of growing an index than trying to split an
existing index. i.e. rely on ecdysis not internal growth. This is much
more viable now because of the CIC changes in PG14.

(I would argue that removing splitting completely is a good idea,
similar to the way we have removed the original VACUUM FULL algorithm,
but that will take a while to swallow that thought). Instead, I
suggest we introduce a new indexam option for hash indexes of
autosplit=on (default) | off, so that users can turn splitting off.
Which means we would probably also need another option for
initial_buckets=0 (default) means use number of existing rows to size,
or N=use that specific size. Note that turning off splitting does not
limit the size of the index, it just stops the index from re-freshing
its number of buckets. B-trees are the default for PKs, so Hash
indexes are an option for larger tables only, so there is less need to
have hash indexes cope with tables of unknown size - we wouldn't even
be using hash unless we already know it is a big table.

If splitting causes any difficulty at all, then we should simply say
that Unique Hash Index indexes should initially force autosplit=off,
so we don't have to worry about the correctness of the locking. I
suggest we implement that first and then decide if we really care
about splitting, cos I'd bet we don't. Yes, I consider uniqueness much
more desirable than splitting.

I've written a patch that refactors index build so we *never* need to
perform a split during index build, allowing us to more credibly skip
index splitting completely. (Incidentally, it avoids the need to
update the metapage for each row during the build, allowing us to
consider writing in batches to the index as a next step). So there
need be no *requirement* for splitting to be supported with
uniqueness, while build/reindex looks like it can be much faster. I
can post it if anyone wants to see it, but I don't want to distract us
from discussion of the main requirements.

I have other performance tuning ideas, but they can wait.

Anyway, that's what I think at present. Thoughts?

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
>
> 2. Unique Hash Indexes have been summarized here:
> https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com
> which also seems to have two parts to it.
>
> 2.1 Uniqueness Check
> Amit: "to ensure that there is no duplicate entry we need to traverse
> the whole bucket chain"
> Agreed. That seems straightforward and can also be improved later.
>
> 2.2 Locking
> Amit's idea of holding ExclusiveLock on the bucket page works for me,
> but there was some doubt about splitting.
>

I think the main thing to think about for uniqueness check during
split (where we scan both the old and new buckets) was whether we need
to lock both the old (bucket_being_split) and new
(bucket_being_populated) buckets or just holding locks on one of them
(the current bucket in which we are inserting) is sufficient? During a
scan of the new bucket, we just retain pins on both the buckets (see
comments in _hash_first()) but if we need to retain locks on both
buckets then we need to do something different then we do it for
scans. But, I think it is sufficient to just hold an exclusive lock on
the primary bucket page in the bucket we are trying to insert and pin
on the other bucket (old bucket as we do for scans). Because no
concurrent inserter should try to insert into the old bucket and new
bucket the same tuple as before starting the split we always update
the metapage for hashm_lowmask and hashm_highmask which decides the
routing of the tuples.

Now, I think here the other problem we need to think about is that for
the hash index after finding the tuple in the index, we need to always
recheck in the heap as we don't store the actual value in the hash
index. For that in the scan, we get the tuple(s) from the index
(release locks) and then match qual after fetching tuple from the
heap. But we can't do that for uniqueness check because if we release
the locks on the index bucket page then another inserter could come
before we match it in heap. I think we need some mechanism that after
fetching TID from the index, we recheck the actual value in heap
before releasing the lock on the index bucket page.

The other thing could be that if we have unique support for hash index
then probably we can allow Insert ... ON Conflict if the user
specifies unique index column as conflict_target.

I am not sure if multicol index support is mandatory to allow
uniqueness for hash indexes, sure it would be good but I feel that can
be done as a separate patch as well.

-- 
With Regards,
Amit Kapila.



Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Tue, Jul 20, 2021 at 5:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> >
> > 2. Unique Hash Indexes have been summarized here:
> > https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com
> > which also seems to have two parts to it.
> >
> > 2.1 Uniqueness Check
> > Amit: "to ensure that there is no duplicate entry we need to traverse
> > the whole bucket chain"
> > Agreed. That seems straightforward and can also be improved later.
> >
> > 2.2 Locking
> > Amit's idea of holding ExclusiveLock on the bucket page works for me,
> > but there was some doubt about splitting.
> >
>
> I think the main thing to think about for uniqueness check during
> split (where we scan both the old and new buckets) was whether we need
> to lock both the old (bucket_being_split) and new
> (bucket_being_populated) buckets or just holding locks on one of them
> (the current bucket in which we are inserting) is sufficient? During a
> scan of the new bucket, we just retain pins on both the buckets (see
> comments in _hash_first()) but if we need to retain locks on both
> buckets then we need to do something different then we do it for
> scans. But, I think it is sufficient to just hold an exclusive lock on
> the primary bucket page in the bucket we are trying to insert and pin
> on the other bucket (old bucket as we do for scans). Because no
> concurrent inserter should try to insert into the old bucket and new
> bucket the same tuple as before starting the split we always update
> the metapage for hashm_lowmask and hashm_highmask which decides the
> routing of the tuples.
>
> Now, I think here the other problem we need to think about is that for
> the hash index after finding the tuple in the index, we need to always
> recheck in the heap as we don't store the actual value in the hash
> index. For that in the scan, we get the tuple(s) from the index
> (release locks) and then match qual after fetching tuple from the
> heap. But we can't do that for uniqueness check because if we release
> the locks on the index bucket page then another inserter could come
> before we match it in heap. I think we need some mechanism that after
> fetching TID from the index, we recheck the actual value in heap
> before releasing the lock on the index bucket page.
>

One more thing we need to think about here is when to find the right
bucket page in the chain where we can insert the new tuple. Do we
first try to complete the uniqueness check (which needs to scan
through the entire bucket chain) and then again scan the bucket with
space to insert or do we want to do it along with uniqueness check
scan and remember it?

-- 
With Regards,
Amit Kapila.



Re: Next Steps with Hash Indexes

From
Simon Riggs
Date:
On Tue, Jul 20, 2021 at 1:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> >
> > 2. Unique Hash Indexes have been summarized here:
> > https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com
> > which also seems to have two parts to it.
> >
> > 2.1 Uniqueness Check
> > Amit: "to ensure that there is no duplicate entry we need to traverse
> > the whole bucket chain"
> > Agreed. That seems straightforward and can also be improved later.
> >
> > 2.2 Locking
> > Amit's idea of holding ExclusiveLock on the bucket page works for me,
> > but there was some doubt about splitting.
> >
>
> I think the main thing to think about for uniqueness check during
> split (where we scan both the old and new buckets) was whether we need
> to lock both the old (bucket_being_split) and new
> (bucket_being_populated) buckets or just holding locks on one of them
> (the current bucket in which we are inserting) is sufficient? During a
> scan of the new bucket, we just retain pins on both the buckets (see
> comments in _hash_first()) but if we need to retain locks on both
> buckets then we need to do something different then we do it for
> scans. But, I think it is sufficient to just hold an exclusive lock on
> the primary bucket page in the bucket we are trying to insert and pin
> on the other bucket (old bucket as we do for scans). Because no
> concurrent inserter should try to insert into the old bucket and new
> bucket the same tuple as before starting the split we always update
> the metapage for hashm_lowmask and hashm_highmask which decides the
> routing of the tuples.

During an incomplete split, we need to scan both old and new. So
during insert, we need to scan both old and new, while holding
exclusive locks on both bucket pages. I've spent a few days looking at
the split behavior and this seems a complete approach. I'm working on
a patch now, still at hacking stage.

(BTW, my opinion of the split mechanism has now changed from bad to
very good. It works really well for unique data, but can be completely
ineffective for badly skewed data).

> Now, I think here the other problem we need to think about is that for
> the hash index after finding the tuple in the index, we need to always
> recheck in the heap as we don't store the actual value in the hash
> index. For that in the scan, we get the tuple(s) from the index
> (release locks) and then match qual after fetching tuple from the
> heap. But we can't do that for uniqueness check because if we release
> the locks on the index bucket page then another inserter could come
> before we match it in heap. I think we need some mechanism that after
> fetching TID from the index, we recheck the actual value in heap
> before releasing the lock on the index bucket page.

I don't think btree does that, so I'm not sure we do need that for
hash. Yes, there is a race condition, but someone will win. Do we care
who? Do we care enough to take the concurrency hit? Duplicate inserts
would be very rare in a declared unique index, so it would be a poor
trade-off.

> The other thing could be that if we have unique support for hash index
> then probably we can allow Insert ... ON Conflict if the user
> specifies unique index column as conflict_target.

Yes, that looks doable.

> I am not sure if multicol index support is mandatory to allow
> uniqueness for hash indexes, sure it would be good but I feel that can
> be done as a separate patch as well.

I have a patch for multicol support, attached.

-- 
Simon Riggs                http://www.EnterpriseDB.com/

Attachment

Re: Next Steps with Hash Indexes

From
Simon Riggs
Date:
On Tue, Jul 20, 2021 at 1:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> One more thing we need to think about here is when to find the right
> bucket page in the chain where we can insert the new tuple. Do we
> first try to complete the uniqueness check (which needs to scan
> through the entire bucket chain) and then again scan the bucket with
> space to insert or do we want to do it along with uniqueness check
> scan and remember it?

The latter approach, but that is just a performance tweak for later.

On a unique hash index, regular splitting means there are almost no
bucket chains more than 2 long (bucket plus overflow), so it seems
like mostly wasted effort at this point.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Tue, Jul 20, 2021 at 6:32 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
>
> On Tue, Jul 20, 2021 at 1:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs
> > <simon.riggs@enterprisedb.com> wrote:
> >
> > I think the main thing to think about for uniqueness check during
> > split (where we scan both the old and new buckets) was whether we need
> > to lock both the old (bucket_being_split) and new
> > (bucket_being_populated) buckets or just holding locks on one of them
> > (the current bucket in which we are inserting) is sufficient? During a
> > scan of the new bucket, we just retain pins on both the buckets (see
> > comments in _hash_first()) but if we need to retain locks on both
> > buckets then we need to do something different then we do it for
> > scans. But, I think it is sufficient to just hold an exclusive lock on
> > the primary bucket page in the bucket we are trying to insert and pin
> > on the other bucket (old bucket as we do for scans). Because no
> > concurrent inserter should try to insert into the old bucket and new
> > bucket the same tuple as before starting the split we always update
> > the metapage for hashm_lowmask and hashm_highmask which decides the
> > routing of the tuples.
>
> During an incomplete split, we need to scan both old and new. So
> during insert, we need to scan both old and new, while holding
> exclusive locks on both bucket pages.
>

It will surely work if we have an exclusive lock on both the buckets
(old and new) in this case but I think it is better if we can avoid
exclusive locking the old bucket (bucket_being_split) unless it is
really required. We need an exclusive lock on the primary bucket where
we are trying to insert to avoid any other inserter with the same key
but I think we don't need it for the old bucket because no inserter
with the same key can try to insert the key in an old bucket which
would belong to the new bucket.

> I've spent a few days looking at
> the split behavior and this seems a complete approach. I'm working on
> a patch now, still at hacking stage.
>
> (BTW, my opinion of the split mechanism has now changed from bad to
> very good. It works really well for unique data, but can be completely
> ineffective for badly skewed data).
>
> > Now, I think here the other problem we need to think about is that for
> > the hash index after finding the tuple in the index, we need to always
> > recheck in the heap as we don't store the actual value in the hash
> > index. For that in the scan, we get the tuple(s) from the index
> > (release locks) and then match qual after fetching tuple from the
> > heap. But we can't do that for uniqueness check because if we release
> > the locks on the index bucket page then another inserter could come
> > before we match it in heap. I think we need some mechanism that after
> > fetching TID from the index, we recheck the actual value in heap
> > before releasing the lock on the index bucket page.
>
> I don't think btree does that, so I'm not sure we do need that for
> hash. Yes, there is a race condition, but someone will win. Do we care
> who? Do we care enough to take the concurrency hit?
>

I think if we don't care we might end up with duplicates. I might be
missing something here but let me explain the problem I see. Say,
while doing a unique check we found the same hash value in the bucket
we are trying to insert, we can't say unique key violation at this
stage and error out without checking the actual value in heap. This is
because there is always a chance that two different key values can map
to the same hash value. Now, after checking in the heap if we found
that the actual value doesn't match so we decide to insert the value
in the hash index, and in the meantime, another insert of the same key
value already performed these checks and ends up inserting the value
in hash index and that would lead to a duplicate value in the hash
index. I think btree doesn't have a similar risk so we don't need such
a mechanism for btree.


-- 
With Regards,
Amit Kapila.



Re: Next Steps with Hash Indexes

From
Simon Riggs
Date:
On Thu, 22 Jul 2021 at 06:10, Amit Kapila <amit.kapila16@gmail.com> wrote:

> It will surely work if we have an exclusive lock on both the buckets
> (old and new) in this case but I think it is better if we can avoid
> exclusive locking the old bucket (bucket_being_split) unless it is
> really required. We need an exclusive lock on the primary bucket where
> we are trying to insert to avoid any other inserter with the same key
> but I think we don't need it for the old bucket because no inserter
> with the same key can try to insert the key in an old bucket which
> would belong to the new bucket.

Agreed.

> > I don't think btree does that, so I'm not sure we do need that for
> > hash. Yes, there is a race condition, but someone will win. Do we care
> > who? Do we care enough to take the concurrency hit?
> >
>
> I think if we don't care we might end up with duplicates. I might be
> missing something here but let me explain the problem I see. Say,
> while doing a unique check we found the same hash value in the bucket
> we are trying to insert, we can't say unique key violation at this
> stage and error out without checking the actual value in heap. This is
> because there is always a chance that two different key values can map
> to the same hash value. Now, after checking in the heap if we found
> that the actual value doesn't match so we decide to insert the value
> in the hash index, and in the meantime, another insert of the same key
> value already performed these checks and ends up inserting the value
> in hash index and that would lead to a duplicate value in the hash
> index. I think btree doesn't have a similar risk so we don't need such
> a mechanism for btree.

Agreed, after thinking about it more while coding.

All of the above implemented in the patches below:

Complete patch for hash_multicol.v3.patch attached, slightly updated
from earlier patch.
Docs, tests, passes make check.

WIP for hash_unique.v4.patch attached, patch-on-patch, to allow us to
discuss flow of logic and shape of code.
The actual comparison is not implemented yet. Not trivial, but can
wait until we decide main logic.
Passes make check and executes attached tests.

Tests in separate file also attached, will eventually be merged into
src/test/regress/sql/hash_index.sql

No tests yet for splitting or deferred uniqueness checks. The latter
is because there are parse analysis changes needed to undo the
assumption that only btrees support uniqueness, but nothing there
looks like an issue.

Thanks for your input, have a good weekend.

-- 
Simon Riggs                http://www.EnterpriseDB.com/

Attachment

Re: Next Steps with Hash Indexes

From
Dilip Kumar
Date:
On Fri, Jul 23, 2021 at 6:16 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
>
> On Thu, 22 Jul 2021 at 06:10, Amit Kapila <amit.kapila16@gmail.com> wrote:

> Complete patch for hash_multicol.v3.patch attached, slightly updated
> from earlier patch.
> Docs, tests, passes make check.

I was looking into the hash_multicoul.v3.patch, I have a question

  <para>
-  Hash indexes support only single-column indexes and do not allow
-  uniqueness checking.
+  Hash indexes support uniqueness checking.
+  Hash indexes support multi-column indexes, but only store the hash value
+  for the first column, so multiple columns are useful only for uniquness
+  checking.
  </para>

The above comments say that we store hash value only for the first
column,  my question is why don't we store for other columns as well?
I mean we can search the bucket based on the first column hash but the
hashes for the other column could be payload data and we can use that
to match the hash value for other key columns before accessing the
heap, as discussed here[1].  IMHO, this will further reduce the heap
access.

[1] https://www.postgresql.org/message-id/7192.1506527843%40sss.pgh.pa.us


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Next Steps with Hash Indexes

From
Peter Geoghegan
Date:
On Thu, Jul 15, 2021 at 9:41 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> It would be very desirable to allow Hash Indexes to become Primary Key
> Indexes, which requires both
>   amroutine->amcanunique = true;
>   amroutine->amcanmulticol = true;

Why do you say that? I don't think it's self-evident that it's desirable.

In general I don't think that hash indexes are all that compelling
compared to B-Trees. In practice the flexibility of B-Trees tends to
win out, even if B-Trees are slightly slower than hash indexes with
certain kinds of benchmarks that are heavy on point lookups and have
no locality.

I have no reason to object to any of this, and I don't object. I'm just asking.

-- 
Peter Geoghegan



Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Tue, Aug 10, 2021 at 6:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jul 23, 2021 at 6:16 PM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> >
> > On Thu, 22 Jul 2021 at 06:10, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Complete patch for hash_multicol.v3.patch attached, slightly updated
> > from earlier patch.
> > Docs, tests, passes make check.
>
> I was looking into the hash_multicoul.v3.patch, I have a question
>
>   <para>
> -  Hash indexes support only single-column indexes and do not allow
> -  uniqueness checking.
> +  Hash indexes support uniqueness checking.
> +  Hash indexes support multi-column indexes, but only store the hash value
> +  for the first column, so multiple columns are useful only for uniquness
> +  checking.
>   </para>
>
> The above comments say that we store hash value only for the first
> column,  my question is why don't we store for other columns as well?
> I mean we can search the bucket based on the first column hash but the
> hashes for the other column could be payload data and we can use that
> to match the hash value for other key columns before accessing the
> heap, as discussed here[1].  IMHO, this will further reduce the heap
> access.
>

True, the other idea could be that in the payload we store the value
after 'combining multi-column hashes into one hash value'. This will
allow us to satisfy queries where the search is on all columns of the
index efficiently provided the planner doesn't remove some of them in
which case we need to do more work.

One more thing which we need to consider is 'hashm_procid' stored in
meta page, currently, it works for the single-column index but for the
multi-column index, we might want to set it as InvalidOid.

-- 
With Regards,
Amit Kapila.



Re: Next Steps with Hash Indexes

From
Robert Haas
Date:
On Tue, Aug 10, 2021 at 8:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I was looking into the hash_multicoul.v3.patch, I have a question
>
>   <para>
> -  Hash indexes support only single-column indexes and do not allow
> -  uniqueness checking.
> +  Hash indexes support uniqueness checking.
> +  Hash indexes support multi-column indexes, but only store the hash value
> +  for the first column, so multiple columns are useful only for uniquness
> +  checking.
>   </para>
>
> The above comments say that we store hash value only for the first
> column,  my question is why don't we store for other columns as well?

I suspect it would be hard to store multiple hash values, one per
column. It seems to me that what we ought to do is combine the hash
values for the individual columns using hash_combine(64) and store the
combined value. I can't really imagine why we would NOT do that. It
seems like it would be easy to do and make the behavior of the feature
way less surprising.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Next Steps with Hash Indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I suspect it would be hard to store multiple hash values, one per
> column. It seems to me that what we ought to do is combine the hash
> values for the individual columns using hash_combine(64) and store the
> combined value. I can't really imagine why we would NOT do that.

That would make it impossible to use the index except with queries
that provide equality conditions on all the index columns.  Maybe
that's fine, but it seems less flexible than other possible definitions.
It really makes me wonder why anyone would bother with a multicol
hash index.

            regards, tom lane



Re: Next Steps with Hash Indexes

From
Robert Haas
Date:
On Wed, Aug 11, 2021 at 10:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I suspect it would be hard to store multiple hash values, one per
> > column. It seems to me that what we ought to do is combine the hash
> > values for the individual columns using hash_combine(64) and store the
> > combined value. I can't really imagine why we would NOT do that.
>
> That would make it impossible to use the index except with queries
> that provide equality conditions on all the index columns.  Maybe
> that's fine, but it seems less flexible than other possible definitions.
> It really makes me wonder why anyone would bother with a multicol
> hash index.

Hmm. That is a point I hadn't considered.

I have to admit that after working with Amit on all the work to make
hash indexes WAL-logged a few years ago, I was somewhat disillusioned
with the whole AM. It seems like a cool idea to me but it's just not
that well-implemented. For example, the strategy of just doubling the
number of buckets in one shot seems pretty terrible for large indexes,
and ea69a0dead5128c421140dc53fac165ba4af8520 will buy only a limited
amount of relief. Likewise, the fact that keys are stored in hash
value order within pages but that the bucket as a whole is not kept in
order seems like it's bad for search performance and really bad for
implementing unique indexes with reasonable amounts of locking. (I
don't know how the present patch tries to solve that problem.) It's
tempting to think that we should think about creating something
altogether new instead of hacking on the existing implementation, but
that's a lot of work and I'm not sure what specific design would be
best.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Next Steps with Hash Indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I have to admit that after working with Amit on all the work to make
> hash indexes WAL-logged a few years ago, I was somewhat disillusioned
> with the whole AM. It seems like a cool idea to me but it's just not
> that well-implemented.

Yeah, agreed.  The whole buckets-are-integral-numbers-of-pages scheme
is pretty well designed to ensure bloat, but trying to ameliorate that
by reducing the number of buckets creates its own problems (since, as
you mention, we have no scheme whatever for searching within a bucket).
I'm quite unimpressed with Simon's upthread proposal to turn off bucket
splitting without doing anything about the latter issue.

I feel like we'd be best off to burn the AM to the ground and start
over.  I do not know what a better design would look like exactly,
but I feel like it's got to decouple buckets from pages somehow.
Along the way, I'd want to store 64-bit hash values (we still haven't
done that have we?).

As far as the specific point at hand is concerned, I think storing
a hash value per index column, while using only the first column's
hash for bucket selection, is what to do for multicol indexes.
We still couldn't set amoptionalkey=true for hash indexes, because
without a hash for the first column we don't know which bucket to
look in.  But storing hashes for the additional columns would allow
us to check additional conditions in the index, and usually save
trips to the heap on queries that provide additional column
conditions.  You could also imagine sorting the contents of a bucket
on all the hashes, which would ease uniqueness checks.

            regards, tom lane



Re: Next Steps with Hash Indexes

From
John Naylor
Date:

On Wed, Aug 11, 2021 at 10:54 AM Robert Haas <robertmhaas@gmail.com> wrote:
> don't know how the present patch tries to solve that problem.) It's
> tempting to think that we should think about creating something
> altogether new instead of hacking on the existing implementation, but
> that's a lot of work and I'm not sure what specific design would be
> best.

(Standard disclaimer that I'm not qualified to design index AMs) I've seen one mention in the literature about the possibility of simply having a btree index over the hash values. That would require faster search within pages, in particular using abbreviated keys in the ItemId array of internal pages [1] and interpolated search rather than pure binary search (which should work reliably with high-entropy keys like hash values), but doing that would speed up all btree indexes, so that much is worth doing regardless of how hash indexes are implemented. In that scheme, the hash index AM would just be around for backward compatibility.


--
John Naylor
EDB: http://www.enterprisedb.com

Re: Next Steps with Hash Indexes

From
Tom Lane
Date:
John Naylor <john.naylor@enterprisedb.com> writes:
> (Standard disclaimer that I'm not qualified to design index AMs) I've seen
> one mention in the literature about the possibility of simply having a
> btree index over the hash values.

Yeah, that's been talked about in the past --- we considered it
moderately seriously back when the hash AM was really only a toy
for lack of WAL support.  The main knock on it is that searching
a btree is necessarily O(log N), while in principle a hash probe
could be O(1).  Of course, a badly-implemented hash AM could be
worse than O(1), but we'd basically be giving up on ever getting
to O(1).

There's a separate discussion to be had about whether there should
be an easier way for users to build indexes that are btrees of
hashes.  You can do it today but the indexes aren't pleasant to
use, requiring query adjustment.

            regards, tom lane



Re: Next Steps with Hash Indexes

From
Robert Haas
Date:
On Wed, Aug 11, 2021 at 11:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, agreed.  The whole buckets-are-integral-numbers-of-pages scheme
> is pretty well designed to ensure bloat, but trying to ameliorate that
> by reducing the number of buckets creates its own problems (since, as
> you mention, we have no scheme whatever for searching within a bucket).
> I'm quite unimpressed with Simon's upthread proposal to turn off bucket
> splitting without doing anything about the latter issue.

Maybe. I don't think that it should be a huge problem to decide that
an occupied bucket has to consume an entire page; if that's a big
issue, you should just have fewer buckets. I do think it's a problem
that a bucket containing no tuples at all still consumes an entire
page, because the data is often going to be skewed so that many
buckets are entirely empty. I also think it's a problem that expanding
the directory from 2^{N} buckets to 2^{N+1} buckets requires 4
allocations of 2^{N-2} *consecutive* pages. That's a lot of
consecutive pages for even fairly modest values of N.

Imagine a design where we have a single-page directory with 1024
slots, each corresponding to one bucket. Each slot stores a block
number, which might be InvalidBlockNumber if there are no tuples in
that bucket. Given a tuple with hash value H, check slot H%1024 and
then go to that page to look further. If there are more tuples in that
bucket than can fit on the page, then it can link to another page. If
we assume for the sake of argument that 1024 is the right number of
buckets, this is going to use about as much space as the current
system when the data distribution is uniform, but considerably less
when it's skewed. The larger you make the number of buckets, the
better this kind of thing looks on skewed data.

Now, you can't just always have 1024 buckets, so we'd actually have to
do something a bit more clever, probably involving multiple levels of
directories. For example, suppose a directory page contains only 32
slots. That will leave a lot of empty space in the page, which can be
used to store tuples. An index search has to scan all tuples that are
stored directly in the page, and also use the first 5 bits of the hash
key to search the appropriate bucket. But the bucket is itself a
directory: it can have some tuples stored directly in the page, and
then it has 32 more slots and you use the next 5 bits of the hash key
to decide which one of those to search. Then it becomes possible to
incrementally expand the hash index: when the space available in a
directory page fills up, you can either create a sub-directory and
move as many tuples as you can into that page, or add an overflow page
that contains only tuples.

It's important to be able to do either one, because sometimes a bucket
fills up with tuples that have identical hash values, and sometimes a
bucket fills up with tuples that have a variety of hash values. The
current implementation tends to massively increase the number of
buckets even when it does very little to spread the index entries out.
("Hey, I doubled the number of buckets and the keys are still almost
all in one bucket ... let me double the number of buckets again and
see if it works better this time!") If we were going to create a
replacement, we'd want the index to respond differently to a bunch of
dups vs. a bunch of non-dups.

> As far as the specific point at hand is concerned, I think storing
> a hash value per index column, while using only the first column's
> hash for bucket selection, is what to do for multicol indexes.
> We still couldn't set amoptionalkey=true for hash indexes, because
> without a hash for the first column we don't know which bucket to
> look in.  But storing hashes for the additional columns would allow
> us to check additional conditions in the index, and usually save
> trips to the heap on queries that provide additional column
> conditions.  You could also imagine sorting the contents of a bucket
> on all the hashes, which would ease uniqueness checks.

That sounds reasonable I guess, but I don't know how hard it is to implement.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Next Steps with Hash Indexes

From
Dilip Kumar
Date:
On Wed, Aug 11, 2021 at 8:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

> As far as the specific point at hand is concerned, I think storing
> a hash value per index column, while using only the first column's
> hash for bucket selection, is what to do for multicol indexes.
> We still couldn't set amoptionalkey=true for hash indexes, because
> without a hash for the first column we don't know which bucket to
> look in.  But storing hashes for the additional columns would allow
> us to check additional conditions in the index, and usually save
> trips to the heap on queries that provide additional column
> conditions.  You could also imagine sorting the contents of a bucket
> on all the hashes, which would ease uniqueness checks.

Earlier, I was thinking that we have two hashes, one for the first key
column that is for identifying the bucket, and one for the remaining
key columns which will further help with heap lookup and ordering for
uniqueness checking.  But yeah if we have a hash value for each column
then it will make it really flexible.

I was also looking into other databases that how they support hash
indexes, then I see at least in MySQL[1] the multiple column index has
a limitation that you have to give all key columns in search for
selecting the index scan.  IMHO, that limitation might be there
because they are storing just one hash value based on all key columns
and also selecting the bucket based on the same hash value.

[1] https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Thu, Aug 12, 2021 at 9:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 11, 2021 at 8:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > As far as the specific point at hand is concerned, I think storing
> > a hash value per index column, while using only the first column's
> > hash for bucket selection, is what to do for multicol indexes.
> > We still couldn't set amoptionalkey=true for hash indexes, because
> > without a hash for the first column we don't know which bucket to
> > look in.  But storing hashes for the additional columns would allow
> > us to check additional conditions in the index, and usually save
> > trips to the heap on queries that provide additional column
> > conditions.

Yeah, this sounds reasonable but I think the alternative proposal by
Dilip (see below) and me [1] also has merits.

> >  You could also imagine sorting the contents of a bucket
> > on all the hashes, which would ease uniqueness checks.
>

Yeah, we can do that but the current design also seems to have merits
for uniqueness checks. For sorting all the hashes in the bucket, we
need to read all the overflow pages and then do sort, which could lead
to additional I/O in some cases. The other possibility is to keep all
the bucket pages sorted during insertion but that would also require
additional I/O. OTOH, in the current design, if the value is not found
in the current bucket page (which has hash values in sorted order),
only then we move to the next page.

> Earlier, I was thinking that we have two hashes, one for the first key
> column that is for identifying the bucket, and one for the remaining
> key columns which will further help with heap lookup and ordering for
> uniqueness checking.
>

I have also mentioned an almost similar idea yesterday [1]. If we go
with a specification similar to MySQL and SQLServer then probably it
would be better than storing the hashes for all the columns.


  But yeah if we have a hash value for each column
> then it will make it really flexible.
>
> I was also looking into other databases that how they support hash
> indexes, then I see at least in MySQL[1] the multiple column index has
> a limitation that you have to give all key columns in search for
> selecting the index scan.

I see that SQLServer also has the same specification for multi-column
hash index [2]. See the "Multi-column index" section. So it might not
be a bad idea to have a similar specification.

[1] -
https://www.postgresql.org/message-id/CAA4eK1JD1%3DnPDi0kDPGLC%2BJDGEYP8DgTanobvgve%2B%2BKniQ68TA%40mail.gmail.com
[2] -
https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15

-- 
With Regards,
Amit Kapila.



Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Wed, Aug 11, 2021 at 8:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > I have to admit that after working with Amit on all the work to make
> > hash indexes WAL-logged a few years ago, I was somewhat disillusioned
> > with the whole AM. It seems like a cool idea to me but it's just not
> > that well-implemented.
>
> Yeah, agreed.  The whole buckets-are-integral-numbers-of-pages scheme
> is pretty well designed to ensure bloat, but trying to ameliorate that
> by reducing the number of buckets creates its own problems (since, as
> you mention, we have no scheme whatever for searching within a bucket).
> I'm quite unimpressed with Simon's upthread proposal to turn off bucket
> splitting without doing anything about the latter issue.
>

The design of the patch has changed since the initial proposal. It
tries to perform unique inserts by holding a write lock on the bucket
page to avoid duplicate inserts.

-- 
With Regards,
Amit Kapila.



Re: Next Steps with Hash Indexes

From
Robert Haas
Date:
On Thu, Aug 12, 2021 at 12:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> The design of the patch has changed since the initial proposal. It
> tries to perform unique inserts by holding a write lock on the bucket
> page to avoid duplicate inserts.

Do you mean that you're holding a buffer lwlock while you search the
whole bucket? If so, that's surely not OK.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Next Steps with Hash Indexes

From
Peter Geoghegan
Date:
On Wed, Aug 11, 2021 at 8:51 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> (Standard disclaimer that I'm not qualified to design index AMs) I've seen one mention in the literature about the
possibilityof simply having a btree index over the hash values. That would require faster search within pages, in
particularusing abbreviated keys in the ItemId array of internal pages [1] and interpolated search rather than pure
binarysearch (which should work reliably with high-entropy keys like hash values), but doing that would speed up all
btreeindexes, so that much is worth doing regardless of how hash indexes are implemented. In that scheme, the hash
indexAM would just be around for backward compatibility. 

I think that it's possible (though hard) to do that without involving
hashing, even for datatypes like text. Having some kind of prefix
compression that makes the final abbreviated keys have high entropy
would be essential, though. I agree that it would probably be
significantly easier when you knew you were dealing with hash values,
but even there you need some kind of prefix compression.

In any case I suspect that it would make sense to reimplement hash
indexes as a translation layer between hash index opclasses and
nbtree. Robert said "Likewise, the fact that keys are stored in hash
value order within pages but that the bucket as a whole is not kept in
order seems like it's bad for search performance". Obviously we've
already done a lot of work on an index AM that deals with a fully
ordered keyspace very well. This includes dealing with large groups of
duplicates gracefully, since in a certain sense there are no duplicate
B-Tree index tuples -- the heap TID tiebreaker ensures that. And it
ensures that you have heap-wise locality within these large groups,
which is a key enabler of things like opportunistic index deletion.

When hash indexes have been used in database systems, it tends to be
in-memory database systems where the recovery path doesn't recover
indexes -- they're just rebuilt from scratch instead. If that's
already a baked-in assumption then hash indexes make more sense. To me
it seems like the problem with true hash indexes is that they're
constructed in a top-down fashion, which is approximately the opposite
of the bottom-up, incremental approach used by B-Tree indexing. This
seems to be where all the skew problems arise from. This approach
cannot be robust to changes in the keyspace over time, really.

--
Peter Geoghegan



Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Thu, Aug 12, 2021 at 8:30 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Aug 12, 2021 at 12:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > The design of the patch has changed since the initial proposal. It
> > tries to perform unique inserts by holding a write lock on the bucket
> > page to avoid duplicate inserts.
>
> Do you mean that you're holding a buffer lwlock while you search the
> whole bucket? If so, that's surely not OK.
>

I think here you are worried that after holding lwlock we might
perform reads of overflow pages which is not a good idea. I think
there are fewer chances of having overflow pages for unique indexes so
we don't expect such cases in common and as far as I can understand
this can happen in btree as well during uniqueness check. Now, I think
the other possibility could be that we do some sort of lock chaining
where we grab the lock of the next bucket before releasing the lock of
the current bucket as we do during bucket clean up but not sure about
the same.

I haven't studied the patch in detail so it is better for Simon to
pitch in here to avoid any incorrect information or if he has a
different understanding/idea.

-- 
With Regards,
Amit Kapila.



Re: Next Steps with Hash Indexes

From
Dilip Kumar
Date:
On Fri, Aug 13, 2021 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 12, 2021 at 8:30 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Thu, Aug 12, 2021 at 12:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > The design of the patch has changed since the initial proposal. It
> > > tries to perform unique inserts by holding a write lock on the bucket
> > > page to avoid duplicate inserts.
> >
> > Do you mean that you're holding a buffer lwlock while you search the
> > whole bucket? If so, that's surely not OK.
> >
>
> I think here you are worried that after holding lwlock we might
> perform reads of overflow pages which is not a good idea. I think
> there are fewer chances of having overflow pages for unique indexes so
> we don't expect such cases in common

I think if we identify the bucket based on the hash value of all the
columns then there should be a fewer overflow bucket, but IIUC, in
this patch bucket, is identified based on the hash value of the first
column only so there could be a lot of duplicates on the first column.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Next Steps with Hash Indexes

From
Sadhuprasad Patro
Date:
On Fri, Aug 13, 2021 at 11:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Aug 13, 2021 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 12, 2021 at 8:30 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On Thu, Aug 12, 2021 at 12:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > The design of the patch has changed since the initial proposal. It
> > > > tries to perform unique inserts by holding a write lock on the bucket
> > > > page to avoid duplicate inserts.
> > >
> > > Do you mean that you're holding a buffer lwlock while you search the
> > > whole bucket? If so, that's surely not OK.
> > >
> >
> > I think here you are worried that after holding lwlock we might
> > perform reads of overflow pages which is not a good idea. I think
> > there are fewer chances of having overflow pages for unique indexes so
> > we don't expect such cases in common
>
> I think if we identify the bucket based on the hash value of all the
> columns then there should be a fewer overflow bucket, but IIUC, in
> this patch bucket, is identified based on the hash value of the first
> column only so there could be a lot of duplicates on the first column.


IMHO, as discussed above, since other databases also have the
limitation that if you create a multi-column hash index then the hash
index can not be used until all the key columns are used in the search
condition. So my point is that users might be using the hash index
with this limitation and their use case might be that they want to
gain the best performance when they use this particular case and they
might not be looking for much flexibility like we provide in BTREE.

For reference:
https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html

https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15

We already know that performance will be better with a single hash for
multiple columns, but still I just wanted to check the performance
difference in PG. This might help us to decide the approach we need to
go for. With a quick POC of both the ideas, I have observed there is a
major performance advantage with single combined hash for multi-key
columns.

Performance Test Details: (Used PGBENCH Tool)
            Initialize cmd: “./pgbench -i -s 100 -d postgres"

postgres=# \d+ pgbench_accounts

                                         Table "public.pgbench_accounts"

  Column  |     Type      | Collation | Nullable | Default | Storage
| Compression | Stats target | Description

----------+---------------+-----------+----------+---------+----------+-------------+--------------+-------------

 aid      | integer       |           | not null |         | plain
|             |              |
 bid      | integer       |           |          |         | plain
|             |              |
 abalance | integer       |           |          |         | plain
|             |              |
 filler   | character(84) |           |          |         | extended
|             |              |

Indexes:
    "pgbench_accounts_idx" hash (aid, bid)
Access method: heap
Options: fillfactor=100

Test Command: “./pgbench -j 1 postgres -C -M prepared -S -T 300”

Performance Test Results:
Idea-1: Single Hash value for multiple key columns
                 TPS = ~292

Idea-2: Separate Hash values for each key column. But use only the
first one to search the bucket. Other hash values are used as payload
to get to the matching tuple before going to the heap.
                 TPS = ~212

Note: Here we got near to 25% better performance in a single combine
hash approach with only TWO key columns. If we go for separate Hash
values for all key columns mentioned then there will be a performance
dip and storage also will be relatively higher when we have more key
columns.

I have just done separate POC patches to get the performance results
as mentioned above, there are many other scenarios like split case, to
be taken care further.
Attaching the POC patches here just for reference…


Thanks & Regards
SadhuPrasad
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Fri, Aug 27, 2021 at 4:27 PM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
>
> IMHO, as discussed above, since other databases also have the
> limitation that if you create a multi-column hash index then the hash
> index can not be used until all the key columns are used in the search
> condition. So my point is that users might be using the hash index
> with this limitation and their use case might be that they want to
> gain the best performance when they use this particular case and they
> might not be looking for much flexibility like we provide in BTREE.
>
> For reference:
> https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html
>
https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15
>
> We already know that performance will be better with a single hash for
> multiple columns, but still I just wanted to check the performance
> difference in PG. This might help us to decide the approach we need to
> go for. With a quick POC of both the ideas, I have observed there is a
> major performance advantage with single combined hash for multi-key
> columns.
>
> Performance Test Details: (Used PGBENCH Tool)
>             Initialize cmd: “./pgbench -i -s 100 -d postgres"
>
> postgres=# \d+ pgbench_accounts
>
>                                          Table "public.pgbench_accounts"
>
>   Column  |     Type      | Collation | Nullable | Default | Storage
> | Compression | Stats target | Description
>
> ----------+---------------+-----------+----------+---------+----------+-------------+--------------+-------------
>
>  aid      | integer       |           | not null |         | plain
> |             |              |
>  bid      | integer       |           |          |         | plain
> |             |              |
>  abalance | integer       |           |          |         | plain
> |             |              |
>  filler   | character(84) |           |          |         | extended
> |             |              |
>
> Indexes:
>     "pgbench_accounts_idx" hash (aid, bid)
> Access method: heap
> Options: fillfactor=100
>
> Test Command: “./pgbench -j 1 postgres -C -M prepared -S -T 300”
>
> Performance Test Results:
> Idea-1: Single Hash value for multiple key columns
>                  TPS = ~292
>
> Idea-2: Separate Hash values for each key column. But use only the
> first one to search the bucket. Other hash values are used as payload
> to get to the matching tuple before going to the heap.
>                  TPS = ~212
>
> Note: Here we got near to 25% better performance in a single combine
> hash approach with only TWO key columns.
>

That's a significant difference. Have you checked via perf or some
other way what causes this difference? I have seen that sometimes
single client performance with pgbench is not stable, so can you
please once check with 4 clients or so and possibly with a larger
dataset as well.

One more thing to consider is that it seems that the planner requires
a condition for the first column of an index before considering an
indexscan plan. See Tom's email [1] in this regard. I think it would
be better to see what kind of work is involved there if you want to
explore a single hash value for all columns idea.

[1] - https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.



Re: Next Steps with Hash Indexes

From
Sadhuprasad Patro
Date:
>
> That's a significant difference. Have you checked via perf or some
> other way what causes this difference? I have seen that sometimes
> single client performance with pgbench is not stable, so can you
> please once check with 4 clients or so and possibly with a larger
> dataset as well.

I have verified manually, without the PGBENCH tool also. I can see a
significant difference for each query fired in both the versions of
patch implemented. We can see as mentioned below, I have run the SAME
query on the SAME dataset on both patches. We have a significant
performance impact with Separate Hash values for multiple key columns.

SingleHash_MultiColumn:
postgres=# create table perftest(a int, b int, c int, d int, e int, f int);
CREATE TABLE

postgres=# insert into perftest values (generate_series(1, 10000000),
generate_series(1, 10000000), generate_series(1, 10000000), 9, 7);
INSERT 0 10000000

postgres=# create index idx on perftest using hash(a, b, c);
CREATE INDEX

postgres=# select * from perftest where a=5999 and b=5999 and c=5999;
  a   |  b   |  c   | d | e | f
------+------+------+---+---+---
 5999 | 5999 | 5999 | 9 | 7 |
(1 row)
Time: 2.022 ms

postgres=# select * from perftest where a=597989 and b=597989 and c=597989;
   a    |   b    |   c    | d | e | f
--------+--------+--------+---+---+---
 597989 | 597989 | 597989 | 9 | 7 |
(1 row)
Time: 0.867 ms

postgres=# select * from perftest where a=6297989 and b=6297989 and c=6297989;
    a    |    b    |    c    | d | e | f
---------+---------+---------+---+---+---
 6297989 | 6297989 | 6297989 | 9 | 7 |
(1 row)
Time: 1.439 ms

postgres=# select * from perftest where a=6290798 and b=6290798 and c=6290798;
    a    |    b    |    c    | d | e | f
---------+---------+---------+---+---+---
 6290798 | 6290798 | 6290798 | 9 | 7 |
(1 row)
Time: 1.013 ms

postgres=# select * from perftest where a=6290791 and b=6290791 and c=6290791;
    a    |    b    |    c    | d | e | f
---------+---------+---------+---+---+---
 6290791 | 6290791 | 6290791 | 9 | 7 |
(1 row)
Time: 0.903 ms

postgres=# select * from perftest where a=62907 and b=62907 and c=62907;
   a   |   b   |   c   | d | e | f
-------+-------+-------+---+---+---
 62907 | 62907 | 62907 | 9 | 7 |
(1 row)
Time: 0.894 ms

SeparateHash_MultiColumn:
postgres=# create table perftest(a int, b int, c int, d int, e int, f int);
CREATE TABLE

postgres=# insert into perftest values (generate_series(1, 10000000),
generate_series(1, 10000000), generate_series(1, 10000000), 9, 7);
INSERT 0 10000000

postgres=# create index idx on perftest using hash(a, b, c);
CREATE INDEX

postgres=# select * from perftest where a=5999 and b=5999 and c=5999;
  a   |  b   |  c   | d | e | f
------+------+------+---+---+---
 5999 | 5999 | 5999 | 9 | 7 |
(1 row)
Time: 2.915 ms

postgres=# select * from perftest where a=597989 and b=597989 and c=597989;
   a    |   b    |   c    | d | e | f
--------+--------+--------+---+---+---
 597989 | 597989 | 597989 | 9 | 7 |
(1 row)
Time: 1.129 ms

postgres=# select * from perftest where a=6297989 and b=6297989 and c=6297989;
    a    |    b    |    c    | d | e | f
---------+---------+---------+---+---+---
 6297989 | 6297989 | 6297989 | 9 | 7 |
(1 row)
Time: 2.454 ms

postgres=# select * from perftest where a=6290798 and b=6290798 and c=6290798;
    a    |    b    |    c    | d | e | f
---------+---------+---------+---+---+---
 6290798 | 6290798 | 6290798 | 9 | 7 |
(1 row)
Time: 2.327 ms

postgres=# select * from perftest where a=6290791 and b=6290791 and c=6290791;
    a    |    b    |    c    | d | e | f
---------+---------+---------+---+---+---
 6290791 | 6290791 | 6290791 | 9 | 7 |
(1 row)
Time: 1.676 ms

postgres=# select * from perftest where a=62907 and b=62907 and c=62907;
   a   |   b   |   c   | d | e | f
-------+-------+-------+---+---+---
 62907 | 62907 | 62907 | 9 | 7 |
(1 row)
Time: 2.614 ms

If I do a test with 4 clients, then there is not much visible
difference. I think this is because of contentions. And here our focus
is single thread & single operation performance.

>
> One more thing to consider is that it seems that the planner requires
> a condition for the first column of an index before considering an
> indexscan plan. See Tom's email [1] in this regard. I think it would
> be better to see what kind of work is involved there if you want to
> explore a single hash value for all columns idea.
>
> [1] - https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us

About this point, I will analyze further and update.

Thanks & Regards
SadhuPrasad
EnterpriseDB: http://www.enterprisedb.com



Re: Next Steps with Hash Indexes

From
Sadhuprasad Patro
Date:
> > One more thing to consider is that it seems that the planner requires
> > a condition for the first column of an index before considering an
> > indexscan plan. See Tom's email [1] in this regard. I think it would
> > be better to see what kind of work is involved there if you want to
> > explore a single hash value for all columns idea.
> >
> > [1] - https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us
>
> About this point, I will analyze further and update.
>

I have checked the planner code, there does not seem to be any
complicated changes needed to cover if we take up a single hash value
for all columns... Below are the major part of changes needed:

In build_index_paths(), there is a check like, "if (index_clauses ==
NIL && !index->amoptionalkey)", which helps to figure out if the
leading column has any clause or not. This needs to be moved out of
the loop and check for clauses on all key columns.
With this we need to add a "amallcolumncluse" field to Index
structure, which will be set to TRUE for HASH index and FALSE in other
cases.

And to get the multi-column hash index selected, we may set
enable_hashjoin =off, to avoid any condition become join condition,
saw similar behaviors in other DBs as well...

Thanks & Regards
SadhuPrasad
EnterpriseDB: http://www.enterprisedb.com



Re: Next Steps with Hash Indexes

From
Dilip Kumar
Date:
On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
>
> > > One more thing to consider is that it seems that the planner requires
> > > a condition for the first column of an index before considering an
> > > indexscan plan. See Tom's email [1] in this regard. I think it would
> > > be better to see what kind of work is involved there if you want to
> > > explore a single hash value for all columns idea.
> > >
> > > [1] - https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us
> >
> > About this point, I will analyze further and update.
> >
>
> I have checked the planner code, there does not seem to be any
> complicated changes needed to cover if we take up a single hash value
> for all columns... Below are the major part of changes needed:
>
> In build_index_paths(), there is a check like, "if (index_clauses ==
> NIL && !index->amoptionalkey)", which helps to figure out if the
> leading column has any clause or not. This needs to be moved out of
> the loop and check for clauses on all key columns.
> With this we need to add a "amallcolumncluse" field to Index
> structure, which will be set to TRUE for HASH index and FALSE in other
> cases.

Right we can add an AM level option and based on that we can decide
whether to select the index scan if conditions are not given for all
the key columns.  And changes don't look that complicated.

>
> And to get the multi-column hash index selected, we may set
> enable_hashjoin =off, to avoid any condition become join condition,
> saw similar behaviors in other DBs as well...

This may be related to Tom's point that, if some of the quals are
removed due to optimization or converted to join quals, then now, even
if the user has given qual on all the key columns the index scan will
not be selected because we will be forcing that the hash index can
only be selected if it has quals on all the key attributes?

I don't think suggesting enable_hashjoin =off is a solution, this can
happen with merge join or the nested loop join with materialized node,
in all such cases join filter can not be pushed down to the inner node
because the outer node will not start to scan until we
materialize/sort/hash the inner node.  But yeah if we test this
behavior in other databases also and if it appeared that this is how
the hash index is being used then maybe this behavior can be
documented.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
> >
> > And to get the multi-column hash index selected, we may set
> > enable_hashjoin =off, to avoid any condition become join condition,
> > saw similar behaviors in other DBs as well...
>
> This may be related to Tom's point that, if some of the quals are
> removed due to optimization or converted to join quals, then now, even
> if the user has given qual on all the key columns the index scan will
> not be selected because we will be forcing that the hash index can
> only be selected if it has quals on all the key attributes?
>
> I don't think suggesting enable_hashjoin =off is a solution,
>

Yeah, this doesn't sound like a good idea. How about instead try to
explore the idea where the hash (bucket assignment and search) will be
based on the first index key and the other columns will be stored as
payload? I think this might pose some difficulty in the consecutive
patch to enable a unique index because it will increase the chance of
traversing more buckets for uniqueness checks. If we see such
problems, then I have another idea to minimize the number of buckets
that we need to lock during uniqueness check which is by lock chaining
as is used during hash bucket clean up where at a time we don't need
to lock more than two buckets at a time.

-- 
With Regards,
Amit Kapila.



Re: Next Steps with Hash Indexes

From
Simon Riggs
Date:
On Mon, 27 Sept 2021 at 06:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
> > >
> > > And to get the multi-column hash index selected, we may set
> > > enable_hashjoin =off, to avoid any condition become join condition,
> > > saw similar behaviors in other DBs as well...
> >
> > This may be related to Tom's point that, if some of the quals are
> > removed due to optimization or converted to join quals, then now, even
> > if the user has given qual on all the key columns the index scan will
> > not be selected because we will be forcing that the hash index can
> > only be selected if it has quals on all the key attributes?
> >
> > I don't think suggesting enable_hashjoin =off is a solution,
> >
>
> Yeah, this doesn't sound like a good idea. How about instead try to
> explore the idea where the hash (bucket assignment and search) will be
> based on the first index key and the other columns will be stored as
> payload? I think this might pose some difficulty in the consecutive
> patch to enable a unique index because it will increase the chance of
> traversing more buckets for uniqueness checks. If we see such
> problems, then I have another idea to minimize the number of buckets
> that we need to lock during uniqueness check which is by lock chaining
> as is used during hash bucket clean up where at a time we don't need
> to lock more than two buckets at a time.

I have presented a simple, almost trivial, patch to allow multi-col
hash indexes. It hashes the first column only, which can be a downside
in *some* cases. If that is clearly documented, it would not cause
many issues, IMHO. However, it does not have any optimization issues
or complexities, which is surely a very good thing.

Trying to involve *all* columns in the hash index is a secondary
optimization. It requires subtle changes in optimizer code, as Tom
points out. It also needs fine tuning to make the all-column approach
beneficial for the additional cases without losing against what the
"first column" approach gives.

I did consider both approaches and after this discussion I am still in
favour of committing the very simple "first column" approach to
multi-col hash indexes now.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Next Steps with Hash Indexes

From
Simon Riggs
Date:
On Fri, 13 Aug 2021 at 05:01, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 12, 2021 at 8:30 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Thu, Aug 12, 2021 at 12:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > The design of the patch has changed since the initial proposal. It
> > > tries to perform unique inserts by holding a write lock on the bucket
> > > page to avoid duplicate inserts.
> >
> > Do you mean that you're holding a buffer lwlock while you search the
> > whole bucket? If so, that's surely not OK.
> >
>
> I think here you are worried that after holding lwlock we might
> perform reads of overflow pages which is not a good idea. I think
> there are fewer chances of having overflow pages for unique indexes so
> we don't expect such cases in common and as far as I can understand
> this can happen in btree as well during uniqueness check. Now, I think
> the other possibility could be that we do some sort of lock chaining
> where we grab the lock of the next bucket before releasing the lock of
> the current bucket as we do during bucket clean up but not sure about
> the same.
>
> I haven't studied the patch in detail so it is better for Simon to
> pitch in here to avoid any incorrect information or if he has a
> different understanding/idea.

That is correct. After analysis of their behavior, I think further
simple work on hash indexes is worthwhile.

With unique data, starting at 1 and monotonically ascending, hash
indexes will grow very nicely from 0 to 10E7 rows without causing >1
overflow block to be allocated for any bucket. This keeps the search
time for such data to just 2 blocks (bucket plus, if present, 1
overflow block). The small number of overflow blocks is because of the
regular and smooth way that splits occur, which works very nicely
without significant extra latency.

The probability of bucket collision while we hold the lock is fairly
low. This is because even with adjacent data values the hash values
would be spread across multiple buckets, so we would expect the
contention to be less than we would get on a monotonically increasing
btree.

So I don't now see any problem from holding the buffer lwlock on the
bucket while we do multi-buffer operations.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Next Steps with Hash Indexes

From
Dilip Kumar
Date:
On Tue, Oct 5, 2021 at 4:08 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote:
>
> On Mon, 27 Sept 2021 at 06:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
> > > >
> > > > And to get the multi-column hash index selected, we may set
> > > > enable_hashjoin =off, to avoid any condition become join condition,
> > > > saw similar behaviors in other DBs as well...
> > >
> > > This may be related to Tom's point that, if some of the quals are
> > > removed due to optimization or converted to join quals, then now, even
> > > if the user has given qual on all the key columns the index scan will
> > > not be selected because we will be forcing that the hash index can
> > > only be selected if it has quals on all the key attributes?
> > >
> > > I don't think suggesting enable_hashjoin =off is a solution,
> > >
> >
> > Yeah, this doesn't sound like a good idea. How about instead try to
> > explore the idea where the hash (bucket assignment and search) will be
> > based on the first index key and the other columns will be stored as
> > payload? I think this might pose some difficulty in the consecutive
> > patch to enable a unique index because it will increase the chance of
> > traversing more buckets for uniqueness checks. If we see such
> > problems, then I have another idea to minimize the number of buckets
> > that we need to lock during uniqueness check which is by lock chaining
> > as is used during hash bucket clean up where at a time we don't need
> > to lock more than two buckets at a time.
>
> I have presented a simple, almost trivial, patch to allow multi-col
> hash indexes. It hashes the first column only, which can be a downside
> in *some* cases. If that is clearly documented, it would not cause
> many issues, IMHO. However, it does not have any optimization issues
> or complexities, which is surely a very good thing.
>
> Trying to involve *all* columns in the hash index is a secondary
> optimization. It requires subtle changes in optimizer code, as Tom
> points out. It also needs fine tuning to make the all-column approach
> beneficial for the additional cases without losing against what the
> "first column" approach gives.
>
> I did consider both approaches and after this discussion I am still in
> favour of committing the very simple "first column" approach to
> multi-col hash indexes now.

But what about the other approach suggested by Tom, basically we hash
only based on the first column for identifying the bucket, but we also
store the hash value for other columns?  With that, we don't need
changes in the optimizer and we can also avoid a lot of disk fetches
because after finding the bucket we can match the secondary columns
before fetching the disk tuple.  I agree, one downside with this
approach is we will increase the index size.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Next Steps with Hash Indexes

From
Simon Riggs
Date:
On Tue, 5 Oct 2021 at 12:24, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Oct 5, 2021 at 4:08 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote:
> >
> > On Mon, 27 Sept 2021 at 06:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
> > > > >
> > > > > And to get the multi-column hash index selected, we may set
> > > > > enable_hashjoin =off, to avoid any condition become join condition,
> > > > > saw similar behaviors in other DBs as well...
> > > >
> > > > This may be related to Tom's point that, if some of the quals are
> > > > removed due to optimization or converted to join quals, then now, even
> > > > if the user has given qual on all the key columns the index scan will
> > > > not be selected because we will be forcing that the hash index can
> > > > only be selected if it has quals on all the key attributes?
> > > >
> > > > I don't think suggesting enable_hashjoin =off is a solution,
> > > >
> > >
> > > Yeah, this doesn't sound like a good idea. How about instead try to
> > > explore the idea where the hash (bucket assignment and search) will be
> > > based on the first index key and the other columns will be stored as
> > > payload? I think this might pose some difficulty in the consecutive
> > > patch to enable a unique index because it will increase the chance of
> > > traversing more buckets for uniqueness checks. If we see such
> > > problems, then I have another idea to minimize the number of buckets
> > > that we need to lock during uniqueness check which is by lock chaining
> > > as is used during hash bucket clean up where at a time we don't need
> > > to lock more than two buckets at a time.
> >
> > I have presented a simple, almost trivial, patch to allow multi-col
> > hash indexes. It hashes the first column only, which can be a downside
> > in *some* cases. If that is clearly documented, it would not cause
> > many issues, IMHO. However, it does not have any optimization issues
> > or complexities, which is surely a very good thing.
> >
> > Trying to involve *all* columns in the hash index is a secondary
> > optimization. It requires subtle changes in optimizer code, as Tom
> > points out. It also needs fine tuning to make the all-column approach
> > beneficial for the additional cases without losing against what the
> > "first column" approach gives.
> >
> > I did consider both approaches and after this discussion I am still in
> > favour of committing the very simple "first column" approach to
> > multi-col hash indexes now.
>
> But what about the other approach suggested by Tom, basically we hash
> only based on the first column for identifying the bucket, but we also
> store the hash value for other columns?  With that, we don't need
> changes in the optimizer and we can also avoid a lot of disk fetches
> because after finding the bucket we can match the secondary columns
> before fetching the disk tuple.  I agree, one downside with this
> approach is we will increase the index size.

Identifying the bucket is the main part of a hash index's work, so
that part would be identical.

Once we have identified the bucket, we sort the bucket page by hash,
so having an all-col hash would help de-duplicate multi-col hash
collisions, but not enough to be worth it, IMHO, given that storing an
extra 4 bytes per index tuple is a significant size increase which
would cause extra overflow pages etc.. The same thought applies to
8-byte hashes.

IMHO, multi-column hash collisions are a secondary issue, given that
we can already select the column order for an index and hash indexes
would only be used by explicit user choice. If there are some minor
sub-optimal aspects of using hash indexes, then btree was already the
default and a great choice for many cases.

If btree didn't already exist I would care more about making hash
indexes perfect. I just want to make them usable.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Next Steps with Hash Indexes

From
Tomas Vondra
Date:

On 10/5/21 18:28, Simon Riggs wrote:
> On Tue, 5 Oct 2021 at 12:24, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Tue, Oct 5, 2021 at 4:08 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote:
>>>
>>> On Mon, 27 Sept 2021 at 06:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>
>>>> On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>>>>
>>>>> On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
>>>>>>
>>>>>> And to get the multi-column hash index selected, we may set
>>>>>> enable_hashjoin =off, to avoid any condition become join condition,
>>>>>> saw similar behaviors in other DBs as well...
>>>>>
>>>>> This may be related to Tom's point that, if some of the quals are
>>>>> removed due to optimization or converted to join quals, then now, even
>>>>> if the user has given qual on all the key columns the index scan will
>>>>> not be selected because we will be forcing that the hash index can
>>>>> only be selected if it has quals on all the key attributes?
>>>>>
>>>>> I don't think suggesting enable_hashjoin =off is a solution,
>>>>>
>>>>
>>>> Yeah, this doesn't sound like a good idea. How about instead try to
>>>> explore the idea where the hash (bucket assignment and search) will be
>>>> based on the first index key and the other columns will be stored as
>>>> payload? I think this might pose some difficulty in the consecutive
>>>> patch to enable a unique index because it will increase the chance of
>>>> traversing more buckets for uniqueness checks. If we see such
>>>> problems, then I have another idea to minimize the number of buckets
>>>> that we need to lock during uniqueness check which is by lock chaining
>>>> as is used during hash bucket clean up where at a time we don't need
>>>> to lock more than two buckets at a time.
>>>
>>> I have presented a simple, almost trivial, patch to allow multi-col
>>> hash indexes. It hashes the first column only, which can be a downside
>>> in *some* cases. If that is clearly documented, it would not cause
>>> many issues, IMHO. However, it does not have any optimization issues
>>> or complexities, which is surely a very good thing.
>>>
>>> Trying to involve *all* columns in the hash index is a secondary
>>> optimization. It requires subtle changes in optimizer code, as Tom
>>> points out. It also needs fine tuning to make the all-column approach
>>> beneficial for the additional cases without losing against what the
>>> "first column" approach gives.
>>>
>>> I did consider both approaches and after this discussion I am still in
>>> favour of committing the very simple "first column" approach to
>>> multi-col hash indexes now.
>>
>> But what about the other approach suggested by Tom, basically we hash
>> only based on the first column for identifying the bucket, but we also
>> store the hash value for other columns?  With that, we don't need
>> changes in the optimizer and we can also avoid a lot of disk fetches
>> because after finding the bucket we can match the secondary columns
>> before fetching the disk tuple.  I agree, one downside with this
>> approach is we will increase the index size.
> 
> Identifying the bucket is the main part of a hash index's work, so
> that part would be identical.
> 
> Once we have identified the bucket, we sort the bucket page by hash,
> so having an all-col hash would help de-duplicate multi-col hash
> collisions, but not enough to be worth it, IMHO, given that storing an
> extra 4 bytes per index tuple is a significant size increase which
> would cause extra overflow pages etc.. The same thought applies to
> 8-byte hashes.
> 

IMO it'd be nice to show some numbers to support the claims that storing 
the extra hashes and/or 8B hashes is not worth it ...

I'm sure there are cases where it'd be a net loss, but imagine for 
example a case when the first column has a lot of duplicate values. 
Which is not all that unlikely - duplicates seem like one of the natural 
reasons why people want multi-column hash indexes. And those duplicates 
are quite expensive, due to having to access the heap. Being able to 
eliminate those extra accesses cheaply might be a clear win, even if it 
makes the index a bit larger (shorter hashes might be enough).


> IMHO, multi-column hash collisions are a secondary issue, given that
> we can already select the column order for an index and hash indexes
> would only be used by explicit user choice. If there are some minor
> sub-optimal aspects of using hash indexes, then btree was already the
> default and a great choice for many cases.
> 

But we can't select arbitrary column order, because the first column is 
used to select the bucket. Which means it's also about what columns are 
used by the queries. If the query is not using the first column, it 
can't use the index.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Next Steps with Hash Indexes

From
Simon Riggs
Date:
On Tue, 5 Oct 2021 at 20:06, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

> >>> I have presented a simple, almost trivial, patch to allow multi-col
> >>> hash indexes. It hashes the first column only, which can be a downside
> >>> in *some* cases. If that is clearly documented, it would not cause
> >>> many issues, IMHO. However, it does not have any optimization issues
> >>> or complexities, which is surely a very good thing.
> >>>
> >>> Trying to involve *all* columns in the hash index is a secondary
> >>> optimization. It requires subtle changes in optimizer code, as Tom
> >>> points out. It also needs fine tuning to make the all-column approach
> >>> beneficial for the additional cases without losing against what the
> >>> "first column" approach gives.
> >>>
> >>> I did consider both approaches and after this discussion I am still in
> >>> favour of committing the very simple "first column" approach to
> >>> multi-col hash indexes now.
> >>
> >> But what about the other approach suggested by Tom, basically we hash
> >> only based on the first column for identifying the bucket, but we also
> >> store the hash value for other columns?  With that, we don't need
> >> changes in the optimizer and we can also avoid a lot of disk fetches
> >> because after finding the bucket we can match the secondary columns
> >> before fetching the disk tuple.  I agree, one downside with this
> >> approach is we will increase the index size.
> >
> > Identifying the bucket is the main part of a hash index's work, so
> > that part would be identical.
> >
> > Once we have identified the bucket, we sort the bucket page by hash,
> > so having an all-col hash would help de-duplicate multi-col hash
> > collisions, but not enough to be worth it, IMHO, given that storing an
> > extra 4 bytes per index tuple is a significant size increase which
> > would cause extra overflow pages etc.. The same thought applies to
> > 8-byte hashes.
> >
>
> IMO it'd be nice to show some numbers to support the claims that storing
> the extra hashes and/or 8B hashes is not worth it ...

Using an 8-byte hash is possible, but only becomes effective when
4-byte hash collisions get hard to manage. 8-byte hash also makes the
index 20% bigger, so it is not a good default.

Let's look at the distribution of values:

In a table with 100 million rows, with consecutive monotonic values,
starting at 1
No Collisions - 98.8%
1 Collision   - 1.15%
2+ Collisions - 0.009% (8979 values colliding)
Max=4

In a table with 1 billion rows (2^30), with consecutive monotonic
values, starting at 1
No Collisions - 89.3%
1 Collision   - 9.8%
2 Collisions - 0.837%
3+ Collisions - 0.0573% (615523 values colliding)
Max=9

At 100 million rows, the collisions from a 4-byte hash are not
painful, but by a billion rows they are starting to become a problem,
and by 2 billion rows we have a noticeable issue (>18% collisions).

Clearly, 8-byte hash values would be appropriate for tables larger
than this. However, we expect users to know about and to use
partitioning, with reasonable limits somewhere in the 100 million  row
(with 100 byte rows, 10GB) to 1 billion row (with 100 byte rows,
100GB) range.

The change from 4 to 8 byte hashes seems simple, so I am not against
it for that reason. IMHO there is no use case for 8-byte hashes since
reasonable users would not have tables big enough to care.

That is my reasoning, YMMV.

> I'm sure there are cases where it'd be a net loss, but imagine for
> example a case when the first column has a lot of duplicate values.
> Which is not all that unlikely - duplicates seem like one of the natural
> reasons why people want multi-column hash indexes. And those duplicates
> are quite expensive, due to having to access the heap. Being able to
> eliminate those extra accesses cheaply might be a clear win, even if it
> makes the index a bit larger (shorter hashes might be enough).

I agree, eliminating duplicates would be a good thing, if that is possible.

However, hashing on multiple columns doesn't eliminate duplicates, we
can still get them from different combinations of rows.

With a first-column hash then (1,1) and (1,2) collide.
But with an all-column hash then (1,2) and (2,1) collide.
So we can still end up with collisions and this depends on the data
values and types.
We can all come up with pessimal use cases.

Perhaps it would be possible to specify a parameter that says how many
columns in the index are part of the hash? Not sure how easy that is.

If you have a situation with lots of duplicates, then use btrees
instead. We shouldn't have to make hash indexes work well for *all*
cases before we allow multiple columns for some cases. The user will
already get to compare btree-vs-hash before they use them in a
particular use case. The perfect should not be the enemy of the good.

Storing multiple hashes uses more space and is more complex. It
doesn't feel like a good solution to me, given the purpose of an index
is not completeness but optimality.
Storing 2 4-byte hashes uses 20% more space than one 4-byte hash.
Storing 2 8-byte hashes uses 40% more space than one 4-byte hash.

> > IMHO, multi-column hash collisions are a secondary issue, given that
> > we can already select the column order for an index and hash indexes
> > would only be used by explicit user choice. If there are some minor
> > sub-optimal aspects of using hash indexes, then btree was already the
> > default and a great choice for many cases.
> >
>
> But we can't select arbitrary column order, because the first column is
> used to select the bucket. Which means it's also about what columns are
> used by the queries. If the query is not using the first column, it
> can't use the index.

Neither approach works in that case, so it is moot. i.e. you cannot
use a first-column hash, nor an all-column hash.


--
Simon Riggs                http://www.EnterpriseDB.com/



Re: Next Steps with Hash Indexes

From
Peter Geoghegan
Date:
On Wed, Oct 13, 2021 at 3:44 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> > IMO it'd be nice to show some numbers to support the claims that storing
> > the extra hashes and/or 8B hashes is not worth it ...
>
> Using an 8-byte hash is possible, but only becomes effective when
> 4-byte hash collisions get hard to manage. 8-byte hash also makes the
> index 20% bigger, so it is not a good default.

Are you sure? I know that nbtree index tuples for a single-column int8
index are exactly the same size as those from a single column int4
index, due to alignment overhead at the tuple level. So my guess is
that hash index tuples (which use the same basic IndexTuple
representation) work in the same way.

-- 
Peter Geoghegan



Re: Next Steps with Hash Indexes

From
Peter Geoghegan
Date:
On Wed, Oct 13, 2021 at 12:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Are you sure? I know that nbtree index tuples for a single-column int8
> index are exactly the same size as those from a single column int4
> index, due to alignment overhead at the tuple level. So my guess is
> that hash index tuples (which use the same basic IndexTuple
> representation) work in the same way.

I'm assuming a 64-bit architecture here, by the way. That assumption
shouldn't matter, since of course approximately 100% of all computers
that run Postgres are 64-bit these days.

-- 
Peter Geoghegan



Re: Next Steps with Hash Indexes

From
Simon Riggs
Date:
On Wed, 13 Oct 2021 at 20:16, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Oct 13, 2021 at 3:44 AM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> > > IMO it'd be nice to show some numbers to support the claims that storing
> > > the extra hashes and/or 8B hashes is not worth it ...
> >
> > Using an 8-byte hash is possible, but only becomes effective when
> > 4-byte hash collisions get hard to manage. 8-byte hash also makes the
> > index 20% bigger, so it is not a good default.
>
> Are you sure? I know that nbtree index tuples for a single-column int8
> index are exactly the same size as those from a single column int4
> index, due to alignment overhead at the tuple level. So my guess is
> that hash index tuples (which use the same basic IndexTuple
> representation) work in the same way.

The hash index tuples are 20-bytes each. If that were rounded up to
8-byte alignment, then that would be 24 bytes.

Using pageinspect, the max(live_items) on any data page (bucket or
overflow) is 407 items, so they can't be 24 bytes long.


Other stats of interest would be that the current bucket design/page
splitting is very effective at maintaining distribution. On a hash
index for a table with 2 billion rows in it, with integer values from
1 to 2billion, there are 3670016 bucket pages and 524286 overflow
pages, distributed so that 87.5% of buckets have no overflow pages,
and 12.5% of buckets have only one overflow page; there are no buckets
with >1 overflow page. The most heavily populated overflow page has
209 items.

The CREATE INDEX time is fairly poor at present, but that can be
optimized easily enough, but I expect to do that after uniqueness is
added, since it would complicate the code to do that work in a
different order.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Next Steps with Hash Indexes

From
Peter Geoghegan
Date:
On Thu, Oct 14, 2021 at 12:48 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> The hash index tuples are 20-bytes each. If that were rounded up to
> 8-byte alignment, then that would be 24 bytes.
>
> Using pageinspect, the max(live_items) on any data page (bucket or
> overflow) is 407 items, so they can't be 24 bytes long.

That's the same as an nbtree page, which confirms my suspicion. The 20
bytes consists of a 16 byte tuple, plus a 4 byte line pointer. The
tuple-level alignment overhead gets you from 12 bytes to 16 bytes with
a single int4 column. So the padding is there for the taking.

-- 
Peter Geoghegan



Re: Next Steps with Hash Indexes

From
Simon Riggs
Date:
On Thu, 14 Oct 2021 at 16:09, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Oct 14, 2021 at 12:48 AM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> > The hash index tuples are 20-bytes each. If that were rounded up to
> > 8-byte alignment, then that would be 24 bytes.
> >
> > Using pageinspect, the max(live_items) on any data page (bucket or
> > overflow) is 407 items, so they can't be 24 bytes long.
>
> That's the same as an nbtree page, which confirms my suspicion. The 20
> bytes consists of a 16 byte tuple, plus a 4 byte line pointer. The
> tuple-level alignment overhead gets you from 12 bytes to 16 bytes with
> a single int4 column. So the padding is there for the taking.

Thank you for nudging me to review the tuple length.

Since hash indexes never store Nulls, and the hash is always fixed
length, ISTM that we can compress the hash index entries down to
ItemPointerData (6 bytes) plus any hashes.

That doesn't change any arguments about size differences between
approaches, but we can significantly reduce index size (by up to 50%).

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Wed, Oct 13, 2021 at 4:13 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
>
> On Tue, 5 Oct 2021 at 20:06, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> > >>> I have presented a simple, almost trivial, patch to allow multi-col
> > >>> hash indexes. It hashes the first column only, which can be a downside
> > >>> in *some* cases. If that is clearly documented, it would not cause
> > >>> many issues, IMHO. However, it does not have any optimization issues
> > >>> or complexities, which is surely a very good thing.
> > >>>
> > >>> Trying to involve *all* columns in the hash index is a secondary
> > >>> optimization. It requires subtle changes in optimizer code, as Tom
> > >>> points out. It also needs fine tuning to make the all-column approach
> > >>> beneficial for the additional cases without losing against what the
> > >>> "first column" approach gives.
> > >>>
> > >>> I did consider both approaches and after this discussion I am still in
> > >>> favour of committing the very simple "first column" approach to
> > >>> multi-col hash indexes now.
> > >>
> > >> But what about the other approach suggested by Tom, basically we hash
> > >> only based on the first column for identifying the bucket, but we also
> > >> store the hash value for other columns?  With that, we don't need
> > >> changes in the optimizer and we can also avoid a lot of disk fetches
> > >> because after finding the bucket we can match the secondary columns
> > >> before fetching the disk tuple.  I agree, one downside with this
> > >> approach is we will increase the index size.
> > >
> > > Identifying the bucket is the main part of a hash index's work, so
> > > that part would be identical.
> > >
> > > Once we have identified the bucket, we sort the bucket page by hash,
> > > so having an all-col hash would help de-duplicate multi-col hash
> > > collisions, but not enough to be worth it, IMHO, given that storing an
> > > extra 4 bytes per index tuple is a significant size increase which
> > > would cause extra overflow pages etc.. The same thought applies to
> > > 8-byte hashes.
> > >
> >
> > IMO it'd be nice to show some numbers to support the claims that storing
> > the extra hashes and/or 8B hashes is not worth it ...
>
> Using an 8-byte hash is possible, but only becomes effective when
> 4-byte hash collisions get hard to manage. 8-byte hash also makes the
> index 20% bigger, so it is not a good default.
>
> Let's look at the distribution of values:
>
> In a table with 100 million rows, with consecutive monotonic values,
> starting at 1
> No Collisions - 98.8%
> 1 Collision   - 1.15%
> 2+ Collisions - 0.009% (8979 values colliding)
> Max=4
>
> In a table with 1 billion rows (2^30), with consecutive monotonic
> values, starting at 1
> No Collisions - 89.3%
> 1 Collision   - 9.8%
> 2 Collisions - 0.837%
> 3+ Collisions - 0.0573% (615523 values colliding)
> Max=9
>
> At 100 million rows, the collisions from a 4-byte hash are not
> painful, but by a billion rows they are starting to become a problem,
> and by 2 billion rows we have a noticeable issue (>18% collisions).
>
> Clearly, 8-byte hash values would be appropriate for tables larger
> than this. However, we expect users to know about and to use
> partitioning, with reasonable limits somewhere in the 100 million  row
> (with 100 byte rows, 10GB) to 1 billion row (with 100 byte rows,
> 100GB) range.
>
> The change from 4 to 8 byte hashes seems simple, so I am not against
> it for that reason. IMHO there is no use case for 8-byte hashes since
> reasonable users would not have tables big enough to care.
>
> That is my reasoning, YMMV.
>
> > I'm sure there are cases where it'd be a net loss, but imagine for
> > example a case when the first column has a lot of duplicate values.
> > Which is not all that unlikely - duplicates seem like one of the natural
> > reasons why people want multi-column hash indexes. And those duplicates
> > are quite expensive, due to having to access the heap. Being able to
> > eliminate those extra accesses cheaply might be a clear win, even if it
> > makes the index a bit larger (shorter hashes might be enough).
>
> I agree, eliminating duplicates would be a good thing, if that is possible.
>
> However, hashing on multiple columns doesn't eliminate duplicates, we
> can still get them from different combinations of rows.
>
> With a first-column hash then (1,1) and (1,2) collide.
> But with an all-column hash then (1,2) and (2,1) collide.
> So we can still end up with collisions and this depends on the data
> values and types.
>

I don't think this will happen if we store the first column as bucket
identifier and other cols as payload.

> We can all come up with pessimal use cases.
>
> Perhaps it would be possible to specify a parameter that says how many
> columns in the index are part of the hash? Not sure how easy that is.
>
> If you have a situation with lots of duplicates, then use btrees
> instead. We shouldn't have to make hash indexes work well for *all*
> cases before we allow multiple columns for some cases. The user will
> already get to compare btree-vs-hash before they use them in a
> particular use case. The perfect should not be the enemy of the good.
>
> Storing multiple hashes uses more space and is more complex.
>

I agree that storing trailing columns (except the first one) as
payload uses more space but it will save heap fetches in many cases.
Apart from search, even for unique key insertion, we need to do heap
fetches as we can only verify the other values after fetching the row
from the heap.

Now, here I feel the question is do we want to save random heap I/O or
save extra space in a hash? I think both approaches have pros and cons
but probably saving heap I/O appears to be important especially for
unique index checks where we need to hold bucket lock as well.

-- 
With Regards,
Amit Kapila.



Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Sun, Oct 17, 2021 at 4:30 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
>
> On Thu, 14 Oct 2021 at 16:09, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Thu, Oct 14, 2021 at 12:48 AM Simon Riggs
> > <simon.riggs@enterprisedb.com> wrote:
> > > The hash index tuples are 20-bytes each. If that were rounded up to
> > > 8-byte alignment, then that would be 24 bytes.
> > >
> > > Using pageinspect, the max(live_items) on any data page (bucket or
> > > overflow) is 407 items, so they can't be 24 bytes long.
> >
> > That's the same as an nbtree page, which confirms my suspicion. The 20
> > bytes consists of a 16 byte tuple, plus a 4 byte line pointer. The
> > tuple-level alignment overhead gets you from 12 bytes to 16 bytes with
> > a single int4 column. So the padding is there for the taking.
>
> Thank you for nudging me to review the tuple length.
>
> Since hash indexes never store Nulls, and the hash is always fixed
> length, ISTM that we can compress the hash index entries down to
> ItemPointerData (6 bytes) plus any hashes.
>

Nice observation but we use INDEX_AM_RESERVED_BIT (as
INDEX_MOVED_BY_SPLIT_MASK) for hash indexes, so we need to take care
of that in some way.

-- 
With Regards,
Amit Kapila.



Re: Next Steps with Hash Indexes

From
Robert Haas
Date:
On Tue, Oct 5, 2021 at 6:50 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote:
> With unique data, starting at 1 and monotonically ascending, hash
> indexes will grow very nicely from 0 to 10E7 rows without causing >1
> overflow block to be allocated for any bucket. This keeps the search
> time for such data to just 2 blocks (bucket plus, if present, 1
> overflow block). The small number of overflow blocks is because of the
> regular and smooth way that splits occur, which works very nicely
> without significant extra latency.

It is my impression that with non-unique data things degrade rather
badly. There's no way to split the buckets that are overflowing
without also splitting the buckets that are completely empty or, in
any event, not full enough to need any overflow pages. I think that's
rather awful.

> The probability of bucket collision while we hold the lock is fairly
> low. This is because even with adjacent data values the hash values
> would be spread across multiple buckets, so we would expect the
> contention to be less than we would get on a monotonically increasing
> btree.
>
> So I don't now see any problem from holding the buffer lwlock on the
> bucket while we do multi-buffer operations.

I don't think that contention is the primary concern here. I think one
major concern is interruptibility: a process must be careful not to
hold lwlocks across long stretches of code, because it cannot be
cancelled while it does. Even if the code is bug-free and the database
has no corruption, long pauses before cancels take effect can be
inconvenient. But as soon as you add in those considerations things
get much worse. Imagine a hash index that is corrupted so that there
is a loop in the list of overflow pages. No matter what, we're going
to go into an infinite loop scanning that bucket, but if we're holding
a buffer lock while we do it, there's no way to escape short of
bouncing the entire server. That's pretty bad.

Undetected deadlock is something to think about, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Wed, Oct 27, 2021 at 2:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Oct 5, 2021 at 6:50 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote:
> > With unique data, starting at 1 and monotonically ascending, hash
> > indexes will grow very nicely from 0 to 10E7 rows without causing >1
> > overflow block to be allocated for any bucket. This keeps the search
> > time for such data to just 2 blocks (bucket plus, if present, 1
> > overflow block). The small number of overflow blocks is because of the
> > regular and smooth way that splits occur, which works very nicely
> > without significant extra latency.
>
> It is my impression that with non-unique data things degrade rather
> badly.
>

But we will hold the bucket lock only for unique-index in which case
there shouldn't be non-unique data in the index. The non-unique case
should work as it works today. I guess this is the reason Simon took
an example of unique data.

-- 
With Regards,
Amit Kapila.



Re: Next Steps with Hash Indexes

From
Matthias van de Meent
Date:
On Wed, 27 Oct 2021 at 12:58, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 27, 2021 at 2:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Oct 5, 2021 at 6:50 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote:
> > > With unique data, starting at 1 and monotonically ascending, hash
> > > indexes will grow very nicely from 0 to 10E7 rows without causing >1
> > > overflow block to be allocated for any bucket. This keeps the search
> > > time for such data to just 2 blocks (bucket plus, if present, 1
> > > overflow block). The small number of overflow blocks is because of the
> > > regular and smooth way that splits occur, which works very nicely
> > > without significant extra latency.
> >
> > It is my impression that with non-unique data things degrade rather
> > badly.
> >
>
> But we will hold the bucket lock only for unique-index in which case
> there shouldn't be non-unique data in the index.

Even in unique indexes there might be many duplicate index entries: A
frequently updated row, to which HOT cannot apply, whose row versions
are waiting for vacuum (which is waiting for that one long-running
transaction to commit) will have many entries in each index.

Sure, it generally won't hit 10E7 duplicates, but we can hit large
numbers of duplicates fast on a frequently updated row. Updating one
row 1000 times between two runs of VACUUM is not at all impossible,
and although I don't think it happens all the time, I do think it can
happen often enough on e.g. an HTAP system to make it a noteworthy
test case.

Kind regards,

Matthias van de Meent



Re: Next Steps with Hash Indexes

From
Amit Kapila
Date:
On Wed, Oct 27, 2021 at 4:55 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Wed, 27 Oct 2021 at 12:58, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Oct 27, 2021 at 2:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On Tue, Oct 5, 2021 at 6:50 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote:
> > > > With unique data, starting at 1 and monotonically ascending, hash
> > > > indexes will grow very nicely from 0 to 10E7 rows without causing >1
> > > > overflow block to be allocated for any bucket. This keeps the search
> > > > time for such data to just 2 blocks (bucket plus, if present, 1
> > > > overflow block). The small number of overflow blocks is because of the
> > > > regular and smooth way that splits occur, which works very nicely
> > > > without significant extra latency.
> > >
> > > It is my impression that with non-unique data things degrade rather
> > > badly.
> > >
> >
> > But we will hold the bucket lock only for unique-index in which case
> > there shouldn't be non-unique data in the index.
>
> Even in unique indexes there might be many duplicate index entries: A
> frequently updated row, to which HOT cannot apply, whose row versions
> are waiting for vacuum (which is waiting for that one long-running
> transaction to commit) will have many entries in each index.
>
> Sure, it generally won't hit 10E7 duplicates, but we can hit large
> numbers of duplicates fast on a frequently updated row. Updating one
> row 1000 times between two runs of VACUUM is not at all impossible,
> and although I don't think it happens all the time, I do think it can
> happen often enough on e.g. an HTAP system to make it a noteworthy
> test case.
>

I think it makes to test such cases and see the behavior w.r.t overflow buckets.

-- 
With Regards,
Amit Kapila.