Re: [POC] A better way to expand hash indexes. - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [POC] A better way to expand hash indexes.
Date
Msg-id CAA4eK1LugOfyBkSOr8dr+OGT0hqVWxRjC_Q7dJbUOxWrWZ5HyA@mail.gmail.com
Whole thread Raw
In response to Re: [POC] A better way to expand hash indexes.  (Mithun Cy <mithun.cy@enterprisedb.com>)
Responses Re: [POC] A better way to expand hash indexes.  (Mithun Cy <mithun.cy@enterprisedb.com>)
List pgsql-hackers
On Tue, Mar 28, 2017 at 10:43 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> On Mon, Mar 27, 2017 at 11:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> As you have said we can solve it if we allocate buckets always in
> power-of-2 when we do hash index meta page init. But on other
> occasions, when we try to double the existing buckets we can do the
> allocation in 4 equal phases.
>
> But I think there are 2 more ways to solve same,
>
> A. Why not pass all 3 parameters high_mask, low_mask, max-buckets to
> tuplesort and let them use _hash_hashkey2bucket to figure out which
> key belong to which bucket. And then sort them. I think this way we
> make both sorting and insertion to hash index both consistent with
> each other.
>
> B. In tuple sort we can use hash function bucket = hash_key %
> num_buckets instead of existing one which does bitwise "and" to
> determine the bucket of hash key. This way we will not wrongly assign
> buckets beyond max_buckets and sorted hash keys will be in sync with
> actual insertion order of _hash_doinsert.
>
>
> I am adding both the patches Patch_A and Patch_B. My preference is
> Patch_A and I am open for suggestion.
>

I think both way it can work.  I feel there is no hard pressed need
that we should make the computation in sorting same as what you do
_hash_doinsert. In patch_B,  I don't think new naming for variables is
good.
 Assert(!a->isnull1);
- hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
+ bucket1 = DatumGetUInt32(a->datum1) % state->num_buckets;

Can we use hash_mod instead of num_buckets and retain hash1 in above
code and similar other places?

>>+#define SPLITPOINT_PHASES_PER_GRP 4
>>+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
>>+#define Buckets_First_Split_Group 4
> Fixed.
>
>>In the above computation +2 and -2 still bothers me.  I think you need
>>to do this because you have defined split group zero to have four
>>buckets, how about if you don't force that and rather define to have
>>split point phases only from split point which has four or more
>>buckets?
>
> Okay as suggested instead of group zero having 4 phases of 1 bucket
> each, I have recalculated the spare mapping as below.
>

Few comments:
1.
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+ ((sp_g < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? \
+ (1 << (sp_g - 1)) : \
+ ((nphase) * ((1 << (sp_g - 1)) >> 2)))

This will go wrong for split point group zero.  In general, I feel if
you handle computation for split groups lesser than
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE in the caller, then all your
macros will become much simpler and less error prone.

2.
+#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) (_hash_log2(num_bucket))

What is the use of such a define, can't we directly use _hash_log2 in
the caller?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least9.5)?
Next
From: Erik Rijkers
Date:
Subject: walsender.c comments