Home > mailing lists
Re: Hash Indexes - Mailing list pgsql-hackers

From	Amit Kapila
Subject	Re: Hash Indexes
Date	November 9, 2016 17:04:09
Msg-id	CAA4eK1LWNzOzkOL+wiSeb-84v4hzVtYrRL=QYp=v5j+DGtzTng@mail.gmail.com Whole thread Raw
In response to	Re: Hash Indexes (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Hash Indexes
List	pgsql-hackers
Tree view
On Wed, Nov 9, 2016 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> [ new patches ]
>
> Attached is yet another incremental patch with some suggested changes.
>
> + * This expects that the caller has acquired a cleanup lock on the target
> + * bucket (primary page of a bucket) and it is reponsibility of caller to
> + * release that lock.
>
> This is confusing, because it makes it sound like we retain the lock
> through the entire execution of the function, which isn't always true.
> I would say that caller must acquire a cleanup lock on the target
> primary bucket page before calling this function, and that on return
> that page will again be write-locked.  However, the lock might be
> temporarily released in the meantime, which visiting overflow pages.
> (Attached patch has a suggested rewrite.)
>

+ * This function expects that the caller has acquired a cleanup lock on the
+ * primary bucket page, and will with a write lock again held on the primary
+ * bucket page.  The lock won't necessarily be held continuously, though,
+ * because we'll release it when visiting overflow pages.

Looks like typo in above comment.   /will with a write lock/will
return with a write lock


> + * During scan of overflow pages, first we need to lock the next bucket and
> + * then release the lock on current bucket.  This ensures that any concurrent
> + * scan started after we start cleaning the bucket will always be behind the
> + * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
> + * required for sanctity of scan.
>
> This comment says that it's bad if other scans can pass our cleanup
> scan, but it doesn't explain why.  I think it's because we don't have
> page-at-a-time mode yet,
>

Right.

> and cleanup might decrease the TIDs for
> existing index entries.
>

I think the reason is that cleanup might move tuples around during
which it might move previously returned TID to a position earlier than
its current position.  This is a problem because it restarts the scan
from previously returned offset and try to find previously returned
tuples TID.  This has been mentioned in README as below:

+ It is must to
+keep scans behind cleanup, else vacuum could remove tuples that are required
+to complete the scan as the scan that returns multiple tuples from the same
+bucket page always restart the scan from the previous offset number from which
+it has returned last tuple.

We might want to slightly improve the README so that the reason is
more clear and then mention in comments to refer README, but I am open
either way, let me know which way you prefer?

>
> +        if (delay)
> +            vacuum_delay_point();
>
> You don't really need "delay".  If we're not in a cost-accounted
> VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(),
> which should be safe (and a good idea) regardless.  (Fixed in
> attached.)
>

Okay, that makes sense.

> +            if (callback && callback(htup, callback_state))
> +            {
> +                /* mark the item for deletion */
> +                deletable[ndeletable++] = offno;
> +                if (tuples_removed)
> +                    *tuples_removed += 1;
> +            }
> +            else if (bucket_has_garbage)
> +            {
> +                /* delete the tuples that are moved by split. */
> +                bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup
> ),
> +                                              maxbucket,
> +                                              highmask,
> +                                              lowmask);
> +                /* mark the item for deletion */
> +                if (bucket != cur_bucket)
> +                {
> +                    /*
> +                     * We expect tuples to either belong to curent bucket or
> +                     * new_bucket.  This is ensured because we don't allow
> +                     * further splits from bucket that contains garbage. See
> +                     * comments in _hash_expandtable.
> +                     */
> +                    Assert(bucket == new_bucket);
> +                    deletable[ndeletable++] = offno;
> +                }
> +                else if (num_index_tuples)
> +                    *num_index_tuples += 1;
> +            }
> +            else if (num_index_tuples)
> +                *num_index_tuples += 1;
> +        }
>
> OK, a couple things here.  First, it seems like we could also delete
> any tuples where ItemIdIsDead, and that seems worth doing.

I think we can't do that because here we want to strictly rely on
callback function for vacuum similar to btree. The reason is explained
as below comment in function btvacuumpage().

/*
* During Hot Standby we currently assume that
* XLOG_BTREE_VACUUM records do not produce conflicts. That is
* only true as long as the callback function depends only
* upon whether the index tuple refers to heap tuples removed
* in the initial heap scan. ...
..

> In fact, I
> think we should check it prior to invoking the callback, because it's
> probably quite substantially cheaper than the callback.  Second,
> repeating deletable[ndeletable++] = offno and *num_index_tuples += 1
> doesn't seem very clean to me; I think we should introduce a new bool
> to track whether we're keeping the tuple or killing it, and then use
> that to drive which of those things we do.  (Fixed in attached.)
>

This looks okay to me. So if you agree with my reasoning for not
including first part, then I can take that out and keep this part in
next patch.

> +        if (H_HAS_GARBAGE(bucket_opaque) &&
> +            !H_INCOMPLETE_SPLIT(bucket_opaque))
>
> This is the only place in the entire patch that use
> H_INCOMPLETE_SPLIT(), and I'm wondering if that's really correct even
> here. Don't you really want H_OLD_INCOMPLETE_SPLIT() here?  (And
> couldn't we then remove H_INCOMPLETE_SPLIT() itself?)

You are right.  Will remove it in next version.

>
> I think it would be a good idea to change this so that
> LH_BUCKET_PAGE_HAS_GARBAGE doesn't get set until
> LH_BUCKET_OLD_PAGE_SPLIT is cleared.  The current way is confusing,
> because those tuples are NOT garbage until the split is completed!
> Moreover, both of the places that care about
> LH_BUCKET_PAGE_HAS_GARBAGE need to make sure that
> LH_BUCKET_OLD_PAGE_SPLIT is clear before they do anything about
> LH_BUCKET_PAGE_HAS_GARBAGE, so the change I am proposing would
> actually simplify the code very slightly.
>

Not an issue.  We can do that way.

> +#define H_OLD_INCOMPLETE_SPLIT(opaque)  ((opaque)->hasho_flag &
> LH_BUCKET_OLD_PAGE_SPLIT)
> +#define H_NEW_INCOMPLETE_SPLIT(opaque)  ((opaque)->hasho_flag &
> LH_BUCKET_NEW_PAGE_SPLIT)
>
> The code isn't consistent about the use of these macros, or at least
> not in a good way.  When you care about LH_BUCKET_OLD_PAGE_SPLIT, you
> test it using the macro; when you care about H_NEW_INCOMPLETE_SPLIT,
> you ignore the macro and test it directly.  Either get rid of both
> macros and always test directly, or keep both macros and use both of
> them. Using a macro for one but not the other is strange.
>

I will like to use a macro at both places.

> I wonder if we should rename these flags and macros.  Maybe
> LH_BUCKET_OLD_PAGE_SPLIT -> LH_BEING_SPLIT and
> LH_BUCKET_NEW_PAGE_SPLIT -> LH_BEING_POPULATED.
>

I think keeping BUCKET (LH_BUCKET_*) in define indicates in some way
about the type of page being split. Hash index has multiple type of
pages. That seems to be taken care in existing defines as below.
#define LH_OVERFLOW_PAGE (1 << 0)
#define LH_BUCKET_PAGE (1 << 1)
#define LH_BITMAP_PAGE (1 << 2)
#define LH_META_PAGE (1 << 3)


>  I think that might be
> clearer.  When LH_BEING_POPULATED is set, the bucket is being filled -
> that is, populated - from the old bucket.
>

How about LH_BUCKET_BEING_POPULATED or may LH_BP_BEING_SPLIT where BP
indicates Bucket page?

I think keeping Split work in these defines might make more sense like
LH_BP_SPLIT_OLD/LH_BP_SPLIT_FORM_NEW.

>  And maybe
> LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too.
>

How about LH_BUCKET_NEEDS_SPLIT_CLEANUP or LH_BP_NEEDS_SPLIT_CLEANUP?
I am slightly inclined to keep Bucket word, but let me know if you
think it will make the length longer.

> +         * Copy bucket mapping info now;  The comment in _hash_expandtable
> +         * where we copy this information and calls _hash_splitbucket explains
> +         * why this is OK.
>
> After a semicolon, the next word should not be capitalized.  There
> shouldn't be two spaces after a semicolon, either.
>

Will fix.

>  Also,
> _hash_splitbucket appears to have a verb before it (calls) and a verb
> after it (explains) and I have no idea what that means.
>

I think conjuction is required there. Let me try to rewrite as below:
refer the comment in _hash_expandtable where we copy this information
before calling _hash_splitbucket to see why this is ok.

If you have better words to explain it, then let me know.

> +    for (;;)
> +    {
> +        mask = lowmask + 1;
> +        new_bucket = old_bucket | mask;
> +        if (new_bucket > metap->hashm_maxbucket)
> +        {
> +            lowmask = lowmask >> 1;
> +            continue;
> +        }
> +        blkno = BUCKET_TO_BLKNO(metap, new_bucket);
> +        break;
> +    }
>
> I can't help feeling that it should be possible to do this without
> looping.  Can we ever loop more than once?
>

No.

>  How?  Can we just use an
> if-then instead of a for-loop?
>

I could see below two possibilities:
First way -

retry:
mask = lowmask + 1;
new_bucket = old_bucket | mask;
if (new_bucket > maxbucket)
{
lowmask = lowmask >> 1;
goto retry;
}

Second way -
new_bucket = CALC_NEW_BUCKET(old_bucket,lowmask);
if (new_bucket > maxbucket)
{
lowmask = lowmask >> 1;
new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
}

#define CALC_NEW_BUCKET(old_bucket, lowmask) \
new_bucket = old_bucket | (lowmask + 1)

Do you have something else in mind?


> Can't _hash_get_oldbucket_newblock call _hash_get_oldbucket_newbucket
> instead of duplicating the logic?
>

Will change in next version of patch.

> I still don't like the names of these functions very much.  If you
> said "get X from Y", it would be clear that you put in Y and you get
> out X.  If you say "X 2 Y", it would be clear that you put in X and
> you get out Y.  As it is, it's not very clear which is the input and
> which is the output.
>

Whatever exists earlier is input and the later one is output. For
example in existing function _hash_get_indextuple_hashkey().  However,
feel free to suggest better names here.  How about
_hash_get_oldbucket2newblock() or _hash_get_newblock_from_oldbucket()
or simply _hash_get_newblock()?

> +    /*
> +     * Acquiring cleanup lock to clear the split-in-progress flag ensures that
> +     * there is no pending scan that has seen the flag after it is cleared.
> +     */
> +    _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
> +    opage = BufferGetPage(bucket_obuf);
> +    oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
>
> I don't understand the comment, because the code *isn't* acquiring a
> cleanup lock.
>

Oops, this is ramnant from one of the design approach to clear these
flags which was later discarded due to issues. I will change this to
indicate Exclusive lock.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date:
From: Peter Eisentraut
Date: 09 November 2016, 17:02:54
Subject: Re: IPv6 link-local addresses and init data type
From: Amit Kapila
Date: 09 November 2016, 17:10:39
Subject: Re: Write Ahead Logging for Hash Indexes
Re: Hash Indexes - Mailing list pgsql-hackers

Previous

Next