Re: Hash Indexes - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Hash Indexes |
Date | |
Msg-id | CAA4eK1LWNzOzkOL+wiSeb-84v4hzVtYrRL=QYp=v5j+DGtzTng@mail.gmail.com Whole thread Raw |
In response to | Re: Hash Indexes (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Hash Indexes
|
List | pgsql-hackers |
On Wed, Nov 9, 2016 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> [ new patches ] > > Attached is yet another incremental patch with some suggested changes. > > + * This expects that the caller has acquired a cleanup lock on the target > + * bucket (primary page of a bucket) and it is reponsibility of caller to > + * release that lock. > > This is confusing, because it makes it sound like we retain the lock > through the entire execution of the function, which isn't always true. > I would say that caller must acquire a cleanup lock on the target > primary bucket page before calling this function, and that on return > that page will again be write-locked. However, the lock might be > temporarily released in the meantime, which visiting overflow pages. > (Attached patch has a suggested rewrite.) > + * This function expects that the caller has acquired a cleanup lock on the + * primary bucket page, and will with a write lock again held on the primary + * bucket page. The lock won't necessarily be held continuously, though, + * because we'll release it when visiting overflow pages. Looks like typo in above comment. /will with a write lock/will return with a write lock > + * During scan of overflow pages, first we need to lock the next bucket and > + * then release the lock on current bucket. This ensures that any concurrent > + * scan started after we start cleaning the bucket will always be behind the > + * cleanup. Allowing scans to cross vacuum will allow it to remove tuples > + * required for sanctity of scan. > > This comment says that it's bad if other scans can pass our cleanup > scan, but it doesn't explain why. I think it's because we don't have > page-at-a-time mode yet, > Right. > and cleanup might decrease the TIDs for > existing index entries. > I think the reason is that cleanup might move tuples around during which it might move previously returned TID to a position earlier than its current position. This is a problem because it restarts the scan from previously returned offset and try to find previously returned tuples TID. This has been mentioned in README as below: + It is must to +keep scans behind cleanup, else vacuum could remove tuples that are required +to complete the scan as the scan that returns multiple tuples from the same +bucket page always restart the scan from the previous offset number from which +it has returned last tuple. We might want to slightly improve the README so that the reason is more clear and then mention in comments to refer README, but I am open either way, let me know which way you prefer? > > + if (delay) > + vacuum_delay_point(); > > You don't really need "delay". If we're not in a cost-accounted > VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(), > which should be safe (and a good idea) regardless. (Fixed in > attached.) > Okay, that makes sense. > + if (callback && callback(htup, callback_state)) > + { > + /* mark the item for deletion */ > + deletable[ndeletable++] = offno; > + if (tuples_removed) > + *tuples_removed += 1; > + } > + else if (bucket_has_garbage) > + { > + /* delete the tuples that are moved by split. */ > + bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup > ), > + maxbucket, > + highmask, > + lowmask); > + /* mark the item for deletion */ > + if (bucket != cur_bucket) > + { > + /* > + * We expect tuples to either belong to curent bucket or > + * new_bucket. This is ensured because we don't allow > + * further splits from bucket that contains garbage. See > + * comments in _hash_expandtable. > + */ > + Assert(bucket == new_bucket); > + deletable[ndeletable++] = offno; > + } > + else if (num_index_tuples) > + *num_index_tuples += 1; > + } > + else if (num_index_tuples) > + *num_index_tuples += 1; > + } > > OK, a couple things here. First, it seems like we could also delete > any tuples where ItemIdIsDead, and that seems worth doing. I think we can't do that because here we want to strictly rely on callback function for vacuum similar to btree. The reason is explained as below comment in function btvacuumpage(). /* * During Hot Standby we currently assume that * XLOG_BTREE_VACUUM records do not produce conflicts. That is * only true as long as the callback function depends only * upon whether the index tuple refers to heap tuples removed * in the initial heap scan. ... .. > In fact, I > think we should check it prior to invoking the callback, because it's > probably quite substantially cheaper than the callback. Second, > repeating deletable[ndeletable++] = offno and *num_index_tuples += 1 > doesn't seem very clean to me; I think we should introduce a new bool > to track whether we're keeping the tuple or killing it, and then use > that to drive which of those things we do. (Fixed in attached.) > This looks okay to me. So if you agree with my reasoning for not including first part, then I can take that out and keep this part in next patch. > + if (H_HAS_GARBAGE(bucket_opaque) && > + !H_INCOMPLETE_SPLIT(bucket_opaque)) > > This is the only place in the entire patch that use > H_INCOMPLETE_SPLIT(), and I'm wondering if that's really correct even > here. Don't you really want H_OLD_INCOMPLETE_SPLIT() here? (And > couldn't we then remove H_INCOMPLETE_SPLIT() itself?) You are right. Will remove it in next version. > > I think it would be a good idea to change this so that > LH_BUCKET_PAGE_HAS_GARBAGE doesn't get set until > LH_BUCKET_OLD_PAGE_SPLIT is cleared. The current way is confusing, > because those tuples are NOT garbage until the split is completed! > Moreover, both of the places that care about > LH_BUCKET_PAGE_HAS_GARBAGE need to make sure that > LH_BUCKET_OLD_PAGE_SPLIT is clear before they do anything about > LH_BUCKET_PAGE_HAS_GARBAGE, so the change I am proposing would > actually simplify the code very slightly. > Not an issue. We can do that way. > +#define H_OLD_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & > LH_BUCKET_OLD_PAGE_SPLIT) > +#define H_NEW_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & > LH_BUCKET_NEW_PAGE_SPLIT) > > The code isn't consistent about the use of these macros, or at least > not in a good way. When you care about LH_BUCKET_OLD_PAGE_SPLIT, you > test it using the macro; when you care about H_NEW_INCOMPLETE_SPLIT, > you ignore the macro and test it directly. Either get rid of both > macros and always test directly, or keep both macros and use both of > them. Using a macro for one but not the other is strange. > I will like to use a macro at both places. > I wonder if we should rename these flags and macros. Maybe > LH_BUCKET_OLD_PAGE_SPLIT -> LH_BEING_SPLIT and > LH_BUCKET_NEW_PAGE_SPLIT -> LH_BEING_POPULATED. > I think keeping BUCKET (LH_BUCKET_*) in define indicates in some way about the type of page being split. Hash index has multiple type of pages. That seems to be taken care in existing defines as below. #define LH_OVERFLOW_PAGE (1 << 0) #define LH_BUCKET_PAGE (1 << 1) #define LH_BITMAP_PAGE (1 << 2) #define LH_META_PAGE (1 << 3) > I think that might be > clearer. When LH_BEING_POPULATED is set, the bucket is being filled - > that is, populated - from the old bucket. > How about LH_BUCKET_BEING_POPULATED or may LH_BP_BEING_SPLIT where BP indicates Bucket page? I think keeping Split work in these defines might make more sense like LH_BP_SPLIT_OLD/LH_BP_SPLIT_FORM_NEW. > And maybe > LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too. > How about LH_BUCKET_NEEDS_SPLIT_CLEANUP or LH_BP_NEEDS_SPLIT_CLEANUP? I am slightly inclined to keep Bucket word, but let me know if you think it will make the length longer. > + * Copy bucket mapping info now; The comment in _hash_expandtable > + * where we copy this information and calls _hash_splitbucket explains > + * why this is OK. > > After a semicolon, the next word should not be capitalized. There > shouldn't be two spaces after a semicolon, either. > Will fix. > Also, > _hash_splitbucket appears to have a verb before it (calls) and a verb > after it (explains) and I have no idea what that means. > I think conjuction is required there. Let me try to rewrite as below: refer the comment in _hash_expandtable where we copy this information before calling _hash_splitbucket to see why this is ok. If you have better words to explain it, then let me know. > + for (;;) > + { > + mask = lowmask + 1; > + new_bucket = old_bucket | mask; > + if (new_bucket > metap->hashm_maxbucket) > + { > + lowmask = lowmask >> 1; > + continue; > + } > + blkno = BUCKET_TO_BLKNO(metap, new_bucket); > + break; > + } > > I can't help feeling that it should be possible to do this without > looping. Can we ever loop more than once? > No. > How? Can we just use an > if-then instead of a for-loop? > I could see below two possibilities: First way - retry: mask = lowmask + 1; new_bucket = old_bucket | mask; if (new_bucket > maxbucket) { lowmask = lowmask >> 1; goto retry; } Second way - new_bucket = CALC_NEW_BUCKET(old_bucket,lowmask); if (new_bucket > maxbucket) { lowmask = lowmask >> 1; new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask); } #define CALC_NEW_BUCKET(old_bucket, lowmask) \ new_bucket = old_bucket | (lowmask + 1) Do you have something else in mind? > Can't _hash_get_oldbucket_newblock call _hash_get_oldbucket_newbucket > instead of duplicating the logic? > Will change in next version of patch. > I still don't like the names of these functions very much. If you > said "get X from Y", it would be clear that you put in Y and you get > out X. If you say "X 2 Y", it would be clear that you put in X and > you get out Y. As it is, it's not very clear which is the input and > which is the output. > Whatever exists earlier is input and the later one is output. For example in existing function _hash_get_indextuple_hashkey(). However, feel free to suggest better names here. How about _hash_get_oldbucket2newblock() or _hash_get_newblock_from_oldbucket() or simply _hash_get_newblock()? > + /* > + * Acquiring cleanup lock to clear the split-in-progress flag ensures that > + * there is no pending scan that has seen the flag after it is cleared. > + */ > + _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE); > + opage = BufferGetPage(bucket_obuf); > + oopaque = (HashPageOpaque) PageGetSpecialPointer(opage); > > I don't understand the comment, because the code *isn't* acquiring a > cleanup lock. > Oops, this is ramnant from one of the design approach to clear these flags which was later discarded due to issues. I will change this to indicate Exclusive lock. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: