Thread: [HACKERS] Page Scan Mode in Hash Index
Hi All, Currently, Hash Index scan works tuple-at-a-time, i.e. for every qualifying tuple in a page, it acquires and releases the lock which eventually increases the lock/unlock traffic. For example, if an index page contains 100 qualified tuples, the current hash index scan has to acquire and release the lock 100 times to read those qualified tuples which is not good from performance perspective and it also impacts concurency with VACUUM. Considering above points, I would like to propose a patch that allows hash index scan to work in page-at-a-time mode. In page-at-a-time mode, once lock is acquired on a target bucket page, the entire page is scanned and all the qualified tuples are saved into backend's local memory. This reduces the lock/unlock calls for retrieving tuples from a page. Moreover, it also eliminates the problem of re-finding the position of the last returned index tuple and more importanly it allows VACUUM to release lock on current page before moving to the next page which eventually improves it's concurrency with scan. Attached patch modifies hash index scan code for page-at-a-time mode. For better readability, I have splitted it into 3 parts, 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this patch rewrites the hash index scan module to work in page-at-a-time mode. It basically introduces two new functions-- _hash_readpage() and _hash_saveitem(). The former is used to load all the qualifying tuples from a target bucket or overflow page into an items array. The latter one is used by _hash_readpage to save all the qualifying tuples found in a page into an items array. Apart from that, this patch bascially cleans _hash_first(), _hash_next and hashgettuple(). 2) 0002-Remove-redundant-function-_hash_step-and-some-of-the.patch: this patch basically removes the redundant function _hash_step() and some of the unused members of HashScanOpaqueData structure. 3) 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patch: this patch basically improves the locking strategy for VACUUM in hash index. As the new hash index scan works in page-at-a-time, vacuum can release the lock on previous page before acquiring a lock on the next page, hence, improving hash index concurrency. Please note that, above patches has to be applied on top of following patches-- 'WAL in hash index - [1]' and 'Microvacuum support for hash index [2]'. Note that in current head, marking of dead tuples requires lock on the page. Now, even if hash index scan is done page-at-a-time, it would still require a a lock on the page just to mark dead tuples. Hence, loosing the advantage of page-at-a-time mode. Therefore, I developed this patch over Microvacuum support for hash index [2]. I have also done the benchmarking of this patch and would like to share the results for the same, Firstly, I have done the benchmarking with non-unique values and i could see a performance improvement of 4-7%. For the detailed results please find the attached file 'results-non-unique values-70ff', and ddl.sql, test.sql are test scripts used in this experimentation. The detail of non-default GUC params and pgbench command are mentioned in the result sheet. I also did the benchmarking with unique values at 300 and 1000 scale factor and its results are provided in 'results-unique-values-default-ff'. [1]- https://www.postgresql.org/message-id/CAA4eK1LTyDHyCmj3pf5KxWgPb1DgNae9ivsB5jX0X_Kt7iLTUA%40mail.gmail.com [2]- https://www.postgresql.org/message-id/CAA4eK1JfrJoa15XStmRKy6mGsVjKh_aa-EXZY%2BUZQOV6mGM0QQ%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, Ashutosh!
I've assigned to review this patch.
At first, I'd like to notice that I like idea and general design.
Secondly, patch set don't apply cleanly to master. Please, rebase it.
On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time. patch: this
patch rewrites the hash index scan module to work in page-at-a-time
mode. It basically introduces two new functions-- _hash_readpage() and
_hash_saveitem(). The former is used to load all the qualifying tuples
from a target bucket or overflow page into an items array. The latter
one is used by _hash_readpage to save all the qualifying tuples found
in a page into an items array. Apart from that, this patch bascially
cleans _hash_first(), _hash_next and hashgettuple().
I see that forward and backward scan cases of function _hash_readpage contain a lot of code duplication
Could you please refactor this function to have less code duplication?
Also, I wonder if you have a special idea behind inserting data in test.sql by 1002 separate SQL statements.
INSERT INTO con_hash_index_table (keycol) SELECT a FROM GENERATE_SERIES(1, 1000) a;
You can achieve the same result by execution of single SQL statement.
INSERT INTO con_hash_index_table (keycol) SELECT (a - 1) % 1000 + 1 FROM GENERATE_SERIES(1, 1002000) a;
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi, > > I've assigned to review this patch. > At first, I'd like to notice that I like idea and general design. > Secondly, patch set don't apply cleanly to master. Please, rebase it. Thanks for showing your interest towards this patch. I would like to inform that this patch has got dependency on patch for 'Write Ahead Logging in hash index - [1]' and 'Microvacuum support in hash index [1]'. Hence, until above two patches becomes stable I may have to keep on rebasing this patch. However, I will try to share you the updated patch asap. > > > On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> >> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this >> patch rewrites the hash index scan module to work in page-at-a-time >> mode. It basically introduces two new functions-- _hash_readpage() and >> _hash_saveitem(). The former is used to load all the qualifying tuples >> from a target bucket or overflow page into an items array. The latter >> one is used by _hash_readpage to save all the qualifying tuples found >> in a page into an items array. Apart from that, this patch bascially >> cleans _hash_first(), _hash_next and hashgettuple(). > > > I see that forward and backward scan cases of function _hash_readpage contain a lot of code duplication > Could you please refactor this function to have less code duplication? Sure, I will try to avoid the code duplication as much as possible. > > Also, I wonder if you have a special idea behind inserting data in test.sql by 1002 separate SQL statements. > INSERT INTO con_hash_index_table (keycol) SELECT a FROM GENERATE_SERIES(1, 1000) a; > > You can achieve the same result by execution of single SQL statement. > INSERT INTO con_hash_index_table (keycol) SELECT (a - 1) % 1000 + 1 FROM GENERATE_SERIES(1, 1002000) a; > Unless you have some special idea of doing this in 1002 separate transactions. There is no reason for having so many INSERT statements in test.sql file. I think it would be better to replace it with single SQL statement. Thanks. [1]- https://www.postgresql.org/message-id/CAA4eK1KibVzgVETVay0%2BsiVEgzaXnP5R21BdWiK9kg9wx2E40Q%40mail.gmail.com [2]- https://www.postgresql.org/message-id/CAE9k0PkRSyzx8dOnokEpUi2A-RFZK72WN0h9DEMv_ut9q6bPRw%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Hi, >> I've assigned to review this patch. >> At first, I'd like to notice that I like idea and general design. >> Secondly, patch set don't apply cleanly to master. Please, rebase it. > > > Thanks for showing your interest towards this patch. I would like to > inform that this patch has got dependency on patch for 'Write Ahead > Logging in hash index - [1]' and 'Microvacuum support in hash index > [1]'. Hence, until above two patches becomes stable I may have to keep > on rebasing this patch. However, I will try to share you the updated > patch asap. > >> >> >> On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >>> >>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this >>> patch rewrites the hash index scan module to work in page-at-a-time >>> mode. It basically introduces two new functions-- _hash_readpage() and >>> _hash_saveitem(). The former is used to load all the qualifying tuples >>> from a target bucket or overflow page into an items array. The latter >>> one is used by _hash_readpage to save all the qualifying tuples found >>> in a page into an items array. Apart from that, this patch bascially >>> cleans _hash_first(), _hash_next and hashgettuple(). >> >> >> I see that forward and backward scan cases of function _hash_readpage contain a lot of code duplication >> Could you please refactor this function to have less code duplication? > > Sure, I will try to avoid the code duplication as much as possible. I had close look into hash_readpage() function and could see that there are only few if-else conditions which are similar for both forward and backward scan cases and can't be optimised further. However, If you have a cursory look into this function definition it looks like the code for forward and backward scan are exactly the same but that's not the case. Attached is the diff report (hash_readpage.html) for forward and backward scan code used in hash_readpage(). This shows what all lines in the hash_readpage() are same or different. Please note that before applying the patches for page scan mode in hash index you may need to first apply the following patches on HEAD, 1) v10 patch for WAL in hash index - [1] 2) v1 patch for wal consistency check for hash index - [2] 3) v6 patch for microvacuum support in hash index - [3] [1] - https://www.postgresql.org/message-id/CAA4eK1%2Bk5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn%2B72o1UA%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAGz5QCKPU2qX75B1bB_LuEC88xWZa5L55J0TLvYMVD8noSH3pA%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAE9k0PkYpAPDJBfgia08o7XhO8nypH9WoO9M8%3DdqLrwwObXKcw%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On 02/14/2017 12:27 AM, Ashutosh Sharma wrote: > Currently, Hash Index scan works tuple-at-a-time, i.e. for every > qualifying tuple in a page, it acquires and releases the lock which > eventually increases the lock/unlock traffic. For example, if an index > page contains 100 qualified tuples, the current hash index scan has to > acquire and release the lock 100 times to read those qualified tuples > which is not good from performance perspective and it also impacts > concurency with VACUUM. > > Considering above points, I would like to propose a patch that allows > hash index scan to work in page-at-a-time mode. In page-at-a-time > mode, once lock is acquired on a target bucket page, the entire page > is scanned and all the qualified tuples are saved into backend's local > memory. This reduces the lock/unlock calls for retrieving tuples from > a page. Moreover, it also eliminates the problem of re-finding the > position of the last returned index tuple and more importanly it > allows VACUUM to release lock on current page before moving to the > next page which eventually improves it's concurrency with scan. > > Attached patch modifies hash index scan code for page-at-a-time mode. > For better readability, I have splitted it into 3 parts, > Due to the commits on master these patches applies with hunks. The README should be updated to mention the use of page scan. hash.h needs pg_indent. > 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this > patch rewrites the hash index scan module to work in page-at-a-time > mode. It basically introduces two new functions-- _hash_readpage() and > _hash_saveitem(). The former is used to load all the qualifying tuples > from a target bucket or overflow page into an items array. The latter > one is used by _hash_readpage to save all the qualifying tuples found > in a page into an items array. Apart from that, this patch bascially > cleans _hash_first(), _hash_next and hashgettuple(). > For _hash_next I don't see this - can you explain ? + * + * On failure exit (no more tuples), we release pin and set + * so->currPos.buf to InvalidBuffer. + * Returns true if any matching items are found else returns false. s/Returns/Return/g > 2) 0002-Remove-redundant-function-_hash_step-and-some-of-the.patch: > this patch basically removes the redundant function _hash_step() and > some of the unused members of HashScanOpaqueData structure. > Looks good. > 3) 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patch: > this patch basically improves the locking strategy for VACUUM in hash > index. As the new hash index scan works in page-at-a-time, vacuum can > release the lock on previous page before acquiring a lock on the next > page, hence, improving hash index concurrency. > + * As the new hash index scan work in page at a time mode, Remove 'new'. > I have also done the benchmarking of this patch and would like to > share the results for the same, > > Firstly, I have done the benchmarking with non-unique values and i > could see a performance improvement of 4-7%. For the detailed results > please find the attached file 'results-non-unique values-70ff', and > ddl.sql, test.sql are test scripts used in this experimentation. The > detail of non-default GUC params and pgbench command are mentioned in > the result sheet. I also did the benchmarking with unique values at > 300 and 1000 scale factor and its results are provided in > 'results-unique-values-default-ff'. > I'm seeing similar results, and especially with write heavy scenarios. Best regards, Jesper
Hi, >> Attached patch modifies hash index scan code for page-at-a-time mode. >> For better readability, I have splitted it into 3 parts, >> > > Due to the commits on master these patches applies with hunks. > > The README should be updated to mention the use of page scan. Done. Please refer to the attached v2 version of patch. > > hash.h needs pg_indent. Fixed. > >> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this >> patch rewrites the hash index scan module to work in page-at-a-time >> mode. It basically introduces two new functions-- _hash_readpage() and >> _hash_saveitem(). The former is used to load all the qualifying tuples >> from a target bucket or overflow page into an items array. The latter >> one is used by _hash_readpage to save all the qualifying tuples found >> in a page into an items array. Apart from that, this patch bascially >> cleans _hash_first(), _hash_next and hashgettuple(). >> > > For _hash_next I don't see this - can you explain ? Sorry, It was wrongly copied from btree code. I have corrected it now. Please check the attached v2 verison of patch. > > + * > + * On failure exit (no more tuples), we release pin and set > + * so->currPos.buf to InvalidBuffer. > > > + * Returns true if any matching items are found else returns false. > > s/Returns/Return/g Done. > >> 2) 0002-Remove-redundant-function-_hash_step-and-some-of-the.patch: >> this patch basically removes the redundant function _hash_step() and >> some of the unused members of HashScanOpaqueData structure. >> > > Looks good. > >> 3) 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patch: >> this patch basically improves the locking strategy for VACUUM in hash >> index. As the new hash index scan works in page-at-a-time, vacuum can >> release the lock on previous page before acquiring a lock on the next >> page, hence, improving hash index concurrency. >> > > + * As the new hash index scan work in page at a time mode, > > Remove 'new'. Done. > >> I have also done the benchmarking of this patch and would like to >> share the results for the same, >> >> Firstly, I have done the benchmarking with non-unique values and i >> could see a performance improvement of 4-7%. For the detailed results >> please find the attached file 'results-non-unique values-70ff', and >> ddl.sql, test.sql are test scripts used in this experimentation. The >> detail of non-default GUC params and pgbench command are mentioned in >> the result sheet. I also did the benchmarking with unique values at >> 300 and 1000 scale factor and its results are provided in >> 'results-unique-values-default-ff'. >> > > I'm seeing similar results, and especially with write heavy scenarios. Great..!! -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On 03/22/2017 09:32 AM, Ashutosh Sharma wrote: > Done. Please refer to the attached v2 version of patch. > Thanks. >>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this >>> patch rewrites the hash index scan module to work in page-at-a-time >>> mode. It basically introduces two new functions-- _hash_readpage() and >>> _hash_saveitem(). The former is used to load all the qualifying tuples >>> from a target bucket or overflow page into an items array. The latter >>> one is used by _hash_readpage to save all the qualifying tuples found >>> in a page into an items array. Apart from that, this patch bascially >>> cleans _hash_first(), _hash_next and hashgettuple(). >>> 0001v2: In hashgettuple() you can remove the 'currItem' and 'offnum' from the 'else' part, and do the assignment inside if (so->numKilled < MaxIndexTuplesPerPage) instead. No new comments for 0002 and 0003. Best regards, Jesper
On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > Hi, > > On 03/22/2017 09:32 AM, Ashutosh Sharma wrote: >> >> Done. Please refer to the attached v2 version of patch. >> > > Thanks. > >>>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this >>>> patch rewrites the hash index scan module to work in page-at-a-time >>>> mode. It basically introduces two new functions-- _hash_readpage() and >>>> _hash_saveitem(). The former is used to load all the qualifying tuples >>>> from a target bucket or overflow page into an items array. The latter >>>> one is used by _hash_readpage to save all the qualifying tuples found >>>> in a page into an items array. Apart from that, this patch bascially >>>> cleans _hash_first(), _hash_next and hashgettuple(). >>>> > > 0001v2: > > In hashgettuple() you can remove the 'currItem' and 'offnum' from the 'else' > part, and do the assignment inside > > if (so->numKilled < MaxIndexTuplesPerPage) > > instead. > Done. Please have a look into the attached v3 patch. > > No new comments for 0002 and 0003. okay. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On 03/23/2017 02:11 PM, Ashutosh Sharma wrote: > On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: >> 0001v2: >> >> In hashgettuple() you can remove the 'currItem' and 'offnum' from the 'else' >> part, and do the assignment inside >> >> if (so->numKilled < MaxIndexTuplesPerPage) >> >> instead. >> > > Done. Please have a look into the attached v3 patch. > >> >> No new comments for 0002 and 0003. > > okay. Thanks. > I'll keep the entry in 'Needs Review' if Alexander, or others, want to add their feedback. (Best to post the entire patch series each time) Best regards, Jesper
> Hi, > > On 03/23/2017 02:11 PM, Ashutosh Sharma wrote: >> >> On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen >> <jesper.pedersen@redhat.com> wrote: >>> >>> 0001v2: >>> >>> In hashgettuple() you can remove the 'currItem' and 'offnum' from the >>> 'else' >>> part, and do the assignment inside >>> >>> if (so->numKilled < MaxIndexTuplesPerPage) >>> >>> instead. >>> >> >> Done. Please have a look into the attached v3 patch. >> >>> >>> No new comments for 0002 and 0003. >> >> >> okay. Thanks. >> > > I'll keep the entry in 'Needs Review' if Alexander, or others, want to add > their feedback. okay. Thanks. > > (Best to post the entire patch series each time) I take your suggestion. Please find the attachments. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Mar 23, 2017 at 2:35 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > I take your suggestion. Please find the attachments. I think you should consider refactoring this so that it doesn't need to use goto. Maybe move the while (offnum <= maxoff) logic into a helper function and have it return itemIndex. If itemIndex == 0, you can call it again. Notice that the code in the "else" branch of the if (itemIndex == 0) stuff could actually be removed from the else block without changing anything, because the "if" block always either returns or does a goto. That's worthy of a little more work to try to make things simple and clear. + * We find the first item(or, if backward scan, the last item) in Missing space. In _hash_dropscanbuf(), the handling of hashso_bucket_buf is now inconsistent with the handling of hashso_split_bucket_buf, which seems suspicious. Suppose we enter this function with so->hashso_bucket_buf and so->currPos.buf both being valid buffers, but not the same one. It looks to me as if we'll release the first pin but not the second one. My guess (which could be wrong) is that so->hashso_bucket_buf = InvalidBuffer should be moved back up higher in the function where it was before, just after the first if statement, and that the new condition so->hashso_bucket_buf == so->currPos.buf on the last _hash_dropbuf() should be removed. If that's not right, then I think I need somebody to explain why not. I am suspicious that _hash_kill_items() is going to have problems if the overflow page is freed before it reacquires the lock. _btkillitems() contains safeguards against similar cases. This is not a full review, but I'm out of time for the moment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, > I think you should consider refactoring this so that it doesn't need > to use goto. Maybe move the while (offnum <= maxoff) logic into a > helper function and have it return itemIndex. If itemIndex == 0, you > can call it again. okay, Added a helper function for _hash_readpage(). Please check v4 patch attached with this mail. Notice that the code in the "else" branch of the > if (itemIndex == 0) stuff could actually be removed from the else > block without changing anything, because the "if" block always either > returns or does a goto. That's worthy of a little more work to try to > make things simple and clear. Corrected. > > + * We find the first item(or, if backward scan, the last item) in > > Missing space. > Corrected. > In _hash_dropscanbuf(), the handling of hashso_bucket_buf is now > inconsistent with the handling of hashso_split_bucket_buf, which seems > suspicious. I have corrected it. Suppose we enter this function with so->hashso_bucket_buf > and so->currPos.buf both being valid buffers, but not the same one. > It looks to me as if we'll release the first pin but not the second > one. Yes, that is because we no need to release pin on currPos.buf if it is an overflow page. In page scan mode once we have saved all the qualified tuples from a current page (could be an overflow page) into items array we do release both pin and lock on a overflow page. This was not true earlier, let us assume a case where we are supposed to fetch only fixed number of tuples from a page using cursor and in such a case once the number of tuples to be fetched is completed we simply return with out releasing pin on a page being scanned. If this page is an overflow page then by end of scan we will end up with pin held on two buffers i.e. bucket buf and current buf which is basically overflow buf. My guess (which could be wrong) is that so->hashso_bucket_buf = > InvalidBuffer should be moved back up higher in the function where it > was before, just after the first if statement, and that the new > condition so->hashso_bucket_buf == so->currPos.buf on the last > _hash_dropbuf() should be removed. If that's not right, then I think > I need somebody to explain why not. Okay, as i mentioned above, in case of page scan mode we only keep pin on a bucket buf. There won't be any case where we will be having pin on overflow buf at the end of scan. So, basically _hash_dropscan buf() should only attempt to release pin on a bucket buf. And an attempt to release pin on so->currPos.buf should only happen when 'so->hashso_bucket_buf == so->currPos.buf' or when 'so->hashso_split_bucket_buf == so->currPos.buf' > > I am suspicious that _hash_kill_items() is going to have problems if > the overflow page is freed before it reacquires the lock. > _btkillitems() contains safeguards against similar cases. I have added similar check for hash_kill_items() as well. > > This is not a full review, but I'm out of time for the moment. No worries. I will be ready for your further review comments any time. Thanks for the review. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Attachment
Hi, On 03/27/2017 09:34 AM, Ashutosh Sharma wrote: > Hi, > >> I think you should consider refactoring this so that it doesn't need >> to use goto. Maybe move the while (offnum <= maxoff) logic into a >> helper function and have it return itemIndex. If itemIndex == 0, you >> can call it again. > > okay, Added a helper function for _hash_readpage(). Please check v4 > patch attached with this mail. >>> >> This is not a full review, but I'm out of time for the moment. > > No worries. I will be ready for your further review comments any time. > Thanks for the review. > This patch needs a rebase. Best regards, Jesper
>>I think you should consider refactoring this so that it doesn't need
to use goto. Maybe move the while (offnum <= maxoff) logic into a
helper function and have it return itemIndex. If itemIndex == 0, you
can call it again.
okay, Added a helper function for _hash_readpage(). Please check v4
patch attached with this mail.This is not a full review, but I'm out of time for the moment.
No worries. I will be ready for your further review comments any time.
Thanks for the review.
This patch needs a rebase.
Please try applying these patches on top of [1]. I think you should be able to apply it cleanly. Sorry, I think I forgot to mention this point in my earlier mail.
Hi Ashutosh, On 03/29/2017 09:16 PM, Ashutosh Sharma wrote: >> This patch needs a rebase. > > Please try applying these patches on top of [1]. I think you should be able > to apply it cleanly. Sorry, I think I forgot to mention this point in my > earlier mail. > > [1] - > https://www.postgresql.org/message-id/CAE9k0P%3DV2LhtyeMXd295fhisp%3DNWUhRVJ9EZQCDowWiY9rSohQ%40mail.gmail.com > Thanks, that works ! As you have provided a patch for Robert's comments, and no other review have been posted I'm moving this patch to "Ready for Committer" for additional committer feedback. Best regards, Jesper
Hi, > > On 03/29/2017 09:16 PM, Ashutosh Sharma wrote: >>> >>> This patch needs a rebase. >> >> >> Please try applying these patches on top of [1]. I think you should be >> able >> to apply it cleanly. Sorry, I think I forgot to mention this point in my >> earlier mail. >> >> [1] - >> >> https://www.postgresql.org/message-id/CAE9k0P%3DV2LhtyeMXd295fhisp%3DNWUhRVJ9EZQCDowWiY9rSohQ%40mail.gmail.com >> > > Thanks, that works ! > > As you have provided a patch for Robert's comments, and no other review have > been posted I'm moving this patch to "Ready for Committer" for additional > committer feedback. Please find the attached new version of patches for page scan mode in hash index rebased on top of - [1]. [1] - https://www.postgresql.org/message-id/CAE9k0P%3D3rtgUtxopG%2BXQVxgASFzAnGd9sNmYUaj_%3DKeVsKGUdA%40mail.gmail.com
Attachment
On Mon, Mar 27, 2017 at 7:04 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> >> I am suspicious that _hash_kill_items() is going to have problems if >> the overflow page is freed before it reacquires the lock. >> _btkillitems() contains safeguards against similar cases. > > I have added similar check for hash_kill_items() as well. > I think hash_kill_items has a much bigger problem which is that we can't kill the items if the page has been modified after re-reading it. Consider a case where Vacuum starts before the Scan on the bucket and then Scan went ahead (which is possible after your 0003 patch) and noted the killed items in killed item array and before it could kill all the items, Vacuum remove those items. Now it is quite possible that before scan tries to kill those items, the corresponding itemids have been used by different tuples. I think what we need to do here is to store the LSN of page first time when we have read the page and then compare it with current page lsn after re-reading the page in hash_kill_tems. * + HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */ +} HashScanPosData; .. HashScanPosItem *killedItems; /* tids and offset numbers of killed items */ int numKilled; /* number of currently storeditems */ + + /* + * Identify all the matching items on a page and save them + * in HashScanPosData + */ + HashScanPosData currPos; /* current position data */} HashScanOpaqueData; After having array of HashScanPosItems as currPos.items, I think you don't need array of HashScanPosItem for killedItems, just an integer array of index in currPos.items should be sufficient. * I think below para in README is not valid after this patch. "To keep concurrency reasonably good, we require readers to cope with concurrent insertions, which means that they have to be able to re-find their current scan position after re-acquiring the buffer content lock on page. Since deletion is not possible while a reader holds the pin on bucket, and we assume that heap tuple TIDs are unique, this can be implemented by searching for the same heap tuple TID previously returned. Insertion does not move index entries across pages, so the previously-returned index entry should always be on the same page, at the same or higher offset number, as it was before." * - next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE, - LH_OVERFLOW_PAGE, - bstrategy); - /* - * release the lock on previous page after acquiring the lock on next - * page + * As the hash index scan work in page at a time mode, + * vacuum can release the lock on previous page before + * acquiring lock on the next page. */ .. + next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE, + LH_OVERFLOW_PAGE, + bstrategy); + After this change, you need to modify comments on top of function hashbucketcleanup() and _hash_squeezebucket(). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi, Thanks for the review. >>> I am suspicious that _hash_kill_items() is going to have problems if >>> the overflow page is freed before it reacquires the lock. >>> _btkillitems() contains safeguards against similar cases. >> >> I have added similar check for hash_kill_items() as well. >> > > I think hash_kill_items has a much bigger problem which is that we > can't kill the items if the page has been modified after re-reading > it. Consider a case where Vacuum starts before the Scan on the bucket > and then Scan went ahead (which is possible after your 0003 patch) and > noted the killed items in killed item array and before it could kill > all the items, Vacuum remove those items. Now it is quite possible > that before scan tries to kill those items, the corresponding itemids > have been used by different tuples. I think what we need to do here > is to store the LSN of page first time when we have read the page and > then compare it with current page lsn after re-reading the page in > hash_kill_tems. Okay, understood. I have done the changes to handle this type of scenario. Please refer to the attached patches. Thanks. > > * > + HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */ > +} HashScanPosData; > .. > > HashScanPosItem *killedItems; /* tids and offset numbers of killed items */ > int numKilled; /* number of currently stored items */ > + > + /* > + * Identify all the matching items on a page and save them > + * in HashScanPosData > + */ > + HashScanPosData currPos; /* current position data */ > } HashScanOpaqueData; > > > After having array of HashScanPosItems as currPos.items, I think you > don't need array of HashScanPosItem for killedItems, just an integer > array of index in currPos.items should be sufficient. Corrected. > > > * > I think below para in README is not valid after this patch. > > "To keep concurrency reasonably good, we require readers to cope with > concurrent insertions, which means that they have to be able to > re-find > their current scan position after re-acquiring the buffer content lock > on page. Since deletion is not possible while a reader holds the pin > on bucket, and we assume that heap tuple TIDs are unique, this can be > implemented by searching for the same heap tuple TID previously > returned. Insertion does not move index entries across pages, so the > previously-returned index entry should always be on the same page, at > the same or higher offset number, > as it was before." I have modified above paragraph in README file. > > * > - next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE, > - LH_OVERFLOW_PAGE, > - bstrategy); > - > > /* > - * release the lock on previous page after acquiring the lock on next > - * page > + * As the hash index scan work in page at a time mode, > + * vacuum can release the lock on previous page before > + * acquiring lock on the next page. > */ > .. > + next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE, > + LH_OVERFLOW_PAGE, > + bstrategy); > + > > After this change, you need to modify comments on top of function > hashbucketcleanup() and _hash_squeezebucket(). > Done. Please note that these patches needs to be applied on top of [1]. [1] - https://www.postgresql.org/message-id/CAE9k0P%3D3rtgUtxopG%2BXQVxgASFzAnGd9sNmYUaj_%3DKeVsKGUdA%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Attachment
On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Please note that these patches needs to be applied on top of [1]. > Few more review comments: 1. - page = BufferGetPage(so->hashso_curbuf); + blkno = so->currPos.currPage; + if (so->hashso_bucket_buf == so->currPos.buf) + { + buf = so->currPos.buf; + LockBuffer(buf, BUFFER_LOCK_SHARE); + page = BufferGetPage(buf); + } Here, you are assuming that only bucket page can be pinned, but I think that assumption is not right. When _hash_kill_items() is called before moving to next page, there could be a pin on the overflow page. You need some logic to check if the buffer is pinned, then just lock it. I think once you do that, it might not be convinient to release the pin at the end of this function. Add some comments on top of _hash_kill_items to explain the new changes or say some thing like "See _bt_killitems for details" 2. + /* + * We save the LSN of the page as we read it, so that we know whether it + * safe to apply LP_DEAD hints to the page later. This allows us to drop + * the pin for MVCC scans, which allows vacuum to avoid blocking. + */ + so->currPos.lsn = PageGetLSN(page); + The second part of above comment doesn't make sense because vacuum's will still be blocked because we hold the pin on primary bucket page. 3. + { + /* + * No more matching tuples were found. return FALSE + * indicating the same. Also, remember the prev and + * next block number so that if fetching tuples using + * cursor we remember the page from where to start the + * scan. + */ + so->currPos.prevPage = (opaque)->hasho_prevblkno; + so->currPos.nextPage = (opaque)->hasho_nextblkno; You can't read opaque without having lock and by this time it has already been released. Also, I think if you want to save position for cursor movement, then you need to save the position of last bucket when scan completes the overflow chain, however as you have written it will be always invalid block number. I think there is similar problem with prevblock number. 4. +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum, + OffsetNumber maxoff, ScanDirection dir) +{ + HashScanOpaque so = (HashScanOpaque) scan->opaque; + IndexTuple itup; + int itemIndex; + + if (ScanDirectionIsForward(dir)) + { + /* load items[] in ascending order */ + itemIndex = 0; + + /* new page, relocate starting position by binary search */ + offnum = _hash_binsearch(page, so->hashso_sk_hash); What is the need to find offset number when this function already has that as an input parameter? 5. +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum, + OffsetNumber maxoff, ScanDirection dir) I think maxoff is not required to be passed a parameter to this function as you need it only for forward scan. You can compute it locally. 6. +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum, + OffsetNumber maxoff, ScanDirection dir) { .. + if (ScanDirectionIsForward(dir)) + { .. + while (offnum <= maxoff) { .. + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) && + _hash_checkqual(scan, itup)) + { + /* tuple is qualified, so remember it */ + _hash_saveitem(so, itemIndex, offnum, itup); + itemIndex++; + } + + offnum = OffsetNumberNext(offnum); .. Why are you traversing the whole page even when there is no match? There is a similar problem in backward scan. I think this is blunder. 7. + if (so->currPos.buf == so->hashso_bucket_buf || + so->currPos.buf == so->hashso_split_bucket_buf) + { + so->currPos.prevPage = InvalidBlockNumber; + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); + } + else + { + so->currPos.prevPage = (opaque)->hasho_prevblkno; + _hash_relbuf(rel, so->currPos.buf); + } + + so->currPos.nextPage = (opaque)->hasho_nextblkno; What makes you think it is safe read opaque after releasing the lock? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Mar 27, 2017 at 7:04 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > My guess (which could be wrong) is that so->hashso_bucket_buf = >> InvalidBuffer should be moved back up higher in the function where it >> was before, just after the first if statement, and that the new >> condition so->hashso_bucket_buf == so->currPos.buf on the last >> _hash_dropbuf() should be removed. If that's not right, then I think >> I need somebody to explain why not. > > Okay, as i mentioned above, in case of page scan mode we only keep pin > on a bucket buf. There won't be any case where we will be having pin > on overflow buf at the end of scan. > What if mark the buffer as invalid after releasing the pin? We are already doing it that way in btree code, refer _bt_drop_lock_and_maybe_pin(). I think if we do that way, then we can do what Robert is suggesting. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Apr 4, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Mar 27, 2017 at 7:04 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> My guess (which could be wrong) is that so->hashso_bucket_buf = >>> InvalidBuffer should be moved back up higher in the function where it >>> was before, just after the first if statement, and that the new >>> condition so->hashso_bucket_buf == so->currPos.buf on the last >>> _hash_dropbuf() should be removed. If that's not right, then I think >>> I need somebody to explain why not. >> >> Okay, as i mentioned above, in case of page scan mode we only keep pin >> on a bucket buf. There won't be any case where we will be having pin >> on overflow buf at the end of scan. >> > > What if mark the buffer as invalid after releasing the pin? We are > already doing it that way in btree code, refer > _bt_drop_lock_and_maybe_pin(). I think if we do that way, then we can > do what Robert is suggesting. Please continue reviewing, but I think we're out of time to get this patch into v10. This patch seems to be still under fairly heavy revision, and we're only a couple of days from feature freeze, and the patch upon which it depends (page-at-a-time vacuum) has had no fewer than four follow-up commits repairing various problems with the logic, with no guarantee that we've found all the bugs yet. In view of those facts, I don't think it would be wise for me to commit this to v10, so I'm instead going to move it to the next CommitFest. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On Tue, Apr 4, 2017 at 3:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> >> Please note that these patches needs to be applied on top of [1]. >> > > Few more review comments: > > 1. > - page = BufferGetPage(so->hashso_curbuf); > + blkno = so->currPos.currPage; > + if (so->hashso_bucket_buf == so->currPos.buf) > + { > + buf = so->currPos.buf; > + LockBuffer(buf, BUFFER_LOCK_SHARE); > + page = BufferGetPage(buf); > + } > > Here, you are assuming that only bucket page can be pinned, but I > think that assumption is not right. When _hash_kill_items() is called > before moving to next page, there could be a pin on the overflow page. > You need some logic to check if the buffer is pinned, then just lock > it. I think once you do that, it might not be convinient to release > the pin at the end of this function. Yes, there are few cases where we might have pin on overflow pages too and we need to handle such cases in _hash_kill_items. I have taken care of this in the attached v7 patch. Thanks. > > Add some comments on top of _hash_kill_items to explain the new > changes or say some thing like "See _bt_killitems for details" Added some more comments on the new changes. > > 2. > + /* > + * We save the LSN of the page as we read it, so that we know whether it > + * safe to apply LP_DEAD hints to the page later. This allows us to drop > + * the pin for MVCC scans, which allows vacuum to avoid blocking. > + */ > + so->currPos.lsn = PageGetLSN(page); > + > > The second part of above comment doesn't make sense because vacuum's > will still be blocked because we hold the pin on primary bucket page. That's right. It doesn't make sense because we won't allow vacuum to start. I have removed it. > > 3. > + { > + /* > + * No more matching tuples were found. return FALSE > + * indicating the same. Also, remember the prev and > + * next block number so that if fetching tuples using > + * cursor we remember the page from where to start the > + * scan. > + */ > + so->currPos.prevPage = (opaque)->hasho_prevblkno; > + so->currPos.nextPage = (opaque)->hasho_nextblkno; > > You can't read opaque without having lock and by this time it has > already been released. I have corrected it. Please refer to the attached v7 patch. Also, I think if you want to save position for > cursor movement, then you need to save the position of last bucket > when scan completes the overflow chain, however as you have written it > will be always invalid block number. I think there is similar problem > with prevblock number. Did you mean last bucket or last page. If it is last page, then I am already storing it. > > 4. > +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum, > + OffsetNumber maxoff, ScanDirection dir) > +{ > + HashScanOpaque so = (HashScanOpaque) scan->opaque; > + IndexTuple itup; > + int itemIndex; > + > + if (ScanDirectionIsForward(dir)) > + { > + /* load items[] in ascending order */ > + itemIndex = 0; > + > + /* new page, relocate starting position by binary search */ > + offnum = _hash_binsearch(page, so->hashso_sk_hash); > > What is the need to find offset number when this function already has > that as an input parameter? It's not required. I have removed it. > > 5. > +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum, > + OffsetNumber maxoff, ScanDirection dir) > > I think maxoff is not required to be passed a parameter to this > function as you need it only for forward scan. You can compute it > locally. It is required for both forward and backward scan. However, I am not passing it to _hash_load_qualified_items() now. > > 6. > +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum, > + OffsetNumber maxoff, ScanDirection dir) > { > .. > + if (ScanDirectionIsForward(dir)) > + { > .. > + while (offnum <= maxoff) > { > .. > + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) && > + _hash_checkqual(scan, itup)) > + { > + /* tuple is qualified, so remember it */ > + _hash_saveitem(so, itemIndex, offnum, itup); > + itemIndex++; > + } > + > + offnum = OffsetNumberNext(offnum); > .. > > Why are you traversing the whole page even when there is no match? > There is a similar problem in backward scan. I think this is blunder. Fixed. Please check the attached '0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patch' > > 7. > + if (so->currPos.buf == so->hashso_bucket_buf || > + so->currPos.buf == so->hashso_split_bucket_buf) > + { > + so->currPos.prevPage = InvalidBlockNumber; > + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); > + } > + else > + { > + so->currPos.prevPage = (opaque)->hasho_prevblkno; > + _hash_relbuf(rel, so->currPos.buf); > + } > + > + so->currPos.nextPage = (opaque)->hasho_nextblkno; > > What makes you think it is safe read opaque after releasing the lock? Nothing makes me think to read opaque after releasing lock. It's a mistake. I have corrected it. Please check attached v7 patch. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
While doing the code coverage testing of v7 patch shared with - [1], I found that there are few lines of code in _hash_next() that are redundant and needs to be removed. I basically came to know this while testing the scenario where a hash index scan starts when a split is in progress. I have removed those lines and attached is the newer version of patch. [1] - https://www.postgresql.org/message-id/CAE9k0Pmn92Le0iOTU5b6087eRXzUnkq1PKcihxtokHJ9boXCBg%40mail.gmail.com On Mon, May 8, 2017 at 6:55 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Hi, > > On Tue, Apr 4, 2017 at 3:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >>> >>> Please note that these patches needs to be applied on top of [1]. >>> >> >> Few more review comments: >> >> 1. >> - page = BufferGetPage(so->hashso_curbuf); >> + blkno = so->currPos.currPage; >> + if (so->hashso_bucket_buf == so->currPos.buf) >> + { >> + buf = so->currPos.buf; >> + LockBuffer(buf, BUFFER_LOCK_SHARE); >> + page = BufferGetPage(buf); >> + } >> >> Here, you are assuming that only bucket page can be pinned, but I >> think that assumption is not right. When _hash_kill_items() is called >> before moving to next page, there could be a pin on the overflow page. >> You need some logic to check if the buffer is pinned, then just lock >> it. I think once you do that, it might not be convinient to release >> the pin at the end of this function. > > Yes, there are few cases where we might have pin on overflow pages too > and we need to handle such cases in _hash_kill_items. I have taken > care of this in the attached v7 patch. Thanks. > >> >> Add some comments on top of _hash_kill_items to explain the new >> changes or say some thing like "See _bt_killitems for details" > > Added some more comments on the new changes. > >> >> 2. >> + /* >> + * We save the LSN of the page as we read it, so that we know whether it >> + * safe to apply LP_DEAD hints to the page later. This allows us to drop >> + * the pin for MVCC scans, which allows vacuum to avoid blocking. >> + */ >> + so->currPos.lsn = PageGetLSN(page); >> + >> >> The second part of above comment doesn't make sense because vacuum's >> will still be blocked because we hold the pin on primary bucket page. > > That's right. It doesn't make sense because we won't allow vacuum to > start. I have removed it. > >> >> 3. >> + { >> + /* >> + * No more matching tuples were found. return FALSE >> + * indicating the same. Also, remember the prev and >> + * next block number so that if fetching tuples using >> + * cursor we remember the page from where to start the >> + * scan. >> + */ >> + so->currPos.prevPage = (opaque)->hasho_prevblkno; >> + so->currPos.nextPage = (opaque)->hasho_nextblkno; >> >> You can't read opaque without having lock and by this time it has >> already been released. > > I have corrected it. Please refer to the attached v7 patch. > > Also, I think if you want to save position for >> cursor movement, then you need to save the position of last bucket >> when scan completes the overflow chain, however as you have written it >> will be always invalid block number. I think there is similar problem >> with prevblock number. > > Did you mean last bucket or last page. If it is last page, then I am > already storing it. > >> >> 4. >> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum, >> + OffsetNumber maxoff, ScanDirection dir) >> +{ >> + HashScanOpaque so = (HashScanOpaque) scan->opaque; >> + IndexTuple itup; >> + int itemIndex; >> + >> + if (ScanDirectionIsForward(dir)) >> + { >> + /* load items[] in ascending order */ >> + itemIndex = 0; >> + >> + /* new page, relocate starting position by binary search */ >> + offnum = _hash_binsearch(page, so->hashso_sk_hash); >> >> What is the need to find offset number when this function already has >> that as an input parameter? > > It's not required. I have removed it. > >> >> 5. >> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum, >> + OffsetNumber maxoff, ScanDirection dir) >> >> I think maxoff is not required to be passed a parameter to this >> function as you need it only for forward scan. You can compute it >> locally. > > It is required for both forward and backward scan. However, I am not > passing it to _hash_load_qualified_items() now. > >> >> 6. >> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum, >> + OffsetNumber maxoff, ScanDirection dir) >> { >> .. >> + if (ScanDirectionIsForward(dir)) >> + { >> .. >> + while (offnum <= maxoff) >> { >> .. >> + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) && >> + _hash_checkqual(scan, itup)) >> + { >> + /* tuple is qualified, so remember it */ >> + _hash_saveitem(so, itemIndex, offnum, itup); >> + itemIndex++; >> + } >> + >> + offnum = OffsetNumberNext(offnum); >> .. >> >> Why are you traversing the whole page even when there is no match? >> There is a similar problem in backward scan. I think this is blunder. > > Fixed. Please check the attached > '0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patch' > >> >> 7. >> + if (so->currPos.buf == so->hashso_bucket_buf || >> + so->currPos.buf == so->hashso_split_bucket_buf) >> + { >> + so->currPos.prevPage = InvalidBlockNumber; >> + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); >> + } >> + else >> + { >> + so->currPos.prevPage = (opaque)->hasho_prevblkno; >> + _hash_relbuf(rel, so->currPos.buf); >> + } >> + >> + so->currPos.nextPage = (opaque)->hasho_nextblkno; >> >> What makes you think it is safe read opaque after releasing the lock? > > Nothing makes me think to read opaque after releasing lock. It's a > mistake. I have corrected it. Please check attached v7 patch. > > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On Wed, May 10, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > While doing the code coverage testing of v7 patch shared with - [1], I > found that there are few lines of code in _hash_next() that are > redundant and needs to be removed. I basically came to know this while > testing the scenario where a hash index scan starts when a split is in > progress. I have removed those lines and attached is the newer version > of patch. > Please find the new version of patches for page at a time scan in hash index rebased on top of latest commit in master branch. Also, i have ran pgindent script with pg_bsd_indent version 2.0 on all the modified files. Thanks. > [1] - https://www.postgresql.org/message-id/CAE9k0Pmn92Le0iOTU5b6087eRXzUnkq1PKcihxtokHJ9boXCBg%40mail.gmail.com > > On Mon, May 8, 2017 at 6:55 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> Hi, >> >> On Tue, Apr 4, 2017 at 3:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >>>> >>>> Please note that these patches needs to be applied on top of [1]. >>>> >>> >>> Few more review comments: >>> >>> 1. >>> - page = BufferGetPage(so->hashso_curbuf); >>> + blkno = so->currPos.currPage; >>> + if (so->hashso_bucket_buf == so->currPos.buf) >>> + { >>> + buf = so->currPos.buf; >>> + LockBuffer(buf, BUFFER_LOCK_SHARE); >>> + page = BufferGetPage(buf); >>> + } >>> >>> Here, you are assuming that only bucket page can be pinned, but I >>> think that assumption is not right. When _hash_kill_items() is called >>> before moving to next page, there could be a pin on the overflow page. >>> You need some logic to check if the buffer is pinned, then just lock >>> it. I think once you do that, it might not be convinient to release >>> the pin at the end of this function. >> >> Yes, there are few cases where we might have pin on overflow pages too >> and we need to handle such cases in _hash_kill_items. I have taken >> care of this in the attached v7 patch. Thanks. >> >>> >>> Add some comments on top of _hash_kill_items to explain the new >>> changes or say some thing like "See _bt_killitems for details" >> >> Added some more comments on the new changes. >> >>> >>> 2. >>> + /* >>> + * We save the LSN of the page as we read it, so that we know whether it >>> + * safe to apply LP_DEAD hints to the page later. This allows us to drop >>> + * the pin for MVCC scans, which allows vacuum to avoid blocking. >>> + */ >>> + so->currPos.lsn = PageGetLSN(page); >>> + >>> >>> The second part of above comment doesn't make sense because vacuum's >>> will still be blocked because we hold the pin on primary bucket page. >> >> That's right. It doesn't make sense because we won't allow vacuum to >> start. I have removed it. >> >>> >>> 3. >>> + { >>> + /* >>> + * No more matching tuples were found. return FALSE >>> + * indicating the same. Also, remember the prev and >>> + * next block number so that if fetching tuples using >>> + * cursor we remember the page from where to start the >>> + * scan. >>> + */ >>> + so->currPos.prevPage = (opaque)->hasho_prevblkno; >>> + so->currPos.nextPage = (opaque)->hasho_nextblkno; >>> >>> You can't read opaque without having lock and by this time it has >>> already been released. >> >> I have corrected it. Please refer to the attached v7 patch. >> >> Also, I think if you want to save position for >>> cursor movement, then you need to save the position of last bucket >>> when scan completes the overflow chain, however as you have written it >>> will be always invalid block number. I think there is similar problem >>> with prevblock number. >> >> Did you mean last bucket or last page. If it is last page, then I am >> already storing it. >> >>> >>> 4. >>> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum, >>> + OffsetNumber maxoff, ScanDirection dir) >>> +{ >>> + HashScanOpaque so = (HashScanOpaque) scan->opaque; >>> + IndexTuple itup; >>> + int itemIndex; >>> + >>> + if (ScanDirectionIsForward(dir)) >>> + { >>> + /* load items[] in ascending order */ >>> + itemIndex = 0; >>> + >>> + /* new page, relocate starting position by binary search */ >>> + offnum = _hash_binsearch(page, so->hashso_sk_hash); >>> >>> What is the need to find offset number when this function already has >>> that as an input parameter? >> >> It's not required. I have removed it. >> >>> >>> 5. >>> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum, >>> + OffsetNumber maxoff, ScanDirection dir) >>> >>> I think maxoff is not required to be passed a parameter to this >>> function as you need it only for forward scan. You can compute it >>> locally. >> >> It is required for both forward and backward scan. However, I am not >> passing it to _hash_load_qualified_items() now. >> >>> >>> 6. >>> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum, >>> + OffsetNumber maxoff, ScanDirection dir) >>> { >>> .. >>> + if (ScanDirectionIsForward(dir)) >>> + { >>> .. >>> + while (offnum <= maxoff) >>> { >>> .. >>> + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) && >>> + _hash_checkqual(scan, itup)) >>> + { >>> + /* tuple is qualified, so remember it */ >>> + _hash_saveitem(so, itemIndex, offnum, itup); >>> + itemIndex++; >>> + } >>> + >>> + offnum = OffsetNumberNext(offnum); >>> .. >>> >>> Why are you traversing the whole page even when there is no match? >>> There is a similar problem in backward scan. I think this is blunder. >> >> Fixed. Please check the attached >> '0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patch' >> >>> >>> 7. >>> + if (so->currPos.buf == so->hashso_bucket_buf || >>> + so->currPos.buf == so->hashso_split_bucket_buf) >>> + { >>> + so->currPos.prevPage = InvalidBlockNumber; >>> + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); >>> + } >>> + else >>> + { >>> + so->currPos.prevPage = (opaque)->hasho_prevblkno; >>> + _hash_relbuf(rel, so->currPos.buf); >>> + } >>> + >>> + so->currPos.nextPage = (opaque)->hasho_nextblkno; >>> >>> What makes you think it is safe read opaque after releasing the lock? >> >> Nothing makes me think to read opaque after releasing lock. It's a >> mistake. I have corrected it. Please check attached v7 patch. >> >> -- >> With Regards, >> Ashutosh Sharma >> EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sun, Jul 30, 2017 at 2:07 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Hi, > > On Wed, May 10, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> While doing the code coverage testing of v7 patch shared with - [1], I >> found that there are few lines of code in _hash_next() that are >> redundant and needs to be removed. I basically came to know this while >> testing the scenario where a hash index scan starts when a split is in >> progress. I have removed those lines and attached is the newer version >> of patch. >> > > Please find the new version of patches for page at a time scan in hash > index rebased on top of latest commit in master branch. Also, i have > ran pgindent script with pg_bsd_indent version 2.0 on all the modified > files. Thanks. > Few comments: 1. +_hash_kill_items(IndexScanDesc scan, bool havePin) I think you can do without the second parameter. Can't we detect inside _hash_kill_items whether the page is pinned or not as we do for btree? 2. + /* + * We remember prev and next block number along with current block + * number so that if fetching the tup- les using cursor we know + * the page from where to startwith. This is for the case where we + * have re- ached the end of bucket chain without finding any + * matching tuples. The spelling of tuples and reached contain some unwanted symbol. Have space between "startwith" or just use "begin" 3. + if (!BlockNumberIsValid((opaque)->hasho_nextblkno)) + { + so->currPos.prevPage = (opaque)->hasho_prevblkno; + so->currPos.nextPage = InvalidBlockNumber; + } There is no need to use Parentheses around opaque. I mean there is no problem with that, but it is redundant and makes code less readable. Also, there is similar usage at other places in the code, please change all another place as well. I think you can save the value of prevblkno in a local variable and use it in else part. 4. + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) && + _hash_checkqual(scan, itup)) + { + /* tuple is qualified, so remember it */ + _hash_saveitem(so, itemIndex, offnum, itup); + itemIndex++; + } + else + + /* + * No more matching tuples exist in this page. so, exit while + * loop. + */ + break; It is better to have braces for the else part. It makes code look better. The type of code exists few line down as well, change that as well. 5. + /* + * We save the LSN of the page as we read it, so that we know whether it + * safe to apply LP_DEAD hints to the page later. + */ "whether it safe"/"whether it is safe" -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 4, 2017 at 9:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > There is no need to use Parentheses around opaque. I mean there is no > problem with that, but it is redundant and makes code less readable. Amit, I'm sure you know this, but just for the benefit of anyone who doesn't: We often include these kinds of extra parentheses in macros, for good reason. Suppose you have: #define mul(x,y) x * y If the user says mul(2+3,5), it will expand to 2 + 3 * 5 = 17, which is wrong. If you instead do this: #define mul(x,y) (x) * (y) ...then mul(2+3,5) expands to (2 + 3) * (5) = 25, which is what the user of the macro is expecting to get. Outside of macro definitions, as you say, there's no point and we should avoid it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 4, 2017 at 7:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sun, Jul 30, 2017 at 2:07 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> Hi, >> >> On Wed, May 10, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >>> While doing the code coverage testing of v7 patch shared with - [1], I >>> found that there are few lines of code in _hash_next() that are >>> redundant and needs to be removed. I basically came to know this while >>> testing the scenario where a hash index scan starts when a split is in >>> progress. I have removed those lines and attached is the newer version >>> of patch. >>> >> >> Please find the new version of patches for page at a time scan in hash >> index rebased on top of latest commit in master branch. Also, i have >> ran pgindent script with pg_bsd_indent version 2.0 on all the modified >> files. Thanks. >> > > Few comments: Thanks for reviewing the patch. > 1. > +_hash_kill_items(IndexScanDesc scan, bool havePin) > > I think you can do without the second parameter. Can't we detect > inside _hash_kill_items whether the page is pinned or not as we do for > btree? Okay, done that way. Please check attached v10 patch. > > 2. > + /* > + * We remember prev and next block number along with current block > + * number so that if fetching the tup- les using cursor we know > + * the page from where to startwith. This is for the case where we > + * have re- ached the end of bucket chain without finding any > + * matching tuples. > > The spelling of tuples and reached contain some unwanted symbol. Have > space between "startwith" or just use "begin" Corrected. > > 3. > + if (!BlockNumberIsValid((opaque)->hasho_nextblkno)) > + { > + so->currPos.prevPage = (opaque)->hasho_prevblkno; > + so->currPos.nextPage = InvalidBlockNumber; > + } > > There is no need to use Parentheses around opaque. I mean there is no > problem with that, but it is redundant and makes code less readable. > Also, there is similar usage at other places in the code, please > change all another place as well. Removed parenthesis around opaque. I think you can save the value of > prevblkno in a local variable and use it in else part. Okay, done that way. > > 4. > + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) && > + _hash_checkqual(scan, itup)) > + { > + /* tuple is qualified, so remember it */ > + _hash_saveitem(so, itemIndex, offnum, itup); > + itemIndex++; > + } > + else > + > + /* > + * No more matching tuples exist in this page. so, exit while > + * loop. > + */ > + break; > > It is better to have braces for the else part. It makes code look > better. The type of code exists few line down as well, change that as > well. Added braces in the else part. > > 5. > + /* > + * We save the LSN of the page as we read it, so that we know whether it > + * safe to apply LP_DEAD hints to the page later. > + */ > > "whether it safe"/"whether it is safe" Corrected. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Aug 7, 2017 at 5:50 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > On Fri, Aug 4, 2017 at 7:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Sun, Jul 30, 2017 at 2:07 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >>> Hi, >>> >>> On Wed, May 10, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >>>> While doing the code coverage testing of v7 patch shared with - [1], I >>>> found that there are few lines of code in _hash_next() that are >>>> redundant and needs to be removed. I basically came to know this while >>>> testing the scenario where a hash index scan starts when a split is in >>>> progress. I have removed those lines and attached is the newer version >>>> of patch. >>>> >>> >>> Please find the new version of patches for page at a time scan in hash >>> index rebased on top of latest commit in master branch. Also, i have >>> ran pgindent script with pg_bsd_indent version 2.0 on all the modified >>> files. Thanks. >>> >> >> Few comments: > > Thanks for reviewing the patch. > Comments on the latest patch: 1. /* + * We remember prev and next block number along with current block + * number so that if fetching the tuples using cursor we know the + * page from where to begin. This is for the case where we have + * reached the end of bucket chain without finding any matching + * tuples. + */ + if (!BlockNumberIsValid(opaque->hasho_nextblkno)) + prev_blkno = opaque->hasho_prevblkno; This doesn't seem to be the right place for this comment as you are not saving next or current block number here. I am not sure whether you really need this comment at all. Will it be better if you move this to else part of the loop and reword it as: "Remember next and previous block numbers for scrollable cursors to know the start position." 2. The code in _hash_readpage looks quite bloated. I think we can make it better if we do something like below. a. +_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) { .. + if (so->currPos.buf == so->hashso_bucket_buf || + so->currPos.buf == so->hashso_split_bucket_buf) + { + so->currPos.prevPage = InvalidBlockNumber; + so->currPos.nextPage = opaque->hasho_nextblkno; + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); + } + else + { + so->currPos.prevPage = opaque->hasho_prevblkno; + so->currPos.nextPage = opaque->hasho_nextblkno; + _hash_relbuf(rel, so->currPos.buf); + so->currPos.buf = InvalidBuffer; + } .. } This code chunk is same for both forward and backward scans. I think you can have single copy of this code by moving it out of if-else loop. b. +_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) { .. + /* new page, locate starting position by binary search */ + offnum = _hash_binsearch(page, so->hashso_sk_hash); + + itemIndex = _hash_load_qualified_items(scan, page, offnum, dir); + + while (itemIndex == 0) + { + /* + * Could not find any matching tuples in the current page, move to + * the next page. Before leaving the current page, also deal with + * any killed items. + */ + if (so->numKilled > 0) + _hash_kill_items(scan); + + /* + * We remember prev and next block number along with current block + * number so that if fetching the tuples using cursor we know the + * page from where to begin. This is for the case where we have + * reached the end of bucket chain without finding any matching + * tuples. + */ + if (!BlockNumberIsValid(opaque->hasho_nextblkno)) + prev_blkno = opaque->hasho_prevblkno; + + _hash_readnext(scan, &buf, &page, &opaque); + if (BufferIsValid(buf)) + { + so->currPos.buf = buf; + so->currPos.currPage = BufferGetBlockNumber(buf); + so->currPos.lsn = PageGetLSN(page); + offnum = _hash_binsearch(page, so->hashso_sk_hash); + itemIndex = _hash_load_qualified_items(scan, page, + offnum, dir); .. } Have just one copy of code search the offset and load qualified items. Change the above while loop to do..while loop and have a check in between break the loop when item index is not zero. 3. - * Find the first item in the index that - * satisfies the qualification associated with the scan descriptor. On - * success, the page containing the current index tuple is read locked - * and pinned, and the scan's opaque data entry is updated to - * include the buffer. + * We find the first item (or, if backward scan, the last item) in + * the index that satisfies the qualification associated with the + * scan descriptor. On success, the page containing the current + * index tuple is read locked and pinned, and data about the + * matching tuple(s) on the page has been loaded into so->currPos, + * scan->xs_ctup.t_self is set to the heap TID of the current tuple. + * + * If there are no matching items in the index, we return FALSE, + * with no pins or locks held. */bool_hash_first(IndexScanDesc scan, ScanDirection dir) @@ -229,15 +297,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir) I don't think on success, the lock or pin is held anymore, after this patch the only pin will be retained and that too for bucket page. Also, there is no need to modify the part of the comment which is not related to change in this patch. I don't see scan->xs_ctup.t_self getting set anywhere in this function. I think you are setting it in hashgettuple. It is better to move that assignment from hashgettuple to _hash_first so as to be consistent with _hash_next. 4. + * On successful exit, scan->xs_ctup.t_self is set to the TID + * of the next heap tuple, and if requested, scan->xs_itup + * points to a copy of the index tuple. so->currPos is updated + * as needed. + * + * On failure exit (no more tuples), we return FALSE with no + * pins or locks held. */bool_hash_next(IndexScanDesc scan, ScanDirection dir) I don't see the usage of scan->xs_itup in this function. It seems to me that you have copied it from btree code and forgot to remove part of the comment which is not relevant to a hash index. 5. @@ -514,19 +422,14 @@ hashendscan(IndexScanDesc scan) { .. + if (HashScanPosIsValid(so->currPos)) { - LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE); - _hash_kill_items(scan); - LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK); + /* Before leaving current page, deal with any killed items */ + if (so->numKilled > 0) + _hash_kill_items(scan); + _hash_dropscanbuf(rel, so); } - _hash_dropscanbuf(rel, so); - .. } I don't think it is a good idea to move _hash_dropscanbuf as that check just ensures if the current buffer is valid. It doesn't say anything about other buffers saved in HashScanOpaque. A similar change is required in hashrescan. 6. _hash_kill_items(IndexScanDesc scan) { .. + if (so->hashso_bucket_buf == so->currPos.buf || + HashScanPosIsPinned(so->currPos)) .. } Isn't second check enough? It should indirectly cover the first test. 7. _hash_kill_items(IndexScanDesc scan) { .. + /* + * If page LSN differs it means that the page was modified since the + * last read. killedItems could be not valid so LP_DEAD hints apply- + * ing is not safe. + */ + page = BufferGetPage(buf); + if (PageGetLSN(page) != so->currPos.lsn) + { + _hash_relbuf(rel, buf); + return; + } .. } How does this check cover the case of unlogged tables? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 9, 2017 at 2:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Aug 7, 2017 at 5:50 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > 7. > _hash_kill_items(IndexScanDesc scan) > { > .. > + /* > + * If page LSN differs it means that the page was modified since the > + * last read. killedItems could be not valid so LP_DEAD hints apply- > + * ing is not safe. > + */ > + page = BufferGetPage(buf); > + if (PageGetLSN(page) != so->currPos.lsn) > + { > + _hash_relbuf(rel, buf); > + return; > + } > .. > } > > How does this check cover the case of unlogged tables? > I have thought about it and I think we can't use the technique btree is using (not to release the pin on the page) to save unlogged or temporary relations. It works for btree because it takes a cleanup lock on each page before removing items from each page whereas in hash index we take cleanup lock only on primary bucket page. Now, one thing we could do is to start taking a cleanup lock on each bucket page (which includes primary bucket page and overflow pages), but I think it can turn out to be worse than the current locking strategy. Another idea could be to preserve the current locking strategy (take the lock on next bucket page and then release the lock on current bucket page) during vacuum of the unlogged hash index. That will ensure vacuum won't be able to remove the TIDs which we are going to mark as dead. Thoughts? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
>
> Comments on the latest patch:
Thank you once again for reviewing my patches.
>
> 1.
> /*
> + * We remember prev and next block number along with current block
> + * number so that if fetching the tuples using cursor we know the
> + * page from where to begin. This is for the case where we have
> + * reached the end of bucket chain without finding any matching
> + * tuples.
> + */
> + if (!BlockNumberIsValid(opaque-> hasho_nextblkno))
> + prev_blkno = opaque->hasho_prevblkno;
>
> This doesn't seem to be the right place for this comment as you are
> not saving next or current block number here. I am not sure whether
> you really need this comment at all. Will it be better if you move
> this to else part of the loop and reword it as:
>
> "Remember next and previous block numbers for scrollable cursors to
> know the start position."
Shifted the comments to else part of the loop.
>
> 2. The code in _hash_readpage looks quite bloated. I think we can
> make it better if we do something like below.
> a.
> +_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
> {
> ..
> + if (so->currPos.buf == so->hashso_bucket_buf ||
> + so->currPos.buf == so->hashso_split_bucket_buf)
> + {
> + so->currPos.prevPage = InvalidBlockNumber;
> + so->currPos.nextPage = opaque->hasho_nextblkno;
> + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
> + }
> + else
> + {
> + so->currPos.prevPage = opaque->hasho_prevblkno;
> + so->currPos.nextPage = opaque->hasho_nextblkno;
> + _hash_relbuf(rel, so->currPos.buf);
> + so->currPos.buf = InvalidBuffer;
> + }
> ..
> }
>
> This code chunk is same for both forward and backward scans. I think
> you can have single copy of this code by moving it out of if-else
> loop.
Done.> Comments on the latest patch:
Thank you once again for reviewing my patches.
>
> 1.
> /*
> + * We remember prev and next block number along with current block
> + * number so that if fetching the tuples using cursor we know the
> + * page from where to begin. This is for the case where we have
> + * reached the end of bucket chain without finding any matching
> + * tuples.
> + */
> + if (!BlockNumberIsValid(opaque->
> + prev_blkno = opaque->hasho_prevblkno;
>
> This doesn't seem to be the right place for this comment as you are
> not saving next or current block number here. I am not sure whether
> you really need this comment at all. Will it be better if you move
> this to else part of the loop and reword it as:
>
> "Remember next and previous block numbers for scrollable cursors to
> know the start position."
Shifted the comments to else part of the loop.
>
> 2. The code in _hash_readpage looks quite bloated. I think we can
> make it better if we do something like below.
> a.
> +_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
> {
> ..
> + if (so->currPos.buf == so->hashso_bucket_buf ||
> + so->currPos.buf == so->hashso_split_bucket_buf)
> + {
> + so->currPos.prevPage = InvalidBlockNumber;
> + so->currPos.nextPage = opaque->hasho_nextblkno;
> + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
> + }
> + else
> + {
> + so->currPos.prevPage = opaque->hasho_prevblkno;
> + so->currPos.nextPage = opaque->hasho_nextblkno;
> + _hash_relbuf(rel, so->currPos.buf);
> + so->currPos.buf = InvalidBuffer;
> + }
> ..
> }
>
> This code chunk is same for both forward and backward scans. I think
> you can have single copy of this code by moving it out of if-else
> loop.
>
> b.
> +_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
> {
> ..
> + /* new page, locate starting position by binary search */
> + offnum = _hash_binsearch(page, so->hashso_sk_hash);
> +
> + itemIndex = _hash_load_qualified_items(
> +
> + while (itemIndex == 0)
> + {
> + /*
> + * Could not find any matching tuples in the current page, move to
> + * the next page. Before leaving the current page, also deal with
> + * any killed items.
> + */
> + if (so->numKilled > 0)
> + _hash_kill_items(scan);
> +
> + /*
> + * We remember prev and next block number along with current block
> + * number so that if fetching the tuples using cursor we know the
> + * page from where to begin. This is for the case where we have
> + * reached the end of bucket chain without finding any matching
> + * tuples.
> + */
> + if (!BlockNumberIsValid(opaque->
> + prev_blkno = opaque->hasho_prevblkno;
> +
> + _hash_readnext(scan, &buf, &page, &opaque);
> + if (BufferIsValid(buf))
> + {
> + so->currPos.buf = buf;
> + so->currPos.currPage = BufferGetBlockNumber(buf);
> + so->currPos.lsn = PageGetLSN(page);
> + offnum = _hash_binsearch(page, so->hashso_sk_hash);
> + itemIndex = _hash_load_qualified_items(
> + offnum, dir);
> ..
> }
>
> Have just one copy of code search the offset and load qualified items.
> Change the above while loop to do..while loop and have a check in
> between break the loop when item index is not zero.
Done that way.
>
> 3.
> - * Find the first item in the index that
> - * satisfies the qualification associated with the scan descriptor. On
> - * success, the page containing the current index tuple is read locked
> - * and pinned, and the scan's opaque data entry is updated to
> - * include the buffer.
> + * We find the first item (or, if backward scan, the last item) in
> + * the index that satisfies the qualification associated with the
> + * scan descriptor. On success, the page containing the current
> + * index tuple is read locked and pinned, and data about the
> + * matching tuple(s) on the page has been loaded into so->currPos,
> + * scan->xs_ctup.t_self is set to the heap TID of the current tuple.
> + *
> + * If there are no matching items in the index, we return FALSE,
> + * with no pins or locks held.
> */
> bool
> _hash_first(IndexScanDesc scan, ScanDirection dir)
> @@ -229,15 +297,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
>
> I don't think on success, the lock or pin is held anymore, after this
> patch the only pin will be retained and that too for bucket page.
> Also, there is no need to modify the part of the comment which is not
> related to change in this patch.
Corrected.
>
> I don't see scan->xs_ctup.t_self getting set anywhere in this
> function. I think you are setting it in hashgettuple. It is better
> to move that assignment from hashgettuple to _hash_first so as to be
> consistent with _hash_next.
Done that way.
>
> 4.
> + * On successful exit, scan->xs_ctup.t_self is set to the TID
> + * of the next heap tuple, and if requested, scan->xs_itup
> + * points to a copy of the index tuple. so->currPos is updated
> + * as needed.
> + *
> + * On failure exit (no more tuples), we return FALSE with no
> + * pins or locks held.
> */
> bool
> _hash_next(IndexScanDesc scan, ScanDirection dir)
>
> I don't see the usage of scan->xs_itup in this function. It seems to
> me that you have copied it from btree code and forgot to remove part
> of the comment which is not relevant to a hash index.
Corrected.
>
> 5.
> @@ -514,19 +422,14 @@ hashendscan(IndexScanDesc scan)
> {
> ..
> + if (HashScanPosIsValid(so->
> {
> - LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
> - _hash_kill_items(scan);
> - LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
> + /* Before leaving current page, deal with any killed items */
> + if (so->numKilled > 0)
> + _hash_kill_items(scan);
> + _hash_dropscanbuf(rel, so);
> }
>
> - _hash_dropscanbuf(rel, so);
> -
> ..
> }
>
> I don't think it is a good idea to move _hash_dropscanbuf as that
> check just ensures if the current buffer is valid. It doesn't say
> anything about other buffers saved in HashScanOpaque. A similar
> change is required in hashrescan.
Done.
>
> 6.
> _hash_kill_items(IndexScanDesc scan)
> {
> ..
> + if (so->hashso_bucket_buf == so->currPos.buf ||
> + HashScanPosIsPinned(so->
> ..
> }
>
> Isn't second check enough? It should indirectly cover the first test.
Yes, one check should be fine. Corrected it.
>
> 7.
> _hash_kill_items(IndexScanDesc scan)
> {
> ..
> + /*
> + * If page LSN differs it means that the page was modified since the
> + * last read. killedItems could be not valid so LP_DEAD hints apply-
> + * ing is not safe.
> + */
> + page = BufferGetPage(buf);
> + if (PageGetLSN(page) != so->currPos.lsn)
> + {
> + _hash_relbuf(rel, buf);
> + return;
> + }
> ..
> }
>
> How does this check cover the case of unlogged tables?
Thanks for putting that point. It doesn't cover the case for unlogged tables. As suggested by you in one of your email in this mailing list, i am now not allowing vacuum to release lock on current page before acquiring lock on next page for unlogged tables. This will ensure that scan is always behind vacuum if they are running on the same bucket simultaneously. Therefore, there is danger in marking tuples as dead for unlogged pages even if they are not having any lsn.
Attachment
On Fri, Aug 11, 2017 at 6:51 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> >> 7. >> _hash_kill_items(IndexScanDesc scan) >> { >> .. >> + /* >> + * If page LSN differs it means that the page was modified since the >> + * last read. killedItems could be not valid so LP_DEAD hints apply- >> + * ing is not safe. >> + */ >> + page = BufferGetPage(buf); >> + if (PageGetLSN(page) != so->currPos.lsn) >> + { >> + _hash_relbuf(rel, buf); >> + return; >> + } >> .. >> } >> >> How does this check cover the case of unlogged tables? > > Thanks for putting that point. It doesn't cover the case for unlogged > tables. As suggested by you in one of your email in this mailing list, i am > now not allowing vacuum to release lock on current page before acquiring > lock on next page for unlogged tables. This will ensure that scan is always > behind vacuum if they are running on the same bucket simultaneously. > Therefore, there is danger in marking tuples as dead for unlogged pages even > if they are not having any lsn. > In the last line, I guess you wanted to say "there is *no* danger ..."? Today, while again thinking about this part of the patch (_hash_kill_items) it occurred to me that we can't rely on a pin on an overflow page to guarantee that it is not modified by Vacuum. Consider a case where vacuum started vacuuming the bucket before the scan and then in-between scan overtakes it. Now, it is possible that even if a scan has a pin on a page (and no lock), vacuum might clean that page, if that happens, then we can't prevent the reuse of TID. What do you think? Few other comments: 1. + * On failure exit (no more tuples), we return FALSE with pin + * pin held on bucket page but no pins or locks held on overflow + * page. */bool_hash_next(IndexScanDesc scan, ScanDirection dir) In the above part of comment 'pin' is used twice. 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v5 2. - * not at all by the rearrangement we are performing here. To prevent - * any concurrent scan to cross the squeeze scan we use lock chaining - * similar to hasbucketcleanup. Refer comments atop hashbucketcleanup. + * not at all by the rearrangement we are performing here. In _hash_squeezebucket, we still use lock chaining, so removing the above comment doesn't seem like a good idea. I think you should copy part of a comment from hasbucketcleanup starting from "There can't be any concurrent .." 3. _hash_freeovflpage() { .. * Concurrency issues are avoided by using lock chaining as * described atop hashbucketcleanup. .. } After fixing #2, you need to change the function name in above comment. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Aug 19, 2017 at 11:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Aug 11, 2017 at 6:51 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >>> >>> 7. >>> _hash_kill_items(IndexScanDesc scan) >>> { >>> .. >>> + /* >>> + * If page LSN differs it means that the page was modified since the >>> + * last read. killedItems could be not valid so LP_DEAD hints apply- >>> + * ing is not safe. >>> + */ >>> + page = BufferGetPage(buf); >>> + if (PageGetLSN(page) != so->currPos.lsn) >>> + { >>> + _hash_relbuf(rel, buf); >>> + return; >>> + } >>> .. >>> } >>> >>> How does this check cover the case of unlogged tables? >> >> Thanks for putting that point. It doesn't cover the case for unlogged >> tables. As suggested by you in one of your email in this mailing list, i am >> now not allowing vacuum to release lock on current page before acquiring >> lock on next page for unlogged tables. This will ensure that scan is always >> behind vacuum if they are running on the same bucket simultaneously. >> Therefore, there is danger in marking tuples as dead for unlogged pages even >> if they are not having any lsn. >> > Once again, Thank you for reviewing my patches. > In the last line, I guess you wanted to say "there is *no* danger > ..."? Yes, i meant that because, it ensures that scan will always be following VACUUM. Today, while again thinking about this part of the patch > (_hash_kill_items) it occurred to me that we can't rely on a pin on an > overflow page to guarantee that it is not modified by Vacuum. > Consider a case where vacuum started vacuuming the bucket before the > scan and then in-between scan overtakes it. Now, it is possible that > even if a scan has a pin on a page (and no lock), vacuum might clean > that page, if that happens, then we can't prevent the reuse of TID. > What do you think? > I think, you are talking about non-mvcc scan case, because in case of mvcc scans, even if we have released both pin and lock on a page, VACUUM can't remove tuples from a page if it is visible to some concurrently running transactions (mvcc scan in our case). So, i don't think it can happen in case of MVCC scans however it can happen for non-mvcc scans and for that to handle i think, it is better that we drop an idea of allowing scan to overtake VACUUM (done by 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v5.patch). However, B-Tree has handled this in _bt_drop_lock_and_maybe_pin() where it releases both pin and lock on a page if it is MVCC snapshot else just releases lock on the page. > Few other comments: > > 1. > + * On failure exit (no more tuples), we return FALSE with pin > + * pin held on bucket page but no pins or locks held on overflow > + * page. > */ > bool > _hash_next(IndexScanDesc scan, ScanDirection dir) > > In the above part of comment 'pin' is used twice. OKay, I will remove one extra pin (from comment) in my next version of patch. > > 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v5 > > 2. > - * not at all by the rearrangement we are performing here. To prevent > - * any concurrent scan to cross the squeeze scan we use lock chaining > - * similar to hasbucketcleanup. Refer comments atop hashbucketcleanup. > + * not at all by the rearrangement we are performing here. > > In _hash_squeezebucket, we still use lock chaining, so removing the > above comment doesn't seem like a good idea. I think you should copy > part of a comment from hasbucketcleanup starting from "There can't be > any concurrent .." Okay, I will correct it in my next version of patch. > > 3. > _hash_freeovflpage() > { > .. > > * Concurrency issues are avoided by using lock chaining as > * described atop hashbucketcleanup. > .. > } > > After fixing #2, you need to change the function name in above comment. > Sure, I will correct that in my next version of patch. With Regards, Ashutosh Sharma. EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 22, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > On Sat, Aug 19, 2017 at 11:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Fri, Aug 11, 2017 at 6:51 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >>>> >>>> 7. >>>> _hash_kill_items(IndexScanDesc scan) >>>> { >>>> .. >>>> + /* >>>> + * If page LSN differs it means that the page was modified since the >>>> + * last read. killedItems could be not valid so LP_DEAD hints apply- >>>> + * ing is not safe. >>>> + */ >>>> + page = BufferGetPage(buf); >>>> + if (PageGetLSN(page) != so->currPos.lsn) >>>> + { >>>> + _hash_relbuf(rel, buf); >>>> + return; >>>> + } >>>> .. >>>> } >>>> >>>> How does this check cover the case of unlogged tables? >>> >>> Thanks for putting that point. It doesn't cover the case for unlogged >>> tables. As suggested by you in one of your email in this mailing list, i am >>> now not allowing vacuum to release lock on current page before acquiring >>> lock on next page for unlogged tables. This will ensure that scan is always >>> behind vacuum if they are running on the same bucket simultaneously. >>> Therefore, there is danger in marking tuples as dead for unlogged pages even >>> if they are not having any lsn. >>> >> > > Once again, Thank you for reviewing my patches. > >> In the last line, I guess you wanted to say "there is *no* danger >> ..."? > > Yes, i meant that because, it ensures that scan will always be following VACUUM. > > Today, while again thinking about this part of the patch >> (_hash_kill_items) it occurred to me that we can't rely on a pin on an >> overflow page to guarantee that it is not modified by Vacuum. >> Consider a case where vacuum started vacuuming the bucket before the >> scan and then in-between scan overtakes it. Now, it is possible that >> even if a scan has a pin on a page (and no lock), vacuum might clean >> that page, if that happens, then we can't prevent the reuse of TID. >> What do you think? >> > > I think, you are talking about non-mvcc scan case, because in case of > mvcc scans, even if we have released both pin and lock on a page, > VACUUM can't remove tuples from a page if it is visible to some > concurrently running transactions (mvcc scan in our case). > I am talking about tuples that are marked as dead in heap. It has nothing to do with the visibility of tuple or type of scan (mvcc or non-mvcc). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 22, 2017 at 3:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Aug 22, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> On Sat, Aug 19, 2017 at 11:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> On Fri, Aug 11, 2017 at 6:51 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >>>>> >>>>> 7. >>>>> _hash_kill_items(IndexScanDesc scan) >>>>> { >>>>> .. >>>>> + /* >>>>> + * If page LSN differs it means that the page was modified since the >>>>> + * last read. killedItems could be not valid so LP_DEAD hints apply- >>>>> + * ing is not safe. >>>>> + */ >>>>> + page = BufferGetPage(buf); >>>>> + if (PageGetLSN(page) != so->currPos.lsn) >>>>> + { >>>>> + _hash_relbuf(rel, buf); >>>>> + return; >>>>> + } >>>>> .. >>>>> } >>>>> >>>>> How does this check cover the case of unlogged tables? >>>> >>>> Thanks for putting that point. It doesn't cover the case for unlogged >>>> tables. As suggested by you in one of your email in this mailing list, i am >>>> now not allowing vacuum to release lock on current page before acquiring >>>> lock on next page for unlogged tables. This will ensure that scan is always >>>> behind vacuum if they are running on the same bucket simultaneously. >>>> Therefore, there is danger in marking tuples as dead for unlogged pages even >>>> if they are not having any lsn. >>>> >>> >> >> Once again, Thank you for reviewing my patches. >> >>> In the last line, I guess you wanted to say "there is *no* danger >>> ..."? >> >> Yes, i meant that because, it ensures that scan will always be following VACUUM. >> >> Today, while again thinking about this part of the patch >>> (_hash_kill_items) it occurred to me that we can't rely on a pin on an >>> overflow page to guarantee that it is not modified by Vacuum. >>> Consider a case where vacuum started vacuuming the bucket before the >>> scan and then in-between scan overtakes it. Now, it is possible that >>> even if a scan has a pin on a page (and no lock), vacuum might clean >>> that page, if that happens, then we can't prevent the reuse of TID. >>> What do you think? >>> >> >> I think, you are talking about non-mvcc scan case, because in case of >> mvcc scans, even if we have released both pin and lock on a page, >> VACUUM can't remove tuples from a page if it is visible to some >> concurrently running transactions (mvcc scan in our case). >> > > I am talking about tuples that are marked as dead in heap. It has > nothing to do with the visibility of tuple or type of scan (mvcc or > non-mvcc). > Okay, I got your point now. I think, currently in _hash_kill_items(), if an overflow page is pinned we do not check if it got modified since the last read or not. Hence, if the vacuum runs on an overflow page that is pinned and also has some dead tuples in it then it could create a problem for scan basically, when scan would attempt to mark the killed items as dead. To get rid of such problem, i think, even if an overflow page is pinned we should check if it got modified or not since the last read was performed on the page. If yes, then do not allow scan to mark killed items as dead. Attached is the newer version with these changes along with some other cosmetic changes mentioned in your earlier email. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Aug 22, 2017 at 7:24 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > On Tue, Aug 22, 2017 at 3:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Okay, I got your point now. I think, currently in _hash_kill_items(), > if an overflow page is pinned we do not check if it got modified since > the last read or > not. Hence, if the vacuum runs on an overflow page that is pinned and > also has some dead tuples in it then it could create a problem for > scan basically, > when scan would attempt to mark the killed items as dead. To get rid > of such problem, i think, even if an overflow page is pinned we should > check if it got > modified or not since the last read was performed on the page. If yes, > then do not allow scan to mark killed items as dead. Attached is the > newer version with these changes along with some other cosmetic > changes mentioned in your earlier email. Thanks. > Thanks for the new version. I again looked at the patches and fixed quite a few comments in the code and ReadMe. You have forgotten to update README for the changes in vacuum patch (0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7). I don't have anything more to add. If you are okay with changes, then we can move it to Ready For Committer unless someone else has some more comments. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 08/23/2017 07:38 AM, Amit Kapila wrote: > Thanks for the new version. I again looked at the patches and fixed > quite a few comments in the code and ReadMe. You have forgotten to > update README for the changes in vacuum patch > (0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7). I > don't have anything more to add. If you are okay with changes, then > we can move it to Ready For Committer unless someone else has some > more comments. > Just some minor comments. README: + it's pin till the end of scan) its pin till the end of the scan) +To minimize lock/unlock traffic, hash index scan always searches entire hash To minimize lock/unlock traffic, hash index scan always searches the entire hash hashsearch.c: +static inline void _hash_saveitem(HashScanOpaque so, int itemIndex, + OffsetNumber offnum, IndexTuple itup); There are other instances of "inline" in the code base, so I guess that this is ok. + * Advance to next tuple on current page; or if there's no more, try to Advance to the next tuple on the current page; or if done, try to Best regards, Jesper
Hi Amit, On Wed, Aug 23, 2017 at 5:08 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Aug 22, 2017 at 7:24 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> On Tue, Aug 22, 2017 at 3:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> Okay, I got your point now. I think, currently in _hash_kill_items(), >> if an overflow page is pinned we do not check if it got modified since >> the last read or >> not. Hence, if the vacuum runs on an overflow page that is pinned and >> also has some dead tuples in it then it could create a problem for >> scan basically, >> when scan would attempt to mark the killed items as dead. To get rid >> of such problem, i think, even if an overflow page is pinned we should >> check if it got >> modified or not since the last read was performed on the page. If yes, >> then do not allow scan to mark killed items as dead. Attached is the >> newer version with these changes along with some other cosmetic >> changes mentioned in your earlier email. Thanks. >> > > Thanks for the new version. I again looked at the patches and fixed > quite a few comments in the code and ReadMe. You have forgotten to > update README for the changes in vacuum patch > (0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7). I > don't have anything more to add. If you are okay with changes, then > we can move it to Ready For Committer unless someone else has some > more comments. > Thanks for reviewing my patches. I've gone through the changes done by you in the README file and few changes in code comments. The changes looks valid to me. But, it seems like there are some more minor review comments from Jesper which i will fix and share the new set of patches shortly. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Wed, Aug 23, 2017 at 7:39 PM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > On 08/23/2017 07:38 AM, Amit Kapila wrote: >> >> Thanks for the new version. I again looked at the patches and fixed >> quite a few comments in the code and ReadMe. You have forgotten to >> update README for the changes in vacuum patch >> (0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7). I >> don't have anything more to add. If you are okay with changes, then >> we can move it to Ready For Committer unless someone else has some >> more comments. >> > > Just some minor comments. Thanks for the review. > > README: > + it's pin till the end of scan) > > its pin till the end of the scan) Corrected. > > +To minimize lock/unlock traffic, hash index scan always searches entire > hash > > To minimize lock/unlock traffic, hash index scan always searches the entire > hash Done. > > hashsearch.c: > > +static inline void _hash_saveitem(HashScanOpaque so, int itemIndex, > + OffsetNumber offnum, IndexTuple itup); > > There are other instances of "inline" in the code base, so I guess that this > is ok. > > + * Advance to next tuple on current page; or if there's no more, try > to > > Advance to the next tuple on the current page; or if done, try to > Done. Attached are the patches with above changes. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 08/24/2017 01:21 AM, Ashutosh Sharma wrote: > Done. > > Attached are the patches with above changes. > Thanks ! Based on the feedback in this thread, I have moved the patch to "Ready for Committer". Best regards, Jesper
On Thu, Aug 24, 2017 at 11:26 AM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > Based on the feedback in this thread, I have moved the patch to "Ready for > Committer". Reviewing 0001: _hash_readpage gets the page LSN to see if we can apply LP_DEAD hints, but if the table is unlogged or temporary, the LSN will never change, so the test in _hash_kill_items will always think that it's OK to apply the hints. (This seems like it might be a pretty serious problem, because I'm not sure what would be a viable workaround.) The logic that tries to ensure that so->currPos.{buf,currPage,lsn} get updated is not, to my eyes, obviously correct. Generally, the logic for this stuff seems unnaturally spread out to me. For example, _hash_next() updates currPos.buf, but leaves it to _hash_readpage to set currPage and lsn. That function also sets all three fields when it advances to the next page by calling _hash_readnext(), but when it tries to advance to the next page and doesn't find one it sets buf but not currPage or lsn. It seems to me that this logic should be more centralized somehow. Can we arrange things so that at least buf, currPage, and lsn, and maybe also nextPage and prevPage, get updated at the same time and as soon after reading the buffer as possible? It would be bad if a primary bucket page's hasho_prevblkno field got copied into so->currPos.prevpage, because the value that appears for the primary bucket is not a real block number. But _hash_readpage seems like it can bring about this exact situation, because of this code: + if (!BlockNumberIsValid(opaque->hasho_nextblkno)) + prev_blkno = opaque->hasho_prevblkno; ... + so->currPos.prevPage = prev_blkno; If we're reading the primary bucket page and there are no overflow pages, hasho_nextblkno will not be valid and hasho_prevblkno won't be a real block number. Incidentally, the "if" statement in the above block of code is probably not saving anything; I would suggest for clarity that you do the assignment unconditionally (but this is just a matter of style, so I don't feel super-strongly about it). + return (so->currPos.firstItem <= so->currPos.lastItem); Can this ever return false? It looks to me like we should never reach this code unless that condition holds, and that it would be a bug if we did. If that's correct, maybe this should be an Assert() and the return statement should just return true. + buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE); + + /* It might not exist anymore; in which case we can't hint it. */ + if (!BufferIsValid(buf)) + return; This is dead code, because _hash_getbuf always returns a valid buffer. If there's actually a risk of the buffer disappearing, then some other handling is needed for this case. But I suspect that, since a scan always holds a pin on the primary bucket, there's actually no way for this to happen and this is just dead code. The comment in hashgetbitmap claims that _hash_first() or _hash_next() never returns dead tuples. If that were true, it would be a bug, because then scans started during recovery would return wrong answers. A more accurate statement would be something like: _hash_first and _hash_next handle eliminate dead index entries whenever scan->ignored_killed_tuples is true. Therefore, there's nothing to do here except add the results to the TIDBitmap. _hash_readpage contains unnecessary "continue" statements inside the loops. The reason that they're unnecessary is that there's no code below that in the loop anyway, so the loop is already going to go back around to the top. Whether to change this is a matter of style, so I won't complain too much if you want to leave it the way it is, but personally I'd favor removing them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Aug 24, 2017 at 11:26 AM, Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: >> Based on the feedback in this thread, I have moved the patch to "Ready for >> Committer". > > Reviewing 0001: > > _hash_readpage gets the page LSN to see if we can apply LP_DEAD hints, > but if the table is unlogged or temporary, the LSN will never change, > so the test in _hash_kill_items will always think that it's OK to > apply the hints. (This seems like it might be a pretty serious > problem, because I'm not sure what would be a viable workaround.) > This point has been discussed above [1] and to avoid this problem we are keeping the scan always behind vacuum for unlogged and temporary tables as we are doing without this patch. That will ensure vacuum won't be able to remove the TIDs which we are going to mark as dead. This has been taken care in patch 0003. I understand that this is slightly ugly, but the other alternative (as mentioned in the email [1]) is much worse. [1] - https://www.postgresql.org/message-id/CAA4eK1J6xiJUOidBaOt0iPsAdS0%2Bp5PoKFf1R2yVjTwrY_4snA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Thanks for all your review comments. Please find my comments in-line. On Tue, Sep 19, 2017 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Aug 24, 2017 at 11:26 AM, Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: >> Based on the feedback in this thread, I have moved the patch to "Ready for >> Committer". > > Reviewing 0001: > > _hash_readpage gets the page LSN to see if we can apply LP_DEAD hints, > but if the table is unlogged or temporary, the LSN will never change, > so the test in _hash_kill_items will always think that it's OK to > apply the hints. (This seems like it might be a pretty serious > problem, because I'm not sure what would be a viable workaround.) > Amit has already replied to this query up-thread. > The logic that tries to ensure that so->currPos.{buf,currPage,lsn} get > updated is not, to my eyes, obviously correct. Generally, the logic > for this stuff seems unnaturally spread out to me. For example, > _hash_next() updates currPos.buf, but leaves it to _hash_readpage to > set currPage and lsn. That function also sets all three fields when > it advances to the next page by calling _hash_readnext(), but when it > tries to advance to the next page and doesn't find one it sets buf but > not currPage or lsn. It seems to me that this logic should be more > centralized somehow. Can we arrange things so that at least buf, > currPage, and lsn, and maybe also nextPage and prevPage, get updated > at the same time and as soon after reading the buffer as possible? > Okay, I have tried to update currPos.{buf, currPage, lsn} in _hash_readpage() at the same time. Please have a look into the attached 0001*.patch. When _hash_readpage() doesn't find any qualifying tuples i.e. when _hash_readnext() returns Invalid buffer, we just update prevPage, nextPage and buf in currPos (not currPage or lsn) as currPage and lsn should point to last page in the hash bucket so that we can mark the killed items as dead at the end of scan (with the help of _hash_kill_items). Hence, we keep the currpage and lsn as it is if no more valid hash pages are found. > It would be bad if a primary bucket page's hasho_prevblkno field got > copied into so->currPos.prevpage, because the value that appears for > the primary bucket is not a real block number. But _hash_readpage > seems like it can bring about this exact situation, because of this > code: > > + if (!BlockNumberIsValid(opaque->hasho_nextblkno)) > + prev_blkno = opaque->hasho_prevblkno; > ... > + so->currPos.prevPage = prev_blkno; > > If we're reading the primary bucket page and there are no overflow > pages, hasho_nextblkno will not be valid and hasho_prevblkno won't be > a real block number. > Fixed. Thanks for putting that point. > Incidentally, the "if" statement in the above block of code is > probably not saving anything; I would suggest for clarity that you do > the assignment unconditionally (but this is just a matter of style, so > I don't feel super-strongly about it). > > + return (so->currPos.firstItem <= so->currPos.lastItem); > > Can this ever return false? It looks to me like we should never reach > this code unless that condition holds, and that it would be a bug if > we did. If that's correct, maybe this should be an Assert() and the > return statement should just return true. > No, it will never return FALSE. I have changed it to Assert statement. > + buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE); > + > + /* It might not exist anymore; in which case we can't hint it. */ > + if (!BufferIsValid(buf)) > + return; > > This is dead code, because _hash_getbuf always returns a valid buffer. > If there's actually a risk of the buffer disappearing, then some other > handling is needed for this case. But I suspect that, since a scan > always holds a pin on the primary bucket, there's actually no way for > this to happen and this is just dead code. > Removed the redundant code. > The comment in hashgetbitmap claims that _hash_first() or _hash_next() > never returns dead tuples. If that were true, it would be a bug, > because then scans started during recovery would return wrong answers. > A more accurate statement would be something like: _hash_first and > _hash_next handle eliminate dead index entries whenever > scan->ignored_killed_tuples is true. Therefore, there's nothing to do > here except add the results to the TIDBitmap. > Corrected. > _hash_readpage contains unnecessary "continue" statements inside the > loops. The reason that they're unnecessary is that there's no code > below that in the loop anyway, so the loop is already going to go back > around to the top. Whether to change this is a matter of style, so I > won't complain too much if you want to leave it the way it is, but > personally I'd favor removing them. > Ohh no, that's a silly mistake. I have corrected it. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Sep 19, 2017 at 11:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > This point has been discussed above [1] and to avoid this problem we > are keeping the scan always behind vacuum for unlogged and temporary > tables as we are doing without this patch. That will ensure vacuum > won't be able to remove the TIDs which we are going to mark as dead. > This has been taken care in patch 0003. I understand that this is > slightly ugly, but the other alternative (as mentioned in the email > [1]) is much worse. Hmm. So if I understand correctly, you're saying that the LSN check in patch 0001 is actually completely unnecessary if we only apply 0001, but is needed in preparation for 0003, after which it will really be doing something? In more detail, I suppose the idea is: a TID cannot be reused until a VACUUM has intervened; VACUUM always visits every data page in the index; we won't allow a scan to pass VACUUM (and thus possibly have one of its TIDs get reused) except when the LSN check is actually sufficient to guarantee no TID reuse (i.e. table is not unlogged). Page-at-a-time index vacuum as in _hash_vacuum_one_page doesn't matter because such an operation doesn't allow TIDs to be reused. Did I get it right? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 19, 2017 at 11:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> This point has been discussed above [1] and to avoid this problem we >> are keeping the scan always behind vacuum for unlogged and temporary >> tables as we are doing without this patch. That will ensure vacuum >> won't be able to remove the TIDs which we are going to mark as dead. >> This has been taken care in patch 0003. I understand that this is >> slightly ugly, but the other alternative (as mentioned in the email >> [1]) is much worse. > > Hmm. So if I understand correctly, you're saying that the LSN check > in patch 0001 is actually completely unnecessary if we only apply > 0001, but is needed in preparation for 0003, after which it will > really be doing something? > Right. > In more detail, I suppose the idea is: a TID cannot be reused until a > VACUUM has intervened; VACUUM always visits every data page in the > index; we won't allow a scan to pass VACUUM (and thus possibly have > one of its TIDs get reused) except when the LSN check is actually > sufficient to guarantee no TID reuse (i.e. table is not unlogged). Right. > Page-at-a-time index vacuum as in _hash_vacuum_one_page doesn't matter > because such an operation doesn't allow TIDs to be reused. > Page-at-a-time index vacuum also allows TIDs to be reused but this is done only for already marked dead items whereas vacuum can make the non-dead entries to be removed. We don't have a problem with page-at-a-time vacuum as it won't remove any items which the scan is going to mark as dead. > Did I get it right? > I think so. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 7:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Page-at-a-time index vacuum as in _hash_vacuum_one_page doesn't matter >> because such an operation doesn't allow TIDs to be reused. > > Page-at-a-time index vacuum also allows TIDs to be reused but this is > done only for already marked dead items whereas vacuum can make the > non-dead entries to be removed. We don't have a problem with > page-at-a-time vacuum as it won't remove any items which the scan is > going to mark as dead. I don't think page-at-a-time index vacuum allows heap TIDs to be reused. To reuse a heap TID, we have to know that there are no index entries pointing to it. There's no way for the heap to know that a page-at-a-time index vacuum has even happened, let alone which TIDs were affected and that all other indexes have also removed all index entries for those TIDs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 4:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 20, 2017 at 7:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Page-at-a-time index vacuum as in _hash_vacuum_one_page doesn't matter >>> because such an operation doesn't allow TIDs to be reused. >> >> Page-at-a-time index vacuum also allows TIDs to be reused but this is >> done only for already marked dead items whereas vacuum can make the >> non-dead entries to be removed. We don't have a problem with >> page-at-a-time vacuum as it won't remove any items which the scan is >> going to mark as dead. > > I don't think page-at-a-time index vacuum allows heap TIDs to be > reused. > Right, I was thinking from the perspective of the index entry. Before marking index entry as dead, we do check for heaptid. So, as heaptid can't be reused via Page-at-a-time index vacuum, scan won't mark index entry as dead. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Right, I was thinking from the perspective of the index entry. Before > marking index entry as dead, we do check for heaptid. So, as heaptid > can't be reused via Page-at-a-time index vacuum, scan won't mark index > entry as dead. It can mark index entries dead, but if it does, they correspond to heap TIDs that are still dead, as opposed to heap TIDs that have been resurrected by being reused for an unrelated tuple. In other words, the danger scenario is this: 1. A page-at-a-time scan records all the TIDs on a page. 2. VACUUM processes the page, removing some of those TIDs. 3. VACUUM finishes, changing the heap TIDs from dead to unused. 4. Somebody inserts a new tuple at one of the existing TIDs, and the index tuple gets put on the page scanned in step 1. 5. The page-at-a-time scan resumes and kills the tuple added in step 4 by mistake, when it really only intended to kill a tuple removed in step 2. What prevent this is: A. To begin scanning a bucket, VACUUM needs a cleanup lock on the primary bucket page. Therefore, there are no scans in progress at the time that VACUUM begins scanning the bucket. B. If a scan begins scanning the bucket, it can't pass VACUUM, because VACUUM doesn't release the page lock on one page before taking the one for the next page. C. After 0003, it becomes possible for a scan to pass VACUUM if the table is permanent, but it won't be a problem because of the LSN check. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 6:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 20, 2017 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Right, I was thinking from the perspective of the index entry. Before >> marking index entry as dead, we do check for heaptid. So, as heaptid >> can't be reused via Page-at-a-time index vacuum, scan won't mark index >> entry as dead. > > It can mark index entries dead, but if it does, they correspond to > heap TIDs that are still dead, as opposed to heap TIDs that have been > resurrected by being reused for an unrelated tuple. > > In other words, the danger scenario is this: > > 1. A page-at-a-time scan records all the TIDs on a page. > 2. VACUUM processes the page, removing some of those TIDs. > 3. VACUUM finishes, changing the heap TIDs from dead to unused. > 4. Somebody inserts a new tuple at one of the existing TIDs, and the > index tuple gets put on the page scanned in step 1. > 5. The page-at-a-time scan resumes and kills the tuple added in step 4 > by mistake, when it really only intended to kill a tuple removed in > step 2. > > What prevent this is: > > A. To begin scanning a bucket, VACUUM needs a cleanup lock on the > primary bucket page. Therefore, there are no scans in progress at the > time that VACUUM begins scanning the bucket. > > B. If a scan begins scanning the bucket, it can't pass VACUUM, because > VACUUM doesn't release the page lock on one page before taking the one > for the next page. > > C. After 0003, it becomes possible for a scan to pass VACUUM if the > table is permanent, but it won't be a problem because of the LSN > check. > That's right. So, in short, this patch handles the problemetic scenario. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 5:37 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Thanks for all your review comments. Please find my comments in-line. + if (!BlockNumberIsValid(opaque->hasho_nextblkno)) + { + if (so->currPos.buf == so->hashso_bucket_buf || + so->currPos.buf == so->hashso_split_bucket_buf) + prev_blkno = InvalidBlockNumber; + else + prev_blkno = opaque->hasho_prevblkno; + } 1. Why not remove the outer "if" statement? 2. How about adding a comment, like /* If this is a primary bucket page, hasho_prevblkno is not a real block number. */ > When _hash_readpage() doesn't find any qualifying tuples i.e. when > _hash_readnext() returns Invalid buffer, we just update prevPage, > nextPage and buf in > currPos (not currPage or lsn) as currPage and lsn should point to last > page in the hash bucket so that we can mark the killed items as dead > at the end of scan (with the help of _hash_kill_items). Hence, we keep > the currpage and lsn as it is if no more valid hash pages are found. How about adding a comment about this, by extending this comment: + * Remember next and previous block numbers for scrollable + * cursors to know the start position and return FALSE + * indicating that no more matching tuples were found. e.g. (Don't reset currPage or lsn, because we expect _hash_kill_items to be called for the old page after this function returns.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 8:05 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 20, 2017 at 5:37 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> Thanks for all your review comments. Please find my comments in-line. > > + if (!BlockNumberIsValid(opaque->hasho_nextblkno)) > + { > + if (so->currPos.buf == so->hashso_bucket_buf || > + so->currPos.buf == so->hashso_split_bucket_buf) > + prev_blkno = InvalidBlockNumber; > + else > + prev_blkno = opaque->hasho_prevblkno; > + } > > 1. Why not remove the outer "if" statement? > Yes, the outer if statement is not required. I just missed to remove that in my earlier patch. > 2. How about adding a comment, like /* If this is a primary bucket > page, hasho_prevblkno is not a real block number. */ > Added. >> When _hash_readpage() doesn't find any qualifying tuples i.e. when >> _hash_readnext() returns Invalid buffer, we just update prevPage, >> nextPage and buf in >> currPos (not currPage or lsn) as currPage and lsn should point to last >> page in the hash bucket so that we can mark the killed items as dead >> at the end of scan (with the help of _hash_kill_items). Hence, we keep >> the currpage and lsn as it is if no more valid hash pages are found. > > How about adding a comment about this, by extending this comment: > > + * Remember next and previous block numbers for scrollable > + * cursors to know the start position and return FALSE > + * indicating that no more matching tuples were found. > > e.g. (Don't reset currPage or lsn, because we expect _hash_kill_items > to be called for the old page after this function returns.) > > Added. Attached are the patches with above changes. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Sep 20, 2017 at 11:43 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Attached are the patches with above changes. Thanks. Thanks. I think that the comments and README changes in 0003 need significantly more work. In several places, they fail to note the unlogged vs. logged differences, and the header comment for hashbucketcleanup still says that scans depend on increasing-TID order (really, 0001 should change that text somehow). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 21, 2017 at 9:30 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 20, 2017 at 11:43 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> Attached are the patches with above changes. Thanks. > > Thanks. I think that the comments and README changes in 0003 need > significantly more work. In several places, they fail to note the > unlogged vs. logged differences, and the header comment for > hashbucketcleanup still says that scans depend on increasing-TID order > (really, 0001 should change that text somehow). > Thanks for putting that point. I will try to correct the comments in hashbucketcleanup(), mention about the handling done for logged and unlogged tables in README and submit the updated patch asap. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 21, 2017 at 9:30 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 20, 2017 at 11:43 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> Attached are the patches with above changes. Thanks. > > Thanks. I think that the comments and README changes in 0003 need > significantly more work. In several places, they fail to note the > unlogged vs. logged differences, and the header comment for > hashbucketcleanup still says that scans depend on increasing-TID order > (really, 0001 should change that text somehow). > I have added a note for handling of logged and unlogged tables in README file and also corrected the header comment for hashbucketcleanup(). Please find the attached 0003*.patch having these changes. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Sep 21, 2017 at 3:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > I have added a note for handling of logged and unlogged tables in > README file and also corrected the header comment for > hashbucketcleanup(). Please find the attached 0003*.patch having these > changes. Thanks. I committed 0001 and 0002 with some additional edits as 7c75ef571579a3ad7a1d3ee909f11dba5e0b9440. I also rebased 0003 and edited it a bit; see attached hash-cleanup-changes.patch. I'm not entirely sold on 0003. An alternative would be to rip the lsn stuff back out of HashScanPosData, and I think we ought to consider that. Basically, 0003 is betting that getting rid of the lock-chaining in hash index vacuum is more valuable than being able to kill dead items more aggressively. I bet that's a bad bet. In the case of btree indexes, since 2ed5b87f96d473962ec5230fd820abfeaccb2069, page-at-a-time scanning allows most btree index scans to avoid holding buffer pins when the scan is suspended, but we gain no such advantage here. We always have to hold a pin on the primary bucket page anyway, so even with this patch cleanup is going to block when it hits a bucket containing a suspended scan. 0003 helps if (a) the relation is permanent, (b) the bucket has overflow pages, and (c) the scan is moving faster than vacuum and can overtake it instead of waiting. But that doesn't seem like it will happen very often at all, whereas the LSN check will probably fail frequently and cause us to skip cleanup that we could usefully have done. So I propose the attached hashscan-no-lsn.patch as an alternative. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Sep 22, 2017 at 11:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 21, 2017 at 3:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> I have added a note for handling of logged and unlogged tables in >> README file and also corrected the header comment for >> hashbucketcleanup(). Please find the attached 0003*.patch having these >> changes. Thanks. > > I committed 0001 and 0002 with some additional edits as > 7c75ef571579a3ad7a1d3ee909f11dba5e0b9440. I also rebased 0003 and > edited it a bit; see attached hash-cleanup-changes.patch. > Thanks for the commit. I had put lot of efforts for this and very happy that it got committed. Thanks to Amit too for the detail review. > I'm not entirely sold on 0003. An alternative would be to rip the lsn > stuff back out of HashScanPosData, and I think we ought to consider > that. Basically, 0003 is betting that getting rid of the > lock-chaining in hash index vacuum is more valuable than being able to > kill dead items more aggressively. I bet that's a bad bet. > > In the case of btree indexes, since > 2ed5b87f96d473962ec5230fd820abfeaccb2069, page-at-a-time scanning > allows most btree index scans to avoid holding buffer pins when the > scan is suspended, but we gain no such advantage here. We always have > to hold a pin on the primary bucket page anyway, so even with this > patch cleanup is going to block when it hits a bucket containing a > suspended scan. 0003 helps if (a) the relation is permanent, (b) the > bucket has overflow pages, and (c) the scan is moving faster than > vacuum and can overtake it instead of waiting. But that doesn't seem > like it will happen very often at all, whereas the LSN check will > probably fail frequently and cause us to skip cleanup that we could > usefully have done. So I propose the attached hashscan-no-lsn.patch > as an alternative. > > Thoughts? > > -- Yes, I too feel that 0003 patch won't help much. The reason being, the chances of scan overtaking vacuum would be very rare and also considering the fact that hash index is normally meant for unique values (I mean that is when hash index is quite dominant over other indexes) which means the chances of overflow pages in hash index won't be high. Therefore, i feel, 0003 patch won't be much beneficial. Honestly speaking, the code changes in 0003 looks a bit ugly as well. So, yes, hashscan no-lsn.patch looks like a better option to me. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 22, 2017 at 11:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 21, 2017 at 3:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> I have added a note for handling of logged and unlogged tables in >> README file and also corrected the header comment for >> hashbucketcleanup(). Please find the attached 0003*.patch having these >> changes. Thanks. > > I committed 0001 and 0002 with some additional edits as > 7c75ef571579a3ad7a1d3ee909f11dba5e0b9440. > I have noticed a typo in that commit (in Readme) and patch for the same is attached. > I also rebased 0003 and > edited it a bit; see attached hash-cleanup-changes.patch. > > I'm not entirely sold on 0003. An alternative would be to rip the lsn > stuff back out of HashScanPosData, and I think we ought to consider > that. Basically, 0003 is betting that getting rid of the > lock-chaining in hash index vacuum is more valuable than being able to > kill dead items more aggressively. I bet that's a bad bet. > > In the case of btree indexes, since > 2ed5b87f96d473962ec5230fd820abfeaccb2069, page-at-a-time scanning > allows most btree index scans to avoid holding buffer pins when the > scan is suspended, but we gain no such advantage here. We always have > to hold a pin on the primary bucket page anyway, so even with this > patch cleanup is going to block when it hits a bucket containing a > suspended scan. 0003 helps if (a) the relation is permanent, (b) the > bucket has overflow pages, and (c) the scan is moving faster than > vacuum and can overtake it instead of waiting. But that doesn't seem > like it will happen very often at all, whereas the LSN check will > probably fail frequently and cause us to skip cleanup that we could > usefully have done. So I propose the attached hashscan-no-lsn.patch > as an alternative. > I think your proposal makes sense. Your patch looks good but you might want to tweak the comments atop _hash_kill_items ("However, having pin on the overflow page doesn't guarantee that vacuum won't delete any items.). That part of the comment has been written to indicate that we have to check LSN in this function unconditonally. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Sep 25, 2017 at 12:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think your proposal makes sense. Your patch looks good but you > might want to tweak the comments atop _hash_kill_items ("However, > having pin on the overflow page doesn't guarantee that vacuum won't > delete any items.). That part of the comment has been written to > indicate that we have to check LSN in this function unconditonally. OK, committed. And I think that's the last of the hash index work. Thanks to Amit Kapila, Ashutosh Sharma, Mithun Cy, Kuntal Ghosh, Jesper Pedersen, and Dilip Kumar for all the patches and reviews and to Jeff Janes and others for additional code review and testing! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers