Thread: [HACKERS] Page Scan Mode in Hash Index

[HACKERS] Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

14 February 2017, 08:27:50

Hi All,

Currently, Hash Index scan works tuple-at-a-time, i.e. for every
qualifying tuple in a page, it acquires and releases the lock which
eventually increases the lock/unlock traffic. For example, if an index
page contains 100 qualified tuples, the current hash index scan has to
acquire and release the lock 100 times to read those qualified tuples
which is not good from performance perspective and it also impacts
concurency with VACUUM.

Considering above points, I would like to propose a patch that allows
hash index scan to work in page-at-a-time mode. In page-at-a-time
mode, once lock is acquired on a target bucket page, the entire page
is scanned and all the qualified tuples are saved into backend's local
memory. This reduces the lock/unlock calls for retrieving tuples from
a page. Moreover, it also eliminates the problem of re-finding the
position of the last returned index tuple and more importanly it
allows VACUUM to release lock on current page before moving to the
next page which eventually improves it's concurrency with scan.

Attached patch modifies hash index scan code for page-at-a-time mode.
For better readability, I have splitted it into 3 parts,

1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
patch rewrites the hash index scan module to work in page-at-a-time
mode. It basically introduces two new functions-- _hash_readpage() and
_hash_saveitem(). The former is used to load all the qualifying tuples
from a target bucket or overflow page into an items array. The latter
one is used by _hash_readpage to save all the qualifying tuples found
in a page into an items array. Apart from that, this patch bascially
cleans _hash_first(), _hash_next and hashgettuple().

2) 0002-Remove-redundant-function-_hash_step-and-some-of-the.patch:
this patch basically removes the redundant function _hash_step() and
some of the unused members of HashScanOpaqueData structure.

3) 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patch:
this patch basically improves the locking strategy for VACUUM in hash
index. As the new hash index scan works in page-at-a-time, vacuum can
release the lock on previous page before acquiring a lock on the next
page, hence, improving hash index concurrency.

Please note that, above patches has to be applied on top of following
patches-- 'WAL in hash index - [1]' and 'Microvacuum support for hash
index [2]'. Note that in current head, marking of dead tuples requires
lock on the page. Now, even if hash index scan is done page-at-a-time,
it would still require a a lock on the page just to mark dead tuples.
Hence, loosing the advantage of page-at-a-time mode. Therefore, I
developed this patch over Microvacuum support for hash index [2].

I have also done the benchmarking of this patch and would like to
share the results for the same,

Firstly, I have done the benchmarking with non-unique values and i
could see a performance improvement of 4-7%. For the detailed results
please find the attached file 'results-non-unique values-70ff', and
ddl.sql, test.sql are test scripts used in this experimentation. The
detail of non-default GUC params and pgbench command are mentioned in
the result sheet. I also did the benchmarking with unique values at
300 and 1000 scale factor and its results are provided in
'results-unique-values-default-ff'.

[1]- https://www.postgresql.org/message-id/CAA4eK1LTyDHyCmj3pf5KxWgPb1DgNae9ivsB5jX0X_Kt7iLTUA%40mail.gmail.com
[2]- https://www.postgresql.org/message-id/CAA4eK1JfrJoa15XStmRKy6mGsVjKh_aa-EXZY%2BUZQOV6mGM0QQ%40mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Page Scan Mode in Hash Index

From

Alexander Korotkov

Date:

10 March 2017, 13:25:21

Hi, Ashutosh!

I've assigned to review this patch.

At first, I'd like to notice that I like idea and general design.

Secondly, patch set don't apply cleanly to master. Please, rebase it.

On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
patch rewrites the hash index scan module to work in page-at-a-time
mode. It basically introduces two new functions-- _hash_readpage() and
_hash_saveitem(). The former is used to load all the qualifying tuples
from a target bucket or overflow page into an items array. The latter
one is used by _hash_readpage to save all the qualifying tuples found
in a page into an items array. Apart from that, this patch bascially
cleans _hash_first(), _hash_next and hashgettuple().

I see that forward and backward scan cases of function _hash_readpage contain a lot of code duplication

Could you please refactor this function to have less code duplication?

Also, I wonder if you have a special idea behind inserting data in test.sql by 1002 separate SQL statements.

INSERT INTO con_hash_index_table (keycol) SELECT a FROM GENERATE_SERIES(1, 1000) a;

You can achieve the same result by execution of single SQL statement.

INSERT INTO con_hash_index_table (keycol) SELECT (a - 1) % 1000 + 1 FROM GENERATE_SERIES(1, 1002000) a;

Unless you have some special idea of doing this in 1002 separate transactions.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

13 March 2017, 08:46:03

Hi,

>
> I've assigned to review this patch.
> At first, I'd like to notice that I like idea and general design.
> Secondly, patch set don't apply cleanly to master.  Please, rebase it.


Thanks for showing your interest towards this patch. I would like to
inform that this patch has got dependency on patch for  'Write Ahead
Logging in hash index - [1]' and 'Microvacuum support in hash index
[1]'. Hence, until above two patches becomes stable I may have to keep
on rebasing this patch. However, I will try to share you the updated
patch asap.

>
>
> On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>
>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
>> patch rewrites the hash index scan module to work in page-at-a-time
>> mode. It basically introduces two new functions-- _hash_readpage() and
>> _hash_saveitem(). The former is used to load all the qualifying tuples
>> from a target bucket or overflow page into an items array. The latter
>> one is used by _hash_readpage to save all the qualifying tuples found
>> in a page into an items array. Apart from that, this patch bascially
>> cleans _hash_first(), _hash_next and hashgettuple().
>
>
> I see that forward and backward scan cases of function _hash_readpage contain a lot of code duplication
>  Could you please refactor this function to have less code duplication?

Sure, I will try to avoid the code duplication as much as possible.

>
> Also, I wonder if you have a special idea behind inserting data in test.sql by 1002 separate SQL statements.
> INSERT INTO con_hash_index_table (keycol) SELECT a FROM GENERATE_SERIES(1, 1000) a;
>
> You can achieve the same result by execution of single SQL statement.
> INSERT INTO con_hash_index_table (keycol) SELECT (a - 1) % 1000 + 1 FROM GENERATE_SERIES(1, 1002000) a;
> Unless you have some special idea of doing this in 1002 separate transactions.


There is no reason for having so many INSERT statements in test.sql
file. I think it would be better to replace it with single SQL
statement. Thanks.

[1]- https://www.postgresql.org/message-id/CAA4eK1KibVzgVETVay0%2BsiVEgzaXnP5R21BdWiK9kg9wx2E40Q%40mail.gmail.com
[2]- https://www.postgresql.org/message-id/CAE9k0PkRSyzx8dOnokEpUi2A-RFZK72WN0h9DEMv_ut9q6bPRw%40mail.gmail.com


--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: [HACKERS] Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

14 March 2017, 15:16:12

Hi,

>> I've assigned to review this patch.
>> At first, I'd like to notice that I like idea and general design.
>> Secondly, patch set don't apply cleanly to master.  Please, rebase it.
>
>
> Thanks for showing your interest towards this patch. I would like to
> inform that this patch has got dependency on patch for  'Write Ahead
> Logging in hash index - [1]' and 'Microvacuum support in hash index
> [1]'. Hence, until above two patches becomes stable I may have to keep
> on rebasing this patch. However, I will try to share you the updated
> patch asap.
>
>>
>>
>> On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>>
>>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
>>> patch rewrites the hash index scan module to work in page-at-a-time
>>> mode. It basically introduces two new functions-- _hash_readpage() and
>>> _hash_saveitem(). The former is used to load all the qualifying tuples
>>> from a target bucket or overflow page into an items array. The latter
>>> one is used by _hash_readpage to save all the qualifying tuples found
>>> in a page into an items array. Apart from that, this patch bascially
>>> cleans _hash_first(), _hash_next and hashgettuple().
>>
>>
>> I see that forward and backward scan cases of function _hash_readpage contain a lot of code duplication
>>  Could you please refactor this function to have less code duplication?
>
> Sure, I will try to avoid the code duplication as much as possible.

I had close look into hash_readpage() function and could see that
there are only few if-else conditions which are similar for both
forward and backward scan cases and can't be optimised further.
However, If you have a cursory look into this function definition it
looks like the code for forward and backward scan are exactly the same
but that's not the case. Attached is the diff report
(hash_readpage.html) for forward and backward scan code used in
hash_readpage(). This shows what all lines in the hash_readpage() are
same or different.

Please note that before applying the patches for page scan mode in
hash index you may need to first apply the following patches on HEAD,

1) v10 patch for WAL in hash index - [1]
2) v1 patch for wal consistency check for hash index - [2]
3) v6 patch for microvacuum support in hash index - [3]

[1] - https://www.postgresql.org/message-id/CAA4eK1%2Bk5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn%2B72o1UA%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAGz5QCKPU2qX75B1bB_LuEC88xWZa5L55J0TLvYMVD8noSH3pA%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAE9k0PkYpAPDJBfgia08o7XhO8nypH9WoO9M8%3DdqLrwwObXKcw%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

hash_readpage.html

Re: [HACKERS] Page Scan Mode in Hash Index

From

Jesper Pedersen

Date:

21 March 2017, 21:18:10

Hi,

On 02/14/2017 12:27 AM, Ashutosh Sharma wrote:
> Currently, Hash Index scan works tuple-at-a-time, i.e. for every
> qualifying tuple in a page, it acquires and releases the lock which
> eventually increases the lock/unlock traffic. For example, if an index
> page contains 100 qualified tuples, the current hash index scan has to
> acquire and release the lock 100 times to read those qualified tuples
> which is not good from performance perspective and it also impacts
> concurency with VACUUM.
>
> Considering above points, I would like to propose a patch that allows
> hash index scan to work in page-at-a-time mode. In page-at-a-time
> mode, once lock is acquired on a target bucket page, the entire page
> is scanned and all the qualified tuples are saved into backend's local
> memory. This reduces the lock/unlock calls for retrieving tuples from
> a page. Moreover, it also eliminates the problem of re-finding the
> position of the last returned index tuple and more importanly it
> allows VACUUM to release lock on current page before moving to the
> next page which eventually improves it's concurrency with scan.
>
> Attached patch modifies hash index scan code for page-at-a-time mode.
> For better readability, I have splitted it into 3 parts,
>

Due to the commits on master these patches applies with hunks.

The README should be updated to mention the use of page scan.

hash.h needs pg_indent.

> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
> patch rewrites the hash index scan module to work in page-at-a-time
> mode. It basically introduces two new functions-- _hash_readpage() and
> _hash_saveitem(). The former is used to load all the qualifying tuples
> from a target bucket or overflow page into an items array. The latter
> one is used by _hash_readpage to save all the qualifying tuples found
> in a page into an items array. Apart from that, this patch bascially
> cleans _hash_first(), _hash_next and hashgettuple().
>

For _hash_next I don't see this - can you explain ?

+ *
+ *             On failure exit (no more tuples), we release pin and set
+ *             so->currPos.buf to InvalidBuffer.


+ *     Returns true if any matching items are found else returns false.

s/Returns/Return/g

> 2) 0002-Remove-redundant-function-_hash_step-and-some-of-the.patch:
> this patch basically removes the redundant function _hash_step() and
> some of the unused members of HashScanOpaqueData structure.
>

Looks good.

> 3) 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patch:
> this patch basically improves the locking strategy for VACUUM in hash
> index. As the new hash index scan works in page-at-a-time, vacuum can
> release the lock on previous page before acquiring a lock on the next
> page, hence, improving hash index concurrency.
>

+                * As the new hash index scan work in page at a time mode,

Remove 'new'.

> I have also done the benchmarking of this patch and would like to
> share the results for the same,
>
> Firstly, I have done the benchmarking with non-unique values and i
> could see a performance improvement of 4-7%. For the detailed results
> please find the attached file 'results-non-unique values-70ff', and
> ddl.sql, test.sql are test scripts used in this experimentation. The
> detail of non-default GUC params and pgbench command are mentioned in
> the result sheet. I also did the benchmarking with unique values at
> 300 and 1000 scale factor and its results are provided in
> 'results-unique-values-default-ff'.
>

I'm seeing similar results, and especially with write heavy scenarios.

Best regards, Jesper

Re: [HACKERS] Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

22 March 2017, 16:32:11

Hi,

>> Attached patch modifies hash index scan code for page-at-a-time mode.
>> For better readability, I have splitted it into 3 parts,
>>
>
> Due to the commits on master these patches applies with hunks.
>
> The README should be updated to mention the use of page scan.

Done. Please refer to the attached v2 version of patch.

>
> hash.h needs pg_indent.

Fixed.

>
>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
>> patch rewrites the hash index scan module to work in page-at-a-time
>> mode. It basically introduces two new functions-- _hash_readpage() and
>> _hash_saveitem(). The former is used to load all the qualifying tuples
>> from a target bucket or overflow page into an items array. The latter
>> one is used by _hash_readpage to save all the qualifying tuples found
>> in a page into an items array. Apart from that, this patch bascially
>> cleans _hash_first(), _hash_next and hashgettuple().
>>
>
> For _hash_next I don't see this - can you explain ?

Sorry, It was wrongly copied from btree code. I have corrected it now. Please
check the attached v2 verison of patch.

>
> + *
> + *             On failure exit (no more tuples), we release pin and set
> + *             so->currPos.buf to InvalidBuffer.
>
>
> + *     Returns true if any matching items are found else returns false.
>
> s/Returns/Return/g

Done.

>
>> 2) 0002-Remove-redundant-function-_hash_step-and-some-of-the.patch:
>> this patch basically removes the redundant function _hash_step() and
>> some of the unused members of HashScanOpaqueData structure.
>>
>
> Looks good.
>
>> 3) 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patch:
>> this patch basically improves the locking strategy for VACUUM in hash
>> index. As the new hash index scan works in page-at-a-time, vacuum can
>> release the lock on previous page before acquiring a lock on the next
>> page, hence, improving hash index concurrency.
>>
>
> +                * As the new hash index scan work in page at a time mode,
>
> Remove 'new'.

Done.

>
>> I have also done the benchmarking of this patch and would like to
>> share the results for the same,
>>
>> Firstly, I have done the benchmarking with non-unique values and i
>> could see a performance improvement of 4-7%. For the detailed results
>> please find the attached file 'results-non-unique values-70ff', and
>> ddl.sql, test.sql are test scripts used in this experimentation. The
>> detail of non-default GUC params and pgbench command are mentioned in
>> the result sheet. I also did the benchmarking with unique values at
>> 300 and 1000 scale factor and its results are provided in
>> 'results-unique-values-default-ff'.
>>
>
> I'm seeing similar results, and especially with write heavy scenarios.

Great..!!

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Page Scan Mode in Hash Index

From

Jesper Pedersen

Date:

23 March 2017, 17:59:59

Hi,

On 03/22/2017 09:32 AM, Ashutosh Sharma wrote:
> Done. Please refer to the attached v2 version of patch.
>

Thanks.

>>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
>>> patch rewrites the hash index scan module to work in page-at-a-time
>>> mode. It basically introduces two new functions-- _hash_readpage() and
>>> _hash_saveitem(). The former is used to load all the qualifying tuples
>>> from a target bucket or overflow page into an items array. The latter
>>> one is used by _hash_readpage to save all the qualifying tuples found
>>> in a page into an items array. Apart from that, this patch bascially
>>> cleans _hash_first(), _hash_next and hashgettuple().
>>>

0001v2:

In hashgettuple() you can remove the 'currItem' and 'offnum' from the 
'else' part, and do the assignment inside
  if (so->numKilled < MaxIndexTuplesPerPage)

instead.


No new comments for 0002 and 0003.

Best regards, Jesper

Re: [HACKERS] Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

23 March 2017, 21:11:29

On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> Hi,
>
> On 03/22/2017 09:32 AM, Ashutosh Sharma wrote:
>>
>> Done. Please refer to the attached v2 version of patch.
>>
>
> Thanks.
>
>>>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
>>>> patch rewrites the hash index scan module to work in page-at-a-time
>>>> mode. It basically introduces two new functions-- _hash_readpage() and
>>>> _hash_saveitem(). The former is used to load all the qualifying tuples
>>>> from a target bucket or overflow page into an items array. The latter
>>>> one is used by _hash_readpage to save all the qualifying tuples found
>>>> in a page into an items array. Apart from that, this patch bascially
>>>> cleans _hash_first(), _hash_next and hashgettuple().
>>>>
>
> 0001v2:
>
> In hashgettuple() you can remove the 'currItem' and 'offnum' from the 'else'
> part, and do the assignment inside
>
>   if (so->numKilled < MaxIndexTuplesPerPage)
>
> instead.
>

Done. Please have a look into the attached v3 patch.

>
> No new comments for 0002 and 0003.

okay. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev3.patch

Re: [HACKERS] Page Scan Mode in Hash Index

From

Jesper Pedersen

Date:

23 March 2017, 21:19:12

Hi,

On 03/23/2017 02:11 PM, Ashutosh Sharma wrote:
> On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> 0001v2:
>>
>> In hashgettuple() you can remove the 'currItem' and 'offnum' from the 'else'
>> part, and do the assignment inside
>>
>>   if (so->numKilled < MaxIndexTuplesPerPage)
>>
>> instead.
>>
>
> Done. Please have a look into the attached v3 patch.
>
>>
>> No new comments for 0002 and 0003.
>
> okay. Thanks.
>

I'll keep the entry in 'Needs Review' if Alexander, or others, want to 
add their feedback.

(Best to post the entire patch series each time)

Best regards, Jesper

Re: [HACKERS] Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

23 March 2017, 21:35:49

> Hi,
>
> On 03/23/2017 02:11 PM, Ashutosh Sharma wrote:
>>
>> On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen
>> <jesper.pedersen@redhat.com> wrote:
>>>
>>> 0001v2:
>>>
>>> In hashgettuple() you can remove the 'currItem' and 'offnum' from the
>>> 'else'
>>> part, and do the assignment inside
>>>
>>>   if (so->numKilled < MaxIndexTuplesPerPage)
>>>
>>> instead.
>>>
>>
>> Done. Please have a look into the attached v3 patch.
>>
>>>
>>> No new comments for 0002 and 0003.
>>
>>
>> okay. Thanks.
>>
>
> I'll keep the entry in 'Needs Review' if Alexander, or others, want to add
> their feedback.

okay. Thanks.

>
> (Best to post the entire patch series each time)

I take your suggestion. Please find the attachments.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: Page Scan Mode in Hash Index

From

Robert Haas

Date:

24 March 2017, 23:21:49

On Thu, Mar 23, 2017 at 2:35 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> I take your suggestion. Please find the attachments.

I think you should consider refactoring this so that it doesn't need
to use goto.  Maybe move the while (offnum <= maxoff) logic into a
helper function and have it return itemIndex.  If itemIndex == 0, you
can call it again.  Notice that the code in the "else" branch of the
if (itemIndex == 0) stuff could actually be removed from the else
block without changing anything, because the "if" block always either
returns or does a goto.  That's worthy of a little more work to try to
make things simple and clear.

+ *      We find the first item(or, if backward scan, the last item) in

Missing space.

In _hash_dropscanbuf(), the handling of hashso_bucket_buf is now
inconsistent with the handling of hashso_split_bucket_buf, which seems
suspicious. Suppose we enter this function with so->hashso_bucket_buf
and so->currPos.buf both being valid buffers, but not the same one.
It looks to me as if we'll release the first pin but not the second
one.  My guess (which could be wrong) is that so->hashso_bucket_buf =
InvalidBuffer should be moved back up higher in the function where it
was before, just after the first if statement, and that the new
condition so->hashso_bucket_buf == so->currPos.buf on the last
_hash_dropbuf() should be removed.  If that's not right, then I think
I need somebody to explain why not.

I am suspicious that _hash_kill_items() is going to have problems if
the overflow page is freed before it reacquires the lock.
_btkillitems() contains safeguards against similar cases.

This is not a full review, but I'm out of time for the moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

27 March 2017, 16:34:54

Hi,

> I think you should consider refactoring this so that it doesn't need
> to use goto.  Maybe move the while (offnum <= maxoff) logic into a
> helper function and have it return itemIndex.  If itemIndex == 0, you
> can call it again.

okay, Added a helper function for _hash_readpage(). Please check v4
patch attached with this mail.

Notice that the code in the "else" branch of the
> if (itemIndex == 0) stuff could actually be removed from the else
> block without changing anything, because the "if" block always either
> returns or does a goto.  That's worthy of a little more work to try to
> make things simple and clear.

Corrected.

>
> + *      We find the first item(or, if backward scan, the last item) in
>
> Missing space.
>

Corrected.

> In _hash_dropscanbuf(), the handling of hashso_bucket_buf is now
> inconsistent with the handling of hashso_split_bucket_buf, which seems
> suspicious.

I have corrected it.

Suppose we enter this function with so->hashso_bucket_buf
> and so->currPos.buf both being valid buffers, but not the same one.
> It looks to me as if we'll release the first pin but not the second
> one.

Yes, that is because we no need to release pin on currPos.buf if it is
an overflow page. In page scan mode once we have saved all the
qualified tuples from a current page (could be an overflow page) into
items array we do release both pin and lock on a overflow page. This
was not true earlier, let us assume a case where we are supposed to
fetch only fixed number of tuples from a page using cursor and in such
a case once the number of tuples to be fetched is completed we simply
return with out releasing pin on a page being scanned. If this page is
an overflow page then by end of scan we will end up with pin held on
two buffers i.e. bucket buf and current buf which is basically
overflow buf.

My guess (which could be wrong) is that so->hashso_bucket_buf =
> InvalidBuffer should be moved back up higher in the function where it
> was before, just after the first if statement, and that the new
> condition so->hashso_bucket_buf == so->currPos.buf on the last
> _hash_dropbuf() should be removed.  If that's not right, then I think
> I need somebody to explain why not.

Okay, as i mentioned above, in case of page scan mode we only keep pin
on a bucket buf. There won't be any case where we will be having pin
on overflow buf at the end of scan. So, basically _hash_dropscan buf()
should only attempt to release pin on a bucket buf. And an attempt to
release pin on so->currPos.buf should only happen when
'so->hashso_bucket_buf == so->currPos.buf' or when
'so->hashso_split_bucket_buf == so->currPos.buf'

>
> I am suspicious that _hash_kill_items() is going to have problems if
> the overflow page is freed before it reacquires the lock.
> _btkillitems() contains safeguards against similar cases.

I have added similar check for hash_kill_items() as well.

>
> This is not a full review, but I'm out of time for the moment.

No worries. I will be ready for your further review comments any time.
Thanks for the review.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachment

Re: Page Scan Mode in Hash Index

From

Jesper Pedersen

Date:

30 March 2017, 00:06:13

Hi,

On 03/27/2017 09:34 AM, Ashutosh Sharma wrote:
> Hi,
>
>> I think you should consider refactoring this so that it doesn't need
>> to use goto.  Maybe move the while (offnum <= maxoff) logic into a
>> helper function and have it return itemIndex.  If itemIndex == 0, you
>> can call it again.
>
> okay, Added a helper function for _hash_readpage(). Please check v4
> patch attached with this mail.
>>>
>> This is not a full review, but I'm out of time for the moment.
>
> No worries. I will be ready for your further review comments any time.
> Thanks for the review.
>

This patch needs a rebase.

Best regards, Jesper

Re: Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

30 March 2017, 04:16:53

I think you should consider refactoring this so that it doesn't need
to use goto. Maybe move the while (offnum <= maxoff) logic into a
helper function and have it return itemIndex. If itemIndex == 0, you
can call it again.

okay, Added a helper function for _hash_readpage(). Please check v4
patch attached with this mail.

>>
This is not a full review, but I'm out of time for the moment.

No worries. I will be ready for your further review comments any time.
Thanks for the review.

This patch needs a rebase.

Please try applying these patches on top of [1]. I think you should be able to apply it cleanly. Sorry, I think I forgot to mention this point in my earlier mail.

[1] - https://www.postgresql.org/message-id/CAE9k0P%3DV2LhtyeMXd295fhisp%3DNWUhRVJ9EZQCDowWiY9rSohQ%40mail.gmail.com

Re: Page Scan Mode in Hash Index

From

Jesper Pedersen

Date:

30 March 2017, 15:18:12

Hi Ashutosh,

On 03/29/2017 09:16 PM, Ashutosh Sharma wrote:
>> This patch needs a rebase.
>
> Please try applying these patches on top of [1]. I think you should be able
> to apply it cleanly. Sorry, I think I forgot to mention this point in my
> earlier mail.
>
> [1] -
> https://www.postgresql.org/message-id/CAE9k0P%3DV2LhtyeMXd295fhisp%3DNWUhRVJ9EZQCDowWiY9rSohQ%40mail.gmail.com
>

Thanks, that works !

As you have provided a patch for Robert's comments, and no other review 
have been posted I'm moving this patch to "Ready for Committer" for 
additional committer feedback.

Best regards, Jesper

Re: Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

01 April 2017, 10:07:24

Hi,

>
> On 03/29/2017 09:16 PM, Ashutosh Sharma wrote:
>>>
>>> This patch needs a rebase.
>>
>>
>> Please try applying these patches on top of [1]. I think you should be
>> able
>> to apply it cleanly. Sorry, I think I forgot to mention this point in my
>> earlier mail.
>>
>> [1] -
>>
>> https://www.postgresql.org/message-id/CAE9k0P%3DV2LhtyeMXd295fhisp%3DNWUhRVJ9EZQCDowWiY9rSohQ%40mail.gmail.com
>>
>
> Thanks, that works !
>
> As you have provided a patch for Robert's comments, and no other review have
> been posted I'm moving this patch to "Ready for Committer" for additional
> committer feedback.

Please find the attached new version of patches for page scan mode in
hash index rebased on top of - [1].

[1] - https://www.postgresql.org/message-id/CAE9k0P%3D3rtgUtxopG%2BXQVxgASFzAnGd9sNmYUaj_%3DKeVsKGUdA%40mail.gmail.com

Attachment

Re: Page Scan Mode in Hash Index

From

Amit Kapila

Date:

01 April 2017, 15:23:51

On Mon, Mar 27, 2017 at 7:04 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>
>> I am suspicious that _hash_kill_items() is going to have problems if
>> the overflow page is freed before it reacquires the lock.
>> _btkillitems() contains safeguards against similar cases.
>
> I have added similar check for hash_kill_items() as well.
>

I think hash_kill_items has a much bigger problem which is that we
can't kill the items if the page has been modified after re-reading
it.  Consider a case where Vacuum starts before the Scan on the bucket
and then Scan went ahead (which is possible after your 0003 patch) and
noted the killed items in killed item array and before it could kill
all the items, Vacuum remove those items.  Now it is quite possible
that before scan tries to kill those items, the corresponding itemids
have been used by different tuples.  I think what we need to do here
is to store the LSN of page first time when we have read the page and
then compare it with current page lsn after re-reading the page in
hash_kill_tems.

*
+ HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+} HashScanPosData;
..

HashScanPosItem *killedItems; /* tids and offset numbers of killed items */ int numKilled; /* number of currently
storeditems */

+
+ /*
+ * Identify all the matching items on a page and save them
+ * in HashScanPosData
+ */
+ HashScanPosData currPos; /* current position data */} HashScanOpaqueData;

After having array of HashScanPosItems as currPos.items, I think you
don't need array of HashScanPosItem for killedItems, just an integer
array of index in currPos.items should be sufficient.

*
I think below para in README is not valid after this patch.

"To keep concurrency reasonably good, we require readers to cope with
concurrent insertions, which means that they have to be able to
re-find
their current scan position after re-acquiring the buffer content lock
on page.  Since deletion is not possible while a reader holds the pin
on bucket, and we assume that heap tuple TIDs are unique, this can be
implemented by searching for the same heap tuple TID previously
returned.  Insertion does not move index entries across pages, so the
previously-returned index entry should always be on the same page, at
the same or higher offset number,
as it was before."

*
- next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-   LH_OVERFLOW_PAGE,
-   bstrategy);
-
 /*
- * release the lock on previous page after acquiring the lock on next
- * page
+ * As the hash index scan work in page at a time mode,
+ * vacuum can release the lock on previous page before
+ * acquiring lock on the next page. */
..
+ next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+   LH_OVERFLOW_PAGE,
+   bstrategy);
+

After this change, you need to modify comments on top of function
hashbucketcleanup() and _hash_squeezebucket().

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

02 April 2017, 01:44:41

Hi,

Thanks for the review.

>>> I am suspicious that _hash_kill_items() is going to have problems if
>>> the overflow page is freed before it reacquires the lock.
>>> _btkillitems() contains safeguards against similar cases.
>>
>> I have added similar check for hash_kill_items() as well.
>>
>
> I think hash_kill_items has a much bigger problem which is that we
> can't kill the items if the page has been modified after re-reading
> it.  Consider a case where Vacuum starts before the Scan on the bucket
> and then Scan went ahead (which is possible after your 0003 patch) and
> noted the killed items in killed item array and before it could kill
> all the items, Vacuum remove those items.  Now it is quite possible
> that before scan tries to kill those items, the corresponding itemids
> have been used by different tuples.  I think what we need to do here
> is to store the LSN of page first time when we have read the page and
> then compare it with current page lsn after re-reading the page in
> hash_kill_tems.

Okay, understood. I have done the changes to handle this type of
scenario. Please refer to the attached patches. Thanks.

>
> *
> + HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
> +} HashScanPosData;
> ..
>
> HashScanPosItem *killedItems; /* tids and offset numbers of killed items */
>   int numKilled; /* number of currently stored items */
> +
> + /*
> + * Identify all the matching items on a page and save them
> + * in HashScanPosData
> + */
> + HashScanPosData currPos; /* current position data */
>  } HashScanOpaqueData;
>
>
> After having array of HashScanPosItems as currPos.items, I think you
> don't need array of HashScanPosItem for killedItems, just an integer
> array of index in currPos.items should be sufficient.

Corrected.

>
>
> *
> I think below para in README is not valid after this patch.
>
> "To keep concurrency reasonably good, we require readers to cope with
> concurrent insertions, which means that they have to be able to
> re-find
> their current scan position after re-acquiring the buffer content lock
> on page.  Since deletion is not possible while a reader holds the pin
> on bucket, and we assume that heap tuple TIDs are unique, this can be
> implemented by searching for the same heap tuple TID previously
> returned.  Insertion does not move index entries across pages, so the
> previously-returned index entry should always be on the same page, at
> the same or higher offset number,
> as it was before."

I have modified above paragraph in README file.

>
> *
> - next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
> -   LH_OVERFLOW_PAGE,
> -   bstrategy);
> -
>
>   /*
> - * release the lock on previous page after acquiring the lock on next
> - * page
> + * As the hash index scan work in page at a time mode,
> + * vacuum can release the lock on previous page before
> + * acquiring lock on the next page.
>   */
> ..
> + next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
> +   LH_OVERFLOW_PAGE,
> +   bstrategy);
> +
>
> After this change, you need to modify comments on top of function
> hashbucketcleanup() and _hash_squeezebucket().
>

Done.

Please note that these patches needs to be applied on top of  [1].

[1] - https://www.postgresql.org/message-id/CAE9k0P%3D3rtgUtxopG%2BXQVxgASFzAnGd9sNmYUaj_%3DKeVsKGUdA%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Hi,

On Tue, Apr 4, 2017 at 3:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>
>> Please note that these patches needs to be applied on top of  [1].
>>
>
> Few more review comments:
>
> 1.
> - page = BufferGetPage(so->hashso_curbuf);
> + blkno = so->currPos.currPage;
> + if (so->hashso_bucket_buf == so->currPos.buf)
> + {
> + buf = so->currPos.buf;
> + LockBuffer(buf, BUFFER_LOCK_SHARE);
> + page = BufferGetPage(buf);
> + }
>
> Here, you are assuming that only bucket page can be pinned, but I
> think that assumption is not right.  When _hash_kill_items() is called
> before moving to next page, there could be a pin on the overflow page.
> You need some logic to check if the buffer is pinned, then just lock
> it.  I think once you do that, it might not be convinient to release
> the pin at the end of this function.

Yes, there are few cases where we might have pin on overflow pages too
and we need to handle such cases in _hash_kill_items. I have taken
care of this in the attached v7 patch. Thanks.

>
> Add some comments on top of _hash_kill_items to explain the new
> changes or say some thing like "See _bt_killitems for details"

Added some more comments on the new changes.

>
> 2.
> + /*
> + * We save the LSN of the page as we read it, so that we know whether it
> + * safe to apply LP_DEAD hints to the page later.  This allows us to drop
> + * the pin for MVCC scans, which allows vacuum to avoid blocking.
> + */
> + so->currPos.lsn = PageGetLSN(page);
> +
>
> The second part of above comment doesn't make sense because vacuum's
> will still be blocked because we hold the pin on primary bucket page.

That's right. It doesn't make sense because we won't allow vacuum to
start. I have removed it.

>
> 3.
> + {
> + /*
> + * No more matching tuples were found. return FALSE
> + * indicating the same. Also, remember the prev and
> + * next block number so that if fetching tuples using
> + * cursor we remember the page from where to start the
> + * scan.
> + */
> + so->currPos.prevPage = (opaque)->hasho_prevblkno;
> + so->currPos.nextPage = (opaque)->hasho_nextblkno;
>
> You can't read opaque without having lock and by this time it has
> already been released.

I have corrected it. Please refer to the attached v7 patch.

Also, I think if you want to save position for
> cursor movement, then you need to save the position of last bucket
> when scan completes the overflow chain, however as you have written it
> will be always invalid block number. I think there is similar problem
> with prevblock number.

Did you mean last bucket or last page. If it is last page, then I am
already storing it.

>
> 4.
> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
> +   OffsetNumber maxoff, ScanDirection dir)
> +{
> + HashScanOpaque so = (HashScanOpaque) scan->opaque;
> + IndexTuple      itup;
> + int itemIndex;
> +
> + if (ScanDirectionIsForward(dir))
> + {
> + /* load items[] in ascending order */
> + itemIndex = 0;
> +
> + /* new page, relocate starting position by binary search */
> + offnum = _hash_binsearch(page, so->hashso_sk_hash);
>
> What is the need to find offset number when this function already has
> that as an input parameter?

It's not required. I have removed it.

>
> 5.
> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
> +   OffsetNumber maxoff, ScanDirection dir)
>
> I think maxoff is not required to be passed a parameter to this
> function as you need it only for forward scan. You can compute it
> locally.

It is required for both forward and backward scan. However, I am not
passing it to _hash_load_qualified_items() now.

>
> 6.
> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
> +   OffsetNumber maxoff, ScanDirection dir)
> {
> ..
> + if (ScanDirectionIsForward(dir))
> + {
> ..
> + while (offnum <= maxoff)
> {
> ..
> + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
> + _hash_checkqual(scan, itup))
> + {
> + /* tuple is qualified, so remember it */
> + _hash_saveitem(so, itemIndex, offnum, itup);
> + itemIndex++;
> + }
> +
> + offnum = OffsetNumberNext(offnum);
> ..
>
> Why are you traversing the whole page even when there is no match?
> There is a similar problem in backward scan. I think this is blunder.

Fixed. Please check the attached
'0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patch'

>
> 7.
> + if (so->currPos.buf == so->hashso_bucket_buf ||
> + so->currPos.buf == so->hashso_split_bucket_buf)
> + {
> + so->currPos.prevPage = InvalidBlockNumber;
> + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
> + }
> + else
> + {
> + so->currPos.prevPage = (opaque)->hasho_prevblkno;
> + _hash_relbuf(rel, so->currPos.buf);
> + }
> +
> + so->currPos.nextPage = (opaque)->hasho_nextblkno;
>
> What makes you think it is safe read opaque after releasing the lock?

Nothing makes me think to read opaque after releasing lock. It's a
mistake. I have corrected it. Please check attached v7 patch.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

10 May 2017, 11:58:55

While doing the code coverage testing of v7 patch shared with - [1], I
found that there are few lines of code in _hash_next() that are
redundant and needs to be removed. I basically came to know this while
testing the scenario where a hash index scan starts when a split is in
progress. I have removed those lines and attached is the newer version
of patch.

[1] - https://www.postgresql.org/message-id/CAE9k0Pmn92Le0iOTU5b6087eRXzUnkq1PKcihxtokHJ9boXCBg%40mail.gmail.com

On Mon, May 8, 2017 at 6:55 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Hi,
>
> On Tue, Apr 4, 2017 at 3:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>>
>>> Please note that these patches needs to be applied on top of  [1].
>>>
>>
>> Few more review comments:
>>
>> 1.
>> - page = BufferGetPage(so->hashso_curbuf);
>> + blkno = so->currPos.currPage;
>> + if (so->hashso_bucket_buf == so->currPos.buf)
>> + {
>> + buf = so->currPos.buf;
>> + LockBuffer(buf, BUFFER_LOCK_SHARE);
>> + page = BufferGetPage(buf);
>> + }
>>
>> Here, you are assuming that only bucket page can be pinned, but I
>> think that assumption is not right.  When _hash_kill_items() is called
>> before moving to next page, there could be a pin on the overflow page.
>> You need some logic to check if the buffer is pinned, then just lock
>> it.  I think once you do that, it might not be convinient to release
>> the pin at the end of this function.
>
> Yes, there are few cases where we might have pin on overflow pages too
> and we need to handle such cases in _hash_kill_items. I have taken
> care of this in the attached v7 patch. Thanks.
>
>>
>> Add some comments on top of _hash_kill_items to explain the new
>> changes or say some thing like "See _bt_killitems for details"
>
> Added some more comments on the new changes.
>
>>
>> 2.
>> + /*
>> + * We save the LSN of the page as we read it, so that we know whether it
>> + * safe to apply LP_DEAD hints to the page later.  This allows us to drop
>> + * the pin for MVCC scans, which allows vacuum to avoid blocking.
>> + */
>> + so->currPos.lsn = PageGetLSN(page);
>> +
>>
>> The second part of above comment doesn't make sense because vacuum's
>> will still be blocked because we hold the pin on primary bucket page.
>
> That's right. It doesn't make sense because we won't allow vacuum to
> start. I have removed it.
>
>>
>> 3.
>> + {
>> + /*
>> + * No more matching tuples were found. return FALSE
>> + * indicating the same. Also, remember the prev and
>> + * next block number so that if fetching tuples using
>> + * cursor we remember the page from where to start the
>> + * scan.
>> + */
>> + so->currPos.prevPage = (opaque)->hasho_prevblkno;
>> + so->currPos.nextPage = (opaque)->hasho_nextblkno;
>>
>> You can't read opaque without having lock and by this time it has
>> already been released.
>
> I have corrected it. Please refer to the attached v7 patch.
>
> Also, I think if you want to save position for
>> cursor movement, then you need to save the position of last bucket
>> when scan completes the overflow chain, however as you have written it
>> will be always invalid block number. I think there is similar problem
>> with prevblock number.
>
> Did you mean last bucket or last page. If it is last page, then I am
> already storing it.
>
>>
>> 4.
>> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
>> +   OffsetNumber maxoff, ScanDirection dir)
>> +{
>> + HashScanOpaque so = (HashScanOpaque) scan->opaque;
>> + IndexTuple      itup;
>> + int itemIndex;
>> +
>> + if (ScanDirectionIsForward(dir))
>> + {
>> + /* load items[] in ascending order */
>> + itemIndex = 0;
>> +
>> + /* new page, relocate starting position by binary search */
>> + offnum = _hash_binsearch(page, so->hashso_sk_hash);
>>
>> What is the need to find offset number when this function already has
>> that as an input parameter?
>
> It's not required. I have removed it.
>
>>
>> 5.
>> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
>> +   OffsetNumber maxoff, ScanDirection dir)
>>
>> I think maxoff is not required to be passed a parameter to this
>> function as you need it only for forward scan. You can compute it
>> locally.
>
> It is required for both forward and backward scan. However, I am not
> passing it to _hash_load_qualified_items() now.
>
>>
>> 6.
>> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
>> +   OffsetNumber maxoff, ScanDirection dir)
>> {
>> ..
>> + if (ScanDirectionIsForward(dir))
>> + {
>> ..
>> + while (offnum <= maxoff)
>> {
>> ..
>> + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
>> + _hash_checkqual(scan, itup))
>> + {
>> + /* tuple is qualified, so remember it */
>> + _hash_saveitem(so, itemIndex, offnum, itup);
>> + itemIndex++;
>> + }
>> +
>> + offnum = OffsetNumberNext(offnum);
>> ..
>>
>> Why are you traversing the whole page even when there is no match?
>> There is a similar problem in backward scan. I think this is blunder.
>
> Fixed. Please check the attached
> '0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patch'
>
>>
>> 7.
>> + if (so->currPos.buf == so->hashso_bucket_buf ||
>> + so->currPos.buf == so->hashso_split_bucket_buf)
>> + {
>> + so->currPos.prevPage = InvalidBlockNumber;
>> + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
>> + }
>> + else
>> + {
>> + so->currPos.prevPage = (opaque)->hasho_prevblkno;
>> + _hash_relbuf(rel, so->currPos.buf);
>> + }
>> +
>> + so->currPos.nextPage = (opaque)->hasho_nextblkno;
>>
>> What makes you think it is safe read opaque after releasing the lock?
>
> Nothing makes me think to read opaque after releasing lock. It's a
> mistake. I have corrected it. Please check attached v7 patch.
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

30 July 2017, 11:37:00

Hi,

On Wed, May 10, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> While doing the code coverage testing of v7 patch shared with - [1], I
> found that there are few lines of code in _hash_next() that are
> redundant and needs to be removed. I basically came to know this while
> testing the scenario where a hash index scan starts when a split is in
> progress. I have removed those lines and attached is the newer version
> of patch.
>

Please find the new version of patches for page at a time scan in hash
index rebased on top of latest commit in master branch. Also, i have
ran pgindent script with pg_bsd_indent version 2.0 on all the modified
files. Thanks.

> [1] - https://www.postgresql.org/message-id/CAE9k0Pmn92Le0iOTU5b6087eRXzUnkq1PKcihxtokHJ9boXCBg%40mail.gmail.com
>
> On Mon, May 8, 2017 at 6:55 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> Hi,
>>
>> On Tue, Apr 4, 2017 at 3:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>>>
>>>> Please note that these patches needs to be applied on top of  [1].
>>>>
>>>
>>> Few more review comments:
>>>
>>> 1.
>>> - page = BufferGetPage(so->hashso_curbuf);
>>> + blkno = so->currPos.currPage;
>>> + if (so->hashso_bucket_buf == so->currPos.buf)
>>> + {
>>> + buf = so->currPos.buf;
>>> + LockBuffer(buf, BUFFER_LOCK_SHARE);
>>> + page = BufferGetPage(buf);
>>> + }
>>>
>>> Here, you are assuming that only bucket page can be pinned, but I
>>> think that assumption is not right.  When _hash_kill_items() is called
>>> before moving to next page, there could be a pin on the overflow page.
>>> You need some logic to check if the buffer is pinned, then just lock
>>> it.  I think once you do that, it might not be convinient to release
>>> the pin at the end of this function.
>>
>> Yes, there are few cases where we might have pin on overflow pages too
>> and we need to handle such cases in _hash_kill_items. I have taken
>> care of this in the attached v7 patch. Thanks.
>>
>>>
>>> Add some comments on top of _hash_kill_items to explain the new
>>> changes or say some thing like "See _bt_killitems for details"
>>
>> Added some more comments on the new changes.
>>
>>>
>>> 2.
>>> + /*
>>> + * We save the LSN of the page as we read it, so that we know whether it
>>> + * safe to apply LP_DEAD hints to the page later.  This allows us to drop
>>> + * the pin for MVCC scans, which allows vacuum to avoid blocking.
>>> + */
>>> + so->currPos.lsn = PageGetLSN(page);
>>> +
>>>
>>> The second part of above comment doesn't make sense because vacuum's
>>> will still be blocked because we hold the pin on primary bucket page.
>>
>> That's right. It doesn't make sense because we won't allow vacuum to
>> start. I have removed it.
>>
>>>
>>> 3.
>>> + {
>>> + /*
>>> + * No more matching tuples were found. return FALSE
>>> + * indicating the same. Also, remember the prev and
>>> + * next block number so that if fetching tuples using
>>> + * cursor we remember the page from where to start the
>>> + * scan.
>>> + */
>>> + so->currPos.prevPage = (opaque)->hasho_prevblkno;
>>> + so->currPos.nextPage = (opaque)->hasho_nextblkno;
>>>
>>> You can't read opaque without having lock and by this time it has
>>> already been released.
>>
>> I have corrected it. Please refer to the attached v7 patch.
>>
>> Also, I think if you want to save position for
>>> cursor movement, then you need to save the position of last bucket
>>> when scan completes the overflow chain, however as you have written it
>>> will be always invalid block number. I think there is similar problem
>>> with prevblock number.
>>
>> Did you mean last bucket or last page. If it is last page, then I am
>> already storing it.
>>
>>>
>>> 4.
>>> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
>>> +   OffsetNumber maxoff, ScanDirection dir)
>>> +{
>>> + HashScanOpaque so = (HashScanOpaque) scan->opaque;
>>> + IndexTuple      itup;
>>> + int itemIndex;
>>> +
>>> + if (ScanDirectionIsForward(dir))
>>> + {
>>> + /* load items[] in ascending order */
>>> + itemIndex = 0;
>>> +
>>> + /* new page, relocate starting position by binary search */
>>> + offnum = _hash_binsearch(page, so->hashso_sk_hash);
>>>
>>> What is the need to find offset number when this function already has
>>> that as an input parameter?
>>
>> It's not required. I have removed it.
>>
>>>
>>> 5.
>>> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
>>> +   OffsetNumber maxoff, ScanDirection dir)
>>>
>>> I think maxoff is not required to be passed a parameter to this
>>> function as you need it only for forward scan. You can compute it
>>> locally.
>>
>> It is required for both forward and backward scan. However, I am not
>> passing it to _hash_load_qualified_items() now.
>>
>>>
>>> 6.
>>> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
>>> +   OffsetNumber maxoff, ScanDirection dir)
>>> {
>>> ..
>>> + if (ScanDirectionIsForward(dir))
>>> + {
>>> ..
>>> + while (offnum <= maxoff)
>>> {
>>> ..
>>> + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
>>> + _hash_checkqual(scan, itup))
>>> + {
>>> + /* tuple is qualified, so remember it */
>>> + _hash_saveitem(so, itemIndex, offnum, itup);
>>> + itemIndex++;
>>> + }
>>> +
>>> + offnum = OffsetNumberNext(offnum);
>>> ..
>>>
>>> Why are you traversing the whole page even when there is no match?
>>> There is a similar problem in backward scan. I think this is blunder.
>>
>> Fixed. Please check the attached
>> '0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patch'
>>
>>>
>>> 7.
>>> + if (so->currPos.buf == so->hashso_bucket_buf ||
>>> + so->currPos.buf == so->hashso_split_bucket_buf)
>>> + {
>>> + so->currPos.prevPage = InvalidBlockNumber;
>>> + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
>>> + }
>>> + else
>>> + {
>>> + so->currPos.prevPage = (opaque)->hasho_prevblkno;
>>> + _hash_relbuf(rel, so->currPos.buf);
>>> + }
>>> +
>>> + so->currPos.nextPage = (opaque)->hasho_nextblkno;
>>>
>>> What makes you think it is safe read opaque after releasing the lock?
>>
>> Nothing makes me think to read opaque after releasing lock. It's a
>> mistake. I have corrected it. Please check attached v7 patch.
>>
>> --
>> With Regards,
>> Ashutosh Sharma
>> EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Fri, Aug 4, 2017 at 7:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Jul 30, 2017 at 2:07 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> Hi,
>>
>> On Wed, May 10, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>> While doing the code coverage testing of v7 patch shared with - [1], I
>>> found that there are few lines of code in _hash_next() that are
>>> redundant and needs to be removed. I basically came to know this while
>>> testing the scenario where a hash index scan starts when a split is in
>>> progress. I have removed those lines and attached is the newer version
>>> of patch.
>>>
>>
>> Please find the new version of patches for page at a time scan in hash
>> index rebased on top of latest commit in master branch. Also, i have
>> ran pgindent script with pg_bsd_indent version 2.0 on all the modified
>> files. Thanks.
>>
>
> Few comments:

Thanks for reviewing the patch.

> 1.
> +_hash_kill_items(IndexScanDesc scan, bool havePin)
>
> I think you can do without the second parameter.  Can't we detect
> inside _hash_kill_items whether the page is pinned or not as we do for
> btree?

Okay, done that way. Please check attached v10 patch.

>
> 2.
> + /*
> + * We remember prev and next block number along with current block
> + * number so that if fetching the tup- les using cursor we know
> + * the page from where to startwith. This is for the case where we
> + * have re- ached the end of bucket chain without finding any
> + * matching tuples.
>
> The spelling of tuples and reached contain some unwanted symbol.  Have
> space between "startwith" or just use "begin"

Corrected.

>
> 3.
> + if (!BlockNumberIsValid((opaque)->hasho_nextblkno))
> + {
> + so->currPos.prevPage = (opaque)->hasho_prevblkno;
> + so->currPos.nextPage = InvalidBlockNumber;
> + }
>
> There is no need to use Parentheses around opaque.  I mean there is no
> problem with that, but it is redundant and makes code less readable.
> Also, there is similar usage at other places in the code, please
> change all another place as well.

Removed parenthesis around opaque.

 I think you can save the value of
> prevblkno in a local variable and use it in else part.

Okay, done that way.

>
> 4.
> + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
> + _hash_checkqual(scan, itup))
> + {
> + /* tuple is qualified, so remember it */
> + _hash_saveitem(so, itemIndex, offnum, itup);
> + itemIndex++;
> + }
> + else
> +
> + /*
> + * No more matching tuples exist in this page. so, exit while
> + * loop.
> + */
> + break;
>
> It is better to have braces for the else part. It makes code look
> better.  The type of code exists few line down as well, change that as
> well.

Added braces in the else part.

>
> 5.
> + /*
> + * We save the LSN of the page as we read it, so that we know whether it
> + * safe to apply LP_DEAD hints to the page later.
> + */
>
> "whether it safe"/"whether it is safe"

Corrected.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

>
> Comments on the latest patch:

Thank you once again for reviewing my patches.

>
> 1.
> /*
> + * We remember prev and next block number along with current block
> + * number so that if fetching the tuples using cursor we know the
> + * page from where to begin. This is for the case where we have
> + * reached the end of bucket chain without finding any matching
> + * tuples.
> + */
> + if (!BlockNumberIsValid(opaque->hasho_nextblkno))
> + prev_blkno = opaque->hasho_prevblkno;
>
> This doesn't seem to be the right place for this comment as you are
> not saving next or current block number here. I am not sure whether
> you really need this comment at all. Will it be better if you move
> this to else part of the loop and reword it as:
>
> "Remember next and previous block numbers for scrollable cursors to
> know the start position."

Shifted the comments to else part of the loop.

>
> 2. The code in _hash_readpage looks quite bloated. I think we can
> make it better if we do something like below.
> a.
> +_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
> {
> ..
> + if (so->currPos.buf == so->hashso_bucket_buf ||
> + so->currPos.buf == so->hashso_split_bucket_buf)
> + {
> + so->currPos.prevPage = InvalidBlockNumber;
> + so->currPos.nextPage = opaque->hasho_nextblkno;
> + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
> + }
> + else
> + {
> + so->currPos.prevPage = opaque->hasho_prevblkno;
> + so->currPos.nextPage = opaque->hasho_nextblkno;
> + _hash_relbuf(rel, so->currPos.buf);
> + so->currPos.buf = InvalidBuffer;
> + }
> ..
> }
>
> This code chunk is same for both forward and backward scans. I think
> you can have single copy of this code by moving it out of if-else
> loop.

Done.

>
> b.
> +_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
> {
> ..
> + /* new page, locate starting position by binary search */
> + offnum = _hash_binsearch(page, so->hashso_sk_hash);
> +
> + itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
> +
> + while (itemIndex == 0)
> + {
> + /*
> + * Could not find any matching tuples in the current page, move to
> + * the next page. Before leaving the current page, also deal with
> + * any killed items.
> + */
> + if (so->numKilled > 0)
> + _hash_kill_items(scan);
> +
> + /*
> + * We remember prev and next block number along with current block
> + * number so that if fetching the tuples using cursor we know the
> + * page from where to begin. This is for the case where we have
> + * reached the end of bucket chain without finding any matching
> + * tuples.
> + */
> + if (!BlockNumberIsValid(opaque->hasho_nextblkno))
> + prev_blkno = opaque->hasho_prevblkno;
> +
> + _hash_readnext(scan, &buf, &page, &opaque);
> + if (BufferIsValid(buf))
> + {
> + so->currPos.buf = buf;
> + so->currPos.currPage = BufferGetBlockNumber(buf);
> + so->currPos.lsn = PageGetLSN(page);
> + offnum = _hash_binsearch(page, so->hashso_sk_hash);
> + itemIndex = _hash_load_qualified_items(scan, page,
> + offnum, dir);
> ..
> }
>
> Have just one copy of code search the offset and load qualified items.
> Change the above while loop to do..while loop and have a check in
> between break the loop when item index is not zero.

Done that way.

>
> 3.
> - * Find the first item in the index that
> - * satisfies the qualification associated with the scan descriptor. On
> - * success, the page containing the current index tuple is read locked
> - * and pinned, and the scan's opaque data entry is updated to
> - * include the buffer.
> + * We find the first item (or, if backward scan, the last item) in
> + * the index that satisfies the qualification associated with the
> + * scan descriptor. On success, the page containing the current
> + * index tuple is read locked and pinned, and data about the
> + * matching tuple(s) on the page has been loaded into so->currPos,
> + * scan->xs_ctup.t_self is set to the heap TID of the current tuple.
> + *
> + * If there are no matching items in the index, we return FALSE,
> + * with no pins or locks held.
> */
> bool
> _hash_first(IndexScanDesc scan, ScanDirection dir)
> @@ -229,15 +297,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
>
> I don't think on success, the lock or pin is held anymore, after this
> patch the only pin will be retained and that too for bucket page.
> Also, there is no need to modify the part of the comment which is not
> related to change in this patch.

Corrected.

>
> I don't see scan->xs_ctup.t_self getting set anywhere in this
> function. I think you are setting it in hashgettuple. It is better
> to move that assignment from hashgettuple to _hash_first so as to be
> consistent with _hash_next.

Done that way.

>
> 4.
> + * On successful exit, scan->xs_ctup.t_self is set to the TID
> + * of the next heap tuple, and if requested, scan->xs_itup
> + * points to a copy of the index tuple. so->currPos is updated
> + * as needed.
> + *
> + * On failure exit (no more tuples), we return FALSE with no
> + * pins or locks held.
> */
> bool
> _hash_next(IndexScanDesc scan, ScanDirection dir)
>
> I don't see the usage of scan->xs_itup in this function. It seems to
> me that you have copied it from btree code and forgot to remove part
> of the comment which is not relevant to a hash index.

Corrected.

>
> 5.
> @@ -514,19 +422,14 @@ hashendscan(IndexScanDesc scan)
> {
> ..
> + if (HashScanPosIsValid(so->currPos))
> {
> - LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
> - _hash_kill_items(scan);
> - LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
> + /* Before leaving current page, deal with any killed items */
> + if (so->numKilled > 0)
> + _hash_kill_items(scan);
> + _hash_dropscanbuf(rel, so);
> }
>
> - _hash_dropscanbuf(rel, so);
> -
> ..
> }
>
> I don't think it is a good idea to move _hash_dropscanbuf as that
> check just ensures if the current buffer is valid. It doesn't say
> anything about other buffers saved in HashScanOpaque. A similar
> change is required in hashrescan.

Done.

>
> 6.
> _hash_kill_items(IndexScanDesc scan)
> {
> ..
> + if (so->hashso_bucket_buf == so->currPos.buf ||
> + HashScanPosIsPinned(so->currPos))
> ..
> }
>
> Isn't second check enough? It should indirectly cover the first test.

Re: [HACKERS] Page Scan Mode in Hash Index

From

Amit Kapila

Date:

23 August 2017, 14:38:10

On Tue, Aug 22, 2017 at 7:24 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> On Tue, Aug 22, 2017 at 3:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Okay, I got your point now. I think, currently in _hash_kill_items(),
> if an overflow page is pinned we do not check if it got modified since
> the last read or
> not. Hence, if the vacuum runs on an overflow page that is pinned and
> also has some dead tuples in it then it could create a problem for
> scan basically,
> when scan would attempt to mark the killed items as dead. To get rid
> of such problem, i think, even if an overflow page is pinned we should
> check if it got
> modified or not since the last read was performed on the page. If yes,
> then do not allow scan to mark killed items as dead. Attached is the
> newer version with these changes along with some other cosmetic
> changes mentioned in your earlier email. Thanks.
>

Thanks for the new version.  I again looked at the patches and fixed
quite a few comments in the code and ReadMe.  You have forgotten to
update README for the changes in vacuum patch
(0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7).  I
don't have anything more to add.  If you are okay with changes, then
we can move it to Ready For Committer unless someone else has some
more comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Wed, Aug 23, 2017 at 7:39 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> On 08/23/2017 07:38 AM, Amit Kapila wrote:
>>
>> Thanks for the new version.  I again looked at the patches and fixed
>> quite a few comments in the code and ReadMe.  You have forgotten to
>> update README for the changes in vacuum patch
>> (0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7).  I
>> don't have anything more to add.  If you are okay with changes, then
>> we can move it to Ready For Committer unless someone else has some
>> more comments.
>>
>
> Just some minor comments.

Thanks for the review.

>
> README:
> +       it's pin till the end of scan)
>
> its pin till the end of the scan)

Corrected.

>
> +To minimize lock/unlock traffic, hash index scan always searches entire
> hash
>
> To minimize lock/unlock traffic, hash index scan always searches the entire
> hash

Done.

>
> hashsearch.c:
>
> +static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
> +                          OffsetNumber offnum, IndexTuple itup);
>
> There are other instances of "inline" in the code base, so I guess that this
> is ok.
>
> +        * Advance to next tuple on current page; or if there's no more, try
> to
>
> Advance to the next tuple on the current page; or if done, try to
>

Done.

Attached are the patches with above changes.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Thanks for all your review comments. Please find my comments in-line.

On Tue, Sep 19, 2017 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Aug 24, 2017 at 11:26 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> Based on the feedback in this thread, I have moved the patch to "Ready for
>> Committer".
>
> Reviewing 0001:
>
> _hash_readpage gets the page LSN to see if we can apply LP_DEAD hints,
> but if the table is unlogged or temporary, the LSN will never change,
> so the test in _hash_kill_items will always think that it's OK to
> apply the hints.  (This seems like it might be a pretty serious
> problem, because I'm not sure what would be a viable workaround.)
>

Amit has already replied to this query up-thread.

> The logic that tries to ensure that so->currPos.{buf,currPage,lsn} get
> updated is not, to my eyes, obviously correct. Generally, the logic
> for this stuff seems unnaturally spread out to me.  For example,
> _hash_next() updates currPos.buf, but leaves it to _hash_readpage to
> set currPage and lsn.  That function also sets all three fields when
> it advances to the next page by calling _hash_readnext(), but when it
> tries to advance to the next page and doesn't find one it sets buf but
> not currPage or lsn.  It seems to me that this logic should be more
> centralized somehow.  Can we arrange things so that at least buf,
> currPage, and lsn, and maybe also nextPage and prevPage, get updated
> at the same time and as soon after reading the buffer as possible?
>

Okay, I have tried to update currPos.{buf, currPage, lsn} in
_hash_readpage() at the same time. Please have a look into the
attached 0001*.patch.

When _hash_readpage() doesn't find any qualifying tuples i.e. when
_hash_readnext() returns Invalid buffer, we just update prevPage,
nextPage and buf in
currPos (not currPage or lsn) as currPage and lsn should point to last
page in the hash bucket so that we can mark the killed items as dead
at the end of scan (with the help of _hash_kill_items). Hence, we keep
the currpage and lsn as it is if no more valid hash pages are found.

> It would be bad if a primary bucket page's hasho_prevblkno field got
> copied into so->currPos.prevpage, because the value that appears for
> the primary bucket is not a real block number.  But _hash_readpage
> seems like it can bring about this exact situation, because of this
> code:
>
> +            if (!BlockNumberIsValid(opaque->hasho_nextblkno))
> +                prev_blkno = opaque->hasho_prevblkno;
> ...
> +                so->currPos.prevPage = prev_blkno;
>
> If we're reading the primary bucket page and there are no overflow
> pages, hasho_nextblkno will not be valid and hasho_prevblkno won't be
> a real block number.
>

Fixed. Thanks for putting that point.

> Incidentally, the "if" statement in the above block of code is
> probably not saving anything; I would suggest for clarity that you do
> the assignment unconditionally (but this is just a matter of style, so
> I don't feel super-strongly about it).
>
> +    return (so->currPos.firstItem <= so->currPos.lastItem);
>
> Can this ever return false?  It looks to me like we should never reach
> this code unless that condition holds, and that it would be a bug if
> we did.  If that's correct, maybe this should be an Assert() and the
> return statement should just return true.
>

No, it will never return FALSE. I have changed it to Assert statement.

> +        buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
> +
> +        /* It might not exist anymore; in which case we can't hint it. */
> +        if (!BufferIsValid(buf))
> +            return;
>
> This is dead code, because _hash_getbuf always returns a valid buffer.
> If there's actually a risk of the buffer disappearing, then some other
> handling is needed for this case.  But I suspect that, since a scan
> always holds a pin on the primary bucket, there's actually no way for
> this to happen and this is just dead code.
>

Removed the redundant code.

> The comment in hashgetbitmap claims that _hash_first() or _hash_next()
> never returns dead tuples.  If that were true, it would be a bug,
> because then scans started during recovery would return wrong answers.
> A more accurate statement would be something like: _hash_first and
> _hash_next handle eliminate dead index entries whenever
> scan->ignored_killed_tuples is true.  Therefore, there's nothing to do
> here except add the results to the TIDBitmap.
>

Corrected.

> _hash_readpage contains unnecessary "continue" statements inside the
> loops.  The reason that they're unnecessary is that there's no code
> below that in the loop anyway, so the loop is already going to go back
> around to the top. Whether to change this is a matter of style, so I
> won't complain too much if you want to leave it the way it is, but
> personally I'd favor removing them.
>

Ohh no, that's a silly mistake. I have corrected it.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Wed, Sep 20, 2017 at 8:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Sep 20, 2017 at 5:37 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> Thanks for all your review comments. Please find my comments in-line.
>
> +            if (!BlockNumberIsValid(opaque->hasho_nextblkno))
> +            {
> +                if (so->currPos.buf == so->hashso_bucket_buf ||
> +                    so->currPos.buf == so->hashso_split_bucket_buf)
> +                    prev_blkno = InvalidBlockNumber;
> +                else
> +                    prev_blkno = opaque->hasho_prevblkno;
> +            }
>
> 1. Why not remove the outer "if" statement?
>

Yes, the outer if statement is not required. I just missed to remove
that in my earlier patch.

> 2. How about adding a comment, like /* If this is a primary bucket
> page, hasho_prevblkno is not a real block number. */
>

Added.

>> When _hash_readpage() doesn't find any qualifying tuples i.e. when
>> _hash_readnext() returns Invalid buffer, we just update prevPage,
>> nextPage and buf in
>> currPos (not currPage or lsn) as currPage and lsn should point to last
>> page in the hash bucket so that we can mark the killed items as dead
>> at the end of scan (with the help of _hash_kill_items). Hence, we keep
>> the currpage and lsn as it is if no more valid hash pages are found.
>
> How about adding a comment about this, by extending this comment:
>
> +                 * Remember next and previous block numbers for scrollable
> +                 * cursors to know the start position and return FALSE
> +                 * indicating that no more matching tuples were found.
>
> e.g. (Don't reset currPage or lsn, because we expect _hash_kill_items
> to be called for the old page after this function returns.)
>
>

Added.

Attached are the patches with above changes. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Thu, Sep 21, 2017 at 9:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Sep 20, 2017 at 11:43 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> Attached are the patches with above changes. Thanks.
>
> Thanks.  I think that the comments and README changes in 0003 need
> significantly more work.  In several places, they fail to note the
> unlogged vs. logged differences, and the header comment for
> hashbucketcleanup still says that scans depend on increasing-TID order
> (really, 0001 should change that text somehow).
>

I have added a note for handling of logged and unlogged tables in
README file and also corrected the header comment for
hashbucketcleanup(). Please find the attached 0003*.patch having these
changes. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Page Scan Mode in Hash Index

From

Robert Haas

Date:

22 September 2017, 21:26:14

On Thu, Sep 21, 2017 at 3:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> I have added a note for handling of logged and unlogged tables in
> README file and also corrected the header comment for
> hashbucketcleanup(). Please find the attached 0003*.patch having these
> changes. Thanks.

I committed 0001 and 0002 with some additional edits as
7c75ef571579a3ad7a1d3ee909f11dba5e0b9440.  I also rebased 0003 and
edited it a bit; see attached hash-cleanup-changes.patch.

I'm not entirely sold on 0003.  An alternative would be to rip the lsn
stuff back out of HashScanPosData, and I think we ought to consider
that.  Basically, 0003 is betting that getting rid of the
lock-chaining in hash index vacuum is more valuable than being able to
kill dead items more aggressively.  I bet that's a bad bet.

In the case of btree indexes, since
2ed5b87f96d473962ec5230fd820abfeaccb2069, page-at-a-time scanning
allows most btree index scans to avoid holding buffer pins when the
scan is suspended, but we gain no such advantage here.  We always have
to hold a pin on the primary bucket page anyway, so even with this
patch cleanup is going to block when it hits a bucket containing a
suspended scan.  0003 helps if (a) the relation is permanent, (b) the
bucket has overflow pages, and (c) the scan is moving faster than
vacuum and can overtake it instead of waiting.  But that doesn't seem
like it will happen very often at all, whereas the LSN check will
probably fail frequently and cause us to skip cleanup that we could
usefully have done.  So I propose the attached hashscan-no-lsn.patch
as an alternative.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Page Scan Mode in Hash Index

From

Ashutosh Sharma

Date:

22 September 2017, 22:44:50

On Fri, Sep 22, 2017 at 11:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 21, 2017 at 3:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> I have added a note for handling of logged and unlogged tables in
>> README file and also corrected the header comment for
>> hashbucketcleanup(). Please find the attached 0003*.patch having these
>> changes. Thanks.
>
> I committed 0001 and 0002 with some additional edits as
> 7c75ef571579a3ad7a1d3ee909f11dba5e0b9440.  I also rebased 0003 and
> edited it a bit; see attached hash-cleanup-changes.patch.
>

Thanks for the commit. I had put lot of efforts for this and very
happy that it got committed. Thanks to Amit too for the detail review.

> I'm not entirely sold on 0003.  An alternative would be to rip the lsn
> stuff back out of HashScanPosData, and I think we ought to consider
> that.  Basically, 0003 is betting that getting rid of the
> lock-chaining in hash index vacuum is more valuable than being able to
> kill dead items more aggressively.  I bet that's a bad bet.
>
> In the case of btree indexes, since
> 2ed5b87f96d473962ec5230fd820abfeaccb2069, page-at-a-time scanning
> allows most btree index scans to avoid holding buffer pins when the
> scan is suspended, but we gain no such advantage here.  We always have
> to hold a pin on the primary bucket page anyway, so even with this
> patch cleanup is going to block when it hits a bucket containing a
> suspended scan.  0003 helps if (a) the relation is permanent, (b) the
> bucket has overflow pages, and (c) the scan is moving faster than
> vacuum and can overtake it instead of waiting.  But that doesn't seem
> like it will happen very often at all, whereas the LSN check will
> probably fail frequently and cause us to skip cleanup that we could
> usefully have done.  So I propose the attached hashscan-no-lsn.patch
> as an alternative.
>
> Thoughts?
>
> --

Yes, I too feel that 0003 patch won't help much. The reason being, the
chances of scan overtaking vacuum would be very rare and also
considering the fact that hash index is normally meant for unique
values (I mean that is when hash index is quite dominant over other
indexes) which means the chances of overflow pages in hash index won't
be high. Therefore, i feel, 0003 patch won't be much beneficial.
Honestly speaking, the code changes in 0003 looks a bit ugly as well.
So, yes, hashscan no-lsn.patch looks like a better option to me.
Thanks.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Scan Mode in Hash Index

From

Amit Kapila

Date:

25 September 2017, 07:25:24

On Fri, Sep 22, 2017 at 11:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 21, 2017 at 3:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> I have added a note for handling of logged and unlogged tables in
>> README file and also corrected the header comment for
>> hashbucketcleanup(). Please find the attached 0003*.patch having these
>> changes. Thanks.
>
> I committed 0001 and 0002 with some additional edits as
> 7c75ef571579a3ad7a1d3ee909f11dba5e0b9440.
>

I have noticed a typo in that commit (in Readme) and patch for the
same is attached.

>  I also rebased 0003 and
> edited it a bit; see attached hash-cleanup-changes.patch.
>
> I'm not entirely sold on 0003.  An alternative would be to rip the lsn
> stuff back out of HashScanPosData, and I think we ought to consider
> that.  Basically, 0003 is betting that getting rid of the
> lock-chaining in hash index vacuum is more valuable than being able to
> kill dead items more aggressively.  I bet that's a bad bet.
>
> In the case of btree indexes, since
> 2ed5b87f96d473962ec5230fd820abfeaccb2069, page-at-a-time scanning
> allows most btree index scans to avoid holding buffer pins when the
> scan is suspended, but we gain no such advantage here.  We always have
> to hold a pin on the primary bucket page anyway, so even with this
> patch cleanup is going to block when it hits a bucket containing a
> suspended scan.  0003 helps if (a) the relation is permanent, (b) the
> bucket has overflow pages, and (c) the scan is moving faster than
> vacuum and can overtake it instead of waiting.  But that doesn't seem
> like it will happen very often at all, whereas the LSN check will
> probably fail frequently and cause us to skip cleanup that we could
> usefully have done.  So I propose the attached hashscan-no-lsn.patch
> as an alternative.
>

I think your proposal makes sense.  Your patch looks good but you
might want to tweak the comments atop _hash_kill_items ("However,
having pin on the overflow page doesn't guarantee that vacuum won't
delete any items.).  That part of the comment has been written to
indicate that we have to check LSN in this function unconditonally.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

typo_hash_readme_v1.patch

Re: [HACKERS] Page Scan Mode in Hash Index

From

Robert Haas

Date:

26 September 2017, 16:40:10

On Mon, Sep 25, 2017 at 12:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think your proposal makes sense.  Your patch looks good but you
> might want to tweak the comments atop _hash_kill_items ("However,
> having pin on the overflow page doesn't guarantee that vacuum won't
> delete any items.).  That part of the comment has been written to
> indicate that we have to check LSN in this function unconditonally.

OK, committed.

And I think that's the last of the hash index work.  Thanks to Amit
Kapila, Ashutosh Sharma, Mithun Cy, Kuntal Ghosh, Jesper Pedersen, and
Dilip Kumar for all the patches and reviews and to Jeff Janes and
others for additional code review and testing!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers